Building Your Own Natural Language Processor - Tokenize

Now that we have sentences we need to break it into words. This phase is called "Tokenize".

Tokenize: This is the process of taking each sentence and separating it into "words" or tokens.

For a basic provider you can do a split into words. These functions can be found in the CSHARP.Text repository but I have placed them here as well. 

        /// <summary>
        /// Splits a string into its words for manipulation
        /// </summary>
        /// <param name="toSplit">String to split into words</param>
        /// <returns></returns>
        /// <remarks>Uses default values to split words</remarks>
        public List<string> SplitStringIntoWords(string toSplit)
        {
            return SplitStringIntoWords(toSplit, new char[] { ' ', ',', ';', ':', '(', ')', '{', '}', '[', ']', '!', '.', '?' });
        }

        /// <summary>
        /// Splits a string into its words for manipulation
        /// </summary>
        /// <param name="toSplit">String to split into words</param>
        /// <param name="endOfWordToken"></param>
        /// <returns></returns>
        /// <remarks>v2.0.0.11 Strips string before splitting into words</remarks>
        public List<string> SplitStringIntoWords(string toSplit, char[] endOfWordToken)
        {
            var words = new List<string>();
            var splitBuffer = toSplit.Trim();

            while (string.IsNullOrEmpty(splitBuffer) == false)
            {
                string foundWord = GetBeforeOneOf(splitBuffer, endOfWordToken, "EXCLUDING");

                // only add word if not empty string.
                if (string.IsNullOrEmpty(foundWord) == false) words.Add(foundWord);

                splitBuffer = (foundWord == splitBuffer)
                    ? string.Empty
                    : GetAfterOneOf(splitBuffer, endOfWordToken, "EXCLUDING").Trim();
            }

            return words;
        }

In your provider you would call SplitStringIntoWords(sentence) for each of your sentences and then store the tokens on the Tokens property for the sentence.

            foreach(var sentence in Sentences)
            {
                sentence.Tokens.AddRange(SplitStringIntoWords(sentence.Sentence));
            }

For your purposes you may need to add more items types of characters to split or even swap out the functionality to split into words for another. For the basic provider included in ByoNlpQuickStart we simply use this.

For DAIN and DIANA I started with this but found that I had to adjust the code to support situations where these characters could be the end of a word or simply be part of a word. The first exception I hit was urls mentioned in text. This would split http://www.test.com so that http was its own word. The sentence one would split into "http://www" as one sentence and "test" as another and then "com" as another. This is not correct, of course.

I added a rules property that stored JSON rules. Then during Tokenize it would look for specific exceptions before parsing. If none existed it could simply do split using the above but if there were exceptions then it would run the pre-processing step for the rules that apply and then split into words and then run the post-processing steps for those same rules. This has worked so far for DAIN and DIANA.

If you are interested in learning more about building your own then reach out and let's discuss. We have created a ByoNlpQuickStart that can help you get started. I will be providing a that QuickStart and others to SitecoreDain subscribers soon. If you cannot wait email me at chris.williams@readwatchcreate.com and as a subscriber, I can release an early version to get you started now.

Comments

Popular posts from this blog

Natural Language Processing

Providers and Pipeline, Oh Why!!!

The Universal Content Processing Engine