Building Your Own Natural Language Processor - Splitting

We learned from the article on Natural Language Processors that there are multiple stages to building your processor. The first step is Splitting. This is the process of taking the text and separating it into Sentences.

Adding the splitting functionality to your processor can be as simple as using this or some variation of it:

Regex.Split(toSplit, @"(?<=[\.!\?])\s+").ToList();

This works fine for most sentences but depending on the source of your text or the language you may need to adjust this. For the basic provider included in ByoNlpQuickStart we simply use this.

DAIN and DIANA use a couple providers depending on the source, language and other criteria. I created a rules-based provider that uses JSON rules to determine which one to load and use for a given scenario. Sometimes it will run 2 providers and then compare the results as a quality control.

If you are interested in learning more about building your own then reach out and let's discuss. We have created a ByoNlpQuickStart that can help you get started. I will be providing a that QuickStart and others to SitecoreDain subscribers soon. If you cannot wait email me at chris.williams@readwatchcreate.com and as a subscriber, I can release an early version to get you started now.

Comments

Popular posts from this blog

At our core we are just a Brain in a Jar

Natural Language Processing

Building Your Own Natural Language Processor - Parts of Speech