Posts

Showing posts from January, 2019

Building Your Own Natural Language Processor - Splitting

We learned from the article on Natural Language Processors that there are multiple stages to building your processor. The first step is Splitting. This is the process of taking the text and separating it into Sentences. Adding the splitting functionality to your processor can be as simple as using this or some variation of it: Regex.Split(toSplit, @"(?<=[\.!\?])\s+").ToList(); This works fine for most sentences but depending on the source of your text or the language you may need to adjust this. For the basic provider included in ByoNlpQuickStart we simply use this. DAIN and DIANA use a couple providers depending on the source, language and other criteria. I created a rules-based provider that uses JSON rules to determine which one to load and use for a given scenario. Sometimes it will run 2 providers and then compare the results as a quality control. If you are interested in learning more about building your own then reach out and let's discu

Building Your Own Natural Language Processor - Designing Your Model

In our case we will be building our processor to be compatible with the NlpQuickStart so that will affect how our provider's external interface works but we still have the flexibility to do our own thing internally and then expose the results in a manner consistent with other providers. The base model for DAIN/DIANA have 2 elements: Words: Everyone knows what these are so I don't really have to explain them. Grammar Blocks: A grammar block is a segment of words that take on a given meaning.  For example "Lord Of The Rings" would be a grammar block. Basic NLPs break sentences into words and then use these words to build out the processing. Some look for an action word and then build upon that.  However the flaw in this is that sometimes the noun is a series of words or an action is a series of words. A grammar block allows you to relate the words to a possible grammar block that contains rules on when that block applies. You can later apply actions based on a g

Why To Build Your Own Natural Language Processor

The first question to ask yourself is why do I want to build my own when there are so many others out there. There are a few reasons why: Domain Knowledge Bias: A lot of the ones available are built on global shared models and as everyone uses the models it learns the things that they are training it on. For your specific domain you may want to use your own model.  Some providers may allow you to swap out the model in which case you can simply build your own model rather than a full provider. Language or Culture Bias: Sometime a model does not support your language and you cannot swap out the model or it does support your language but some of cultural nuances are not handled. If it is related to the model then you can simply swap it out but if it is part of the algorithm you cannot. Algorithm Failure for given scenarios or you need to inject additional rules to provide context and the processor does not allow for it. Cost: Using your own means you can host it wherever and contro

SpellCheck and other Cleaning Techniques

Just a reminder as we head into the Holiday season that we will not be doing blog posts for the next 2 weeks as you are probably going to be too busy anyways.  Have a Merry Christmas and a Happy New Year and we will be back every Monday in the 2019 If you work a lot with data you have probably heard a lot about data cleansing. There are various ways to perform this. For CRM there are ways to split full names into first, middle and last names. There are ways to validate postal codes. The same applies to AI and ML you need to help the provider with a good data set. Before doing Natural Language Processing there are 3 general things you can do to improve the success of your processing: Do language detection. This will help because if the text is in French then all your English models for NLP will fail.  Run SpellCheck on your text. Although this seems simple it is often missed when doing NLP. If you use a very basic dictionary then it will catch any words that are not common words