Natural Language Processing

Another key concept to DAIN and DIANA is Natural Language Processing. There are a variety of libraries and algorithms that can convert a sentence or series of statements into their various figures of speech.  DAIN and DIANA use the provider model the same way we did for Language Detection. This allows you to swap out your NLP as well as use multiple with the same idea of providing weighting to determine which provider would be considered more important.

Weighting can be a simple number weighting or could include simple or more complex rules like this provider works better for English than it does for french or if this sentence contains these keywords then use this provider over that one.

Natural Language Processing Steps

There are various phases in Natural Language Processing. This article explains the basic ones.  Some libraries will expose the various steps while others will group them together or have a single method to do all of them:
  1. Splitting: This is the process of taking the text and separating it into Sentences. 
  2. Tokenize: This is the process of taking each sentence and separating it into "words" or tokens.
  3. Parts of Speech Tagging: This process takes each token and assigning it a part of speech eg. noun, verb etc.
  4. Token Grouping or Chunking: This is when you take the various tokens with their parts of speech and group them for easier processing.
  5. Parse or sometimes referred to as Full Parse: This takes all the grouped tokens and builds a tree to represent the sentences.
  6.  Named Entity Recognition or Find Names: This takes the results of step 5 and uses it to find entities like people or places or dates or money etc. 
  7. Sentiment: Some Natural Language Processors will provide sentiment built in while others do not and you will need to use another library for sentiment. 

Using SharpNLP

There are a few options available that can do Natural Language Processing. SharpNLP is an open source library ported from OpenNLP. The original SharpNLP is a .NET Framework 2.0 library and has not been updated in quite a while. There are some known bugs that will be resolved in the .NET Core port. If you are still using the .NET Framework version you need to be aware of them:
  1. SentenceDetect has an issue if you have more than one sentence and the last sentence does not have punctuation. It will simply not include the sentence in the results.
In writing this blog post I found I could not use the existing library against the new OpenNLP models. Also being a .NET Framework 2.0 meant that .NET Core or Xamarin developers could not use the library. For these reasons I started porting the library to .NET Standard. I also started creating the demo application as a Console application as this is more likely how you are going to use the library. Check out the SharpNLPStandard Github and if you have questions or would like to contribute let me know and I can grant you permissions email me at chris.williams@readwatchcreate.com

To use the SharpNLPStandard you have to grab the latest models. As SharpNLP is based on Open NLP you download the OpenNLP Models and copy them to a folder.

There are a few steps you need to do to use these models:

1) The models bin files are actually zip files so rename them all to .zip
2) Extract the models using an unzip tool.
3) Be aware that the en-pos-perceptron.bin is not a GIS model so either rename it to .pbin or put it in a different model folder.
4) Convert the models to ones SharpNLP uses. This is done by using the ModelConverterCoreConsole. You pass in the path to the folder you copied the OpenNLP models to and it will looks for them recursively to convert to .nbin files.

I started to use the models with DAIN to test but the sentence detect was only working part of the time so I went back to using the one I specifically developed for DAIN. I will be discussing the ByoNlpProvider in a January article called Build Your Own NLP Provider

Now you have something you can start with to determine a response. If you are ready to delve deeper take a look at the NlpQuickStart this provides 2 providers to start. The first wraps SharpNlp while the second uses Microsoft Cognitive Services and a third is coming soon that will wrap Watson. 

Using NlpQuickStart

NlpQuickStart uses a pipeline to allow you to leverage 1 or more Natural Language Processing providers to determine a response.  

If you are using the NlpQuickStart and decide to use SharpNLP then you have the option of using the NlpQuickStartSharpNlpProvider or the NlpQuickStartAutoConvertingSharpNlpProvider. By using the AutoConverting one you simply have to copy the .bin files into the model folder and it will take care of all the unzipping and converting upon first use of the model. That means if you have a new version of the model you simply save the new model in the model folder and delete the converted folder which will have the same name as the model. The next use will automatically convert it.

Note that if you are using SharpNLP that the models are not robust so you will run into issues with translation. We use Build Your Own NLP Provider for DAIN.

The Cognitive Services version requires a valid Speech API Key.

The Watson provider requires a valid BlueMax account. We used this article to Create A Watson Conversation Service and Consume it from .NET and then made some modifications to fit our provider model.

If you are interested in Natural Language Processing (NLP) and adding that to your applications reach out and let's discuss. I will be providing a basic QuickStart to SitecoreDain subscribers soon.

If you cannot wait email me at chris.williams@readwatchcreate.com and as a subscriber, I can release an early version to get you started now.

Comments

Popular posts from this blog

At our core we are just a Brain in a Jar

Building Your Own Natural Language Processor - Parts of Speech