Building Your Own Natural Language Processor

In our case we will be building our processor to be compatible with the NlpQuickStart so that will affect how our provider's external interface works but we still have the flexibility to do our own thing internally and then expose the results in a manner consistent with other providers.

The base model for DAIN/DIANA have 2 elements:

Words: Everyone knows what these are so I don't really have to explain them.
Grammar Blocks: A grammar block is a segment of words that take on a given meaning. For example "Lord Of The Rings" would be a grammar block.

Basic NLPs break sentences into words and then use these words to build out the processing. Some look for an action word and then build upon that. However the flaw in this is that sometimes the noun is a series of words or an action is a series of words. A grammar block allows you to relate the words to a possible grammar block that contains rules on when that block applies. You can later apply actions based on a grammar block.

When dealing with things like twitter that have handles and hash tags, you may find that the name of the handle and not the handle itself is the relevant terms.

For example, @hamillhimself does not provide relevance in your sentance unless you look up their name or whom they are and realize it is Mark Hamill.

A handle can also help by relating an inside joke that is known for that individual eg. Jeremy Davis is always placed in pictures as cameos. Akshay wonders why you won't be his friend. Corey love Poutine and there are lots of moustache references. I love me a good chocolate milk.

You can extend your model if there is more context you want to add to your models but to start we will keep these simple elements: Word and GrammarBlock.

Word

Here are the initial properties for a Word. When we get into the training you will see more how each one is used but this gives you an idea on what is necessary. DAIN and DIANA use extended models that contain more context to help it more accurately recognize the entity.

/// <summary>
/// Unique Id for this word
/// </summary>
string Id { get; set; }

/// <summary>
/// Primary Classification of the word
/// </summary>
string PrimaryClassification { get; set; }

/// <summary>
/// Additional classifications for the word
/// </summary>
List<string> Classifications { get; }

/// <summary>
/// The actual word text as Upper Case
/// </summary>
string UpperCaseText { get; set; }

/// <summary>
/// The actual word text
/// </summary>
string Text { get; set; }

/// <summary>
/// A JSON containing a rule on when this word applies
/// </summary>
List<string> Rules { get; }

/// <summary>
/// This is a list of domains the word applies to. If empty the word applies to all domains.
/// </summary>
List<string> Domain { get; }

/// <summary>
/// A list of json formatted context blocks.
/// </summary>
/// <remarks>This could be the full tweet or sentence. In the case of an article it could be a paragraph, a chapter. It could contain a contact, a character, a place, a thing mentioned in the grammar block. Basically anything that could be used to more easily process the grammar block.</remarks>
Dictionary<string, string> Context { get; }

Grammar Block

Here are the initial properties for a GrammarBlock. When we get into the training you will see more how each one is used but this gives you an idea on what is necessary. DAIN and DIANA use extended models that contain more context to help it more accurately recognize the entity.

/// <summary>
/// Unique id for this grammar block
/// </summary>
string Id { get; set; }

/// <summary>
/// The text making up the grammar block. The format depends on the mode
/// </summary>
/// <remarks>
/// <remarks>
/// For TextMode of DainEx, the following describes some of the basics:
///
/// [*] means any characters so "The [*] car" would accept The Red Car, The Green Car, The Super Fast Car.
/// [a,b,c] means that it will accept any one of the enclosed values so for "The [a,b,c] list" it would accept "The a list", "the b list", "the c list"
///
/// </remarks>
string Text { get; set; }

/// <summary>
/// How the text field gets treated. Eg. STARTS_WITH, CONTAINS, REGEX, DAINEX
/// </summary>
string TextMode { get; set; }

/// <summary>
/// A list of json formated rules that provide conditions and possible reactions.
/// </summary>
/// <remarks>As both are JSON in format, the grammar block processors can take what they support and leave the rest</remarks>
List<string> Rules { get; }

/// <summary>
/// This is a list of domains the word applies to. If empty the grammar block applies to all domains.
/// </summary>
List<string> Domain { get; }

/// <summary>
/// A list of json formatted context blocks.
/// </summary>
/// <remarks>This could be the full tweet or sentence. In the case of an article it could be a paragraph, a chapter.
/// It could contain a contact, a character, a place, a thing mentioned in the grammar block. Basically anything that could be used to more easily process the grammar block.</remarks>
Dictionary<string, string> Context { get; }

/// <summary>
/// List of words that make up the grammer block. Could be one or more words
/// </summary>
List<IWord> ContainedWords { get; }

/// <summary>
/// Tags that represent an IS relationship. Coveo is a company.
/// </summary>
List<string> IsTags { get; }

/// <summary>
/// Tags that represent a HAS relationship. tim has brown eyes
/// </summary>
List<string> HasTags { get; }

Next Steps

The model has 2 purposes.

The first is to provide a mechanism train so the model may have temporary properties such as context that allows us to hold the context with it until we establish rules and other things regarding it.
Once we know the rules we can remove the context to slim the model. Other properties will remain filled and will often get fuller. For example, as we experience words and grammar blocks in other contexts then more rules can be established.

Optimization: As your solution grows, you may want to separate Word and Grammar Block into ActualWord and TrainingWord. The ActualWord would have the basic things you need to identify the word. The TrainingWord contains the extra context we use to establish the rules. This will help keep your live model slim saving on data size which affects storage but also the speed at which it can determine results.

There are a few things you will need to do for your model:

You need to create the mechanism to store your model somewhere whether it is the File System, SQL Server, a NoSQL or one or more JSON files. We will provide a couple options with the ByoNlpQuickStart and then you can write your own for more specific purposes or adapt them as you adjust your model to meet your needs.
You will need to build your mechanisms to train your model. In ByoNlpQuickStart we will offer a couple training mechanisms as samples. You will be able to collect sentences from rss/atom feeds for blogs, use your twitter feed to grab tweets and type in sentences for it to process.

Determining your models early at least the major pieces are important because it often takes time and a lot of work to fill them and clean any holes in the data. For a simple POC you could build your own models but for a more advanced project we recommend you connect with a Data Scientist that can help you determine the right features to add and the right models to build. One extra column/feature multiplied by millions can really extend your data set and cause additional performance concerns that you may not need. However missing a key column/feature can mean your model is bias and will not provide you with the right answers.

If you need help let me know and I can get you in touch with someone to work with you to build the right model.

DAIN and DIANA actually have multiple models and a Recognition Provider to determine which model to use based on a set of rules. For example, if the text you are processing comes from twitter then there are 2 models specific to Twitter that we use:

Twitter Followers: This will contain the @sitecorechris and then additional information to tie them to their company or contact record in our CRM. This will come in handy because you can gather more context to give a more informed response.
Hash Tags: This will contain #RobotLivesMatter and additional information about that hash tag and what it means. It could also collect other Tweets for that hash tag to see if something is going on today that you should be aware of before responding.

Currently DAIN/DIANA have upwards of 50+ models that it uses depending on the source, patterns found in the text, additional context information. We will get more into that further in the blog series.

We will skip over reading and writing the model to storage as that is the boring stuff and continue onto the initial implementation of INlpProvider for your ByoNlpProvider in the next few posts. We will then follow up that with the mechanisms to train your model(s).

If you are interested in learning more about building your own then reach out and let's discuss. We have created a ByoNlpQuickStart that can help you get started. I will be providing a that QuickStart and others to SitecoreDain subscribers soon. If you cannot wait email me at chris.williams@readwatchcreate.com and as a subscriber, I can release an early version to get you started now.

Search This Blog

DAIN and DIANA - The Story Of Our Lives