Building Your Own Natural Language Processor - Parts of Speech

Although you may think this is an easy lookup it is not. For a first attempt you could do that and some simple sentences could work but if someone said "That dumbbell was as heavy as lead" then lead is a noun but if someone said "He lead the parade through town" then lead is an action, a verb. This is where context comes in and you need to establish rules on when lead is a noun and when it is a verb.

If you look at our model you will notice that we do have context and we do have rules. The context is used during training and the rules are used to determine which Word is the right match. During training, you can play with different sentences and establish patterns on what works and what doesn't and create additional rules to resolve these conflicts.

Tagging parts of speech is a nested for loop. For each sentence and for each word. Look up the word in the dictionary and if listed once then check the rules and if there is a match then use it. If there is more than one and the rules match add it to the matching word list. If there is only one matching word then use it. If there are more than one that match we need to flag it as an exception to be resolved. During training the trainer can correct it by specifying another rule. If during processing it could be written to a log and then go onto the next one.

For DAIN and DIANA, there are lots of rules and some rules are tagged as requiring the rest of the sentence to be processed and then check the rule. Sometimes it may take 3 or 4 passes through the sentence to resolve all the rules and ensure you have all the parts of speech tagged.

Some rules are simple like if "lead" follows a noun or pronoun then it is a verb. If it follows "as" then it is a noun or it could be more complex and require a pattern of multiple parts of speech to determine.

For the basic provider included in ByoNlpQuickStart we started with no rules in our model. The next version of this may include some of the basic rules processing we do in DAIN and DIANA.

If you are interested in learning more about building your own then reach out and let's discuss. We have created a ByoNlpQuickStart that can help you get started. I will be providing a that QuickStart and others to SitecoreDain subscribers soon. If you cannot wait email me at chris.williams@readwatchcreate.com and as a subscriber, I can release an early version to get you started now.

Comments

Popular posts from this blog

Natural Language Processing

Providers and Pipeline, Oh Why!!!

The Universal Content Processing Engine