SpellCheck and other Cleaning Techniques

Just a reminder as we head into the Holiday season that we will not be doing blog posts for the next 2 weeks as you are probably going to be too busy anyways.  Have a Merry Christmas and a Happy New Year and we will be back every Monday in the 2019

If you work a lot with data you have probably heard a lot about data cleansing. There are various ways to perform this. For CRM there are ways to split full names into first, middle and last names. There are ways to validate postal codes. The same applies to AI and ML you need to help the provider with a good data set.

Before doing Natural Language Processing there are 3 general things you can do to improve the success of your processing:
  1. Do language detection. This will help because if the text is in French then all your English models for NLP will fail. 
  2. Run SpellCheck on your text. Although this seems simple it is often missed when doing NLP. If you use a very basic dictionary then it will catch any words that are not common words in that langauge.

    Use the spelling errors to help determine if there is a topic specific model that should be used. For example if SpellCheck returned Sitecore as a spelling error then you know the article or tweet is about Sitecore and can use that model for more success.

    Sometimes spelling errors are on purpose for example Mark Stiles (a Cognitive Services Guru) has a twitter handle of @maaakstiles so sometimes if people are tweeting to him they may do similar to other words like aaaaiiii is awesome.  There may be that running joke about things being Legen ... wait for it ... dary. Those are things that a spell check would find and you can flag them and then run your NLP with it corrected but keep the context.
  3. Run a word counter or popular word finder. There may be words in the dictionary that are still special. The post popular words could help you determine context and as mentioned in the previous article context will help improve your NLP
With these 2 things you can dramatically improve the success of your Natural Language Processor.



If you are interested in learning more then reach out and let's discuss. I will be providing a basic QuickStart to SitecoreDain subscribers soon. If you cannot wait email me at chris.williams@readwatchcreate.com and as a subscriber, I can release an early version to get you started now.


Comments

Popular posts from this blog

At our core we are just a Brain in a Jar

Natural Language Processing

Building Your Own Natural Language Processor - Parts of Speech