Language Detection

As Diana is from Canada, she needs to be bilingual so it is important that when asked a question or told something that the language is detected.  If the text comes from Twitter then the tweet will contain the language, however if it comes from another source say Slack or text on a console, that language is not provided.  This is where language detection providers are so important.

POC #1: Detection Via LanguageDetection Nuget Package

Initially Dain and Diana used the LanguageDetection Nuget:

Install-Package LanguageDetection

This package is described as "Detect the language of a text using naive a Bayesian filter with generated language profiles from Wikipedia abstract xml, 99% over precision for 53 languages. Original author: Nakatani Shuyo."

POC #2: Cognitive Services

After some growth and chatting with Mark Stiles a second provider was created for Cognitive Services based on this article Quickstart: Identify language from text with the Translator Text REST API (C#).

There are many other ways to detect language. DAIN and DIANA use a provider model so you have the option of using different providers. You may also use multiple providers weighting the reliability. You can then apply that weighting to determine which language is most probable.  This is important especially if you are given a single statement as some languages are too close to determine.

For example Portuguese and Spanish may be detected as the same if you simply read one response of Si however if you look at previous responses the Spanish or Portuguese ones would be quite different. That is why the language the language provider would be passed what we think the prev.

So that is the other improvement that is necessary. You need to look at previous statements. The problem is if someone changes languages mid conversation. For example if you were reading a speech given in parliament then it is possible that one sentence or paragraph is in English then the same paragraph repeated in french.  The same could be said if the politician is tweeting. They may tweet in English then follow it up with the same in french.

The language context is of importance. You need to use context of prior statements but need to understand that just because the previous statement is in one language does not mean the next one is in the same language. Just some things to ponder.

Language becomes more important as you determine how you will do Natural Language Processing. There are 3 approaches when dealing with language:

  1. Detect Language, Translation and then Natural Language Processing. Although the simpler way to do it and actually the way DAIN and DIANA worked for the POC, you do lose some of the local nuances the native language contains. You may also suffer some performance loss as you have to translate both the request and the responses back.
  2. Detect Language and then Language Specific NLP and Language Specific processing of response. This is more complex so it will take more work however you do gain a more fluid conversation. The other disadvantage is maintaining multiple NLP and as the volume of data grows it will be multiplied by number of langauges.
  3. Detect Language and then Language Specific NLP and then shared processing of response.

In the case of DAIN and DIANA, we use approach 3 because our Reputation Engine results are stored in English however any links to resources do have a language flag so that factor can be used to determine the best answer for the asker.

For your application you may find approach 1 or 2 is better. The base DAIN does allow you to downgrade to shared NLP, it also allows you to have Language Specific processing rather than a shared one.

If you are interested in language detection and adding that to your applications reach out and let's discuss. I will be providing a basic QuickStart to SitecoreDain subscribers soon. If you cannot wait email me at chris.williams@readwatchcreate.com and as a subscriber, I can release an early version to get you started now.

Comments

Popular posts from this blog

At our core we are just a Brain in a Jar

Natural Language Processing

Building Your Own Natural Language Processor - Parts of Speech