Providers and Pipeline, Oh Why!!!

I have mentioned to a few people that DAIN and DIANA are built with providers and pipelines. The pipelines. Some providers are weighted while others run conditionally based on rules.  The question I get is often is why not just choose an algorithm and stick with it for simplicity.  There are a few reasons for that:

There are 3 reasons to have different providers the first is Bias and the second is Context.

Bias

In the case of Natural Language Processing, you have heard a lot in the news about bias in AI and ML. Some of that is due to data which is the subject of another article but other times it is that the algorithm has been trained with data that is not specific to your domain and thus can produce bias that is not specific to your data. That is why it is so important to to use multiple algorithms to keep the other in check. You can weight one algorithm higher so it is most likely to produce the result but there are cases where it differs greatly from other algorithms that is something to track and learn from. Why does SharpNLP give me x while Cognitive Services always gives me y.  By more closely examining these anomilies you can set up rules that help determine when each algorithm is beneficial.

I highly recommend reading this article on Ethical AI thanks Eric Ramseur for sharing.

Context

The other reason to swap out your Natural Language Provider or maybe just the model would be context.  If you are working with data from a given source, say Twitter, then there are certain words, terms, short forms and other things that are specific to Twitter. The provider may do some pre-processing then use an existing provider.  Depending on the topic you may also have specific domain knowledge that the provider or model need to be aware of. For instance, an article you know is on Sitecore would have different terms than an article on astrophysics.

Cost/Limits

This is always a factor in software development even though we want the ideal solutions we want to keep things frugal too. If you are doing NLP on a large volume of text and documents using metered services can put you over your subscription limits or escalate per call costs.  This is where you would apply rules regarding using metered providers.  In some cases you may simply set USE_METERED to false but in other cases you may use rules based on previous providers responses. For example, if you run 2 free providers and they do not provide answers then you would use the paid one and then may use the results to improve the models for the free ones.  In the case of DAIN we run Language Detection using the BasicLanguageDetectionProvider. If the result is under a given ProbablilityWeight then we would run it under the CognitiveServicesLanguageDetectionProvider one and then use that to update the BasicLanguageDetectionProvider

The whole point of creating providers and pipelines is that you have the flexibility to improve as you go and use the proper provider for the proper situation.

If you are interested in learning more then reach out and let's discuss. I will be providing a basic QuickStart to SitecoreDain subscribers soon. If you cannot wait email me at chris.williams@readwatchcreate.com and as a subscriber, I can release an early version to get you started now.

Comments

Popular posts from this blog

Natural Language Processing

The Universal Content Processing Engine