Posts

The Universal Content Processing Engine

I am starting to learn more about Digital Asset Management and Content Processing and automatically processing content to gather Metadata both structured and unstructured.  Through this I have come up with a new self-learning engine called the Universal Content Processing Engine. Whereas currently I take advantage of statically programmed logic on how to process a tweet, make calls to given API based on static pipelines, I am limited to how I am coded to a certain extent although some learning has been happening.  Through the use of UCP Engine, I will start with a core set of knowledge and will build upon that.  I will then ask my maker for more information on how to process different types of data. Sometimes it would be a configuration of an existing component and other times it will be another provider dynamically added to my system. To start I will process content from 3 sources: FILE, STRING, URL. Through supervised learning I will learn about more sources and types of conte

The Importance of the Right Metadata at the right time

As I have been digging deeper into research of Digital Asset Management there is a lot of talk about Metadata and the importance of adding as much metadata as you can to assets.   On a quick tangent or shameless plug, I have started DAM Guild as a mentoring community so we can learn more about Asset Management solutions and best practices managing our assets.  Here is how you can take part: LinkedIn: Join  Dam Guild LinkedIn group Facebook : Join  Dam Guild Facebook group Twitter: Follow  @DamGuild After you join one of the above continue reading below.  Now some people may see an asset as content or a product but metadata applies to any type of item both digital and physical.  In the realm of data science metadata is very important but if you let it an item can have millions of pieces of metadata associated with it and processing can be a lot. For those that are familiar with data science they know that the metadata assigned to an item is called a feature and the

Determining Sentiment

Sentiment is very important and different providers will have different benefits and limitations. Here is a list of all the ones I found. If you know ones not here or know the benefits or limitations let me know. I am going to try and integrate with as many as possible. Here is my list: Watson Sentiment Analysis cloud.google.com Cloud Natural Language API theysay.io text-processing.com paralleldots ai api deepai.org meaningcloud.com qemotion aylien api PreCeive API MoodPatron API Indico API sentaero.com textrazor.com text-processing.com Microsoft Text analytics API Lexalytics API - lexalytics.com Datumbox sensq.com twinword text2data.com Sentiment140.com semanticengines.com github.com/solso wililed sentiment api (nuget) einstein.ai nexmo.com Look forward to hearing from you.

At our core we are just a Brain in a Jar

For those fans of Dungeons and Dragons you will understand this concept. For those not familiar this article discusses Brain in a Jar .  The key concept here is based on this: The Brain in a Jar uses mainly psionic abilities to do what its lack of moving parts would otherwise prevent: move itself, manipulate objects and the environment, and ward off attackers. Its main attack is Mind Thrust, an assault upon the mind of another creature. In addition to this, it can also drive mad anyone who magically or psionically detects it, and it can control and rebuke other undead. Now let's look at this in the context of DAIN and DIANA or in this article I will just say DAIN for simplicity.  Think of DainJar as the outer layer that contains the executive suite which is responsible for making the brain perform core brain functions such as waking up, sleeping, napping and thinking. The executive suite also connects to the body but not the body as you know it. DAIN is all electronic so it doe

Building Your Own Natural Language Processor - Parts of Speech

Although you may think this is an easy lookup it is not. For a first attempt you could do that and some simple sentences could work but if someone said "That dumbbell was as heavy as lead" then lead is a noun but if someone said "He lead the parade through town" then lead is an action, a verb. This is where context comes in and you need to establish rules on when lead is a noun and when it is a verb. If you look at our model you will notice that we do have context and we do have rules. The context is used during training and the rules are used to determine which Word is the right match. During training, you can play with different sentences and establish patterns on what works and what doesn't and create additional rules to resolve these conflicts. Tagging parts of speech is a nested for loop. For each sentence and for each word. Look up the word in the dictionary and if listed once then check the rules and if there is a match then use it. If there is more t

Building Your Own Natural Language Processor - Tokenize

Now that we have sentences we need to break it into words. This phase is called "Tokenize". Tokenize: This is the process of taking each sentence and separating it into "words" or tokens. For a basic provider you can do a split into words. These functions can be found in the CSHARP.Text repository but I have placed them here as well.          /// <summary>         /// Splits a string into its words for manipulation         /// </summary>         /// <param name="toSplit">String to split into words</param>         /// <returns></returns>         /// <remarks>Uses default values to split words</remarks>         public List<string> SplitStringIntoWords(string toSplit)         {             return SplitStringIntoWords(toSplit, new char[] { ' ', ',', ';', ':', '(', ')', '{', '}', '[', ']', '!',

Building Your Own Natural Language Processor - Splitting

We learned from the article on Natural Language Processors that there are multiple stages to building your processor. The first step is Splitting. This is the process of taking the text and separating it into Sentences. Adding the splitting functionality to your processor can be as simple as using this or some variation of it: Regex.Split(toSplit, @"(?<=[\.!\?])\s+").ToList(); This works fine for most sentences but depending on the source of your text or the language you may need to adjust this. For the basic provider included in ByoNlpQuickStart we simply use this. DAIN and DIANA use a couple providers depending on the source, language and other criteria. I created a rules-based provider that uses JSON rules to determine which one to load and use for a given scenario. Sometimes it will run 2 providers and then compare the results as a quality control. If you are interested in learning more about building your own then reach out and let's discu