Wednesday, November 13, 2019

Accord.net Machine Learning

Machine learning has been out of reach of the common developer for much of its early life.  While R has come along and stood out as "the" statistics language, it does not easily plug into the more mainstream languages.  Fortunately for us, in the last few years a library called accord-framework.net has grown up to fill this gap.

This framework is written in C# and allows the average .net developer access to a large number of machine intelligence algorithms that require very little statistical knowledge to actually use.  I specify the statistical qualifier because it does take a decent amount of basic c# experience to overcome some of the rough edges in validation the library still has.

The website also includes an impressive amount of documentation with code examples, unfortunately many of these examples are aging and have broken pieces in them due to changes in the software, however they are usually enough to get someone up and going after a little playing around.

Concepts and Implementation

The concept is fairly simple.  All the machine learning algorithms in this library take a two dimensional array of numbers (int or double), along with a single dimensional array of values with the correct outputs.  The algorithm then trains itself on these numbers.

After training you send it another set of numbers in the same format, and this time it will give you back what it thinks the outputs will be.

Your process will look something like this:
- Load datatable with data
- Codify the datatable into integer values using the Codification library
- Extract a two dimensional integer array using a combination of the datatable and the newly created code mapping object for all the columns you want to use as inputs.
- Extract a single dimensional integer array just like you did the two dimensional one, only this time for the single column that holds the values you are trying to guess.  It is important that the order of the values in this single column cause them to correspond with the correct input column values by index location.  If all the columns are being extracted from the same datatable then this should happen naturally.
- You will pass these two arrays into the desired algorithm to train it, some algorithms will require multiple training cycles to tune them.
- Once you have a trained algorithm object you can then pass it another two dimensional array of integers.  This time the values will be used to guess a single dimensional array of output integer values.  One gotcha here is that many of the algorithms can't handle input values they have not been trained on, so you can't throw just anything at it.

Because it works only with integers, any string values you want to use as input or output must first be converted to integers with no gaps between the number values.  You can do this on your own, but for convenience they have provide a special Code library which handles converting standard tables of data into encoded integers and back.

Gotchas

There are multiple bugs and missing features in this Code library, which is one of the biggest challenges that has to be overcome when working with these algorithms.  However, despite these issues I have still chosen to use the conversion library.

I have discovered that it's biggest shortcoming is that it is not capable of handling NULL values in the data.  So first you have to loop through every single value in your datatable and remove all NULL values.  From a speed perspective, this flaw alone probably means it would make for faster code to roll your own; but for the majority of developers out there, it is likely not worth the additional time to do that.

I have read that there is a default value setting inside the library at a per column level that allows the library to deal with NULLs.  For some reason that default either does not work, or is not initialized on its own.

The next annoying issue this Codification library has is really more of a versioning problem.  It looks like over time, rather than fixing particular issues, new methods get created to handle the new cases.  So you end up with multiple methods that run off different logic when encoding values.  Specifically, there seems to be a big difference in Codification between the Transform method and the Apply method.  Transform seems to attempt to specifically encode all columns requested, kind of like the explicit class creation overload that accepts a list of columns.  Apply on the other hand logically processes a datatable detecting which columns need to be converted and which do not.

Thoughts and Concepts

Most of my work in this area has been with the various Classification algorithms in the accord.net library.  These seem to have the limitation of not being able to accept a Continuous (un-encoded int) value type as the output column it is supposed to be guessing.  The solution to this particular issue is probably to switch to using a Regression algorithm.

Something else that might not immediately come to the mind of the average developer is that the output is not going to have any human readable meaning.  Because the system works exclusively with integer arrays, the output will just be an array of numbers.  These output numbers must then be passed through the Codification library a second time, this time in reverse, to get back to the human readable version of them.