Categories
AI Machine Learning

AI – Pattern Recognition

In my last [post] I talked about Optical Character Recognition and how I managed to get a solution working using machine learning, particularly using the nearest neighbour algorithm. In this post I will briefly explain how the NN algorithm works and then introduce another two pattern recognition algorithms which I used to solve the OCR problem and compare their strengths and weaknesses.

Live demo can be accessed [here].

In a future article, I will implement the same algorithms using ML.NET and see if I get better results than my own implementations. So stay tuned!

1. Nearest Neighbour

The algorithm is based on a distance metric such as the Euclidean Distance which is used to find the closest similarity (nearest distance) among a set of vectors. Let’s assume in this case we have an array of integers and we want to compare this with other arrays of integers to determine which one is the most similar to our own array of integers. A loop would go through each of the arrays one by one and make a comparison using the following simple formula below to obtain the euclidean distance.

At each iteration, every integer from array 1 is represented as x and every integer from array 2 is represented as y. In C# code, the following formula would translate to something like [this]. By applying this formula to compare the array of integers against all the other arrays inside the data set, the array which produces the shortest distance as result is considered to be the most similar to our array and this is how it was possible to solve the OCR problem in my previous [post], hence by comparing the array of bits produced by the scribbling on a canvas against all the other OCR images inside the CSV files.

2. K-Nearest Neighbour (k-NN)

In combination with nearest neighbour, a very powerful solution which can be used to optimise the accuracy of the nearest neighbour is the k-NN algorithm; a non-parametric supervised learning method used for both classification and regression predictive problems.

The k-NN rather than classifying the nearest neighbour by shortest distance metric, it takes the closest K neighbours and forms a majority vote among them to determine which character occurs the most times (K being any number). So basically going back to the OCR problem, let’s imagine that “3” is the number we are trying to predict by using the Nearest Neighbour and we get Euclidean Distance of 0.2 on a character which is a “3” and 0.19 on a character that is an “8”. This can very much happen if the character is not written very clearly and thus it gets mistaken for a number with similar pattern attributes.

What the k-NN does basically, rather than choosing the result with shortest distance; it takes the top four or five (K) with shortest distance which are found. This way a vote can then be made and determine the likelihood from the majority. For the sake of our example, we could get the following for K=4:

{ 1.9 = “8”, 0.2 = “3”, 0.21 = “3”, 0.24 = “3” }

This means that although the result with shortest euclidean distance relative to our number is an 8, there are another 3 results which indicate it’s a 3 and therefore the majority vote concludes that it is in fact a 3. This improves slightly the predictive performance over the classical nearest neighbour. A C# implementation of the k-NN can be found [here].

3. Multi-Layered Perceptron

The Multi-Layered Perceptron originates from a single perceptron node which takes a set of input values as parameters and then produces a binary output (usually 0 or 1). At each input node, a randomly generated weight value (ex: -0.5 to +0.5) is set and multiplied to the input value. The totals are then summed together in a connecting node and finally the total is produced using the sigmoid function a.k.a the step function (or activation function).

Of course the perceptron approach works very well when dealing with linear separation for example when determining whether an object is a car or a truck (separating only 2 types), but in our case the OCR problem is more complex than that. For this reason, multiple perceptrons can be combined together to form a neural network which will be able to identify 10 different characters, hence produce 10 different output types.

This approach could work even if multiple middle layers (called hidden layers) could be connected in between each with their own weight value and values summed together from back to front (feed forward). However the a neural network is usually implemented with a back propagation function which goes backwards and corrects the values at each iteration during training phase in order to minimise the resulting error for that particular value and thus drastically improving the performance on the prediction results. Back propagation is quite complex to understand but many tutorials are available online which go in depth with the explanation.

For our OCR solution, a MLP algorithm classifier example in C# has been implemented and can be found [here]. The following configurations have been applied.

  • Input Layers (64): The OCR data is represented in a 64 bit binarised numeric value.
    
  • Output Layers (10): The possible classification of digits 0 – 9.
    
  • Hidden Layers (3): By trial and error it turns out 3 layers give the most accurate results.
    
  • Hidden Neurons (52): “The number of hidden neurons should be 2/3 of the input layer size, plus the size of the output layer”, Foram S.Panchal et al (2014).
    
  • Momentum (0.4): The coefficient value multiplied by the learning step which adds speed to the training of any multi-layer perceptron.
    
  • Learning Step (0.04): The constant used by algorithm for learning, 0.4 gave the best results while fine tuning although for smaller datasets 0.06 also worked very well.
    
  • Least Mean Square Error (0.01): The minimum error difference to get in training precision before stopping.
    
  • Epoch (500): The number of times to repeat the back-propagation during training to get the weights adjusted correctly. The higher the value the more precise the character recognition but the slower the performance. NB: Setting this value higher than 500 seems to make insignificant difference in the output accuracy for the data set which we are working with.
    
  • Function: Sigmoid function has been used as most practically used worldwide.

When passing the two-fold data through the perceptron, the training method might take a bit long to complete since it will need to iterate too many N times for each character given. For this reason filtering has been applied to limit the number of training data for each represented digit from 0 – 9 but is has been observed that the smaller the data set for training passed although it shortens the processing time, it reduces the percentage accuracy of the result.

Results Comparison

Below are the accuracy performance results based on a 2-fold test. But perhaps the accuracy can be tested manually on the live demo page [here]. Which one gets it right most of the time for your handwriting?

AlgorithmPerformance% Accuracy
KNN2760 / 281098.22%
NN2755 / 281098.04%
MLP2727 / 281097.04%

Conclusion

The nearest neighbour and k-NN functions have shown by far the highest accuracy achievement in the smallest amount of running time however such algorithms, being  non-parametric kind; lack the ability to learn and therefore their processing requires a training dataset to be available at all time. On the other hand, the multilayer perceptron is able to learn and produce a neural network with adjusted weights over a span of time which then uses to classify characters without the need to the original training set.

Although such method takes a very long time to run compared to the other methods, the weight adjustment values can be saved in a file after the training has been completed and then use only the saved adjusted weights from the file to run the OCR classification almost instantly. Theoretically, the MLP method has also the potential that more accuracy can be achieved when fed a larger training dataset and finer tuning of the other parameters such as epoch and learning step.

References

1. Multi-layer Perceptron Tutorial
http://blog.refu.co/?p=931

2. Multi-Layer Neural Networks with Sigmoid Function
https://towardsdatascience.com/multi-layer-neural-networks-with-sigmoid-function-deep-learning-for-rookies-2-bf464f09eb7f/

3. Review on Methods of Selecting Number of Hidden Nodes in Artificial Neural Network
http://www.ijcsmc.com/docs/papers/November2014/V3I11201499a19.pdf

Categories
AI Machine Learning

Creating a hand written number recognition program using machine learning

Back when I was studying AI and Machine Learning, I was given an assignment where I had to create an Optical Character Recognition (OCR) program using the MNIST (http://yann.lecun.com/exdb/mnist/) data set. The requirement was to create a couple of Machine Learning algorithms such as Nearest Neighbour and Multi-Layred Perceptron (more on these later) and then run a two-fold test through these algorithms to train a model to for classifying hand-written digits. The first half of the data set would be used to train the model and the second half to test the outcome prediction for the given characters ultimately calculating the percentage of accuracy in the results.
I wanted to take this a step further and create a GUI where a user could hand-write a digit and the program would try guessing the input by learning from the data set.

Check out the live demo [here].

Step 1: Examining Data

The training data which we will be working with is in CSV format and each row contains a set of 65 integer values, the first 64 being the data bits (vectors) representing the character and the last being the actual identifier (a character between 0 – 9). The data set can be found [here]. The scanned numbers have been converted from image bitmaps to 32×32 pixel format, kind of like a grid of ON and OFF values which depict the character black on white. These ON and OFF values are represented as 0s and 1s and then these bits are divided into 4×4 non-overlapping blocks.

The number of ON pixels is counted in each of these blocks to give a value between 1 and 16. This part is explained in more detail later but for now, the idea is to understand the format of the train data in order to create the same format when inserting the hand-drawn values into the program.

The SUM of each 4×4 block give the 64 bit integer sequence which will be representation of the character (0-9).

Step 2: Canvas to Pixels

The goal is to create a canvas onto which a user could write a digit using mouse or touch-pad and convert the bitmap into the same 64 bit sequence in order to run through a machine learning algorithm to predict the most likelihood of the character compared to the rest of the training data set. If we had to provide a 32×32 pixel canvas, it would be a bit too tiny to allow any scribbling as input and therefore we will create a 100×100 pixel canvas and then resize it later using a preview canvas. The code for creating user scribbling on an HTML canvas and then resizing can be found on Github [here].

When processing the canvas data, it’s vital to understand how pixels work. Each pixel is repented by 4 values: Red, Blue, Green and Alpha. I won’t get too much into detail on this since it can be looked up easily but basically for our sake, the important thing to know is that when dealing with raw canvas data, each sequence of 4 characters = 1 pixel. In this case, 32×32 pixels = 1024 bits * 4 = 4096; below is the console log of canvas.data illustrated.

The convertToGrascaleMatrix() method, loops through all these pixels and organises them into a 32×32 pixel grid array. In each iteration, the Alpha value (4th bit) is checked whether it’s a 0 (OFF) or any value between 1-255 (ON) and 1 or 0 is inserted in the array accordingly. Below is the console log of the end result, the binary output resembles the drawn image.

The next step is to send the pixels array into the count4X4MatrixSections() to create the 4×4 blocks and count the ON /OFF values inside and produce the 64 bit integer sequence to send into the machine learning algorithm.

Example in this case: 3,15,14,15,15,3,0,0,3,7,1,2,11,12,0,0,0,0,0,2,12,12,0,0,0,0,5,16,16,4,0,0,0,0,2,9,15,13,1,0,0,0,0,0,2,15,9,0,0,15,12,2,0,13,12,0,0,5,13,16,14,16,7,0,

Step 3: Machine Learning

The nearest neighbour algorithm is a good starting place for attempting to resolve the OCR problem as it will use a distance function to calculate the distance between the sums of the vectors of a given OCR character (in our case the 64 bit integers) against those in the training data set and classify the number according to its closes neighbour’s identifier. I won’t get too much into detail about the algorithm itself since I will be doing another [article] about machine learning algorithms soon, plus Google is your friend if you wish to look up the nearest neighbour algorithm and euclidean distance.

For this project, I chose to implement this algorithm and use Euclidian Distance as the distance function to predict the inserted characters. The code is in C# and can be found [here] with inline comments explaining the behaviour at each step.

Step 4: Refining Results

When I finished the first version of this project, I was getting pretty good result predictions from the algorithm but when I asked some friends to try it, the results weren’t so accurate anymore. I realised after some analysis that when I was writing the characters on the canvas I was drawing big numbers to occupy all the space in the box whilst other people weren’t.

I concluded that since the OCR training data-set is based on bitmaps of numbers which are centered and without much white bordering and white space in general that it would be a good idea to introduce a certalise and crop function in order to give the algorithm more refined data to work with. The centraliseAndCropCanvas() method is based on a few concepts which I found by research and adapted to my project, I will leave references for these below. Basically finding the bounding rectangle in the image eliminates the outer white spaces and then finding the center of mass of the character helps repositioning the character at the middle of the canvas after the cropping.

The code for these functions can be found in this JavaScript file [here]. This has drastically improved the performance of the algorithm. The preview box illustrates the final version of the drawn up character after undergoing all transformation and before being sent to the algorithm.

Conclusion

Of course, it can never reach a 100% accuracy but if you think about it, even a human being can have difficulty recognising a handwritten digit every now and then let alone a machine!

You want to try yourself? Check out the demo [here].

Full source code can be found on [Github].

What next?

I will make another [post] where I will compare other algorithms such as the kNN and the Multi-Layered Perceptron and discuss the advantages and disadvantages and try to apply the GUI created in this project on them.

Also I will be trying the ML.NET libraries for machine learning and apply the GUI to check if I get better results than my own.

References

1. Optical Recognition of Handwritten Digits Data Set:
http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

2. The MNIST Database:
http://yann.lecun.com/exdb/mnist/

3. Using Touch Events with the HTML5 Canvas
http://bencentra.com/code/2014/12/05/html5-canvas-touch-events.html

4. Canvas bounding box
http://phrogz.net/tmp/canvas_bounding_box.html
https://stackoverflow.com/questions/9852159/calculate-bounding-box-of-arbitrary-pixel-based-drawing

5. JS – Center Image inside canvas element
https://stackoverflow.com/questions/39619967/js-center-image-inside-canvas-element