Summer 2019

one demo
two demo
three demo

This project was developed as a teaching module for AI4ALL 2019 in Boston University.
It is a script that identifies "air-written" digits by tracking a colored sticker and proceeding to feed the tracked data into a neural network. This project is created by combining three separate components:

  1. Tracking sticker
  2. Training a neural network on MNIST dataset
  3. Integrating trained neural network to tracker

Tracking sticker

Detecting sticker

Choosing Color Spaces

In school or art class you may have mixed different colours of paint or dye together in order to make new colours. In painting it's common to use red, yellow and blue as three "primary" colours that can be mixed to produce lots more colours. Mixing red and blue give purple, red and yellow give orange, and so on. For printing, printers commonly use three slightly different primary colours: cyan, magenta, and yellow (CMY). All the colours on a printed document were made by mixing these primary colours. Both these kinds of mixing are called "subtractive mixing", because they start with a white canvas or paper, and "subtract" colour from it.

rgb and cmyk

Computer screens and related devices also rely on mixing three colours, except they need a different set of primary colours because they are additive, starting with a black screen and adding colour to it. For additive colour on computers, the colours red, green and blue (RGB) are used. Each pixel on a screen is typically made up of three tiny "lights"; one red, one green, and one blue. By increasing and decreasing the amount of light coming out of each of these three, all the different colours can be made. A colour is simply made up of amounts of the primary colours (red, green and blue), three numbers can be used to specify how much of each of these primary colours is needed to make the overall colour. A commonly used scheme is to use numbers in the range 0 to 255. Those numbers tell the computer how fully to turn on each of the primary colour "lights" in an individual pixel. If red was set to 0, that means the red "light" is completely off. If the red "light" was set to 255, that would mean the "light" was fully on.With 256 possible values for each of the three primary colours (don't forget to count 0!), that gives 256 x 256 x 256 = 16,777,216 possible colours – more than the human eye can detect!

The Color Space of Our Choice

The HSV(hue, saturation, value) colorspace was designed in the 1970s by computer graphics researchers to better align with the way human vision perceives color-making attributes. In these models, colors of each hue are arranged in a radial slice, around a central axis of neutral colors which ranges from black at the bottom to white at the top. The HSV representation models the way paints of different colors mix together, with the saturation dimension resembling various shades of brightly colored paint, and the value dimension resembling the mixture of those paints with varying amounts of black or white paint.

HSV cone

Why HSV?

The reason we use HSV colorspace for color detection over RGB is that HSV is more robust towards external lighting changes. Meaning that in cases of minor changes in external lighting (such as pale shadows,etc.) . HSV vary relatively less than RGB values.

For example, two shades of red colour might have similar HSV values, but widely different RGB values. In real life scenarios such as object tracking based on colour,we need to make sure that our program runs well irrespective of environmental changes as much as possible. So, we prefer HSV colour thresholding over RGB.


To track the sticker we want to ignore everything in the background. Suppose we choose to track a red sticker, the masking will result as below:

colored stickers


There are occasions where the mask doesn't completely remove other unwanted colors, in this case erosion could be applied to remove them. Essentialy removing noise and making the "important" object smaller. An example erosion can be seen below:

unedited mask


In the case where the mask removes too much of the wanted color, dilation could be applied. Essentialy making the "important" object bigger. An example dilation can be seen below:

unedited mask


Countours can be explained simply as a curve joining all the continuous points (along the boundary), having same color or intensity. The contours are a useful tool for shape analysis and object detection and recognition. Due to masking, erosion and dilation the contour created isn't the smoothest shape. To revert this back into the a circle, find the largest contour and computer the minimum enclosing circle.
bumpy contour
good contour


The centroid of a shape is the arithmetic mean (i.e. the average) of all the points in a shape. Suppose a shape consists of \(n\) distinct points \(x_1 ...x_n\), then the centroid is given by: $$c=\frac{1}{n}\sum^n_{i=1}x_1$$ In the context of image processing and computer vision, each shape is made of pixels, and the centroid is simply the weighted average of all the pixels constituting the shape.

We can find the center of the blob using moments in OpenCV. However,we should know what exactly Image moment. Image Moment is a particular weighted average of image pixel intensities, it is used in computing some specific properties of an image, like radius, area, centroid etc. To find the centroid of the image, we generally convert the image to binary format and then find its center.
The centroid is given by the formula: $$C_x=\frac{M_{10}}{M_{00}}$$ $$C_y=\frac{M_{01}}{M_{00}}$$ \(C_x\) is the \(x\) coordinate and \(C_y\) is the \(y\) coordinate of the centroid and \(M\) denotes the Moment.


Writing the script

Tracking can be done by storing the position of the centroid for every frame. To replicate a "pen-up" and "pen-down" fuction, a key can be binded to only store the position of the centroid when pressed.

Converting tracked data points into an suitable form for neural network

The data from the tracker is in the form of a list of coordinates, to convert this into the written digit, connect the coordinates. Resize the image to \(28 \times 28\) and convert it to grayscale to match the MNIST dataset.

Training neural network on MNIST dataset

Create a Convolutional Neural Network

The architecture of my convolutional neural network for this task is as follows:

Train your Model

Training for 6 epoch with a learning rate of \(0.001\) was sufficient to achieve a \(99\%\) test accuracy

Test on Air-Written Digits

Apply the same data transform used on the MNIST data to the our resized grayscale image from the previous steps and pass it into the neural network

Air written prediction

Integrating trained neural network to tracker

To integrate the trained neural network into the tracker, download the trained neural network, and load it into the tracker script. Also include all the transform and convertion process from list of centroid coordinates to data that can be fed into the neural network. Bind a key to predict tracked data and empty the stored list, to avoid having to re-run the script for air writing a single digit.