Goal / Problem Definition
To explore the classification of hand gestures using OpenCV, color thresholding, and binary manipulation. Specifically, I chose to look at the following static hand gestures: closed-fist, spread-hand, thumbs-up/down, and peace-sign.
I decided to approach the problem of identifying different hand positions (gestures) by comparing hand contours with those from a set of known templates.
Initially, I had intended to use OpenCV's template matching functions instead of writing my own contour matching function, but its sensitivity to rotation/sizing differences meant that I was getting horrible results, even with multiple templates at different rotations etc. My contour matching function doesn't suffer from the problem of rotation/sizing differences that OpenCV's template matching does, but it is outsizely affected by small differences in the contour (as can remain accumulated for a long time).
First, my code runs over the image, discarding pixel values that do not (roughly) match human skin color, using a combination of RGB and HSV thresholds.
Next, I use OpenCV's contour detection algorithm to generate contours, and discard contours that are unlikely to be hands (ie. because they are too complex, they are bounded by another contour, etc).
Once I have these contours, I compare them to contours grabbed from the template images using a rotation/translation/scaling-invariant contour representation of my own design. This representation describes contours based on the sum of the angle present when moving from one contour pixel to the next, such that contours ideally become a continuous two-dimensional function, where the value of the function at the "end" of the contour is equal to the value at the "start". (Note: I specify "ideally" here because my implementation doesn't always produce a continuous function at the moment...)
Using these differences, I then color the object bounded by the contour with a color based upon the closest template contour match, and display this image to the user.
Color-coding is as follows:
Blue = Closed Fist gesture
Light-Blue = Spread Hand gesture
Magenta = Thumbs-Up / Thumbs-Down gestures
Yellow = Peace Sign gesture
Additionally, the invariant contours for two of the thumb template images are written as jpeg files to tui.jpg and tui2.jpg in the form of graphs.
Mat threshold_process(Mat image): Runs thresholding for skin colors
vector<Point> getLargestContour(Mat image): Gets the largest contour (by bounding box size) from an image
vector<float> getInvariantContour(vector<Point> contour): Gets a representation of a contour that attempts to be invariant to rotation, location, and scaling by tracking the sum of the angle difference when following the contour
float diffInvariantContours(vector<float> contour1, vector<float> contour2): Calculates the difference between two invariant contours
In order to generate usable results, I took a video of the program's output while performing the four tracked hand gestures, and then sampled the video at semi-regular intervals to get discrete data points.
Confusion Matrix: (columns = predicted, rows = observed)
H F T P
H 3 3 4 5
F 2 2 1 2
T 0 2 5 3
P 0 3 0 3
N 0 0 2 1
H: The spread Hand gesture
F: The Fist gesture
T: The Thumb gesture
P: The Peace gesture
N: No gesture recognized
Over-all accuracy: 13/31 = 0.4194
Hand detection true-positive rate: 3/5 = 0.6000
Fist detection true-positive rate: 2/10 = 0.2000
Thumbs-Up detection true-positive rate: 5/12 = 0.4167
Peace Sign detection true-positive rate: 3/14 = 0.2143
Hand detection false-positive rate: 12/15 = 0.8000
Fist detection false-positive rate: 5/7 = 0.7143
Thumbs-Up detection false-positive rate: 5/10 = 0.5000
Peace Sign detection false-positive rate: 3/6 = 0.5000
Most of these values are extremely underwhelming, likely due to 1) oversights in my custom contour-matching implementation, and 2) issues with determining hand shape based solely on contour similarities.
It is of note that none of the values are worse than randomly guessing, despite the fact that the fase-positive rates render my gesture detection useless for almost all practical scenarios.
It is also possible that the rates could have been improved by providing more templates to match the contours for each gesture.
Based on my work in this lab, template matching as provided in OpenCV is not a good solution (at least on its own) for static gesture recognition. Likewise, while my contour-similarity code has the benefit of being invariant under common image translations, it would require a great deal of additional work / modification to be of realistic use in static gesture recognition.
Credits and Bibliography
https://docs.opencv.org/3.4/d9/d8b/tutorial_py_contours_hierarchy.html OpenCV Documentation on Contour Hierarchies (Accessed February 12, 2020).
https://en.wikipedia.org/wiki/Template_matching Wikipedia - Template Matching (Accessed February 12, 2020).