CS 585 HW 2
Ziliang ZHU

Problem Definition

Design and implement algorithms that recognize hand shapes (such as making a fist, thumbs up, thumbs down, pointing with an index finger etc.)

Method and Implementation

Read frames through camera with opencv api. Use thresholds and RGB relations to do skin detection and binary classify it to black and white pixels.

An important step is environment detection. We first detect the environment for 5 seconds with skin detection, then in later experiments we set these "environmental skin" to be negative, so that they are not evaluated in calculations. This greatly reduces the noise.

Project positive pixels to horizontal and vertical axis. Calculate waves in both axis to detect open-hand gesture. Find the beginning pixel of large sum of positive values as a starting point and perform template matching to detect thumbs up, thumbs down or fist gestures. Output predictions based on template matching similarity. We use L1 distance, which is the sum over pixel-wise (1-xor).

It also pops out a small logo indicating prediction results.


The scripts has run for countless times to evaluate whether the template boundary is too large, skin detection is proper, the size of the template, running time, etc. The parameter used in the script performs best to predict gestures.

we run 100 frames for each gestures and produces the confusion matrix.


Here are some frame snippets.

detection samples

Trial Source Image Result
trial 1 Thumbs up
trial 2 Fist
trial 3 Thumbs down
trial 4 Open hand

Confusion Matrix

Truth open hand thumbs up thumbs down fist
open hand 89 0 8 3
thumbs up 0 80 0 20
thumbs down 0 0 73 27
fist 0 33 14 53

Accuracy: 0.7375


The method described above and demonstrated in the code has the largest advance of speed. it takes within milliseconds to deal with each frame, and can do prediction with real time video. The accuracy, under proper light condition and parameter setting, is rather satisfying: accuracy: 0.7375, with the help of noise cancelling method. However, since it is rather simple, it still cannot handle when the background is too noisy. For example, when the user's face is in the camera, the whole region will be cancelled from calculation, making prediction very hard. Also, the template is not adjustable by size and does not perform sliding window. These has been implemented and tested, but it took too long for a simple project like this. These are the drawbacks in detection precision. Besides, the gestures selected are pretty similar, so that they can be easily confused. Finally, the algorithm does not involve motion and videos, which is also a prospective future development.


This project illustrats some simple tracking and detection methods in image manipulation utilizing basic traditional learning algorithms. It shows good result but has plenty of room for future studying.

Credits and Bibliography

The contents used are mainly from lectures or intrigued and rebuilt from lab materials.

My teammates and I encountered this project separately.