Problem Definition
Give a concise description of current problem. What needs to be solved? Why is the result useful? Do you make any assumptions? What are the anticipated difficulties?
Method and Implementation
The image was captured from the default camera. Firstly, a bilateral filter was applied to reduce noise yet preserving the characteristic of the image. To detect skin from the video stream, we transformed the image into YCbCr color spaces. Y is the luma component and Cb and Cr are the blue-difference and red-difference chroma components. Then we apply Gaussian Blur to the Cr components, and use Otsu thresholding to transform it into binary image. After transforming into binary image, we use findcontour to find the convex hull of the “white” dots, and determine the boundary of the gesture, then cropped and scaled it to a predetermined size.

We tried to use some sort of background subtractor to improve the performance on a more general environment. We used createBackgroundSubtractorMOG2, a gaussian mixture-based background/foreground segmentation algorithm that used history frames to compute a foreground mask. However, during the testing we’ve found that it adds lots of noise on clean background, and didn’t perform very good on general environment, which makes skin detection even harder.

Gesture Recognizing After cropping the gesture from capture, we compared it with prestored gestures (thumbs up, thumbs down, one, victory, five) by using matchTemplate to calculate correlation coefficient, and return the one with the highest confidence. This function has several method to calculate correlation, and we’ve found that the TM_CCORR_NORMED matching mode gives the best correlation results.
Experiments
Describe your experiments, including the number of tests that you performed, and the relevant parameter values.
Define your evaluation metrics, e.g., detection rates, accuracy, running time.
Results
With five candidate gestures, we’ve achieved good accuracy on clean backgrounds (without other parts of body). We can also detect if there’s a gesture on the capture or not. Since the algorithm make recognition on every frame (30 fps), it is difficult to calculate the actual accuracy/recall values, and it makes no sense to calculate that by static images as it would be inaccurate as well.
Results | ||
Trial | Tamplate Image | Result Image |
trial 1: Thumbs up | ![]() |
![]() |
trial 2: Thumbs down | ![]() |
![]() |
trial 3: One | ![]() |
![]() |
trial 4: Victory | ![]() |
![]() |
trial 5: Five | ![]() |
![]() |
Confusion Matrix
Discussion
Discuss your method and results:
- We use skin color detection (YCbCr) and background subtraction to get the images of hands, it performed really well if the background is simple like wall. But if the backgrond becomes more complex, this method did not perform really well. Also, the color skin detection algorithm cannot distinguish between hands and face and that's why I tried not to show my face during experiments. We used template matching to do the classification and this method is easy to implement. However, this is not a really good classification algorithm. If you turn your hand a little bit, the result can change a lot.
- We used to think that the color of human skin can looks very different from the environment and it's easy to find hands in the image. However, it's much harder than we thought. Template matching can only work well if we have only small amount of gestures. I think increse the amount of templates can help.
- Potential future work. How could your method be improved? What would you try (if you had more time) to overcome the failures/limitations of your work? Firstly, we need to change the classification algorithm. I think comparing skeleton may be much easier than comparing the whole image. Or we may try to find some nods in hand, which can turn the hand into a 'tree' and we may store the angle between joints. Then we run a BFS or DFS and try to find some pattern.
Conclusion
Our algorithm works really well if the background is simple and other parts of our body are not included in the camera. However, we need to change our classification algorithm and try to use other methods to find the face, maybe CNN.
Credits and Bibliography
none