Problem Definition
We are asked to create a desktop program that reads video from the webcam and recognizes four hand gestures: "palm", "ok", "peace" and "rock". The program must be able to detect and separately classify both hands and visualize the result in the separate window.
Method and Implementation
- The core method that we have used is template matching. It takes as input a binary image and a template, and computes a similarity value using the sliding window approach. As the result it returns the similarity measure and the heatmap (an array of similarities for each corresponding patch of the image).
- In order to be able to apply template matching to the frames from the webcam, we need to get the binary mask of the hands, and for that we used thresholding: we converted the image to HSV and filtered out the values that do not lie within the range [0, 40, 80] and [20, 255, 255]. We also performed closing morphological operation in order to get rid of the noise and holes in the masks.
- Using skincolor thresholding we created a set of templates that we used to classify the gestures.
- We used the normalized cross correlation measure to estimate the similarity of the input image to the templates.
- Since the similarity measure heavily depends on the size of the objects in the images, we used the pyramid scaling: we applied classification to the images rescaled to four different scales.
- All the templates are created from the pictures of the right hand, so at this point our method should only classify the right hand correctly. In order to detect the right hand, we mirror the image and use the same pipeline again.
- We attempted to roughly estimate the hand contour using cross-frame difference, but did not achieve any noticeable results.
Experiments
We have tested our method on the data in various lighting conditions: the skincolor detection method turned out to be highly dependent on the lighting and required some adjustments in order to work properly. Since face is also present in the skincolor map, our algorithm tends to confuse it with the hands and sometimes misdetects the actual hands.
Results
Templates | |||
Results | ||
Input | Skincolor mask | NCC heatmap |
Good Examples | ||
Discussion
We would like to point out some observations about our approach:
- Our method is relatively fast, it is able to process the whole image within the frame time which is relatively rare for more complex solutions that are usually based on convolutional neural networks.
- However, the accuracy of our method leaves to be desired. It detects and classifies the hands in a "good" lighting and if the head is out of the frame. It also often detects both hands in the same location.
- There is a big room for improvement in the future work. First of all, we could analyze some properties of the objects, such as curvature or circularity, to distinguish the head. Second, it would supposedly be beneficial to use a different matching approach, e.g. convolution. Also, it might be beneficial to preprocess the image in such way that the lighting conditions and other obstacles would not affect the classification accuracy. And finally, we could also analyze the movement of the objects and that would help us track the hands more efficiently.
Conclusions
The template matching based approach for hand gesture classification proved to be somewhat efficient in case when no other body parts are present on the frame. It is a relatively simple solution that does not require a large dataset for training, however, it lacks accuracy and hence is not applicable for the enterprise products.
Credits and Bibliography
Mathematical_morphology
Template_matching
Skincolor Detection using HSV Color Space