For this assignment, I will use computer vision techniques to detect hands in a video, classify their shape, and display the result in a GUI.
Method and Implementation
The method I developed to solve this problem follows these steps:
1. First, use skin color detection to create a binary mask representing the hand pixels
2. Median blur the binary image to smooth and minimize noise
3. Use horizontal and vertical projections to extract the bounding box of the hand
4. Load the pre-processed templates for each hand shape, and resize it to match the size of the bounding box
5. Apply template matching with cross correlation on the binarized hand and templates
6. Find the max score in the cross correlation matrix and select the hand shape that gave the max score as the best guess
7. Overlay a bounding box over the hand detection and write in the name of the best guess template to the GUI
I wrote all of the code in python using numpy and opencv. The process_video function handles the I/O and calls the function main, which does the actual hand detection and overlaying. It uses a variety of helper functions which I wrote to accomplish this. I did use some opencv functions to do major parts of the project, such as matchTemplate for template matching, medianBlur for smoothing, and reduce for x/y projections.
I recorded one 10s video made of 204 frames, and this is what I tested my code on. I will measure the success of the project based on classification accuracy.
Classification was done over 4 hand shapes. Here are the templates I used for these shapes:
Here are some screen shots of my GUI for the hand shapes
Here is a video with a demo of my code working.
To measure the success of the classification, I looked at many images where handshape was changed slightly, moved in the scene, or varied in distance to the camera. Here is the confusion matrix.
The accuracy of the project is quite high, especially for the fist hand shape. Whenever there was a fist, it was correctly classified. However, the code did sometimes confuse other shapes as a fist, especially thumb, which is somewhat expected due to their similarity. It also makes sense that the palm hand shape was sometimes mistaken for the peace sign (about 10% of the time). Interestingly however, peace was rarely mistaken for palm.
Overall, I am happy with the results of classification but I recognize that my code is dependent on skin tone recognition which can be finnicky and fragile, as seen in the video above. Lighting changes caused the background to be considered skin tone. However, even in these cases the template matching seemed to perform alright. If there was a case where the skin was overly shadowed or glared to the point that it evaded detection thresholds, my code would no longer work. Additionally, if there were multiple people or hands in the scene it would likely be much worse.
If I had more time to work on this project, I would take time to improve my GUI. In addition to aesthetic modifications, I would work to improve the consistency of the bounding box, which as seen in the video can be a bit wild due to noise. At the end of the day, the classification accuracy was high enough that I didn't consider this a priority or necessity due to time constraints. This can be done with thresholding, though this may be dependent on the video. Another approach would be to find the largest continuous box in the projections and box that.
Overall, I think this experiment was a success that also has much room for improvement. An obvious one would be to make it robust to orientation changes, or to classify more hand shapes or even moving gestures.
Credits and Bibliography
I had no teammate but discussed the assignment briefly with Nam Pham.