The main purpose of this assignment is to implement an algorithm that can recognize different hand shapes and gestures from a webcam (real-time) video as well as provide a visual representation of those results. More precisely, the tasks are to:
- Read and display each video frame
- Segment the critical parts of the frame for the recognition process
- Use the template matching algorithm to determine which hand shapes are in the frame
- Track the gestures using multiple frames
- Create a visual representation of the results
Method and Implementation
- Template generation
- Template matching
- Movement detection
- Motion energy template matching
- Object center tracking
- Visual representation
To create template images, we took photos of different hand shapes using both the left and right hand. For all of these images, we used a white background. These photos included hands with just the index finger pointing up, a hand in a closed fist, a palm with closed fingers, a palm with spread fingers, a hand with the thumb up, and a hand with the thumb down. Then, we used the same skin color detection algorithm from lab 3 to extract the hand in each image. We removed noise by thresholding and only using the area inside the largest contour. The hand template image is converted to grayscale and blurred for a more generic hand shape.
The pre-processing of each frame is very similar to how we generated the template images. First, we used skin color detection to extract any area in the frame with a hand, leaving other sections of the frame black. Then, we used the mathematical morphology technique called opening to remove noise with skin color. We used an elliptical kernel so the image after the opening process remains "curvy" and similar to an actual hand shape. For the bigger blobs, we used thresholding to find the binary representations. The area of the blobs is calculated using the contour of the blobs. The area of a blob needs to meet a certain threshold to be identified as a hand. In this project, the area threshold is set to be 8000 pixels. Next, the detected blobs bigger than the threshold are passed to template matching.
Once we have obtained the larger skin-color blobs, we used the template matching algorithm to check whether each blob is a hand shape or not.
First, we resized all the template images to the blob size. By resizing the templates, we can ignore the bias in NCC and SSD values caused by the different image matching sizes. We chose to resize the image rather than using the image pyramid because resizing is more efficient (one match per image) and more effective (we do not need to worry about the proper image scale ratios).
Second, we compared the normalized correlation coefficient (NCC) values of the grayscale images. One benefit for the NCC is that it's normalized in the scale of [-1,1]. It is easy to set a threshold on multiple blobs to judge whether they should be accepted as a hand shape or not, no matter how many hands we have on the screen. Moreover, compared to computing the sum of squared differences (SSD) values on the binarized images, the NNC on grayscale images contains more information about the hand shapes, which can have better performance on the recognition job.
Furthermore, with the consideration of slight differences in angles among blobs and templates, we also adopted image rotations before the matching process. We created rotated blob images and compared them with the template images in order to choose the best-matched values.
For this part, two methods are considered.
We created a union of binary difference images over the area of a waving hand, and used the area as a subimage. For the real time image input, we keep track of a list of image binaries, using the union of the binary of real time images, to match the template of a waving hand. However, due to the performance of the project from template matching, this method is eventually discarded.
For each object found, we kept track of the center of the bounding box. For a sequence of images, we try to match object in the next image with the object in the previous frame by their Euclidean distance. The threshold is set to be 10000 pixels. If the object in the new frame is within 10000 pixels from another object from the previous image, we assume that they are the same object. We track a sequence of the center coordinates of each object in the image.
From the sequence of center points for an object, the mean value of x coordinate is calculated and the average of absolute error between the mean and the x coordinates are calculated. If the average error is within a lower bound and an upper bound, the object is considered waving. If the average error is smaller than the lower bound, it is considered the natural movement of the body.
The hand object is considered drawing, if the average error is above the waving upper bound.
For areas detected to be a hand, the program will draw the bounding box of the object, along with the classification label on top or bottom of the box.
For waving gestures, the program will attach the "waving" label on the blobs on the screen. When there are drawing movements, we use the trace queue that we stored previously to draw the paths of the movement, so that it's easy to see what has been drawn.
The program has several bottlenecks. The most significant one being template matching. For this reason, we changed the implementation of moment energy template matching to simple center tracking. The result is better than we expected.
Other optimizations involve vectorization of some functions. For example, vectorization of NCC function decreased the template matching function by around 10%. Also, the code has been formatted for better performance.
imgMax(img): Takes in an image, return a matrix of maximum values from a 3 channel image.
imgMin(img): Takes in an image, return a matrix of minimum values from a 3 channel image.
def skinDetect(img): Takes in an image, return the image with black background and detected skin foreground.
gesture_identifier(mirror=False): The main program running the gesture classification algorithm.
def ncc(img, template): Take the grayscale sub-image and template as input. Output the normalized correlation coefficient value.
def remove_padding(binary_imgs) : Take the binary image as input, Output the de-noised and no-padding image (only blob itself in the bounding box).
def template_matching(imgs, templates, method='binary_ssd'): Take a sequence of grayscale images and templates with a method parameter (binary_ssd or grayscale_ncc) as input. Output the best-matched the template for each of the image in the sequence along with the ssd/ncc values.
The experiment is completed in Python. Source Code can be found here.
There are several templates used in our experiment, only some of them are shown in this page. All templates can be found here.
|Fist||Thumb Up||Palm||Index Finger|
|Hand||Fist||Thumb up||Thumb down|
|Index finger||Double hands||Waving||Drawing|
|Hand||Fist||Thumb up||Thumb down||Index finger|
The problem of thresholding small objects that are similar to skin color is that the skin area detected is subject to the distance from the camera to the hand. If the hand is too far away to the camera, it might not be classified as a skin blob. However, the skin detection algorithm is also a hard-coded thresholding that is very prone to lighting conditions. The skin in a darker area may not even be recognized as skin.
Our template matching implementation rotates the template and subimage for plus or minus 15 degrees, which provides a little bit of robustness to our algorithm. I think the better performing template matching algorithm will first calculate the orientation of the subimage found, then match its orientation with the orientation of the template.
The drawback of drawing classification: It is only tracking average error in x coordinate, which means if the hand is moving vertically the algorithm cannot properly classify the movement as drawing.
This project allowed us to familiarize ourselves with traditional methods of object detection using color detection. It also exposed many of the drawbacks to traditional methods, such as lighting environment, low efficiency, and sensitivity to the distortion of an image. However, it is a easy solution that can be quickly implemented in many applications with average accuracy requirements. Further improvements of the project include vectorization of template matching and migration from Python to C++.
Credits and Bibliography
CS 585 - Lab 3 Solution - Teaching Fellow Yifu Hu
Classmates: Shijie Zhao, Jamie Nelson