Gesture recognition can be seen as a way for computers to begin to understand human body language, thus building a richer bridge between machines and humans than text user interfaces or graphical user interfaces, which still limit the majority of input to keyboard and mouse. I completed basic hand gesture recognition with opencv2 and python3. I established the functioning structure for detecting 2 static hand gestures and 1 dynamic hand gesture. More gestures could be added in case of future requirements or needs.
Method and Implementation
I implemented the project in python3 with the package opencv2, numpy, math, matplotlib, collections. For implementing the recognization ability, I only use the opencv2 library. I use other packages for simple math calculations and plot the result pictures.
In code, I activate the built-in camera on the computer. If there's no camera detected, the program stops. The project contains two major parts: static hand gesture and dynamic hand gesture recognization.
Static hand gesture recognization:
I use computer camera to capture live vision (consists of many frames) and manipulate the frames and analyze each frame to check if it's a rock sign(fist) or paper sign(same as sign of number five).
1. mySkinDetect(src): this method is used to detect the skin color pixels and changes them to white pixels and rest to black color. The function traverse through every pixels in the frame and check if the current pixel has r,g,b values that sit within certain range. If yes, then it's skin, vice versa. This function is to decrease the sensitivity that the camera captures unrelated background. So it increases the accuracy or efficiency for testing, it saves the time that the subject sets the hand. This function inputs a original frame and outputs a black-white frame.
2. preprocessing(frame): preprocess the frame that captured by the camera. First, it uses mySkinDetect(frame) to get the black-white frame (white is the hand). Then it uses opencv Gaussian Blur function to blur the frame to decrese the sensitivity of the detection function to treat minor sharp fragments to interfere the process. Then it uses opencv threshold function to get the final black-white frame for the next steps. The threshold function check each pixels, if certain piexel have the value larger than a threshold, it's changed to white.
3. Contours(thresh): find the biggest contour and visualize it, return the biggest contour and the visualized frame. First use opencv findContours function to get all the contours from the preprocessde frame. And then go through all the contours to get the biggest contour. The biggest contour would idealy be the hand contour. Then draw both the hand contour and the bounding hull for the contour. It is for both the gesture detection and visualization.
4. Fingers(cnt,drawing): This function detect fingers. It inputs the biggest contour in the frame and the frame from the output of the contour funtion. Basically it use a universal funciton for finger detection. First use the open convexityDefects to get the defects. The defects mean the area between fingers in the 2D graph. It goes through all the defects to check the angle between two fingers using math.acos. If the angle is less than 90 degrees, we treat the two things next to the defects are fingers and we output the number of defects in the frame.
5. The main part: this part we use all the functions above. First open the camera, then draw a small rectangle area for the hand detection. Then the small area is called subframe and processed through preprocessing, contour funciton, fingers funciton to get the number of defects and all the visualizations. Then we decide which static gesture it is by the number of defects. If the number is 0 then it's rock shape, if number is 4 then it's paper shape. Otherwise, its' chaos (which is not a gesture). We shows the real world frame, the threshold frame and the contour frame to have a intuitive visualization.
Dynamic hand gesture recognization:
1. myFrameDifferencing(prev, curr): does frame differencing between the current frame and the previous frame. prev is the previous color image. curr is the current color image. It returns the destination grayscale image where pixels are colored white if the corresponding pixel intensities in the current and previous image are not the same.
2. myMotionEnergy(mh): accumulates the frame differences for a certain number of pairs of frames. mh is vector of frame difference images. It returns the destination grayscale image to store the accumulation of the frame difference images. If the pixel in either of all the 3 frames(each mh has 3 frames) is white, make the destination pixel white.
3. takeDynamicPics(n): takes n pictures from the camera when the key 'q' is pressed. Pictures are motion energy frame differences. First we create template frames for motion energy frames. Then we read new frames from video(live), then do frame differencing and visualize motion energy history using the above two help functions.
4. templateMatching(img): it is a method for searching and finding the location of a template image in a larger image. It slides the template image over the input image (2D convolution) and compares the template and patch of input image under the template image.The images are resized to 50% of image/template. Then turn both images to grayscale and use opencv matchTemplate function to find the matched template. And the founded matched area would be enclosed by a blue rectangle.
5. At last, run a loop to run template matching for 20 times to calculate the accuracy using the confusion matrix.
For adjusting the code, I changed GaussianBlur value from (9,9) to (41,41), and threashold value from 25 to 125 to have better detection results.
In static part, I need to test 2 gestures: rock and paper. For each gesture, run 20 tests. For example, for test rock gesture, the first 10 tests are I posed target gesture(rock), the next 5 tests I posed a paper gesture, and next 5 tests I posed another gesture(eg. victory sign). Then use the results to create a confusion matrix to compute the accuracies.
In dynamic part, I test the gesture waving hand. Run 20 tests. First record a ideal template. Then set the loop value to 20 to press 'q' 20 times during the recording. Then plot all the output images to create the confusion matrix.
Accuracy of the algorithm depended on how accurately it could predict each of our chosen gestures, namely a handwave, a paper sign, a rock sign, and refer to any other gesture as chaos. The overall accuracy of our algorithm is 77.5%. The picture below demonstrates the accuracy for each individual gesture. As it can be seen, the gesture with the highest accuracy is a rock sign, having a 100% accuracy. However, when I looked experimented more with the algorithm and how it recognizes objects, I realized that any object that doesn't have 'dents' in its shape is categorized as a rock, which highlights the problem with my algorithm for the rock sign classification.
The gesture with the lowest accuracy was the dynamic handwave, having the accuracy of 65%. My algorithm took a picture of the motion energy frame whenever a user hit the 'q' key on the keyboard and compared it to the picture taken earlier that was classified as a desirable output. I believe that the reason why the accuracy is so low for the dynamic gesture recognition is the fact that motion energy frame show the difference in frames as white pixels on a black background. Since any movement in the frame would be captured by the motion difference, many pixels would be recognized as the desired gesture because there was a lot of noise in the background.
The pair of pictures for the gesture Paper illustrates a case of success and a case of a failure. The gesture is overall well recognizes; however, as established, if the background is not a uniform color, the accuracy of the algorithm goes down and it does not produce a correct result.
The pair of pictures for the gesture Rock illustrates a case of success and a case of a failure. The gesture, when tested solely on its own, is a 100% accurate. However, as mentioned earlier, the algorithm also qualifies other objects that have a more uniform shape as a rock. Additionally, as can be seen from the example above, the Thumbs Up gesture is classified as a rock. I noticed that the algorithm does not detect enough convex in the gesture to classify it as a Chaos gesture. Therefore, in the future, it has to be adjusted to reflect a better measurement of the Rock gesture.
The pair of pictures for the gesture Handwave (Wave) illustrates a case of success and a case of a failure. As mentioned before, the template matching algorithm qualifies everything we had tested as if it found a template in the source image. Although I used various statistical modelling methods provided by the OpenCV library, none of them seemed to solve the issue. We believe that I have to change the motion energy function in the future for it to become less sensitive to noise, thus helping the template matching function have more accurate results.
The pair of pictures for the gesture Chaos illustrates a case of success and a case of a failure. The algorithm fairly successfully classifies any non-declared gestures as a chaos. However, as seen from the Rock gesture failure example, a thumbs up gesture that should be a chaos gesture is recognized as a rock sign. Therefore, going in the future, I have to adjust what we classify a chaos gesture to improve my algorithm's accuracy.
Techniques and features
These are the techniques listed in the requirements that I used:
1. access video camera input with OpenCV 2. frame-to-frame differencing 3. template matching 4. motion energy templates 5. skin-color detection
Discussion of the method and results:
The overall accuracy of the algorithm is 77.5%, which can be classified as moderately accurate.
The two main problems are the classification of a thumbs up gesture
as a rock sign and the recognition of a dynamic handwave. I expected that my dynamic gesture recognition
would lack accuracy; however, it would be more accurate than it was in reality.
For the future, I would like to do more research on dynamic gesture recognition algorithms
and various computer vision techniques that could be added to improve the accuracy of the results.
Additionally, I would like to correct how we classify the rock sign gesture with specification on what we expect the shape to look like. Another way to increase the overall accuracy of my algorithm would be to add more gestures that the camera could recognize and classify. I would list the differences between gestures and angles between fingers for each of the gesture to cover various scenarios of what could be presented to the camera.
The model that I created for static and dynamic gestures is a working model with a reasonable accuracy. To have a better performance and accuracy, the model has to be adjusted to include more computer vision techniques to better its accuracy.
Credits and Bibliography
Training a Neural Network to Detect Gestures with OpenCV in Python (Towards Data Science): https://towardsdatascience.com/training-a-neural-network-to-detect-gestures-with-opencv-in-python-e09b0a12bdf1. (Accessed March 28, 2019).
Hand Gesture Recognition using Python and OpenCV - Part 1 (Github): https://gogul09.github.io/software/hand-gesture-recognition-p1. (Accessed March 28, 2019).
Contour Features (OpenCV): https://docs.opencv.org/3.1.0/dd/d49/tutorial_py_contour_features.html. (Accessed March 28, 2019).
Convex Hull (OpenCV): https://docs.opencv.org/trunk/d7/d1d/tutorial_hull.html. (Accessed March 28, 2019).
Template matching using OpenCV in Python (GeeksforGeeks): https://www.geeksforgeeks.org/template-matching-using-opencv-in-python/. (Accessed April 1, 2019).
Template Matching (Reach the docs): https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_imgproc/py_template_matching/py_template_matching.html#template-matching-in-opencv. (Accessed April 1, 2019).
Template Matching (OpenCV): https://docs.opencv.org/126.96.36.199/doc/tutorials/imgproc/histograms/template_matching/template_matching.html. (Accessed April 2, 2019).