Part 1 - Objection Detection by Template Matching

CS 585 HW 2
Yizhang Bai
Teammates: Haoyu Zhang, Hui Li
Sep 21, 2016

Problem Definition

The problem in part one is tracking an object in a stream of images from webcam by template matching.

Method and Implementation

The method we use to perform tracking is template matching. First, prepare a template of the object we try to recognize. Then, sliding the template in the source image pixel by pixel both vertically and horizontally. The template thus goes though all locations in the source image. At every location, the small part in source image is compared with the template and calculate the difference between the part and the template. After scanning all locations, we select the location which has the least difference with the template and where we find the best match. According to the different measure of difference, the selection criteria are different.

Additionally, instead of using the template of on size, we apply multi-resolution to the template and thus we obtain an image pyramid that collecting templates in different scale. We use templates at different sizes to compare with each location in the source image to find the best match. It deals with the problem of object at different scales.

We also apply another tricky method to reduce the noise, or false positives, during the detection. Assuming the object always moves slowly, the best match locations in two successive frames should not be far away from each other. If the match location in this frame is far away from the previous one, then we recognize the current location is a false one, and we still use the previous location as the match location for current frame. To do so, we set a range: 400 pixels. If the absolute value of the difference between current location and the previous location larger than 400, then we recognize it as a false positive and still keep the previous location. Similarly, since the slow moving speed, the size of the object also changes slowly. Thus, if the scale level suddenly increases or decreases two or more than two level, then we also recognize it as a false positive. When the sudden jump happens, we use previous scale level instead of the new value.


Load template and build image pyramid.
     templ[1] = imread("template.jpg", 1);
     pyrUp(templ[1], templ[2], Size(templ[1].cols * 2, templ[1].rows * 2));
     pyrDown(templ[1], templ[0], Size(templ[1].cols / 2, templ[1].rows / 2));

Do the matching and normalizing.
     matchTemplate(img, templ[i], result[i], match_method);
     normalize(result[i], result[i], 0, 1, NORM_M INMAX, -1, Mat());

Find the minimum/maximum location and value for template in different size.
     minMaxLoc(result[i], &minValTemp, &maxValTemp, &minLocTemp, &maxLocTemp, Mat());

Find the best match among template in three different size.
     for (int i = 0; i < 3; i++){
          if (minVal[i] < min){
               min = minVal[i];
               templateScale = i;
          }      }

Reduce false positive.
     if (abs(templateScale - previousScale) > 1)
          templateScale = previousScale;
          matchLoc = minLoc[templateScale];
     if (abs(matchLoc.x - previousLoc.x) > windowSize || abs(matchLoc.y - previousLoc.y) > windowSize)
          matchLoc = previousLoc;

Finally draw bounding box.
     rectangle(img_display, matchLoc, Point(matchLoc.x + templ[templateScale].cols, matchLoc.y + templ[templateScale].rows), Scalar::all(0), 2, 8, 0);
     rectangle(result[templateScale], matchLoc, Point(matchLoc.x + templ[templateScale].cols, matchLoc.y + templ[templateScale].rows), Scalar::all(0), 2, 8, 0);

We track the ball in the source scenes using the template showed above. 200 frames are detected. The correct detection is defined as the following: the object or part of the object are delineated by the bounding box. The result is evaluated in detection rate: the percentage of correct detection.

Detection Rate = Number of correct detection/Number of all detections.


Detection Rate:
Number of correct detection : 147
Number of all detections : 200
Detection rate = 147/200 = 73.5%

Detection of different scale:

Successful detection:

Unsuccessful detection:

Detection of different orientation:


The detection rate shows the strength of our program. With an acceptable detection rate, 73.5%, our program successfully tracks more than half frames. By using image pyramid on template and basic noise reduction, we improve the accuracy of the detection under different object size and false detection. In terms of orientation, our red ball is successfully detected with different orientations.

However, limitations still exist in our program. Without difference the background, many false positives happen in the background that make strong noise to detection. Thus, to improve our result, if we have more time, we would difference the background and perform background subtraction to enhance the detection rate.


According to the result, we think detection rate of our program is basically acceptable. Although techniques such as template pyramid and noise reduction is applied, there is other factors affecting our result. Due to the time limitation, we did not develop the background subtraction which we think might cause a huge enhance in detection rate of our program.

Part 2 - Recognition of the hand shapes or gestures

Problem Definition

Design and implement algorithms that delineate hand shapes (such as making a fist, thumbs up, thumbs down, pointing with an index finger etc.) or gestures (such as waving with one or both hands, swinging, drawing something in the air etc.) and create a graphical display that responds to the recognition of the hand shapes or gestures. For your system, you could use some of the following computer vision techniques that were discussed in class:

Method and Implementation

our system uses following steps to implement hand shapes recognition and gestures: &nbps 1.Skin detection: we use pixel color to recognize human skin. 2. Template matching: we exhause source images and using matchTemplate function in opencv to match hand shapes. 3. Compare results: compare outputs from different results and find the best one as our final result, and create a bounding box on the best matching point. 4. Motion detection: continually calculate the frame difference, and calculate motion energy accumulate frame difference. And using the bounding box gained in step 3, we count the white pixels in bounding box on motion energy image. We use this count to measure the intensity of movement.

Skin detection function: find the pixels which Frame difference function: calculate absolute value of the subtraction of the previous frame and current frame. Montion Energy function: using bitwise_or to accumulate several frame difference images.


We carried out some experiments about skin detection functions, and in the end we find some suitable thresholds which produce best output. After skin detection, because there are still a lot of noisy points in the image we tried some methods such as flur() and erose(), and finally we use erose funciton to eliminate noisies. For template matching part, we find that the size of templates or the position of hands in template images always influence our outcome. We in the end find a suitable way to generate templates, and it produce a good result.

accuracy: images without hands:83% fist:82% palm:90% thumbs up:91% thumbs down:83%
complexity: O(width*height)


List your experimental results. Provide examples of input images and output images. If relevant, you may provide images showing any intermediate steps


My contribution

According to the experiments, there are two basic parts:

  • Template matching
  • Gesture recognition

  • For each part, my contribution is:
    For first part: create and improve the template matching algorithms. Except the basic algorithms we put on the above, I create some ways to improve the precision of the template matching:

    1. create a series of sizes of a template, such as 1/2, 3/4, 1, 3/2,etc. When doing the template matching, for each picture captured from the camera, we can calculate each size of template of the template matching, and to compare the result, for the max value of MaxVal or for the min value of MinVal, we can get the most similiar size of the template on the screen. Thus, we can put the most similiar size of the template and set the size of the rectangle bounds of the template. And accordingly get the most precise template matching.

    2. Suppose the template we match is something contains pure color. When calculate the color, from the inspirion from the skin detect, we can get the RGB of the color of the template. And deal with the screen to set it only print out the similiar color of the template(for example set it to white); and for other colors which is different from the color of template, set it to black. With the dealed picture, we can match the template more precisely.

    For the second part: create a more efficient skin detection. For example, when representing colors in a image, we have three main functions, they are RGB; Y,Cr,Cb; and HSV. when detecting the human skins, we can use three kinds of methods to compute the pixels and for each pixel, if all the method says the current pixel is a pixel of skin, then print out the skin.


    Discuss your method and results: