Software Setup:

The following experiments are done in Visual Studio 2017 Community Edition using OpenCV 3.3.0 and in Windows( 7 Enterprise and 10).


CS 585 HW 3

Yifu Hu
Teammates: Mona Jalal, Yi Zheng 


Part 1 - Object Shape Analysis and Segmentation


Problem Definition

Given binary images of hand, head and a stained tissue sample of a breast cancer tumor, we want to label the connected components and display them in different color. For each labelled object/region, we will draw its boundary, then compute the area, orientation, circularity and finally, skeleton the object.


Method and Experiment

  1. Connected component labeling

      Recursive labeling algorithm(“flood fill algorithm”)

Scan each row and column, when finding the object pixel, we assign to it a label and then search its 8-neighbour pixels. Then we assign the same label to non-zero value pixel and then continue to search its 8-neighbour until we can’t find any other connected neighbours.

      Filter trivial components by its size

Since we are not interested in small/trivial object and don’t want to display it, we set an  object size to filter it. For objects whose width or length are smaller than 100 pixels, we will discard them.

      Erosion and dilation

We use opening to reduce the blur on the image boundary and remove noise (especially  for the breast cancer tumor). This will help with removal of many unwanted small objects in the background that we consider as noise.

  1. Draw boundary

      Boundary following algorithm

We delineate the boundary of the object using boundary following algorithm. We find the first object pixel and use it as the boundary pixel and start point. Then search its 8-neighbour pixels clockwise to find the next boundary point. Repeat this process until we return to the starting point

  1. Object area, orientation, circularity

      Area: object area is equal to 0th moment



  1.  Skeleton:

A morphological skeleton can be computed using only the two basic morphological operations: dilate and erode. As shown in Wiki:

At each iteration, the image is eroded again and the skeleton is refined by computing the union of the current erosion less the opening of this erosion. An opening is simply erosion followed by dilation.


OpenCV methods used

      erode - used for image erosion

      dilate - used for image dilation

      moments - used to get object moments

      Threshold - used to convert image into binary image


External libraries used



Conclusion and Discussions

We successfully detect all the interested objects and draw the boundary of each object. And also calculate each object’s area, circularity, orientation and compactness. Lastly, we draw the skeleton of the object.

For breast cancer tumor image, we select the ROI manually and perform erosion 3 times to remove trivial objects shown in image. After we get the detected tumor in image, we use dilation to fill the holes in the tumor.


Bugs faced

The labeling algorithm uses array to store the starting point and ending point of each detected object. When there are too many connected objects in the image, this algorithm will use too much memory.



                                 Object’s area, orientation, circularity, compactness

image1: open-bw-full


image2: open-bw-partial


image3: open_fist-bw


Image4: tumor-fold





Part 2 - Hand Detection in Piano


Problem Definition

We are given a video sequence of 19 frames in which a pianist is sitting in front of a piano playing with both her hands. The camera is static and the primarily moving object is the pianist, especially her hands. Giving this dataset, the goal of this algorithm is to detect the hands of the pianist in each frame of the video sequence.


Method and Experiment

      Locate the Region of Interest (RoI)

As this is a static scene, background subtraction is used to remove most unwanted background objects such as the floor and the carpet, which have the similar color of the skin, and the piano. To do this, all 19 frames are averaged to obtain an average scene. Then the absolute difference between the average frame and each individual frame is calculated. After converting the difference into grayscale, an absolute thresholding is applied to it to remove the parts with minor change and keep the area with only the pianist.

The average frame of 19 frames

The difference between the first frame and the average frame

The region of interest is located, where the hands are within


      Skin Color Detection

The skin color detection method is a simple but powerful method when dealing with hand detection. One problem with this dataset is that there are so many skin-colored components that aren’t hands, even aren’t human skin, like the lid of the piano, the floor, and the carpet. Even we can use RoI to remove the piano, there are still part of the shirts and the piano keys that can be detected as skin. Thus, the skin color condition is manually attuned for this dataset. Then the region of interest is used to remove the piano. At last, only the hands, part of the hair, and a small part of the sleeves are kept.

The scene after skin color detection

The skin-colored scene within the Region of Interest


      Connected Component Labeling to Identify Hands

The final step is to locate the hand blobs. To do this, a priori information is used that the hand blobs are the two large blob at the very left of the RoI. So after using connected component labeling algorithms to label each blob, the top 4 blobs with largest area (total number of pixel) is picked out. For these four blobs, the x value (or column index) of their centers (not centroid but the averaged pixel value in each dimension) are sorted and only the smallest two are kept. These two blobs are identified as hands.

The final hand detection result


OpenCV methods used

      threshold:  used to convert image into binary image with absolute thresholding

      absdiff: used to calculate the absolute difference between two images

      sortIdx: used to sort the elements of a matrix

      floodFill: used to label the connected component


External libraries used




Conclusion and Discussions

We successfully detect the hands in each and every frame of the dataset “Piano”. Since parts of the hands are in shadow which could not be detected using skin color detection, so only the hand parts in light are used in the detection process and then the rough boxes that enclose hands are drawn based on the detected part with some extension in each direction so that the hand part in shadow could also be within the box.


Using the background subtraction in such a static scene is a simple and useful way to eliminate unwanted regions. If the scene is not static, then things will be different and more complex. The player’s sleeves also play a role in this detection, if she is wearing T-shirt, it would be a challenge to separate the hand from the arm using just skin color detection, since they have the same color.


Bugs faced

For the frame image named “piano_33.png”, the two hands are overlapped, so using the connected component labeling only one blob is detected for all hands. Although the green box marks the hands, the red one marks just a small part of the piano’s white key. This could be fixed by setting a threshold for the hand blob.



















Part 2 - Bat detection


Problem Definition

Detect bats appeared in the grayscale image and for each bat, decide whether the wings are spread or folded.


Method and Experiment

  1. We apply histogram equalization preprocessing on images first before bat segmentation

      Histogram equalization

            It can increase the image contrast, so that we can separate bat from the background better..


Before histogram equalization:


After histogram equalization:




      2. After we get the histogram equalization image, we use absolute thresholding method to segment bat from the background.

      Absolute thresholding




      3. Label each bat in the segmentation image

      Recursive labeling algorithm(“flood fill algorithm”)

      Decision rule for each labelled object:

            For each labeled object, if it’s too small, we will filter it by adding a decision rule to check

its width and length:  5 < width < 150 and 5 < length < 150.


      4. Wings classification - spread and folded

      Circularity: when the wings are spread, the circularity is small (approximate to 0) and when the wings are folded, the circularity is large (approximate to 1), so we can calculate the circularity for each bat in image and use it as the decision boundary to classify spread and folded. To calculate circularity, we first get the Emin and Emax through object moments.


OpenCV libraries used

      equalizeHist - used for image histogram equalization

      moments - used to get object moments


External libraries used



Conclusion and Discussions

With this method we successfully detect the bats in image (not all the bats shown in image) and can tell whether the wings are spread or folded at around 75% accuracy.


Since we only use “circularity” as the decision boundary, the classifier is not very robust to variations in the image. For example, one reason to failure detection is that for some spread wings, its shape approximate to a rectangle or even square, which leads to a relatively higher circularity. Apparently, we need to find more topological or morphological properties of the object as the features for our classifier to increase the accuracy.


Absolute thresholding method can segment the bats from the background. But at the bottom of the image, the bats and background have very similar brightness which makes it very difficult to select a threshold to separate them, so some bats will be regarded as background in our method.


Bugs faced

Since there are too many bats in the image, when labeling the objects, our labeling algorithm also stores the starting point and ending point of each bat in array. It leads to a very small detection rate for each frame.



Here are the results of the detected bats in image and for each bat, we recognize whether the wings are folded or spread.




Part 2 - Pedestrian Detection


Problem Definition
OpenCV offers two features for human detection: HoG and Haar. HoG has proven to be more accurate than Haar.


Method and Experiment

In HoG/SVM we have to make a trade-off between speed and accuracy.

In the GIF below a sliding window sweeps through the image for the purpose of face detection.

In our case, at each step of the sliding window, we extract the HoG features and pass them to the linear SVM classifier. The smaller the stride window, the more windows we will have which ends in more SVM classification. SVM classification is an expensive one and we prefer to have a near real-time solution so we should try to have less windows and bigger stride. Scale and window stride affect the accuracy and speed significantly.

sliding_window_example (1).gif


HoG algorithm is described in below diagrams in high-level:


A HoG descriptor looks like below

HOGDescriptor(Size win_size=Size(64, 128), Size block_size=Size(16, 16),
Size block_stride=Size(8, 8), Size cell_size=Size(8, 8),
int nbins=9, double win_sigma=DEFAULT_WIN_SIGMA,
double threshold_L2hys=0.2, bool gamma_correction=true,
int nlevels=DEFAULT_NLEVELS);

Here’s the parameters we used:

HOGDescriptor hog( cv::Size(64, 128),

  cv::Size(16, 16),

  cv::Size(8, 8),

  cv::Size(8, 8), 9, 1, -1, cv::HOGDescriptor::L2Hys, 0.2, true,        cv::HOGDescriptor::DEFAULT_NLEVELS);


OpenCV libraries used




      hitThreshold : “Threshold for the distance between features and SVM classifying plane”.

      winStride: a 2-tuple that shows the step size in both x and y directions.

      Padding: a 2-tuple in both x and y directions that are padded prior to to HoG feature extraction. Typical values for padding according to Dalal’s HoG paper are (8, 8), (16, 16), (24, 24), (32, 32).

      scale: controls the number of layers our image is passed through an image pyramid. A smaller scale value will increase the number of the layers in the image pyramid and increases the total detection time.





External libraries used

      dirent.h can be found at



Using HoG and SVM has the issue of having multiple overlapping bounding boxes for one person in which the detector is triggered numeros times.In Dalal’s HoG paper, the author suggests using Mean Shift algorithm however, that works suboptimally. A better method is using non-maxima suppression (NMS) which results in faster and more accurate final results.


      When two people get very close their bounding boxes merge and only one bounding box is shown and respectively only one person is detected.

      HoG/SVM is not very robust when there is occlusion. Perhaps use of optical flow or Kalman filter could help with this problem.

      We thought histogram equalization would improve the accuracy of detection. However, we were wrong in our assumption.

      We thought feeding gray images to HoG/SVM detector would speed up the execution but we were wrong.

      Haar cascade for full body detection is overall way faster than HoG features extracted and fed into an SVM classifier.



Bugs faced

      HoG/SVM with parameters hog.detectMultiScale(img, found, 0, Size(8, 8), Size(32, 32), 1.05, 2); is very slow and detects two or three people within a single bounding box. Also, sometimes the bounding box length jumps and becomes so long. We are not sure why this effect happens.

      Overlapping bounding boxes (that can be fixed using non-maxima suppression)



For hog.detectMultiScale(img, found, 0, Size(8, 8), Size(32, 32), 1.05, 2); we have:




Here’s the result for hog.detectMultiScale(img, found, 0, Size(4, 4), Size(32, 32), 1.2);

Decreasing the window stride and decreasing the scale have improved our accuracy and performance.






Additionally, we used Haar Cascade Classifier using the haar cascade xml file for the full body. We get way worse results. In many frames, the camera leg is considered as a human itself. Also, many of the pedestrians are not detected in comparison to using HoG along with SVM classifier.






Eventually, as for HoG, we tried hog.setSVMDetector(cv::HOGDescriptor::getDaimlerPeopleDetector()); instead of hog.setSVMDetector(HOGDescriptor::getDefaultPeopleDetector()); by changing the window size from that of 64*128 for original HoG paper from Dalal (2005) to 48*96 for Daimler people detector. As you see in the screenshot below, number of false positive has increased to a great deal. Somehow seems NMS is not implemented inside Daimler People Detector algorithm as multiple window proposals are accepted for each pedestrian.






      _nbins Cell is divided into several directions (9)

      _win_sigma Smoothing parameters

      _L2HysThreshold Regularized parameters

      _gammaCorrection Do you want to use Gamma correction?

      _winSize Detect the size of the target

      _blockSize Block pixel size (16x16)

      _blockStride Block Displacement Pixels (8x8)

      _cellSize Cell's pixel size (8x8)


Further improvement suggestions

      Running the code on GPU specially in case of human detection will increase the performance to a whole new level. Some of the methods are as:






Navneet Dalal, Bill Triggs: Histograms of Oriented Gradients for Human Detection. CVPR (1) 2005: 886-893