Problem 1 deals with object shape analysis and segmentation over a few provided binary images. We implemented connected component labeling, boundary finding, moment calculation, and skeleton finding. Problem 2 is a series of video datasets, with the goal to detect and in some cases classify specific objects in the frame via segmentation and other techniques.
Problem 1: Binary Image Analysis
Problem 1 is a series of tasks related to the following 4 pictures. We took the inversion of these images for all of our analyses.
1.1 and 1.2
For problem 1 part 1 and 2, we implemented a sequential labeling algorithm, similar to the one described in class but with a variant structuring element that includes the top left pixel as well as the other 3 shown in class. This allowed the labels to be "fixed" in a second pass. For part 2, we have a filter by area flag that can be used to prune small detections.
Here is the result of the connected components labeling method, with each component colored with a different random value.
For problem 1 part 3, we implemented the boundary finding algorithm as described in class. The algorithm takes a label and an image and finds the first pixel with that label in the image. It then traverses the object with that label in a counter clockwise fashion until it ends up at the pixel it started with.
Here is the result of the boundary finding algorithm on all of the components detected by our connected components code:
Problem 4 is the measurements of the area, orientation, circularity (Emin/Emax), perimeter, compactness, and ratio of the area to the perimeter. These were straight forward calculations, given by formulas from class. Here is the measurements for the largest object in each image (excluding background):
a,b,c,x_bar,y_bar: (205022766.81464016, -61630637.649107769, 106679956.94220567, 274.04721076040943, 130.78026365146562)
Orientation: (-16.037518490714444, -212.43027008968218)
a,b,c,x_bar,y_bar: (135458797.28176308, -72981519.426295519, 95162681.245386541, 297.3215028543612, 134.46740717794398)
Orientation: (-30.54754729491003, -140.0657794640955)
open_fist-bw.png Object 1
a,b,c,x_bar,y_bar: (88831831.154353589, 19598256.200373203, 197567226.7842519, 307.99303529659124, 171.51735919472259)
Orientation: (84.89140767072008, -150.06406643827142)
open_fist-bw.png Object 2
a,b,c,x_bar,y_bar: (53001143.773689613, -31640776.315619115, 112023459.18301022, 83.211645803745412, 189.11956270436303)
Orientation: (-75.90253028647778, 205.78392546033115)
a,b,c,x_bar,y_bar: (1692927882.0930378, 605747001.29744184, 627976481.90510976, 468.49263602802091, 502.14109605124122)
Orientation: (14.81566323378246, -679.8849873296374)
For 1.5, We created the skeleton of the images with repeated applications of erosion.
Here is the result of the skeleton finding algorithm on all of the components detected by our connected components code:
2.1: Piano Hand Tracking
For part 2.1, the goal is to detect the piano player's hands in each frame. To achieve this, we implemented the following steps:
1. Compute the average frame of the video to detect areas of motion
2. For each frame, take the difference between it and the average frame to segment out a region that will contain the moving hands
3. Take an absolute threshold over the difference image and create a bounding box around any objects detected to isolate the area of interest
4. In that region of the original image, perform skin color detection to find the hands, and erode and dilate to clean it up
5. Find connected components of the skin color detection, and sort by area
6. Of the top 5 largest connected components that are in skin color range, take the 2 that have the left most centroid in the area of detection
7. Designate these as the hands in the video and draw a bounding box
Here is an example of detection over the first frame in the image:
Average frame over entire video:
Difference between avg and frame, after thresholding, to find potential hand region:
Skin color detection in region of interest:
The algorithm is correct in all but 2 of the frames in the video, and has very high accuracy (89.5%) but sometimes has problems when the hands are touching (considered to be one detection) and it grabs the hair as the second hand.Example of this phenomenon:
2.2: Bat Detection
For problem 2.2, we were given a video of bats flying. Our task was to detect bats and classify their wing shape as spread or folded.
Here is an outline of our method:
1. Extract a column of clean pixels (manually discovered) in order to set row specific thresholds for detection
2. For each frame, blur it and convert to grayscale
3. Perform Canny edge detection on the bottom of the grayscale image to detect tiny bats
4. Near the top of the frame, where detections are larger, perform adaptive thresholding using the clean pixel row +epsilon as a value
5. To finalize detection, use connected components labeling and filter by area threshold
6. Now that we have detections, we will use skeletons of each detection to classify shape
7. After extracting skeletons, compute the principle components and then calculate the eigenvalues of the corresponding set of points
8.If the the shape is 11x as long as it is wide, we determine that the detected bat has its wings spread. If not, we determine that it's flapping its wings
9. To determine if bats might be overlapping (multiple bats in an area), we extract a bounding region around a detected bat. Using thresholding to find the centers of the bats (found to be ~155 from pixel values),we attempt to determine if multiple bats are in the bounding box with the associated object label
10. A rectangular bounding box is placed over each detected bat where the color of the box indicates classification: Green indicates flapping, Magenta indicates spread wings, and Red indicates multiple bats
An example of bat detection over a frame:
Bat Detection After Adaptive Thresholding:
Bat Skeletons for Shape Analysis:
The result, where green indicates folded wings, magenta indicates spread wings, and red indicates multiple bats:
Measuring performance of the bat detection is tricky due to the fact that it is hard to count bats manually, and in some cases the bats shape is ambiguous. Overall, the algorithm qualitatively performs well. It does miss a significant amount of bats that are very small, especially at the bottom of the frame that were filtered out. This filtering is necessary to exclude artifacts like stars in the image, so it is a trade off. Additionally, with multiple bat detection there are a noticeable amount of false positives.
Here is an example of this:
2.3: People Detection
For problem 2.3, we were given a video of people walking across the frame. Our task was to count how many people were in the frame.
1. Create a background frame containing no people, from sliced portions of multiple frames for differencing
2. Blur video frame and convert it to to grayscale
3. Take the difference between that frame and the pre-processed background frame, and threshold that difference to obtain a binary image
4. Perform connected components analysis and skeletonize the detected objects
5. Erode and dilate the skeletons by vertical 3x3 kernels (shape of I, J, \, etc) to get rid of horizontal components of the skeletons. This eliminates problems like hands touching in an attempt to isolate the spine of the people in the frame
6. Extract connected components of the skeletons, and count how many people are in the frame
7. Overlay a dot on the centroid of the detection and the count of each frame for output
The background frame:
The Original frame:
Thresholded Binary Difference from Background frame:
Eroded and Dilated Vertical Skeleton:
Overall, the count given from this algorithm had 86% accuracy, with an average error of .97. For this problem, the main difficulties arose when people were occluded by the sign in the center of the frame. Here is an example of a bad detection due to occlusion:
Credits and Bibliography
My teammate for this project was Sid Mysore. I also discussed the assignment briefly with Nam Pham.