Patrick Bellis

Assignment One


For the purposes of this assignment, I designed a fast and efficient algorithm for converting an rgb color image to a greyscale image. The algorithm applied the following formula to each pixel to perform the conversion.

dst_pixel = (src_pixel[0] >> 2) + (src_pixel[1] >> 1) + (src_pixel[2] >> 3);

The above algorithm is much faster then a typical algorithm that assigns weights to the pixel values and miltiples them since we are only shifting the bits.

The output looked as follows:

Input Image Output Image


Assignment Two


Problem Definition

Given a game where two people use their hands as "guns" how can we detect gun fire and which player shot the gun.


Method and Implementation

To detect which gun was owned by which player, I created two template images, one was a leftward facing hand and the other was a rightward facing hand. Then I performed two template matches per frame to determine where each players gun was. This could later be used to give information where the bullets of the gun would be headed but was for now not used other then to filter out unneeded points. To determine a gun fire, I performed a second template match which was the first template image with a motion blur. The second template image was matched with the sum energy of successive frames. Although in principle this should work, this ended up being rather finicky

The crucial parts of the game are the following; extract green channel from rgb frame. Template match green channel frame with hand template image. Filter out pixels that are not within the bounding box. Subtract the previous frame with the current frame. Sum energy of successive frames. Template match motion blurred hand with motion energy of successive frames within the bounding box of the hand.


Experiments

The most important parameter to this game is the template image and the maximum difference coefficient between a template image and a frame. Thus template images where carefully constructed to mimic experimental results. The coefficients were set to minimize amount of incorrect guesses. I would rather the game not detect a gun then detect something that is not a gun or worse detect the red player's gun as the blue player's.

To evaluate my game, I simply count the amount of times it could not find a player and times it identified a character (and when it should vs when it should not). I also measured if the game detected a shot fired in a similar fashion.


Results

Results

Trial Success Fail
Blue Gun
Red Gun
Shooting

ROC

Trial True False
Blue Gun present 9 1
Blue Gun not present 0 10
Red Gun present 8 2
Red Gun not present 0 10
Shooting Occured 6 4
Shooting Did Not Occur 3 7

Keeping in mind, that all conditions were optimal (including the distances).


Discussion

Discuss your method and results:

I would most likely create a template pyramid so as to be able to account for differences in size. I would also track the hand over the entire time the gun is "shot" insead of just the start and end of it.


Conclusions

In conclusion I created a fun although simple game. I have ideas in which to improve the interface, but for now guns not being tracked means that you "missed" the other player, which makes sense because if you are too far away for the template matching to work you wouldnt be pointing your gun at the other person. When the system does not correctly identify a gun fire, that is attributed to a misfire or a jam. I used a combination of many techniques all involving template matching but extending template matching to allow the tracking of unique objects to give contect to specific actions so that the actions can be attributed to a specific player. I think that the game was a success, although this was quite a lot of work for one person to do and so in the future I will probably (hopefully) work with a group.


Assignment Three


Problem Definition

Given three data sets {bats, cells, aquarium} find a way to segment the objects of interest in a fast and efficient manner.


Method and Implementation

For all three data sets, we used adaptive thresholding. Percentile thresholding worked well for the bat data set as well but we found that the adaptive thresholding mixed with a dilation worked far better and more consistently. For the bat data set, we converted the source image to greyscale then applied an adaptive threshold to that greyscale image. Then we perform a dilation. Finally we caclulate the bounds and area and use the area to determine if we think that it is a bat or not. The area must be greater then 50 but less then 500 pixels for us to consider it a bat. With the cells we follow the same process except the area must be greater then 250. Finally for the fish we calculate the circularity and ensure that it is greater then 0.5.


Experiments

Experiments were simply on each data set did the algorithm find a close amount of the object of interest and how many of those objects were actually the object of iterest. Thus we have three confusion matrices (one for each data set).


Results

Results

Output Example Output 1 Example Output 2
Bats
Cells
Aquarium

Bat ROC

Bats Not Bats
Bats 30 9
Not Bats 5 ?

Cell ROC

Cells Not Cells
Cells 6 0
Not Cells 2 ?

Aquarium ROC

Fish Not Fish
Fish 30 4
Not Fish 15 ?


Discussion

Discuss your method and results:

I would most likely design better filters that used information about the object of interest to filter out objects that were not that object. For eaxample using perimeter vs area ratios and/or compactness. We also relied on a cheap trick of downsampling the cell data to get it to work well instead I would like a more mathematically correct approach to modelling data.


Conclusions

We used the same segmentation algorithm for each data set (adaptive thresholding) although we also implemented percentile and global binary thresholding. I would like to have seen more advanced segmentation algorithms but sadly didn't have the time to research them. It might have been good to use adaptive and percentile thresholding together. Rather then use the stack or recursive connected components algorithm. I implemented an iterative connected components algorithm. This was largly due to the massive performance increase out of using the iterative approach. I believe this is due to the stack and recursive versions being O(number of labels * size of image) where the iterative case is O(size of image). So for large amounts of labels which we often reached, iterative was far superior. For the data sets set we did a dilation and checked area. For the cell data set we downsampled the data by 1/3 in order to cause the outer lines of the cells to connected then applied the connected components and then upsampled by 3 to get out output. For the aqaurium data set, we checked the circularity of the object and threw out all objects that had less the 0.5 circularity. The bats were very simple, the circularity of the bats changed depending on whether they had their wings extended or not. The cells were very dynamic and we found that their properties were often changing. The cell data set was very noisy so we filtered out all objects that had small areas to achieve our results. Our code base was initially very buggy and that led to us believed we had achieved a solution when actually the solution was gotten due to unstable bugs. This led to some very last minute changes and some "hacks" to achieve our goals. Seeing as the bats where some of the few objects in their data set and that their pixels were relatively brighter then their surrounding pixels adaptive thresholding made perfect sense. Percentile thresholding would have worked well too since the bats were the brightest pixels in the data set. Dilation helped because some pixels were lost due to their relative brightness. The cell data set worked in a similar way. Our solution to the aqaurium data set was not very well implemented. We needed to spend more time analyzing the properties of the fish. We could have used color better since some fish were very orange or very blue but instead used a greyscale image (mostly) and the shape of the object.



Assignment Four


Problem Definition

Given footage of eels and crabs swimming in a water tank:


Method and Implementation

To segment the water tanks out from the rest of the footage, I designed an algorithm that used a simple thresholding to isolate the brightest areas of the footage. Then I calculated the area and bounding boxes of the connected components and chose the larges two areas that were squarelike. To segment/track the eels I used optical flow. Understanding that eels are generally moving, I only looked at areas where the optical flow's vectors had at least a magnitude of 0.5. Then we could use the optical flow to track which eels are which eels in cases where there are more then one. Since the crabs don't move much we used thresholding to segment and tracked using centroids. Although not presently finsihed, it would be simply to determine if an eel entered the tank because we have a very robust and accurate tracker. The movement of the eel is related to the optical flow of the eel we could also calculate the moments of the eels and use that to determine the fequency of the eels oscilations.


Experiments

Experiments were could we accuately track an eel.


Results

Results

Output Example Output 1 Example Output 2
Eels
Crabs

Bat ROC

Eels Not Eels
Eels 3 0
Not Eels 5 ?

Cell ROC

Crabs Not Crabs
Crabs 3 2
Not Crabs 0 ?


Discussion

Discuss your method and results:

I would most likely like to refine segmentation futher as that seems to be the most import part of any computer vision process. Tracking a small amount of objects is not difficult if we can get a very good segmentation. That said I think we needed to focus more of our time on the remaining aspects of the assignment.


Conclusions

Any project that uses computer vision relies on a good segmentation. This is intuitive because we really need a good understanding of what an object is in order to find meaningful data about the object. Any information about a blob is only as good as the algorithm that generates that blob. Segmentation is the hardest to accurately implement but with a good segmentation everything else just follows.


Art Vision

Final Project Proposal

CS 585 Fall 2016

Patrick Bellis, Arjun Lamba, Qiwei (Victor) Zheng

Background and Objective

Large landscape paintings are always drawn following perspective principle. This project will try to extract the lines that are “parallel” on the building and construct perspective lines and the converging point on the painting. We can evaluate the perspective accuracy of the painting and even recreate a perspective correct version of it. After we get the converging/vanishing point, we will run a segmentation method to extract the objects, people and buildings in the foreground from the background. We will find a way to tell the distance of the objects on the paintings from the viewpoint. Then we can use the z-values that we get from the previous step and create a 2.5d space that contains several layers that have different distance from the viewpoint of the painting. The space will be able to be rotated, transformed and modified.

  1. Vanishing Point
  2. Segmentation
  3. 2.5D Conversion

Tools

Data Source

After browsing many painting from wikiArt and some other online galleries. We decided that we are using renaissance perspective paintings with people on the foreground and buildings on the side and on the background. We also want the paints to be realistic and sharped, and we want the buildings and constructions to have parallel lines that help us determine the converging point. Additionally, the objects and people on the foreground must have significant difference from the background that helps decent segmentation. Here are few painting we might be using:

Converging Lines and Focal Point

We expect to be able to generate a set of lines that help define the projection of the image. We expect that edge detection might be helpful to accomplish this. Once we construct the edges we can determine the focal point of the image and calculate the different layers within the image.

2.5D Conversion

The result of the third step will be something as the following figure. The people in the foreground are extracted as different layers and spaced in the 3d space as flat objects, with the background place at the far-end of the field of view.

State of the Art

We have carefully studied this topic. There are not many research projects on this topic. The closest topic we have found is a paper called “A generative model for 2.5D vision: Estimating appearance, transformation, illumination, transparency and occlusion” by Jojic and Frey. We found a lot of material that talks about linear perspective and aerial perspective in paintings that may help our project. Wikipedia has a useful article on 3D Reconstruction . It discuesses several different ways of reconstructing a 3D scene from a 2D image which may be of use. Wikipedia also includes reconstructing a scene using distortions to calculate the perspective of the image. This is very similar to the approach that we discussed together. Since projecting a 3D point onto a 2D plane is a nonreversable process, we need some notion of size or at least ratios. We will not be able to give accurate readings in terms of distance but it is very possible to output normalized coordinates. To do so we need knowledge of the size ratios of a person, in particular, we need the typical size ratios of a person as drawn by Renaissance artists. To be able to use this information we must accuartely be able to determine what "blobs" of pixels represent a person. This article goes through several different techniques for detecting humans although sadly many rely on video data. The circular hough transform looks somewhat promising as it should allow us to detect the faces of humans. From there we may be able to infer if a person is closer of further depending on the size of the circles detected (or we may chose to go a step further and find the exact dimensions of the face).