In basketball, the game revolves around moving the ball to reach its goal. “Plays” are run on the court to achieve this, and are typically analyzed by player movement and outdated positions. However, plays are drawn up with a focus on the ball itself! There exists no technology so far that analyzes video of a play and produces results - but there’s a massive amount of data - over 200 possessions per game! Computer vision can help us reach conclusions based on the path of a ball given a possession. In this project we are going to focus specifically on the number of passes made in a certain play/game.
We are trying to solve this problem by using computer vision techniques learned in class to track the path of a ball in a given play, and produce useful insights about it - such as the number of passes.
Live broadcast games of basketball proved to be tricky to work with, since they involve unnecessary (for our purpose) zooming in on players, the ball, etc., and awkward camera angles. As a result, it would be difficult to make much sense of the path of the basketball with the view changing so often.
To account for this, we used a simulation for our own synthetic data. The video game, NBA 2k20, provides a realistic and accurate depiction of basketball plays, and we can grab videos in the desired camera angle. As a result, we chose NBA 2k20 videos as our dataset.
Detecting the ball
Initially, we tried implementing template matching and motion energy templates to detect the ball in the video. This did not provide very accurate results, since it relied on a specific template of the basketball that varied from both frame to frame (due to, say, occlusions from the player’s hand) and video to video. Since it wasn’t a scalable solution, we moved on.
We also tried implementing Hough Circles detection to find the ball, but this also proved to be inaccurate, since it couldn’t detect the ball even when it was covered by the player’s hand, and it often detected non-circles (such as the player’s shoulder) as the ball.
We had to make improvements, so we implemented our own custom ball detection algorithm using traditional computer vision methods we learned in class.
First, we process the video frame by frame and apply a Gaussian blur for preprocessing to smooth out the features. Then, we filter the frame by a HSV color range, which would capture the ball and similar regions of color. The resultant filtered image is converted to binary, so all that is seen is the HSV filtered objects. We then perform the erosion and dilation morphological operations on this to remove small blobs.
Then, we find all the contours in this image, and iterate over them to filter the selection by shape. Specifically, we are looking for near-circles, to account for occlusions of the ball by hand, within a certain threshold. Only detected contours that fulfill this near-circle threshold move on. Then, we filter by area of the contour, since we know the predicted area of the basketball due to a relatively constant height and zoom of the video dataset.
Finally, from the resultant filtered frame, we calculate the circularity using moments for all the detected contours, and the most circular one, is detected as the ball with the highest probability. This algorithm already detects the ball with a high accuracy, provided it is at least visible in the frame. The next section on tracking accounts for this.
Tracking the ball
Sometimes, the ball is completely occluded by the players, so the detection algorithm will not find it. In some cases, the ball may be slightly visible, but the algorithm finds other objects that look like the ball. These cases completely ruin the following of the ball path, since it leads to the tracked path jumping around from detected object to object. To solve this, we decided to implement a Kalman Filter.
The Kalman Filter uses prior positions of the ball to predict the next position. It has a distance threshold so that detections far away from the ball are not counted as the ball itself (minimizes glitches where the ball jumps across the screen), and has a threshold for frames where the ball is not detected before it detects it as a new object (this helps with occlusion).
We use these predicted values as the detected position of the ball to draw the trace of the path, and this results in very accurate tracking.
Pass detection was done with a custom algorithm designed by us. We first stored the points where the ball was tracked in previous frames into a queue. Then we used the 3 most recent points. Each frame, we calculated the angle between the two vectors that were formed by these last 3 points that the ball was detected at. If the angle was between 0-60 degrees or 300-360 degrees and enough frames had passed since the last pass was detected (30 frames), we labelled this as a possible pass. Then we used the queue of points to find the average vertical and horizontal movement over the past 10 frames of the video. If the vertical movement was under 120 pixels, then we label this as a pass and update the pass count. This seemed to work fairly well for our dataset, although the frames passed threshold and vertical movement threshold may have to be adjusted for different videos.
Video 1: This video displays the results for tracking the ball in the video of the play.
Video 2: This video displays the results for detecting the number of passes in a play.
Here is a confusion matrix depicting the results and accuracy.
Simply run track_ball.py and modify the video source within the code. From our results,
Video 1: videos/2k_trim_1.mp4
CS 585 HW 4
April 3, 2020
The goal of this programming assignment is for to learn more about the practical issues that arise when designing a tracking system. We are asked to track moving objects in video sequences, i.e., identifying the same object from frame to frame:
Multiple frames are considered at a time, and an optimal data association algorithm (Hungarian method) is implemented and used.
To estimate the state of each tracked object, we implemented a Kalman filter.
There are two datasets to apply these tracking by detection methods to.
1) The bat dataset shows bats in flight, where the bats appear bright against a dark sky. We used the provided Segmentations and Detections to track the bats.
2) The cell dataset shows mouse muscle stem cells moving in hydrogel microwells, the brightness of the pixels within the cells are very similar to the values of the background. Here, we implemented both the segmentation and tracking.
Method and Implementation
Here, I implemented the program in python using what we learned from lab, lecture, and prior HWs.
The tracker.py file has 2 classes, the Track class is individual and unique for each object in the frame, and contains the ID, path (for tracing), and a kalman filter for each object to calculate predictions.
The Tracker class is used to update the Track for every object. The Update function uses the Hungarian method to assign the correct detected measurements to the predicted track.
It takes in as parameter the true detections of the centroids of the objects, and computes the sum square difference between this and the predicted value (using Kalman Filter, explained later).
It stores these in a Cost matrix, which we analyze to assign the correct prediction to the object using data association. The Hungarian method determines which assignment is optimal.
To solve the assignment using the cost matrix, we call the linear_sum_assignment function that solces the minimum weight matching in bipartite graphs to find the optimal assignment with minimum cost.
It then updates each kalman filter instance and keeps track for being able to trace the path.
The kalman_filter.py file has a class that keeps track of the estimated state of the system, and the variance or uncertainty of the estimate.
It has two methods, Predict and Correct, which implement the functionality of the Kalman Filter.
Following the equations and method from https://en.wikipedia.org/wiki/Kalman_filter, we initialize the state variables. The Predict function returns the updated state vector based on the dot product of the previous state vector and the covariance matrix.
It also updates the covariance matrix and accounts for noice in the process. After this function is called, the correct method returns the correct prediction for the given object.
To start off, we create an object of the Tracker class to be able to track the bats. After reading in the 150 frames from the Segmentation dataset provided, we convert this from grayscale to BGR values based on brightness in the frame itself. This makes it easier to see which bat is which.
Then, we load in the frames from the Localization dataset which provide the values of the centroid for each bat in each frame. This is the detections, which we use for the tracking.
To start the tracking, we iterate over each frame, each frame has the list of centroids. We pass these true centroid detections to our tracker, which further on uses the Kalman filter to predict the value of the centroid.
After this, we use the trace list from the tracker to get the coordinates of the previous position, and the predicted one. Using these, we draw a line showing the trace, and draw a circle and the associated object number to see which object is which for tracking between frames.
We save the frames to a video, which visualizes the tracking of different bats.
Here, the methods are essentially the same as with the bats, except we are not provided with the Segmentation or Localization datasets.
As a result, we have to produce these results ourselves first.
For Segmentation, we simply set a threshold to distinguish between the cells and their background.
For Detection of the centroids for each cell in each frame, we find contours using cv2 and filter by size (area). We then generate bounding boxes around the detected objects, and calculate the central point which is the centroid coordinates. We store these in a list.
Now, we have the list of centroids for each cell in each frame, and we can proceed as we did for the bats by instantiating a Tracker object, using the Kalman Filter, and drawing the resultant lines and numbers.
I primarily experimented with finding the ideal value for the tracker distance threshold for each dataset. This value determines the distance threshold: so when the predicted distance exceeds the threshold, the track will be deleted and new track is created.
The value of this determines how often a new object will be made after, say, an occlusion, versus the same object being tracked since it's below the threshold.
For the bat dataset, I found a good value to be 250.
For the cell dataset, I found a good value to be 100.
Here are the some results:
1) Bat Tracking
2) Cell Tracking
Conclusions and Discussion
1) As seen in the video above, the tracking of bats in the upper left corner was fairly successful. These bats were closer to the foreground of the video and did not overlap as often. You can see that at 8 seconds in, in the top middle the tracker succeeds at tracking the overlapping bats (bat 68 and 21). However, if you look at the bottom right corner where there are many more overlapping bats, the tracker fails in some instances.
2) By adjusting the value of dist_threshold when creating the Tracker object, we can determine how long of a distance is acceptable for a new track to be created vs recognizing it as an old track. I had to do some trial and error to get the results, as explained in the Experiments section. We also put a cap on the number of frames in between an object being tracked before it gets labelled as a new object. This seems to work fairly well but also fails in some instances.
3) Our implementation uses the Kalman filter and the Hungarian algorithm to deal with objects overlapping. The predicted location of the object in the next frame lets the algorithm decide which label belongs to which object after an overlap occurs. The optimal data association produced by the Hungarian algorithm allows us to assign the label to each object. This, along with the dist_threshold in 2), do the job fairly well.
4) If there are detections that do not match with the measurements in other frames, our algorithm has a threshold of frames for which they can exist. After that the label is removed since the object is assumed to have left the frame.
5) Our implementation does not take into account the velocity of the objects, but it could be a useful improvement since we threshold the distance an object can travel per frame before it is labelled as a new object. This creates some issues when there are objects very close in the foreground that move across many pixels very quickly.
Credits and Bibliography
Teammate: Andrew Spencer
CS 585 HW 3
February 24, 2020
This part of the homework involves Object Shape Analysis and Segmentation, where we perform binary image analysis on hands and a stained tissue sample of a breast cancer tumor.
Here, we have to:
1) Implement a connected component labeling algorithm and apply it to the data below.
2) If the output contains a large number of components, apply a technique to reduce the number of components
3) Implement the boundary following algorithm and apply it for the relevant regions/objects.
4) For each relevant region/object, compute the area, orientation, and circularity (Emin/Emax). Also, identify and count the boundary pixels of each region, and compute compactness, the ratio of the area to the perimeter.
5) Implement a skeleton finding algorithm. Apply it to the relevant regions/objects.
The second part of the assingment involves performing segmentation on video data.
Method and Implementation
Here, I implemented the program in python using what we learned from lab and prior HWs.
I analyzed the images and performed binary analysis to produce the desirbale results.
Some notable techniques used:
1) Connected components labeling using my own implementation of the Sequential Labelling Algorithm (SLA)
2) Finding the contours of the images
3) Calculating the statistics: area, orientation, and circularity (Emin/Emax). Also, identify and count the boundary pixels of each region, and compute compactness, the ratio of the area to the perimeter.
4) Morphology for opening, closing, erosion, and dilation to compare results and find the apt number of connected components >
Here are the images used:
Here are some results :
(a) passing through sla algorithm for connected components:
(b) finding contours:
(c) Morphology in this instance did more harm than benefit! Here is the result after applying 'closing', for example:
However, in this instance morphology seemed to have produced better results with less number of labels! As seen by applying erosion, for instance:
My program did a fairly well job of finding the connected components. It computed some extra small components because how the algorithm was implemented. It only really needs one pass, since it is a modified SLA algortihm that updates labels of prior pixels as it goes along.
As seen, there are differences in which preprocessing morphology step leads to more accurate results, depending on the image. The contours found were very accurate and was useful to point out the objects vs the background.
Results are obtained by changing the String of the image name in the line 'image = cv2.imread('1_hand.png',0)'
This is powerful in image analysis to be able to segment parts of images out and perform analysis.
Credits and Bibliography
Teammate: Andrew Spencer
CS 585 HW 2
February 12, 2020
This part of the homework involves programmatically designing and implementing algorithms that recognize hand shapes or gestures, and creating a graphical display that responds to the recognition of the hand shapes or gestures.
The algorithm should detect at least four different hand shapes or gestures. It must use skin-color detection and binary image analysis (e.g. centroids, orientation, etc.) to distinguish hand shapes or gestures.
Method and Implementation
Here, I implemented the program in python using the helper skeleton code given to us in C++ from lab.
I analyzed a live video feed from the webcam to solve the problem mentioned above.
Some notable techniques used:
1) template matching (created templates of three static hand positions, and one dynamic gesture)
2) background differencing: D(x,y,t) = |I(x,y,t)-I(x,y,0)|
3) frame-to-frame differencing: D’(x,y,t) = |I(x,y,t)-I(x,y,t-1)|
4) motion energy templates (union of binary difference images over a window of time)
5) skin-color detection (thresholding pixel values)
6) horizontal and vertical projections to find bounding boxes of ”movement blobs” or ”skin-color blobs”
7) tracking the position and orientation of moving objects
The program handles 3 static hand signs:
1) A closed fist
2) An open palm
3) A peace sign (two fingers held up)
It also handles one dynamic hand gesture: hand wave
The program is executed by running "python hw2_gesture.py static" to obtain results for the three static hand signs, and "python hw2_gesture.py dynamic" for the hand gesture.
static hand signs:
First off, pre-captured templates are loaded in and converted to black and white, and resized appropriately. The video camera is started, and the program reads in the current frame and overlays a rectangle in which the hand sign is recognized.
Then, I use background subtraction (get the average frame sequence for 30 frames) to prepare the area for hand detection.
A contour, or bounding box, is drawn on the hand which follows the shape.
Opencv's matchTemplate function is used with every template to generate scores based on how accurate the match is of the user's hand sign and the 3 templates. These scores are stored in an array and the max of these is the final result which is displayed in the same live video feed, if the value is above a certain threshold.
dynamic hand gesture:
The video feed is read and resized the template size.
The frame differencing function is used on the previous and current frames detect dynamic motion and ignore the rest.
Then, skin color detection attempts to match the current frame to our template and is thresholded.
After this, a contour is drawn that bounds areas in the fram that match the skin color. I then find the motion history to accumlate the frame differnces for a certain number of pairs of frame. A feed is displayed that depicts the motion history, and another feed depicts the bounding box drawn to detect object movement that matches the skin color.
Opencv's matchTemplate function is used with the template to generate a score based on how accurate the match is of the user's hand wave and the template hand wave, generated separately (done using similar techniques in the lab). If the value is above a certain threshold, we can inform the user that it is a hand wave.
Here are the template images used:
3. Fist Sign:
4. Hand Wave:
Here are some results (please look at the top left of the image to see what the program idicates the hand sign is):
But sometimes, a wrong result occurs due to some errors in contour detection and threshold values:
My program did a fairly well job of detecting hand signs and gestures, usually provided there's enough contrast in the images (a dark background works best)
This is powerful in image analysis to be able to detect hand positions and general motion!
Credits and Bibliography
CS 585 HW 1
January 29, 2020
This part of the homework involves programmatically modifying an image of a face. It has three parts:
1. Create a grayscale image of your face by converting your color image using one of the conversions we discussed in class last week.
2. Flip your face image horizontally, i.e. left to right, right to left.
3. Come up with a third way of manipulating your face that produces an interesting output. For example, you may create a blurred image of your grayscale face by assigning to each pixel the average grayscale pixel value of itself and its 8 neighbors. Hint: You may have to run your program a few times to make the blurring noticeable.
Method and Implementation
Here, I implemented the program in C++ and analyzed the image (my face) as a matrix of pixel values. These values were manipulated to solve the three above mentioned parts.
1. The grayscale() function converts the image into grayscale by taking the BGR pixel values and using them to calculate the grayscale value using the formula V = (B+G+R)/3.
2. The flip() function works by swapping pixel values from the first half of the image to its complement in the other half.
3. Here, I chose to mirror the face to create a new face that looks like a cyclops with one eye, and then tinting the whole image red.
Below are the resultant images generated by my code for the source image for each part:
3. Red Cyclops:
Images can be analyzed as matrices of pixel values, and these values can be programmatically manipulated to perform, say, face manipulation.
Credits and Bibliography