GO TO PROJECTS HOME PAGE

CAS CS 640 Artificial Intelligence - Fall 2007

Programming Assignment 1

by ESRA CANSIZOGLU


PART-1: Video-based Activity Recognition

1.1. System Overview:

In this part, three different activities are recognized by the help of shots from three cameras that are arranged in different orientations. The orientation of each camera can be seen in the following figure. As seen from the figures, orientations have slight differences.

The third activity that I select for evaluation is turning while arms are extended sideways. I chose such an activity, because it resembles to second activity, waving arms that are extended sideways, and it presents different views in different cameras. An example of the third activity can be seen in the following figure.

 

For recognition of activities, frame differencing and projection profile of motion models are used in a very simple way. First motion models are created by summing up the frame differences. Then projection profile of motion models are used for feature extraction. width of the largest peak on projection profile is used as feature. Below are example motion models and their projection profiles for the three activities. Projection profiles are smoothed with a gaussian window, so that peaks and valleys can be more clear. Note that images are converted to gray scale before frame differencing.

The figure indicates that width of the peak for the activities differ from each other slightly and throughout experiments, generally I observed that

width_of_peakcrouch < width_of_peakturn < width_of_peakwave

After feature extraction, 60% of the samples are used for training and the remaining are used for testing. I make use of k-fold cross validation by randomly selecting the test samples each time and repeating this five times. In testing step, the distance between  the feature of the query activity and the training samples are computed and then the activity is recognized as the most nearest activity. Let's say q is the feature vector of query activity. Distance of query activity to the ith activity, is


 

where is the feature vector of jth training sample among n samples of activity i. At the end activity is assigned to the class where the distance is minimum.

 

Combining the results coming from the cameras is done by voting technique. Each camera votes for its most nearest activity, and at the end the activity is recognized as the one with maximum number of votes. If all votes are equal, then the result of the 1st camera is preferred since it is adjusted from the frontal and shows the body more clearly.

 

In summary the system has three main parts: Feature extraction, training and testing which is reported in detail below:

1. Feature extraction

    1.1. Create motion models by summing up frame differences

    1.2. smooth the vertical projection profile of the motion model.

    1.3. compute the width of the largest peak on the profile. (here a threshold value is used)

2. Training

    2.1. Randomly select training samples

    2.2. Train a system for each camera

3. Testing

    3.1. for each camera

        3.1.1. get the nearest activity to the query feature vector.

        3.1.2. vote for that activity.

    3.2. combine the results coming from cameras

        3.2.1. recognize the activity as the one with maximum number of votes.

        3.2.2. If all activities are equally voted, use the result coming from the first camera, since it is adjusted from the frontal.

1.2. Results

In order to show the affect of using multi camera, two different experiments are conducted.

1. by using the results coming from three cameras, by applying the technique told in previous part.

2. only using the result of first camera.

Confusion matrix for the first and second experiments are:

  Crouch Wave Turn
Crouch 1 0 0
Wave 0 0.8 0.2
Turn 0 0.4 0.6

 

 

       

  Crouch Wave Turn
Crouch 1 0 0
Wave 0 0.8 0.2
Turn 0.3 0.4 0.3

In first experiment recognition rate, which is the proportion of true positives, is 80%, while in the second experiment this rate decrease to 70%. This shows that using multi camera improves the recognition by obtaining different views and different features. Another observation is the problem on recognition of turn activity. When I used only one camera, it is harder to distinguish turn activity from the others. (Compare the third lines of confusion matrices) Although in my system camera settings does not provide strictly different views, usage of multi camera improves the system performance. This means that, for a more convenient setting on cameras, i.e. one is frontal, second is at the top and the third is at one side, more accurate results can be acquired.

(a) One example recognition result for crouching. First five views are image sequence for the activity and the last one is its motion model. This is an example of a true positive detection.

(b) An example of miss detected waving activity. This sequence recognized as turning. Note that in the motion model, arms can be seen when they are by side and up, intermediate steps are not visualized maybe because the action is done so quickly. So the projection profile of the motion model can be likely to the profiles of turning activity. This miss detection can be prevented when there are more than five images for an activity.

(c) An example of a miss detection for a turning activity, which is recognized as waving by the system. Motion model have a wide range of white pixels on vertical projection profile, so that it is detected as a wave activity. If one of my cameras was set from the side, this miss detection can be cancelled since turning and waving differs for that camera.

1.3. Conclusion

In this part of the assignment, a simple video-based activity recognition system is proposed. The system uses the information coming from three different cameras and make recognition for three activities. The results show that usage of multi camera improves recognition rate, although in my system the orientation of the cameras have very slight differences.

Difficulties on recognition of the activities.

1. For each camera, there are 5 images for an activity. If there are more images, then the motion models will be more clear and the system have a greater performance.

2. One important fault that I made for this assignment, I can arrange the cameras with certainly different conditions. But once I got the pictures and I designed a system, I did not want to change, since time was limited. Maybe with more different orientations of cameras, the design and performance of the difference will be improved.

3. The features used for recognition are so simple. Therefore it was difficult to classify an activity. With more complicated features, better results can be achieved.

One final remark to mention that I thought when making this assignment :) In social life, looking from different point of views bring success. For example, a decision made by a group of people is more appropriate than the decision of a unique person. Because, in a group everyone can look in a different point of view and can capture a different aspect of the matter. Similarly, it is better to use the views coming from different cameras than using a unique camera. Just a final interpretation about the assignment :)


PART-2: Face Detector

2.1. System Overview

In this part of the assignment, our aim is detect face on views of each camera and evaluate which camera demonstrates the best view. So we have two steps, detecting the face and selecting the best view among three cameras for that face.

For detection of faces I trained Gaussian classifiers for face and non-face pixels of the image. For this purpose, I create a small set of training objects that are taken from the images of the first part. Each pixel is represented with the (R,G,B) values of that pixel and Gaussian classifiers are learnt from the pixels of training samples. Training samples can be seen on following figure. Left part are the positive samples and right part is negative samples.

After training step, the system detects face and non-face pixels for each view of the camera and computes the proportion of face-pixels for each view. Then it selects the best view as the image which gives maximum proportion.

2.2. Experiments and Results

Below are some examples of detection and evaluation results.

(a) System selects the view from the third camera as the best view. Note that since the training is based on color, some pixels on the table are recognized as face pixels.

 

(b) The best view is determined as the first one. But it can not detect the face properly due to brightness changes.

 

(c) Second view is evaluated as the best view, while we expect it should select the first one. The system detects skin pixels as if face and these pixels weights proportion of face pixels. The second view contains more skin pixels, thus it is selected as the best one. 

(d) Same problem occurs as in figure (c). Detected non-face skin pixels causes the selection of the third one as the best view.

          

As seen from the figures, the system is based on color. So it can also detect the skin-pixels that do not belong to a face. This problem can be  solved by using the relative position of head and arms or by using some additional features rather than color.

I also test my system for other images that contain faces. Below is the face detection result for the image Betke.ppm. Training samples are collected from a unique source of pictures so the system does not cover the sample space accurately. Therefore the system is not good on detection of faces under different illumination conditions.

2.3. Conclusion

A simple face detector is implemented by training Gaussian classifiers on (R,G,B) value of image pixels.

1. The system is based on color so that it can also detect some non-face skin pixels. this can be prevented by using some additional features or relative positioning of hands and face etc.

2. The training set is small and cannot cover the search space accurately. So the detection performance is not so good at images from outer set of training samples.

3. The system also have difficulty on detection of faces under different illumination conditions. This is also due to the fact that it uses RGB colorspace as feature.

4. Some small pixels that are non-skin can be eliminated with a second step after detection. This is a future goal. Time did not permit to implement this idea :)


For activity recognition part,  I include only 1 sample for each activity during submission. You can reach all samples from here. Please unzip this file and replace it with the submitted version of images directory.

The system is implemented on Linux environment using Matlab7.1.