Final Project

YIDA XIN






P1 – Proposal





Updated on 10/24/2018




P2 – Update


  1. Project Title:
    SIFT-Enabled Automatic Face Recognition

  2. Team Members:
    Vitali Petsiuk; Yida Xin

  3. Problem Definition:
    We focus on scale-invarianct, rotation-invariant, and facial-expression–invariant face recognition.  Consequently, we think SIFT is a particular suitable mechanism, and we will attempt a SIFT-based approach on the LFW database used in the paper that has inspired us (Lenc and Král, 2015), as discussed in our project proposal.  The authors of the paper also tested and evaluated on the FERET and AR databases, neither of which is currently or conveniently available for us.  Hence, we stick with LFW.

  4. Inputs and Desired Outputs:
    Both our inputs and our desired outputs will be images collected in the LFW database — desired outputs are pairs of matched images that belong to the same persons, and both of each pair of matched images come from LFW itself.

    Link to LFW examples: http://vis-www.cs.umass.edu/lfw/devTrain.html

  5. Background Research:
    At the time of writing, we do not know of any other researchers besides the authors of the paper above who have been actively working on this approach, which is why we have included only this one inspiration/baseline.  It is worth noting that the authors have since made some progress and published another paper in 2017, and they have made their code and datasets available on their website: http://home.zcu.cz/~pkral/sw/

  6. Other Things:
    Below is an example of how our implementation-thus-far handles the matching between SIFT key points in a pair of images that are supposed to match.  The code so far: face_recognition.py, split_lfw.py




Updated on 11/26/2018




P3 – Report


  1. Pipeline:
    The overall pipeline constitutes two subpipelines that run in parallel meets in the middle.  Both subpipelines constitute a pre-processing step and a SIFT-descriptors–extraction step.  The first subpipeline pre-processes the training images and obtains the SIFT descriptors of the pre-processed training images; the second subpipeline does the same for the test images.  These two subpipelines meet in the middle at a “match box,” where the Adapted Kepenakci Matching algorithm attempts to correctly match the training images’ SIFT descriptors with the test images’ SIFT descriptors.
    1. Pre-processing:
      First, Viola-Jones (which is an effectuation of Haar Cascade) is used to detect where the face is in a training image; a bounding box is drawn around the detected face.  Then, within this bounding box, Haar Cascade is used once again to detect where the eyes are, assuming there are two eyes, and two bounding boxes are drawn around the two eyes.  Next, the centers of the two eye boxes are connected by a line segment, which is then rotated either clockwise or counterclockwise until it is horizontal (Note that only the eye boxes are rotated; the face box is not).  Finally, we crop out the region bounded by the face box.

      The purpose of this pre-processing mirrors that of data normalization in the setting of Deep Learning-based Face Recognition.  In both settings, the goal is to reduce the variance in training data, in hopes of obtaining a reasonable trained model.
    2. Scale-Invariant Feature Transformation (SIFT):
      The reader is hereby referred to online resources for details about SIFT.  One thing to note is that we did not use the improved method proposed by [Kirchner, 2016] for extracting refined SIFT descriptors.  Only the original method for extracting SIFT descriptors is used, proposed by [Lowe, 2004].
    3. Adapted Kepenekci Matching:
      This matching algorithm sets a threshold on the resultant SIFT descriptors, and eliminates the ones that are below the threshold.  Then, a similarity score is computed.

  2. Results:
    After about twenty sweeps through the LFW database, with parameter-tuning after each sweep, our best accuracy was slightly over 50% while our highest similarity was slightly over 0.8.  This result is obtained on the same dataset as in the paper, using a basically reimplemented version of the same architecture.  Consequently, we are curious about what the nuances are that have incurred such a sharp decline in prediction accuracy but nonetheless a relatively okay similarity score once the correct prediction has been made—i.e. a relatively high true-positive rate.  One hypothesis is that Adapted Kepenekci Matching actually eliminated some of the important SIFT descriptors.

  3. Discussion:
    The steps that constitute the completion of this project reflect the sparse steps that I took to learn about some of these “old” Computer Vision techniques, such as Viola-Jones, Haar Cascade, and SIFT.

    It seems to me that nowadays many people are either unwilling or afraid to ask the “why” question: Why does anyone want to bother themselves with learning all that “old” stuff when all that new Deep Learning stuff works like a charm.  In reality, nonetheless, the AI community must now face the tradeoff between how much we want to achieve high accuracies fast in highly specific task domains and how much we want to understand “what is under the hood” such that generalizations across task domains can be done in a human-interpretable and human-trustworthy way.

    SIFT is known for its invariance to scale, rotation, translation and its robustness to changes in illumination and 3D viewpoint.  For all of these features, there is a transparent mathematical explanation that corresponds clearly to a step in the algorithm.  However, these features are considered as “low-level” today, which can be learned easily using just the first few layers of a neural net.  So we may think of SIFT features as a particular instance of the more general class of low-level features of visual data.  Moreover, neural nets are capable of automatically learning many “higher-level” features using their hidden layers, but then the tradeoff is that a neural net is a black box and we don’t know what goes on inside.

    Can we, perhaps, think of interpretable low-level and mid-level features learned by “old” Computer Vision algorithms as seeds for developing neural nets that can learn, in humanly-interpretable ways, the much more general classes of such features, as well as perhaps even higher-level classes of features?

    My research concerns teaching computers to learn common-sense knowledge and to build models of the world using those knowledge.  My fellow travelers believe that all common-sense knowledge—be it on the symbolic level or sub-symbolic level—must ultimately be grounded in perception.  On the one hand, it seems exciting and useful to figure out ways of quickly grounding all that symbolic stuff (e.g. rule-based systems, semantic nets, knowledge graphs, etc.) on top of Deep Learning-enabled representations of perceptual data taken from the world; on the other hand, it seems safer but much more time-consuming to figure out mathematically-transparent ways of grounding symbols on top of more interpretable perceptual representations, many of which still might eventually be Deep Learning-based.  Which way is the right way is, perhaps, yet to be foreseen; and of course, it would be silly to deny the values of Deep Learning.  But either way, a valid concern remains how far things can—and should be allowed to—go, if in the absence of proper seeds, such as the kind of seeds that “old” Computer Vision can provide for “modern” Computer Vision.





Updated on 12/11/2018