Exploiting phonological constraints for handshape recognition in sign language video
|
Handshape is a key articulatory parameter in sign language and thus
handshape recognition from signing video is essential for sign recognition and retrieval.
Handshape transitions in lexical signs (the largest class of signs in signed languages)
are governed by phonological rules that constrain the transitions to involve either
closing or opening of the hand (i.e., to exclusively use either folding or unfolding
of the palm and one or more fingers). Furthermore, akin to allophonic variations in
spoken languages, variations in handshape articulation are observed among different
signers. We propose a Bayesian network formulation to exploit handshape co-occurrence
constraints and utilizes information regarding allophonic variations to aid handshape
recognition. We propose a fast non-rigid image alignment method to gain improved
robustness to handshape appearance variations during computing observation likelihoods in
the Bayesian network. We evaluate our handshape recognition approach on a large corpus of
lexical signs (described in a subsequent project below). We
demonstrate improved handshape recognition accuracy leveraging linguistic constraints
on handshapes.
[CVPR2011 paper (to appear)]
|
Learning a family of detectors via multiplicative kernels
|
Object detection is challenging when the object class exhibits large
within-class variations. In this work, we show that foreground-background classification
(detection) and within-class classification of the foreground class (pose estimation)
can be jointly learned in a multiplicative form of two kernel functions. Model training
is accomplished via standard SVM learning. Our approach compares favorably to existing
methods on hand and vehicle detection tasks.
[T-PAMI 2011 paper]
[CVPR 2008 paper]
[CVPR 2007 paper]
|
Layers of graphical models for tracking partially-occluded objects
|
We propose a representation for scenes containing relocatable objects
that can cause partial occlusions of people in a camera's field of view. In this
representation, called a graphical model layer, a person's motion in the ground plane
is defined as a first-order Markov process on activity zones, while image evidence
is aggregated in 2D observation regions that are depth-ordered with respect to the
occlusion mask of the relocatable object. The effectiveness of our scene representation
is demonstrated on challenging parking-lot surveillance scenarios.
[CVPR2008 paper]
|
Large video corpus for American sign language retrieval and indexing algorithms
|
There is currently a dearth of large video datasets for sign language
research, most notably, those that include variations among different native
signers, as well as linguistic annotations pertaining to phonological properties
for different articulatory parameters such as handshape, hand location, orientation,
movement trajectory and facial actions. Towards bridging this gap, we introduce the ASL Lexicon Video Dataset, a large
and expanding public dataset containing video sequences of approximately three thousand
distinct ASL signs produced by three native signers. The dataset includes annotations
for start/end video frames, gloss label for every sign (an English desctiptor label
for the sign) and start/end handshape labels. These annotations were coded using
SignStream. A
portion of the dataset corresponding to lexical signs used in
our handshape recognition project described above is displayed here.
[CVPR4HB2008 paper]
|
|