Wednesday, February 6, 2008

A Similarity Measure for Motion Stream Segmentation and Recognition - Li and Prabhakaran

Summary
Li and Prabhakarn define a metric that can measure how similar two gestures are to one another. They create this metric base on singular value decomposition. After collecting a matrix A of sensor observations (column for each sensor, row for each observation), they find the eigenvalues and eigenvectors of the square matrix A^T A. Using only the first k eigenvectors/values, they find a weighted sum of the dot product of the eigenvectors. This metric ranges from 0 to 1, not similar to identical. To recognize and segment gestures from a stream of gestures, beginning at the start of the stream or end of the last gesture, they scan a section of the stream varying in size between a minimum and maximum window. For each window size, the similarity to stored isolated gestures is computed, and the window that is most similar to a stored gesture is classified as that gesture. After testing with both CyberGlove and Motion Capture data of both isolated and sequences of gestures, they determined that this metric was much more accurate, especially on data streams, than previous metrics, but took time comparable to the fastest previous metric.

Discussion
I liked that they not only gave overall accuracy comparisons between the three metrics, but also compared accuracy over a wide range of k values. However, while they discuss dead time (no gestures performed) between two gestures, saying that it causes noise, they don't segment the "no gesture" segments out, but instead incorporate it into the later gesture. The windowing procedure could also have flaws related to aggressive recognition (first part of a gesture is similar to another) where the beginning of a gesture could be misclassified, and the remainder lumped into the next sequence.

Reference

1 comment:

Paul Taele said...

Yeah, that is also strange that they don't segment out the "no gesture" by classifying it as some junk gesture. Combined with the windowing approach, I see more problems being introduced as opposed to remedying. Even if they fine-tuned their thresholds just right, I'm guessing this technique doesn't generalize well to potentially common domains.