Sunday, May 4, 2008

Invariant features for 3-D gesture recognition

Invariant features for 3-D gesture recognition

Campbell et al test a variety of features for gesture recognition. They combine various Cartesian and polar coordinates, as well as velocities, speed, and curvature, and use these features as input to HMMs. They find that radial speeds perform the best.

Discussion
I was hoping for some novel features, but instead ended up with the usual speed, curvature, and position.

Campbell, L.W.; Becker, D.A.; Azarbayejani, A.; Bobick, A.F.;Pentland, A., "Invariant features for 3-D gesture recognition,"Automatic Face and Gesture Recognition, 1996., Proceedings of the Second International Conference on , vol., no., pp.157-162, 14-16 Oct 1996.

Interacting with human physiology

Interacting with human physiology
Pavlidis et al discuss the use of non-invasive techniques to monitor human physiology. Using thermal imaging the heart rate, respiration rate, and stress level. This involves face tracking. Generally this is accomplished by the Condensation method tracking the area of interests with adjustments using a more global face tracker and remembering the position the area of interest with respect to the face.

Discussion
This is an interesting area of research, but all these applications were presented to my Senior Seminar class 3 years earlier. I was hoping for a new application or method.

I. Pavlidis, J. Dowdall, N. Sun, C. Puri, J. Fei and M. Garbey.
"Interacting with human physiology." Computer Vision and Image
Understanding

Device Independence and Extensibility in Gesture Recognition

Device Independence and Extensibility in Gesture Recognition (2003) Eisenstein, J., et al

Eisenstein et al develop a multi-layer gesture recognition framework. From raw data, a set of postural predicates are recognized. These predicates are in turn combined into gestures by matching to gesture templates. They also implement a context history that can help predict which gesture follows a give gesture sequence. The compare to a baseline monolithic neural network

Discussion
This classifier isn't really device independent. For each different device you have to retrain the predicate recognizer. Furthermore the retraining of the predicate recognizers takes longer (more weight updates) than the single network. The extensibility comes from the fact that it uses an instance based learner. Any instance base learner should show similar performance; however, the use of predicates ultimately limits the number of gestures possible.

Feature Selection for Grasp Recognition from Optical Markers

Feature Selection for Grasp Recognition from Optical Markers

Chang et al use feature subset selection to determine an optimal number of optical markers to track grasps. By sequentially adding to or removing from the set of possible markers, they determined that when using only 5 of 30 possible recognition accuracy is not significantly reduced (3-8% reduction). Recognition also translates well to new objects when using either 5 or 30 markers, but not necessarily to new users, who may form grasps in entirely different manners.

Discussion
Shouldn't the other 2 marker thats are used to define the local coordinate system also be excluded from the feature selection process, since their positions should be invariant as well? Also, several of the markers seem to be placed in highly redundant positions.

Reference
L.Y. Chang, N. Pollard, T. Mitchell, and E.P. Xing, "Feature Selection for Grasp Recognition from Optical Markers," Proceedings of the 2007 IEEE/RSJ Intl. Conference on Intelligent Robots and Systems (IROS 2007), October, 2007, pp. 2944-2950.

Glove-TalkII-a neural-network interface which maps gestures to parallel formant speech synthesizer controls.

Glove-TalkII-a neural-network interface which maps gestures to parallel formant speech synthesizer controls.

Fels and Hinton create a hand based artificial speech system using neural networks. They create 3 networks that determine the inputs to a speech synthesizer. The first determines whether the left hand is forming a vowel or a consonant. The second determines which vowel is being formed based on the hands vertical and horizontal position. The last network determines which consonant the user is forming with the fingers of the right hand. The first is a fully feedforward network trained on 2600 samples, while each other is an RBF network with the centers trained as class averages. Each network shows low expected error (<6%), and a user trained for 100 hours can speak intelligibly.

Discussion
100 hours of training? For someone that they think will learn easily due to prior experience? How long will someone with no experience take to learn to speak with this system. It's also odd that they chose to use phoneme signing, rather than interpreting sign language to text and using a text to speech converter. The system would undoubtedly have a small translation lag time, but could be used by someone who already knew how to sign alphabetically.

Reference

Fels, S. S. and G. E. Hinton (1998). "Glove-TalkII-a neural-network
interface which maps gestures to parallel formant speech synthesizer
controls." Neural Networks, IEEE Transactions on 9(1): 205-212.

RFID-enabled Target Tracking and Following with a Mobile Robot Using Direction Finding Antennas

RFID-enabled Target Tracking and Following with a Mobile Robot Using Direction Finding Antennas

Kim et al

Uses RFID tracking to guide a mobile robot to a target. Using an RFID transmitter on the target and 2 perpendicularly mounted RFID antennas, the tangent to angle of the target can be computed as the ratio of the voltages induced in the antennas. This can be used to control the direction of the robot, orienting it on the target, and allowing it to follow the target. Rather than simply monitoring the voltage ratios, the robot turns the antenna array to maintain a constant ratio of 1 between the two antenna (orienting the array so it points at the target) then turn the robot so that it is aligned with the antenna.

Discussion
Only tracks in essentially 1D, though this could be increased to 2D with an additional antenna. Could be used for hand tracking in 2D, but posture determination would still need a glove, or tracking multiple transmitters which would be bulky when placed on the finger.

Reference Kim, M., Chong, N.Y., Ahn, H,-S., and W. Yu. 2007. RFID-enabled Target Tracking and Following with a Mobile Robot Using Direction Finding Antennas. Proceedings of the 3rd Annual IEEE Conference on Automation Science and Engineering Scottsdale, AZ, USA, Sept 22-25, 2007

$1

$1

Wobbrock et al

Does template matching starting with a single example for each class.

Step 1: Resample to 64 points
Step 2: Rotate so that the line between the starting point and the centroid (average (x,y)) is horizontal (centroid at origin)
Step 3: Scale to a reference square and translate so that the centroid becomes the origin.
Step 4: The pathwise distance between the templates and an input is computed by averageing the distance between corresponding points. This pathwise distance is determined as the minimum distance over several possible rotations of the input over +/-45 degrees. Adding additional templates with the same name can capture variations of that class. $1 performs on par with Dynamic Time Warping, and both outperform Rubine.

Discussion
The differences in error are not suprising considering some gestures are very similar from the standpoint of the features specified by Rubine, especially after rotating and rescaling. This also forces all classes to have the same value for 2 of the features, as they have the same bounding box after scaling.

Reference
Wobbrock, J., A. Wilson, and Y. Li. Gestures without Libraries,Toolkits or Training: A $1 Recognizer for User Interface Prototypes.UIST, 2007