Haptics Spring 08: 2008

Sunday, May 4, 2008

Invariant features for 3-D gesture recognition

Invariant features for 3-D gesture recognition

Campbell et al test a variety of features for gesture recognition. They combine various Cartesian and polar coordinates, as well as velocities, speed, and curvature, and use these features as input to HMMs. They find that radial speeds perform the best.

Discussion
I was hoping for some novel features, but instead ended up with the usual speed, curvature, and position.

Campbell, L.W.; Becker, D.A.; Azarbayejani, A.; Bobick, A.F.;Pentland, A., "Invariant features for 3-D gesture recognition,"Automatic Face and Gesture Recognition, 1996., Proceedings of the Second International Conference on , vol., no., pp.157-162, 14-16 Oct 1996.

Interacting with human physiology

Interacting with human physiology
Pavlidis et al discuss the use of non-invasive techniques to monitor human physiology. Using thermal imaging the heart rate, respiration rate, and stress level. This involves face tracking. Generally this is accomplished by the Condensation method tracking the area of interests with adjustments using a more global face tracker and remembering the position the area of interest with respect to the face.

Discussion
This is an interesting area of research, but all these applications were presented to my Senior Seminar class 3 years earlier. I was hoping for a new application or method.

I. Pavlidis, J. Dowdall, N. Sun, C. Puri, J. Fei and M. Garbey.
"Interacting with human physiology." Computer Vision and Image
Understanding

Device Independence and Extensibility in Gesture Recognition

Device Independence and Extensibility in Gesture Recognition (2003) Eisenstein, J., et al

Eisenstein et al develop a multi-layer gesture recognition framework. From raw data, a set of postural predicates are recognized. These predicates are in turn combined into gestures by matching to gesture templates. They also implement a context history that can help predict which gesture follows a give gesture sequence. The compare to a baseline monolithic neural network

Discussion
This classifier isn't really device independent. For each different device you have to retrain the predicate recognizer. Furthermore the retraining of the predicate recognizers takes longer (more weight updates) than the single network. The extensibility comes from the fact that it uses an instance based learner. Any instance base learner should show similar performance; however, the use of predicates ultimately limits the number of gestures possible.

Feature Selection for Grasp Recognition from Optical Markers

Feature Selection for Grasp Recognition from Optical Markers

Chang et al use feature subset selection to determine an optimal number of optical markers to track grasps. By sequentially adding to or removing from the set of possible markers, they determined that when using only 5 of 30 possible recognition accuracy is not significantly reduced (3-8% reduction). Recognition also translates well to new objects when using either 5 or 30 markers, but not necessarily to new users, who may form grasps in entirely different manners.

Discussion
Shouldn't the other 2 marker thats are used to define the local coordinate system also be excluded from the feature selection process, since their positions should be invariant as well? Also, several of the markers seem to be placed in highly redundant positions.

Reference
L.Y. Chang, N. Pollard, T. Mitchell, and E.P. Xing, "Feature Selection for Grasp Recognition from Optical Markers," Proceedings of the 2007 IEEE/RSJ Intl. Conference on Intelligent Robots and Systems (IROS 2007), October, 2007, pp. 2944-2950.

Glove-TalkII-a neural-network interface which maps gestures to parallel formant speech synthesizer controls.

Glove-TalkII-a neural-network interface which maps gestures to parallel formant speech synthesizer controls.

Fels and Hinton create a hand based artificial speech system using neural networks. They create 3 networks that determine the inputs to a speech synthesizer. The first determines whether the left hand is forming a vowel or a consonant. The second determines which vowel is being formed based on the hands vertical and horizontal position. The last network determines which consonant the user is forming with the fingers of the right hand. The first is a fully feedforward network trained on 2600 samples, while each other is an RBF network with the centers trained as class averages. Each network shows low expected error (<6%), and a user trained for 100 hours can speak intelligibly.

Discussion
100 hours of training? For someone that they think will learn easily due to prior experience? How long will someone with no experience take to learn to speak with this system. It's also odd that they chose to use phoneme signing, rather than interpreting sign language to text and using a text to speech converter. The system would undoubtedly have a small translation lag time, but could be used by someone who already knew how to sign alphabetically.

Reference

Fels, S. S. and G. E. Hinton (1998). "Glove-TalkII-a neural-network
interface which maps gestures to parallel formant speech synthesizer
controls." Neural Networks, IEEE Transactions on 9(1): 205-212.

RFID-enabled Target Tracking and Following with a Mobile Robot Using Direction Finding Antennas

RFID-enabled Target Tracking and Following with a Mobile Robot Using Direction Finding Antennas

Kim et al

Uses RFID tracking to guide a mobile robot to a target. Using an RFID transmitter on the target and 2 perpendicularly mounted RFID antennas, the tangent to angle of the target can be computed as the ratio of the voltages induced in the antennas. This can be used to control the direction of the robot, orienting it on the target, and allowing it to follow the target. Rather than simply monitoring the voltage ratios, the robot turns the antenna array to maintain a constant ratio of 1 between the two antenna (orienting the array so it points at the target) then turn the robot so that it is aligned with the antenna.

Discussion
Only tracks in essentially 1D, though this could be increased to 2D with an additional antenna. Could be used for hand tracking in 2D, but posture determination would still need a glove, or tracking multiple transmitters which would be bulky when placed on the finger.

Reference Kim, M., Chong, N.Y., Ahn, H,-S., and W. Yu. 2007. RFID-enabled Target Tracking and Following with a Mobile Robot Using Direction Finding Antennas. Proceedings of the 3rd Annual IEEE Conference on Automation Science and Engineering Scottsdale, AZ, USA, Sept 22-25, 2007

$1

$1

Wobbrock et al

Does template matching starting with a single example for each class.

Step 1: Resample to 64 points
Step 2: Rotate so that the line between the starting point and the centroid (average (x,y)) is horizontal (centroid at origin)
Step 3: Scale to a reference square and translate so that the centroid becomes the origin.
Step 4: The pathwise distance between the templates and an input is computed by averageing the distance between corresponding points. This pathwise distance is determined as the minimum distance over several possible rotations of the input over +/-45 degrees. Adding additional templates with the same name can capture variations of that class. $1 performs on par with Dynamic Time Warping, and both outperform Rubine.

Discussion
The differences in error are not suprising considering some gestures are very similar from the standpoint of the features specified by Rubine, especially after rotating and rescaling. This also forces all classes to have the same value for 2 of the features, as they have the same bounding box after scaling.

Reference
Wobbrock, J., A. Wilson, and Y. Li. Gestures without Libraries,Toolkits or Training: A $1 Recognizer for User Interface Prototypes.UIST, 2007

Enabling fast and effortless customisation in accelerometer based gesture interaction

Enabling fast and effortless customisation in accelerometer based gesture interaction

Mäntyjärvi et al.

Create gestures using accelerometers. First the gesture is resampled to 40 points.(?) Each resampled point is then vector quantized into one of eight codewords. The resampled, vector quantized data is then input to a set of HMMs. To create their training set, rather than force each user to repeat gestures multiple times, the authors added noise to the data samples, testing both uniform and gaussian noise distributions. The entire set was used to determine the vector quantization codebook, but was divided for HMM training and testing. The gestures used represent DVD player instructions and are 2D symbols. The system was tested using varios numbers of training examples for the HMMs, various amounts of noise, and various amounts of noisy data.

Discussion
Gesture set is just 2D, could use any sketch recognition system for as good or better results, no need to train HMMs. Gesture set is also very simple which is the only reason they can get away with using only 8 codewords and train HMMs on tiny datasets (1-2 examples).

Reference
Enabling fast and effortless customisation in accelerometer based gesture interaction Mantyjarvi, J., Kela, J., Korpipaa, P. and Kallio S. Proc. of the MUM '04, ACM Press (2004)

Gesture Recognition with a Wii Controller

Gesture Recognition with a Wii Controller

Scholmer et al recognize gestures using the accelerometer data of the Wiimote. First the data is vector quantized with k-means and k=14. Next the data is input to HMMs for each of 5 gestures. Additionally the accelerometer data is filtered before processing. Acceleration values below a threshold are considered idle and ignored as well as accelerations that differ too greatly on all components from the previous acceleration.

Discussion
Wii+HMM=Gestures. Also all but one of the gestures are in only a 2D plane which should make classification of the other relatively simple.

Gesture Recognition with a Wii Controller. Scholmer et al 2007

SPIDAR G&G: A Two-Handed Haptic Interface for Bimanual VR Interaction

SPIDAR G&G: A Two-Handed Haptic Interface for Bimanual VR Interaction

The Spidar G&G is an input device that allows users to interact in 3D through both translation and rotation for 2 hands. It is composed of 2 Spidar Gs which feature a sphere that is supported by 8 strings attached to tensioning motors. Moving or rotating the sphere puts a certain tension on different strings which is converted into translation or rotation of the object on the screen or cursor. The Spidar also features a button on each sphere which when pressed selects an object. Feedback can also be provided to the user by motors pulling on the strings. This is featured when a held object intersects with another. The sphere is pulled in resistance to one object entering the other.

Discussion
While the users seem to adapt to the Spidar G&G, the strings wrapped around the sphere seem like they would wrap over the fingers making using the device uncomfortable especially when the strings pull due to object interaction.

Hand gesture modelling and recognition involving changing shapes and trajectories, using a Predictive EigenTracker

Hand gesture modelling and recognition involving changing shapes and trajectories, using a Predictive EigenTracker

Patwardhana and Roy update the EigenTracker to use particle filtering to automatically select the hand and track it. The Tracker predicts position each hand image, then finds the hand using the eigenvectors of hand templates, and computes the reconstruction error of the predicted image from the previous two iterations. If the change is large enough, a change in hand shape has occurred. By tracking the shape between hand changes, gestures can be determined.

Discussion
The change in eigen-templates used to find the hand seemingly just finds corners. Aren't there easier methods to find corners?

Kaustubh Srikrishna Patwardhana, and Sumantra Dutta Roy "Hand gesture modelling and recognition involving changing shapes and trajectories, using a Predictive EigenTracker" Pattern Recognition Letters Volume 28, Issue 3, 1 February 2007, Pages 329-334 Advances in Visual
information Processing, Special Issue of Pattern Recognition Letters on Advances in Visual Information Processing. (ICVGIP 2004)

Taiwan sign language (TSL) recognition based on 3D data and neural networks

Taiwan sign language (TSL) recognition based on 3D data and neural networks

Lee and Tsai recognize Taiwan sign language with neural networks. The hands were tracked with a Vicon system with markers on the finger tips and uses as features the distance between the finger tips and wrist and between each of the pairs of fingertips. These are input into a back-propagation neural network with three hidden layers and an output node for every class. At large hidden layer sizes, the network can achieve over 94% accuracy on test data

Discussion
Vicon hand tracking + neural network = posture classifier.

Yung-Hui Lee and Cheng-Yueh Tsai, Taiwan sign language (TSL) recognition based on 3D data and neural networks, Expert Systems with ApplicationsIn Press, Corrected Proof, , Available online 17 November 2007.

Wiizards: 3D gesture recognition for game play input

Wiizards: 3D gesture recognition for game play input

Kratz et al use HMMs to classify accelerometer data from Wiimotes to create a game. Their system achieves high accuracy rates when tested on data generated by the same users who produced the training data, but only about 50% accuracy on other users.

Discussion
HMMs + Wii = gesture recognition.

Reference
Kratz, L., Smith, M., and Lee, F. J. 2007. Wiizards: 3D gesture
recognition for game play input. In Proceedings of the 2007 Conference
on Future Play (Toronto, Canada, November 14 - 17, 2007). Future Play
'07. ACM, New York, NY, 209-212.

TIKL: Development of a Wearable Vibrotactile Feedback Suit for Improved Human Motor Learning

TIKL: Development of a Wearable Vibrotactile Feedback Suit for Improved Human Motor Learning

Lieberman and Breazeal describe a suit that adds tactile feedback to the learning process for a user making arm gestures. The suit produces vibrations on the arm at positions that cause the user to feel like the suit is pushing his arm towards the desired position. Rotation of the arm is suggested by causing a vibration sequence around the arm. The arm of the user is tracked using a Vicon camera system that tracks targets on the arm and maps them to an arm model. The angle of the joints in this model are compared to a reference gesture, and the magnitude of the difference in angles causes a corresponding vibration in the suit. Once the users were accustomed to the suit, it improved their learning rate and their ability to mimic the taught gestures.

Discussion
Not much gesture recognition going on. The tracking could be adapted to an instance based classification system. Its use as a learning tool is interesting, and would be cool to try.

Reference

J. Lieberman & C. Breazeal (in press) "TIKL: Development of a Wearable Vibrotactile Feedback Suit for Improved Human Motor Learning". IEEE Transactions in Robotics (T-RO).

Articulated Hand Tracking by PCA-ICA Approach

Articulated Hand Tracking by PCA-ICA Approach

Kato et al perform PCA and ICA on hand motion data captured from a CyberGlove and determine that five components to hand motion can describe hand motion sufficiently: motion of the pinky, ring finger, middle finger, index finger, and thumb. These components are translated into a 3D model of the hand from which 2D visual templates are created. These templates allow a visual hand tracking system to project the image of the hand onto the five components and produce a representation of the hand.

Discussion
The input consists of touching various fingers to the palm, so of course the variation lies in how bent the fingers are. This seems like a lot of effort to get a system that models the hand by just how much the fingers are bent.

Reference
Articulated Hand Tracking by PCA-ICA Approach Makoto Kato, Yen-Wei Chen and Gang Xu Computer Vision Laboratory, College of Information Science and Engineering, Ritsumeikan University, Kusatsu, Shiga 525-8577, Japan

A sensor fusion approach for recognizing continuous human grasping sequences using hidden Markov models

A sensor fusion approach for recognizing continuous human grasping sequences using hidden Markov models

Bernardin et al add hand mounted pressure sensors to the CyberGlove to determine grasps. These sensors are mounted at various positions on the hand that contact grasped objects. The CyberGlove+Pressure data is input into a set of HMMs. Single user trained systems perform between 75-92% accuracy, while multiple user trained systems perform at 90% on average.

Discussion
Not much to say. Pressure seems like a very useful piece of information for determining contact between the hand and objects.

Reference
Bernardin, K., K. Ogawara, et al. (2005). "A sensor fusion approach for recognizing continuous human grasping sequences using hidden Markov models." Robotics, IEEE Transactions on [see also Robotics and Automation, IEEE Transactions on] 21(1): 47-57

The 3D Tractus: a three-dimensional drawing board

The 3D Tractus: a three-dimensional drawing board

Lapides et al create a tool for 3D sketching on a tablet PC. This tool is a table that the user can alter the height and transmits this height information to the PC. This allows users to sketch in 2D on the tablet and move the surface up and down for the third dimension. The table top is counter-balanced to allow it to slide up and down easily. To differentiate where a point along the vertical plane, both a thickness gradient and image hiding are used. Any ink "above" the current level of the screen is not displayed and the further the lines are below the level of the screen the thinner they become. A perspective display also gives the user cues about the depth in the drawing.

Discussion
Moving the table up and down to draw depth seems ackward. I think I'd prefer to just rotate a drawing to add depth.

Reference
Lapides, P.; Sharlin, E.; Sousa, M.C.; Streit, L. The 3D Tractus: a three-dimensional drawing board . Horizontal Interactive Human-Computer Systems, 2006. TableTop 2006. First IEEE International Workshop on Volume , Issue , 5-7 Jan. 2006

Using Ultrasonic Hand Tracking to Augment Motion Analysis Based Recognition of Manipulative Gestures

Using Ultrasonic Hand Tracking to Augment Motion Analysis Based Recognition of Manipulative Gestures

Ogris et al classify actions taken in a bike shop using a combination of gyroscopic and ultrasonic sensors. They try several method to classify the gestures: HMMs and 2 Frame based methods, kNN and C4.5 (decision tree). For kNN and C4.5, the classifiers vote on a set of sliding windows and the majority vote decides on what to classify the gesture as. They also test several sensor fusion methods: Plausibility Analysis, Joint Feature Vector classification, and classifier fusion. Fusion techniques improve classification results greatly.

Discussion
While the authors discussion the limitations of ultrasonics, such as sensitivity to reflections, they just leave it to the classifier to filter the noise. Wouldn't a real shop environment have a great deal of reflections from the surroundings? This isn't really address in the paper; maybe they have to have an empty room to work in.

Reference
Ogris, G., Stiefmeier, T., Junker, H., Lukowicz, P., and Troster, G. 2005. Using Ultrasonic Hand Tracking to Augment Motion Analysis Based Recognition of Manipulative Gestures. In Proceedings of the Ninth IEEE international Symposium on Wearable Computers (October 18 - 21, 2005). ISWC. IEEE Computer Society, Washington, DC, 152-159.

Saturday, May 3, 2008

American Sign Language Recognition in Game Development for Deaf Children

American Sign Language Recognition in Game Development for Deaf Children

Brasher et al combine visual and accelerometer based hand tracking. Accelerometers provide x,y, and z position. Visual tracking provides x,y hand centers, mass, major and minor axes, eccentricity, and orientation. These are input to the Georgia Tech Gesture Tool Kit. They achieve fairly high word accuracy but relatively low sentence accuracy.

Discussion
Another fairly straight forward gesture system. Get data from a tracking system and plug into an HMM. The lower sentence accuracy is easily explained: missing a single word causes an entire sentence to be incorrect, but there are multiple words per sentence.

Reference
Brashear, H., Henderson, V., Park, K., Hamilton, H., Lee, S., and Starner, T. 2006. American sign language recognition in game development for deaf children. In Proceedings of the 8th international ACM SIGACCESS Conference on Computers and Accessibility (Portland, Oregon, USA, October 23 - 25, 2006). Assets '06. ACM, New York, NY, 79-86.

A Method for Recognizing a Sequence of Sign Language Word Represented in a Japanese Sign Language Sentence

Sagawa et al focus on gesture segmentation. They define potential segmentation points as the minima of hand velocity and large enough hand direction changes. These points are filtered for noise. They also determine whether the gesture is one or two handed by finding the ratio of hand velocities. Lastly, sentences are formed by scoring recognized word and transition combinations.

Discussion
Sagawa et al use essentially the same segmentation points (speed and curvature) that we've seen before. Determining the number of hands used is also straight forward (are they moving about the same speed?).

Reference
Hirohiko Sagawa, Masaru Takeuchi, "A Method for Recognizing a Sequence of Sign Language Words Represented in a Japanese Sign Language Sentence," fg , p. 434, 2000.

COMPUTER VISION-BASED GESTURE RECOGNITION FOR AN AUGMENTED REALITY INTERFACE

COMPUTER VISION-BASED GESTURE RECOGNITION FOR AN AUGMENTED REALITY INTERFACE

Störring et al present a virtual reality interface using hand gestures captured by a head mounted camera. First they must segment the hand from the background. The images are projected into a chromaticity space so that color can be separated from intensity and other image features. The hands and objects are modeled as chromaticity distributions represented as Gaussians. Each pixel is classified as hand, background, or PHO objects. Objects must fall within a specific size range, and have missing pixels filled in using an opening filter. Next the hand is plotted radially from its center, and the number of protrusions beyond a certain radius is counted to determine the gesture.

Discussion
Projecting the hand on radial coordinates to determine the number of outstretched fingers is rather novel, but the "robustness" to how a gesture is formed is simply attempting to change a drawback into an advantage. They can't tell which fingers are outstretched, only the number, so they say that's a desired quality. Also, they only provide the generic "users adapted quickly" as evidence that the system works well.

Reference
COMPUTER VISION-BASED GESTURE RECOGNITION FOR AN AUGMENTED REALITY INTERFACE Moritz Störring, Thomas B. Moeslund, Yong Liu, and Erik Granum In 4th IASTED International Conference on VISUALIZATION, IMAGING, AND IMAGE PROCESSING, pages 766-771, Marbella, Spain, Sep 2004

Georgia Tech Gesture Toolkit: Supporting Experiments in Gesture Recognition

Georgia Tech Gesture Toolkit: Supporting Experiments in Gesture Recognition

Westeyn et al present the Georgia Tech Gesture Toolkit and several example applications created with it. GT2k is built on HTK, an HMM toolkit. Essentially, users create HMMs for each gesture and a grammar that combines isolated gestures into sequences with meaning. The toolkit automatically trains the HMMs with data that has been annotated by the application creator. The system also allows for cross-validation to be used to determine how well the system will perform on real data. They also present several applications such as a workshop activity recognizer.

Discussion
They really don't add much to HTK. Really its just a new coat of paint so that HTK looks like its doing something new.

Reference

Westeyn, Brashear, Atrash, Starner. Georgia Tech Gesture Toolkit: Supporting Experiments in Gesture Recognition. Proceedings of the 5th International Conference on Multimodal Interfaces. Vancouver, British Columbia, Canada (2003), 85-92.

3D Visual Detection of Correct NGT Sign Production

3D Visual Detection of Correct NGT Sign Production

Lichtenauer et al using 3D visual tracking to classify Dutch sign language. Using a single camera they can capture 2D features: x and y position, displacement, motion angle, and velocities. With 2 cameras this can be converted into 3D positions, angles, and velocities. DTW is used to map gestures to a reference gesture. Gestures are mapped on an individual feature basis creating classifiers for each feature. The average classification is determined as the output classification. The feature classifiers determine if an example feature falls within the range of 90% of a Gaussian modeling the positive examples of the class, and a sign is assigned to classifications if the average of the features is above a threshold.

Discussion
This paper is interesting in that it is the first one using DTW. While the classification seems fairly accurate, they don't appear to decide between classes only if an example belongs to a given class.

Reference
Title: 3D Visual Detection of Correct NGT Sign Production Authors: J.F. Lichtenauer, G.A. ten Holt, E.A. Hendriks, M.J.T. Reinders

Television Control by Hand Gestures

Television Control by Hand Gestures

Weissman and Freeman present an early gesture recognition system. Using hand tracking through video, they determine where the users hand is by applying a hand template to the image. They overlay a control gui on the TV screen which the user manipulates by moving his hand. The hand is displayed on the gui and acts like a buttonless mouse. Closing the hand ends the manipulation.

Discussion
This shows how gesture input started. It is a very simple interface with relatively simple tracking and interaction methods.

Reference

Television Control by Hand Gestures. William T. Freeman, Craig D. Weissman. TR94-24 December 1994

A Survey of Hand Posture and GestureRecognition Techniques and Technology

A Survey of Hand Posture and Gesture Recognition Techniques and Technology

Laviola presents an excellent literature survey over hand and gesture recognition.
Included are:

Devices - Tracking and gloves
Methods:
Features - Template Matching, PCA
Learning - Neural Nets, HMMS, Instance-based method
Applications:
Sign Language Recognition
Gesture to Speech
Virtual Reality
3D Modeling
Control Systems (robots, tv, etc)

Discussion
A good starting point to find useful methods. Summarizes the state of the art at 1999. We're still using pretty much the same methods 8 years later.

Reference
Joseph J. LaViola, Jr. (1999). A Survey of Hand Posture and Gesture
Recognition Techniques and Technology, Brown University.

Real-Time Locomotion Control by Sensing Gloves

Real-Time Locomotion Control by Sensing Gloves

Komura maps the motions of the fingers as measured by P5 glove to the motion of a humanoid figure in a real time game. The user first mimicks a reference character so that finger bend and hand orientation can be mapped to various parts of the virtual figure. This is done by matching the perioid of change in finger flex to the period of the figure's motion. Users were successfully able to control a figure in a game.

Discussion
The authors subtly switch from P5 to cyberglove for their experiments suggesting that the P5 does not have the sensitivity necessary for this system. In the experiments the user took more time to complete tasks but collided fewer times on average, suggesting that they were more careful when using the glove.

Reference
Taku Komura, Wai-Chun Lam; Real-time locomotion control by sensing gloves; Computer Animation and Virtual Worlds 17:5, 513-525, 2006

A dynamic gesture recognition system for the Korean sign language (KSL)

A dynamic gesture recognition system for the Korean sign language (KSL)

Kim et al use a combination of Cybergloves and 3d position sensors to capture Korean sign language gestures using template matching and neural neworks. The overall gesture is match to templates. The x and y axis are divided into 8 regions and the motion of the gesture is tracked as positive or negative changes in region. Each gesture is matched to a set of template region changes to determine. Posture recognition is performed using Fuzzy Min Max Networks. Each class is determined by a max and min point defining a hyperbox and a membership function. After matching a sequence of positions to a template, the posture is used to determine which sign is represented.

Discussion
Gesture templates exist only in 2D; however it should be fairly easy to extend. Some of the templates seem to not match the motions from Figure 2. The templates also limit place a limit on the size of gestures performed.

Reference
J. S. Kim,W. Jang, and Z. Bien, "A dynamic gesture recognition system for the Korean sign language (KSL)," IEEE Trans. Syst., Man, Cybern. B, vol. 26, pp. 354–359, Apr. 1996.

Shape Your Imagination: Iconic Gestural-Based Interaction.

Shape Your Imagination: Iconic Gestural-Based Interaction.

Marsh and Watt present a study of naturally made gestures. By presenting a set of card with common items written on them and asking the study members to describe the items nonverbally, the researchers could determine what types of gesture the study memeber used. They found that people tend to prefer virtual to substitutive gestures and model shapes using 2 hands rather than one..

Discussion

Reference
T. Marsh, A. Watt, "Shape Your Imagination: Iconic Gestural-Based Interaction," vrais , p. 122, 1998.

A Survey of POMDP Applications

A Survey of POMDP Applications

Anthony Cassandra present the Partially Observable Markov Decision Process and summarizes several applications. This is largely a literature survey. Types of applications:

Machine Maintenance
Structural Inspection
Elevator Control Policies
Fishery Industry
Autonomous Robots
Behavioral Ecology
Machine Vision
Network Troubleshooting
Distributed Database Queries
Marketing
etc

Looking at the references for applications similar to yours could provide useful information about applying POMDPs or HMMs to your situation.

Reference
Anthony Cassandra. A Survey of Partially-Observable Markov Decision Process Applications. Presented at the AAAI Fall Symposium, 1998

Simultaneous Gesture Segmentation and Recognition based on Forward Spotting Accumulative HMMs

Simultaneous Gesture Segmentation and Recognition based on Forward Spotting Accumulative HMMs

Song and Kim modify the usual HMM model dividing the observation sequence into block through use of a sliding window. Each block of a gesture is used to train the corresponding HMM. Each HMM is used to recognize partial segments of the gesture over the block (train on [O1], then [o1,o2], etc), and a gesture is recognized through majority voting over the block. They determined that the optimal window size was 3. After the gesture is selected from the set of gesture HMMs it is compared either to a manually set threshold or the output of an HMM train on non-gestures. If the gesture HMM probablity exceeds the threshold or non-gesture HMM it is determined to be a gesture. Testing demonstrates that use of the non-gesture HMM spot gestures more accurately than manual thresholding.

Discussion
The gestures used in the experiment are very simple, mostly lift one arm or the other. Template matching could probably achieve similar results while being much less complex to implement.

Reference
Jinyoung Song, Daijin Kim, "Simultaneous Gesture Segmentation and Recognition based on Forward Spotting Accumulative HMMs," icpr , pp. 1231-1235, 2006.

Cyber Composer: Hand Gesture-Driven Intelligent Music Composition and Generation

Cyber Composer: Hand Gesture-Driven Intelligent Music Composition and Generation

Ip et al present a system that interprets hand and arm motion into music. They begin with a discussion of music theory that helps them determine what tones to play. Musical theory such as chord coherence and cadence can help determine which chord should follow previous ones. The system itself is composed of a pair of CyberGloves and a Polhemus 3D position tracker for each glove. The right hand determines the melody. New notes are generated when the user flexes his wrist, and the height of the hand determines the pitch of the note. Vertical movement of the hand can cause the pitch to shift with the motion. Finger flexion determines the dynamics and volume of the note. Lifting the left hand brings in a second instrument to play in either unison or harmony, and clenching the left hand initiates cadence and terminates the music.

Discussion
After desribing the system a usability discussion would have been useful. Waving the hand to generate notes seems like it would be very tiring. The use of chord coherence used to determine the next chord suggests that picking the chord you want may be difficult, and only possible by trial and error through pitch shifting. Also the use of Cybergloves is somewhat over kill, since all they are interested in is when the wrist bends and the general degree of finger flex.

Reference
Ip, H. H. S., K. C. K. Law, et al. (2005). Cyber Composer: Hand Gesture-Driven Intelligent Music Composition and Generation. Multimedia Modelling Conference, 2005. MMM 2005. Proceedings of the 11th International.

Wednesday, February 6, 2008

A Similarity Measure for Motion Stream Segmentation and Recognition - Li and Prabhakaran

Summary
Li and Prabhakarn define a metric that can measure how similar two gestures are to one another. They create this metric base on singular value decomposition. After collecting a matrix A of sensor observations (column for each sensor, row for each observation), they find the eigenvalues and eigenvectors of the square matrix A^T A. Using only the first k eigenvectors/values, they find a weighted sum of the dot product of the eigenvectors. This metric ranges from 0 to 1, not similar to identical. To recognize and segment gestures from a stream of gestures, beginning at the start of the stream or end of the last gesture, they scan a section of the stream varying in size between a minimum and maximum window. For each window size, the similarity to stored isolated gestures is computed, and the window that is most similar to a stored gesture is classified as that gesture. After testing with both CyberGlove and Motion Capture data of both isolated and sequences of gestures, they determined that this metric was much more accurate, especially on data streams, than previous metrics, but took time comparable to the fastest previous metric.

Discussion
I liked that they not only gave overall accuracy comparisons between the three metrics, but also compared accuracy over a wide range of k values. However, while they discuss dead time (no gestures performed) between two gestures, saying that it causes noise, they don't segment the "no gesture" segments out, but instead incorporate it into the later gesture. The windowing procedure could also have flaws related to aggressive recognition (first part of a gesture is similar to another) where the beginning of a gesture could be misclassified, and the remainder lumped into the next sequence.

Reference

Monday, February 4, 2008

A multi-class pattern recognition system for practical finger spelling translation

Summary Hernandez et al create a simple, cheap, accelerometer-based glove to track hand postures and classify gestures using dimensionality reduction and a decision tree. The glove consists of five accelerometers attached to the fingers between the second and third joints. By relating the accelerometers to the pull of gravity the overall position of a finger (or at least that of the segment of the finger). By summing the x-components of each accelerometer and the y-components, they form an global x and y position. The y-position of the index finger is taken as a measure of the

Discussion The amount of data reduction would seem to oversimplify hand posture at least in general. I'm still not convinced that it's adequate to describe the position of the fingers simply by an average or overall curvature and spread when two distinct gestures may differ only in the bend of a single joint. While this hand measurement system seems to work well for signing, it doesn't seem to be useful for general gesturing, since you could in theory have two different gestures with the fingers in the same orientation but different hand positioning. Also, accelerometers tend to be very sensitive to noise, making dynamic, moving gestures difficult.

Reference Hernandez-Rebollar, J. L., R. W. Lindeman, et al. (2002). A multi-class pattern recognition system for practical finger spelling translation. Multimodal Interfaces, 2002. Proceedings. Fourth IEEE International Conference on.

Hand tension as gesture segmentation cue

Summary
Rather than just build a gesture recognizer, Harling and Edwards want to create a gesture framework (or interface). They begin by trying to group gestures into broad classes, initially arriving at four groups that contain most gestures: Static posture, static location; Dynamic posture, static location; static posture, dynamic location; and dynamic posture dynamic location. In sequence each class is more complex than the previous and builds upon the less complex classes. Since the first class has been solved adequately through previous works, they focus on the second. The key differentiation between the two classes is the problem of gesture segmentation or separating one posture from the next in the case of this class. They define a "hand tension" metric as a method to segment one posture from the next. When assuming a hand posture, a person must exert effort to maintain that posture rather than return to a natural resting hand position, and between two postures the hand first tends to move toward this rest position. The hand tension metric increases as the hand moves away from the rest position. Gestures can be segmented by finding the minima of hand tension and taking the maximal tension between the minima as the intended postures. They provide two graphs of hand tension during sequences of gestures that suggest that hand tension can segment postures.

Discussion
I like the four gesture classes presented here. It seems to me that most of the gestures that we perform fall the two middle categories, though the most complex is certainly not negligible. The first class SPSL sounds too cumbersome to use (Make the position then hit the recognize key). This paper provides what could be a fairly useful metric for the second class DPSL, though it could use some updating for modern tools that could give more fine-tuned tension readings. We've previously seen an example of the third class SPDL, simply use recognizers from SPSL, and add location tracking. Then the fourth class gets harder, though a complex gesture could possibly be modeled as a sequence of sub-gestures from the 2nd or 3rd class.

Reference
Philip A. Harling and Alistair D. N. Edwards. Hand tension as a gesture segmentation cue. Progress in Gestural Interaction: Proceedings of Gesture Workshop '96, pages 75--87, Springer, Berlin et al., 1997

Wednesday, January 30, 2008

A Dynamic Gesture Interface for Virtual Environments Based on Hidden Markov Models - Qing et al.

Summary
Qing et al use HMMs to recognize hand gestures. First, to combat the difficulty of spotting a gesture in a continuous stream of input, they reduce the time sequence of each sensor to its standard deviation, though don't say how this segments the gestures. Then, the data is vector quantized. Next, the 20 sensor standard deviations are used as the observation sequence that is input to the HMM. Initially, the HMM for each consists of 20 states, corresponding each sensor, with the transition probability from state i to i+1 equaling 1 and the rest equaling 0, always starting in the first state. They train the HMMs using 10 examples of each of three gestures, index finger bending, thumb bending, and index/middle fingers bending (difficult to separate, no?). The state transition and initial state probabilities are trained in addition to the observation probabilities. Lastly, they note that they successfully rotate a 3-D cube along 3 axes using this system and its 3 gestures.

Discussion
I don't like this paper. There is no reason to use an HMM in this setup. Some kind of probability distribution estimate for the standard deviation of each sensor value for each gesture class maybe, but HMM this is not. They already know what each state is, so Hidden is out. They aren't modeling a process over time with the states any more, so why use state transitions even? Don't tell me you're going to use an HMM then boil out all the complexity to something that you could just use nearest neighbors or a linear/quadratic classifier for.

Reference
Qing, C., A. El-Sawah, et al. (2005). A dynamic gesture interface for virtual environments based on hidden Markov models. Haptic Audio Visual Environments and their Applications, 2005. IEEE International Workshop on.

Online, Interactive Learning of Gestures for Human/Robot Interfaces - Lee and Xu

Summary
Lee and Xu seek to create an HMM-based system that recognizes hand gestures with little up-front training that can learn from its mistakes and add new gestures on the fly. First they segment the input stream from a CyberGlove into discrete symbols using a fast Fourier transform and vector quantization. They collect one example of a set gestures and train several left-to-right HMMs to recognize these gestures. Next, they classify several test gestures using a confidence measure. If this measure is below a threshold, the classifier is certain of its classification and an action is taken; otherwise, it is uncertain and prompts the user for the correct classification. The uncertain example is then either used to create a new HMM and class or to update the appropriate HMM by iterating through Baum-Welch with the additional example. Their iterative method achieves high accuracy (>99%) after a small number of examples and performs on par with batch methods (based on the likelihood that the HMMs would generate the training data).

Discussion
This is a good extension of HMMs allowing for tuning the system to a user while in use; however, they do not provide a test accuracy of batch trained HMMs for comparison making it difficult to determine which performs more accurately. Their ideal to probabilistically determine the certainty of a classification seems like a very good (useful) idea. I'd like to know if their evaluation function is just something that they thought up that works pretty well or if it has some statistical basis.

Reference
Lee, C. and X. Yangsheng (1996). Online, interactive learning of gestures for human/robot interfaces. Robotics and Automation, 1996. Proceedings., 1996 IEEE International Conference on.

Wednesday, January 23, 2008

An Introduction to Hidden Markov Models

Summary
Rabiner and Juang provide an excellent beginner's guide to Hidden Markov models. The begin with a bit of background information about HMMs, before describing what an HMM is through an example. An HMM is essential a set of hidden states about which probabilistic observations can be made and a set of rules governing how to move between states. Given an HMM, we can produce a series of observations by moving between states according to the rules and then probabilistically generating the observation. Additionally, Rabiner and Juang detail three other problems that can be solved using HMMs. First, is determining the likelihood of a sequence of observations given an HMM. This can be done using the forward or backward procedure, detailed on page 9. Next, given an HMM and an observation sequence, the most likely state sequence can be determined using the Viterbi algorithm on page 11. Lastly, an HMM can be generated from a sequence or sequence of observations using Baum-Welch re-estimation, also on page 11. Lastly, they provide a example application of HMMs, recognition of single spoken words.

Discussion
This paper is a great reference for HMMs. The algorithms are described in a straight-forward, understandable manner. The only hard part is when and how to apply an HMM to a given problem.

Reference
Rabiner, L. and B. Juang (1986). "An introduction to hidden Markov models." ASSP Magazine, IEEE [see also IEEE Signal Processing Magazine] 3(1): 4-16.

American Sign Language Finger Spelling Recognition System

Summary
Allen et al. seek to create a system to allow improved communication between the deaf community and the general public. To this end, they first seek to create an automated translator from the alphabetic portion of American Sign Language to written and spoken letters. They use an 18 sensor CyberGlove to measure the position and orientation of the users fingers and the orientation of the hand with respect to the rest of the arm. They trained a perceptron-based neural network to translate a single person's signs. With ten examples of each letter, they achieved a 90% accuracy rate for translation for a single user.

Discussion
Not much to say about this one. It's essentially CyberGlove + neural network = translator. It's a good first step, but faces a few problems, starting with the hardware being somewhat expensive. Training to a specific user isn't too big of a problem, since it could be marketed to an individual user, but a version that achieves high accuracy for multiple users would be nice.

Reference

Tuesday, January 22, 2008

Flexible Gesture Recognition for Immersive Virtual Environments - Deller, Ebert, et al

Summary
Deller et al create a framework for interaction involving a data glove, analagous to the LADDER framework for geometric sketches. They use a P5 data glove for their system, but it is adaptable for any type of hardware. The data glove provides hand position and orientation information as well as finger flexion. Additionally, the glove has several buttons for additional input. Gestures are defined as a sequence of postures and orientations rather than as motions over time. Postures rely mainly on the flexion of the fingers, though orientation may be important as well, therefore posture information contains both flexion and orientation, as well as a relevance value for orientation. As new postures are generated by example, a user simply move thier hand to the correct position to define the posture. Alternately, variations of the posture can be input to create an average posture. Recognition is divided into two phases: data acquistion and gesture management. As the data glove is very noisy, the data must be filtered to obtain adequate values. First a deadband filter is applied, and extreme changes are rejected. Then a dynamic average is taken to smooth the data. Next, matching posture candidates are found from the posture library, and if the posture is held briefly, a PostureChanged event is created. This contains both the previous and current posture as well as position and orientation. Also, GloveMove and ButtonPressed events are created when the glove position changes enough or a button is pressed. Gesture management matches postures data to stored postures by treating flexion values as a five dimensional vector and calculating the closest stored posture. If the posture is close enough to the stored one and the orientation is stasified, it is assigned that posture class. Gestures are defined as a sequence of one or more postures, and the sequence of past postures is matched to possible gestures. The gesture system was demonstrated using a virtual desktop. User natually interacted with the environment, grasping objects by making a fist or pointing at objects to examine them more closely.

Discussion
Though it seems relatively simple, the authors do not test recognition accuracy extensively. Also, their demonstration uses only a handful of postures, all of which would seem to be fairly distinct, making posture recognition easy. It would be more interesting to see how accurate posture recognition is for a more expansive posture data set, such as sign language mentioned by the authors. A more robust posture recognizer may be required in the face of a greater number of possibly ambiguous postures.

Reference
Deller, M., A. Ebert, et al. (2006). Flexible Gesture Recognition for Immersive Virtual Environments. Information Visualization, 2006. IV 2006. Tenth International Conference on.

Environmental technology: making the real world virtual - Myron

Summary
Myron summarizes his work in shifting from a world in which users must learn how to use computers and software up front to one in which they learn by interacting with the system as they do with the real world. Unlike other research at that time who used bulky hardware to measure how a user was interacting, Myron focused on interaction through observation, using video and floor pressure sensors to perceive user actions. In the creation of VideoPlace, Myron created a virtual shared space that overlapped video of the users' hand with virtual objects in which multiple users could interact with each other as well as shared objects via teleconference. In this environment, Myron observed that users reacted to and interacted with objects much as they would with real ones. Myron's next project created a virtual world through which users could move based on the movement of their hands and body. This led to a variety of VideoPlace applications such as range-of-motion therapy, virtual tutoring, and other virtual educational experiences. Myron next moved from a large scale setup to a smaller one, creating the more contained VideoDesk and associated applications such a virtual modeling and sculpting. Throughout his research, Myron sees teleconferencing as the primary benefactor of haptic interaction.

Discussion

Reference
Myron, W. K. (1993). "Environmental technology: making the real world virtual." Commun. ACM 36(7): 36-37.