Sunday, May 4, 2008

Invariant features for 3-D gesture recognition

Invariant features for 3-D gesture recognition

Campbell et al test a variety of features for gesture recognition. They combine various Cartesian and polar coordinates, as well as velocities, speed, and curvature, and use these features as input to HMMs. They find that radial speeds perform the best.

Discussion
I was hoping for some novel features, but instead ended up with the usual speed, curvature, and position.

Campbell, L.W.; Becker, D.A.; Azarbayejani, A.; Bobick, A.F.;Pentland, A., "Invariant features for 3-D gesture recognition,"Automatic Face and Gesture Recognition, 1996., Proceedings of the Second International Conference on , vol., no., pp.157-162, 14-16 Oct 1996.

Interacting with human physiology

Interacting with human physiology
Pavlidis et al discuss the use of non-invasive techniques to monitor human physiology. Using thermal imaging the heart rate, respiration rate, and stress level. This involves face tracking. Generally this is accomplished by the Condensation method tracking the area of interests with adjustments using a more global face tracker and remembering the position the area of interest with respect to the face.

Discussion
This is an interesting area of research, but all these applications were presented to my Senior Seminar class 3 years earlier. I was hoping for a new application or method.

I. Pavlidis, J. Dowdall, N. Sun, C. Puri, J. Fei and M. Garbey.
"Interacting with human physiology." Computer Vision and Image
Understanding

Device Independence and Extensibility in Gesture Recognition

Device Independence and Extensibility in Gesture Recognition (2003) Eisenstein, J., et al

Eisenstein et al develop a multi-layer gesture recognition framework. From raw data, a set of postural predicates are recognized. These predicates are in turn combined into gestures by matching to gesture templates. They also implement a context history that can help predict which gesture follows a give gesture sequence. The compare to a baseline monolithic neural network

Discussion
This classifier isn't really device independent. For each different device you have to retrain the predicate recognizer. Furthermore the retraining of the predicate recognizers takes longer (more weight updates) than the single network. The extensibility comes from the fact that it uses an instance based learner. Any instance base learner should show similar performance; however, the use of predicates ultimately limits the number of gestures possible.

Feature Selection for Grasp Recognition from Optical Markers

Feature Selection for Grasp Recognition from Optical Markers

Chang et al use feature subset selection to determine an optimal number of optical markers to track grasps. By sequentially adding to or removing from the set of possible markers, they determined that when using only 5 of 30 possible recognition accuracy is not significantly reduced (3-8% reduction). Recognition also translates well to new objects when using either 5 or 30 markers, but not necessarily to new users, who may form grasps in entirely different manners.

Discussion
Shouldn't the other 2 marker thats are used to define the local coordinate system also be excluded from the feature selection process, since their positions should be invariant as well? Also, several of the markers seem to be placed in highly redundant positions.

Reference
L.Y. Chang, N. Pollard, T. Mitchell, and E.P. Xing, "Feature Selection for Grasp Recognition from Optical Markers," Proceedings of the 2007 IEEE/RSJ Intl. Conference on Intelligent Robots and Systems (IROS 2007), October, 2007, pp. 2944-2950.

Glove-TalkII-a neural-network interface which maps gestures to parallel formant speech synthesizer controls.

Glove-TalkII-a neural-network interface which maps gestures to parallel formant speech synthesizer controls.

Fels and Hinton create a hand based artificial speech system using neural networks. They create 3 networks that determine the inputs to a speech synthesizer. The first determines whether the left hand is forming a vowel or a consonant. The second determines which vowel is being formed based on the hands vertical and horizontal position. The last network determines which consonant the user is forming with the fingers of the right hand. The first is a fully feedforward network trained on 2600 samples, while each other is an RBF network with the centers trained as class averages. Each network shows low expected error (<6%), and a user trained for 100 hours can speak intelligibly.

Discussion
100 hours of training? For someone that they think will learn easily due to prior experience? How long will someone with no experience take to learn to speak with this system. It's also odd that they chose to use phoneme signing, rather than interpreting sign language to text and using a text to speech converter. The system would undoubtedly have a small translation lag time, but could be used by someone who already knew how to sign alphabetically.

Reference

Fels, S. S. and G. E. Hinton (1998). "Glove-TalkII-a neural-network
interface which maps gestures to parallel formant speech synthesizer
controls." Neural Networks, IEEE Transactions on 9(1): 205-212.

RFID-enabled Target Tracking and Following with a Mobile Robot Using Direction Finding Antennas

RFID-enabled Target Tracking and Following with a Mobile Robot Using Direction Finding Antennas

Kim et al

Uses RFID tracking to guide a mobile robot to a target. Using an RFID transmitter on the target and 2 perpendicularly mounted RFID antennas, the tangent to angle of the target can be computed as the ratio of the voltages induced in the antennas. This can be used to control the direction of the robot, orienting it on the target, and allowing it to follow the target. Rather than simply monitoring the voltage ratios, the robot turns the antenna array to maintain a constant ratio of 1 between the two antenna (orienting the array so it points at the target) then turn the robot so that it is aligned with the antenna.

Discussion
Only tracks in essentially 1D, though this could be increased to 2D with an additional antenna. Could be used for hand tracking in 2D, but posture determination would still need a glove, or tracking multiple transmitters which would be bulky when placed on the finger.

Reference Kim, M., Chong, N.Y., Ahn, H,-S., and W. Yu. 2007. RFID-enabled Target Tracking and Following with a Mobile Robot Using Direction Finding Antennas. Proceedings of the 3rd Annual IEEE Conference on Automation Science and Engineering Scottsdale, AZ, USA, Sept 22-25, 2007

$1

$1

Wobbrock et al

Does template matching starting with a single example for each class.

Step 1: Resample to 64 points
Step 2: Rotate so that the line between the starting point and the centroid (average (x,y)) is horizontal (centroid at origin)
Step 3: Scale to a reference square and translate so that the centroid becomes the origin.
Step 4: The pathwise distance between the templates and an input is computed by averageing the distance between corresponding points. This pathwise distance is determined as the minimum distance over several possible rotations of the input over +/-45 degrees. Adding additional templates with the same name can capture variations of that class. $1 performs on par with Dynamic Time Warping, and both outperform Rubine.

Discussion
The differences in error are not suprising considering some gestures are very similar from the standpoint of the features specified by Rubine, especially after rotating and rescaling. This also forces all classes to have the same value for 2 of the features, as they have the same bounding box after scaling.

Reference
Wobbrock, J., A. Wilson, and Y. Li. Gestures without Libraries,Toolkits or Training: A $1 Recognizer for User Interface Prototypes.UIST, 2007

Enabling fast and effortless customisation in accelerometer based gesture interaction

Enabling fast and effortless customisation in accelerometer based gesture interaction

Mäntyjärvi et al.

Create gestures using accelerometers. First the gesture is resampled to 40 points.(?) Each resampled point is then vector quantized into one of eight codewords. The resampled, vector quantized data is then input to a set of HMMs. To create their training set, rather than force each user to repeat gestures multiple times, the authors added noise to the data samples, testing both uniform and gaussian noise distributions. The entire set was used to determine the vector quantization codebook, but was divided for HMM training and testing. The gestures used represent DVD player instructions and are 2D symbols. The system was tested using varios numbers of training examples for the HMMs, various amounts of noise, and various amounts of noisy data.


Discussion
Gesture set is just 2D, could use any sketch recognition system for as good or better results, no need to train HMMs. Gesture set is also very simple which is the only reason they can get away with using only 8 codewords and train HMMs on tiny datasets (1-2 examples).

Reference
Enabling fast and effortless customisation in accelerometer based gesture interaction Mantyjarvi, J., Kela, J., Korpipaa, P. and Kallio S. Proc. of the MUM '04, ACM Press (2004)

Gesture Recognition with a Wii Controller

Gesture Recognition with a Wii Controller

Scholmer et al recognize gestures using the accelerometer data of the Wiimote. First the data is vector quantized with k-means and k=14. Next the data is input to HMMs for each of 5 gestures. Additionally the accelerometer data is filtered before processing. Acceleration values below a threshold are considered idle and ignored as well as accelerations that differ too greatly on all components from the previous acceleration.

Discussion
Wii+HMM=Gestures. Also all but one of the gestures are in only a 2D plane which should make classification of the other relatively simple.

Gesture Recognition with a Wii Controller. Scholmer et al 2007

SPIDAR G&G: A Two-Handed Haptic Interface for Bimanual VR Interaction

SPIDAR G&G: A Two-Handed Haptic Interface for Bimanual VR Interaction

The Spidar G&G is an input device that allows users to interact in 3D through both translation and rotation for 2 hands. It is composed of 2 Spidar Gs which feature a sphere that is supported by 8 strings attached to tensioning motors. Moving or rotating the sphere puts a certain tension on different strings which is converted into translation or rotation of the object on the screen or cursor. The Spidar also features a button on each sphere which when pressed selects an object. Feedback can also be provided to the user by motors pulling on the strings. This is featured when a held object intersects with another. The sphere is pulled in resistance to one object entering the other.

Discussion
While the users seem to adapt to the Spidar G&G, the strings wrapped around the sphere seem like they would wrap over the fingers making using the device uncomfortable especially when the strings pull due to object interaction.

Hand gesture modelling and recognition involving changing shapes and trajectories, using a Predictive EigenTracker

Hand gesture modelling and recognition involving changing shapes and trajectories, using a Predictive EigenTracker

Patwardhana and Roy update the EigenTracker to use particle filtering to automatically select the hand and track it. The Tracker predicts position each hand image, then finds the hand using the eigenvectors of hand templates, and computes the reconstruction error of the predicted image from the previous two iterations. If the change is large enough, a change in hand shape has occurred. By tracking the shape between hand changes, gestures can be determined.

Discussion
The change in eigen-templates used to find the hand seemingly just finds corners. Aren't there easier methods to find corners?

Kaustubh Srikrishna Patwardhana, and Sumantra Dutta Roy "Hand gesture modelling and recognition involving changing shapes and trajectories, using a Predictive EigenTracker" Pattern Recognition Letters Volume 28, Issue 3, 1 February 2007, Pages 329-334 Advances in Visual
information Processing, Special Issue of Pattern Recognition Letters on Advances in Visual Information Processing. (ICVGIP 2004)

Taiwan sign language (TSL) recognition based on 3D data and neural networks

Taiwan sign language (TSL) recognition based on 3D data and neural networks

Lee and Tsai recognize Taiwan sign language with neural networks. The hands were tracked with a Vicon system with markers on the finger tips and uses as features the distance between the finger tips and wrist and between each of the pairs of fingertips. These are input into a back-propagation neural network with three hidden layers and an output node for every class. At large hidden layer sizes, the network can achieve over 94% accuracy on test data

Discussion
Vicon hand tracking + neural network = posture classifier.

Yung-Hui Lee and Cheng-Yueh Tsai, Taiwan sign language (TSL) recognition based on 3D data and neural networks, Expert Systems with ApplicationsIn Press, Corrected Proof, , Available online 17 November 2007.

Wiizards: 3D gesture recognition for game play input

Wiizards: 3D gesture recognition for game play input

Kratz et al use HMMs to classify accelerometer data from Wiimotes to create a game. Their system achieves high accuracy rates when tested on data generated by the same users who produced the training data, but only about 50% accuracy on other users.

Discussion
HMMs + Wii = gesture recognition.

Reference
Kratz, L., Smith, M., and Lee, F. J. 2007. Wiizards: 3D gesture
recognition for game play input. In Proceedings of the 2007 Conference
on Future Play (Toronto, Canada, November 14 - 17, 2007). Future Play
'07. ACM, New York, NY, 209-212.

TIKL: Development of a Wearable Vibrotactile Feedback Suit for Improved Human Motor Learning

TIKL: Development of a Wearable Vibrotactile Feedback Suit for Improved Human Motor Learning

Lieberman and Breazeal describe a suit that adds tactile feedback to the learning process for a user making arm gestures. The suit produces vibrations on the arm at positions that cause the user to feel like the suit is pushing his arm towards the desired position. Rotation of the arm is suggested by causing a vibration sequence around the arm. The arm of the user is tracked using a Vicon camera system that tracks targets on the arm and maps them to an arm model. The angle of the joints in this model are compared to a reference gesture, and the magnitude of the difference in angles causes a corresponding vibration in the suit. Once the users were accustomed to the suit, it improved their learning rate and their ability to mimic the taught gestures.

Discussion
Not much gesture recognition going on. The tracking could be adapted to an instance based classification system. Its use as a learning tool is interesting, and would be cool to try.

Reference

J. Lieberman & C. Breazeal (in press) "TIKL: Development of a Wearable Vibrotactile Feedback Suit for Improved Human Motor Learning". IEEE Transactions in Robotics (T-RO).

Articulated Hand Tracking by PCA-ICA Approach

Articulated Hand Tracking by PCA-ICA Approach

Kato et al perform PCA and ICA on hand motion data captured from a CyberGlove and determine that five components to hand motion can describe hand motion sufficiently: motion of the pinky, ring finger, middle finger, index finger, and thumb. These components are translated into a 3D model of the hand from which 2D visual templates are created. These templates allow a visual hand tracking system to project the image of the hand onto the five components and produce a representation of the hand.

Discussion
The input consists of touching various fingers to the palm, so of course the variation lies in how bent the fingers are. This seems like a lot of effort to get a system that models the hand by just how much the fingers are bent.

Reference
Articulated Hand Tracking by PCA-ICA Approach Makoto Kato, Yen-Wei Chen and Gang Xu Computer Vision Laboratory, College of Information Science and Engineering, Ritsumeikan University, Kusatsu, Shiga 525-8577, Japan

A sensor fusion approach for recognizing continuous human grasping sequences using hidden Markov models

A sensor fusion approach for recognizing continuous human grasping sequences using hidden Markov models

Bernardin et al add hand mounted pressure sensors to the CyberGlove to determine grasps. These sensors are mounted at various positions on the hand that contact grasped objects. The CyberGlove+Pressure data is input into a set of HMMs. Single user trained systems perform between 75-92% accuracy, while multiple user trained systems perform at 90% on average.

Discussion
Not much to say. Pressure seems like a very useful piece of information for determining contact between the hand and objects.

Reference
Bernardin, K., K. Ogawara, et al. (2005). "A sensor fusion approach for recognizing continuous human grasping sequences using hidden Markov models." Robotics, IEEE Transactions on [see also Robotics and Automation, IEEE Transactions on] 21(1): 47-57

The 3D Tractus: a three-dimensional drawing board

The 3D Tractus: a three-dimensional drawing board

Lapides et al create a tool for 3D sketching on a tablet PC. This tool is a table that the user can alter the height and transmits this height information to the PC. This allows users to sketch in 2D on the tablet and move the surface up and down for the third dimension. The table top is counter-balanced to allow it to slide up and down easily. To differentiate where a point along the vertical plane, both a thickness gradient and image hiding are used. Any ink "above" the current level of the screen is not displayed and the further the lines are below the level of the screen the thinner they become. A perspective display also gives the user cues about the depth in the drawing.

Discussion
Moving the table up and down to draw depth seems ackward. I think I'd prefer to just rotate a drawing to add depth.

Reference
Lapides, P.; Sharlin, E.; Sousa, M.C.; Streit, L. The 3D Tractus: a three-dimensional drawing board . Horizontal Interactive Human-Computer Systems, 2006. TableTop 2006. First IEEE International Workshop on Volume , Issue , 5-7 Jan. 2006

Using Ultrasonic Hand Tracking to Augment Motion Analysis Based Recognition of Manipulative Gestures

Using Ultrasonic Hand Tracking to Augment Motion Analysis Based Recognition of Manipulative Gestures

Ogris et al classify actions taken in a bike shop using a combination of gyroscopic and ultrasonic sensors. They try several method to classify the gestures: HMMs and 2 Frame based methods, kNN and C4.5 (decision tree). For kNN and C4.5, the classifiers vote on a set of sliding windows and the majority vote decides on what to classify the gesture as. They also test several sensor fusion methods: Plausibility Analysis, Joint Feature Vector classification, and classifier fusion. Fusion techniques improve classification results greatly.

Discussion
While the authors discussion the limitations of ultrasonics, such as sensitivity to reflections, they just leave it to the classifier to filter the noise. Wouldn't a real shop environment have a great deal of reflections from the surroundings? This isn't really address in the paper; maybe they have to have an empty room to work in.

Reference
Ogris, G., Stiefmeier, T., Junker, H., Lukowicz, P., and Troster, G. 2005. Using Ultrasonic Hand Tracking to Augment Motion Analysis Based Recognition of Manipulative Gestures. In Proceedings of the Ninth IEEE international Symposium on Wearable Computers (October 18 - 21, 2005). ISWC. IEEE Computer Society, Washington, DC, 152-159.

Saturday, May 3, 2008

American Sign Language Recognition in Game Development for Deaf Children

American Sign Language Recognition in Game Development for Deaf Children

Brasher et al combine visual and accelerometer based hand tracking. Accelerometers provide x,y, and z position. Visual tracking provides x,y hand centers, mass, major and minor axes, eccentricity, and orientation. These are input to the Georgia Tech Gesture Tool Kit. They achieve fairly high word accuracy but relatively low sentence accuracy.

Discussion
Another fairly straight forward gesture system. Get data from a tracking system and plug into an HMM. The lower sentence accuracy is easily explained: missing a single word causes an entire sentence to be incorrect, but there are multiple words per sentence.

Reference
Brashear, H., Henderson, V., Park, K., Hamilton, H., Lee, S., and Starner, T. 2006. American sign language recognition in game development for deaf children. In Proceedings of the 8th international ACM SIGACCESS Conference on Computers and Accessibility (Portland, Oregon, USA, October 23 - 25, 2006). Assets '06. ACM, New York, NY, 79-86.
A Method for Recognizing a Sequence of Sign Language Word Represented in a Japanese Sign Language Sentence

Sagawa et al focus on gesture segmentation. They define potential segmentation points as the minima of hand velocity and large enough hand direction changes. These points are filtered for noise. They also determine whether the gesture is one or two handed by finding the ratio of hand velocities. Lastly, sentences are formed by scoring recognized word and transition combinations.

Discussion
Sagawa et al use essentially the same segmentation points (speed and curvature) that we've seen before. Determining the number of hands used is also straight forward (are they moving about the same speed?).

Reference
Hirohiko Sagawa, Masaru Takeuchi, "A Method for Recognizing a Sequence of Sign Language Words Represented in a Japanese Sign Language Sentence," fg , p. 434, 2000.

COMPUTER VISION-BASED GESTURE RECOGNITION FOR AN AUGMENTED REALITY INTERFACE

COMPUTER VISION-BASED GESTURE RECOGNITION FOR AN AUGMENTED REALITY INTERFACE

Störring et al present a virtual reality interface using hand gestures captured by a head mounted camera. First they must segment the hand from the background. The images are projected into a chromaticity space so that color can be separated from intensity and other image features. The hands and objects are modeled as chromaticity distributions represented as Gaussians. Each pixel is classified as hand, background, or PHO objects. Objects must fall within a specific size range, and have missing pixels filled in using an opening filter. Next the hand is plotted radially from its center, and the number of protrusions beyond a certain radius is counted to determine the gesture.

Discussion
Projecting the hand on radial coordinates to determine the number of outstretched fingers is rather novel, but the "robustness" to how a gesture is formed is simply attempting to change a drawback into an advantage. They can't tell which fingers are outstretched, only the number, so they say that's a desired quality. Also, they only provide the generic "users adapted quickly" as evidence that the system works well.

Reference
COMPUTER VISION-BASED GESTURE RECOGNITION FOR AN AUGMENTED REALITY INTERFACE Moritz Störring, Thomas B. Moeslund, Yong Liu, and Erik Granum In 4th IASTED International Conference on VISUALIZATION, IMAGING, AND IMAGE PROCESSING, pages 766-771, Marbella, Spain, Sep 2004

Georgia Tech Gesture Toolkit: Supporting Experiments in Gesture Recognition

Georgia Tech Gesture Toolkit: Supporting Experiments in Gesture Recognition

Westeyn et al present the Georgia Tech Gesture Toolkit and several example applications created with it. GT2k is built on HTK, an HMM toolkit. Essentially, users create HMMs for each gesture and a grammar that combines isolated gestures into sequences with meaning. The toolkit automatically trains the HMMs with data that has been annotated by the application creator. The system also allows for cross-validation to be used to determine how well the system will perform on real data. They also present several applications such as a workshop activity recognizer.

Discussion
They really don't add much to HTK. Really its just a new coat of paint so that HTK looks like its doing something new.

Reference

Westeyn, Brashear, Atrash, Starner. Georgia Tech Gesture Toolkit: Supporting Experiments in Gesture Recognition. Proceedings of the 5th International Conference on Multimodal Interfaces. Vancouver, British Columbia, Canada (2003), 85-92.

3D Visual Detection of Correct NGT Sign Production

3D Visual Detection of Correct NGT Sign Production

Lichtenauer et al using 3D visual tracking to classify Dutch sign language. Using a single camera they can capture 2D features: x and y position, displacement, motion angle, and velocities. With 2 cameras this can be converted into 3D positions, angles, and velocities. DTW is used to map gestures to a reference gesture. Gestures are mapped on an individual feature basis creating classifiers for each feature. The average classification is determined as the output classification. The feature classifiers determine if an example feature falls within the range of 90% of a Gaussian modeling the positive examples of the class, and a sign is assigned to classifications if the average of the features is above a threshold.

Discussion
This paper is interesting in that it is the first one using DTW. While the classification seems fairly accurate, they don't appear to decide between classes only if an example belongs to a given class.


Reference
Title: 3D Visual Detection of Correct NGT Sign Production Authors: J.F. Lichtenauer, G.A. ten Holt, E.A. Hendriks, M.J.T. Reinders

Television Control by Hand Gestures

Television Control by Hand Gestures

Weissman and Freeman present an early gesture recognition system. Using hand tracking through video, they determine where the users hand is by applying a hand template to the image. They overlay a control gui on the TV screen which the user manipulates by moving his hand. The hand is displayed on the gui and acts like a buttonless mouse. Closing the hand ends the manipulation.

Discussion
This shows how gesture input started. It is a very simple interface with relatively simple tracking and interaction methods.

Reference

Television Control by Hand Gestures. William T. Freeman, Craig D. Weissman. TR94-24 December 1994

A Survey of Hand Posture and GestureRecognition Techniques and Technology

A Survey of Hand Posture and Gesture Recognition Techniques and Technology

Laviola presents an excellent literature survey over hand and gesture recognition.
Included are:

Devices - Tracking and gloves
Methods:
Features - Template Matching, PCA
Learning - Neural Nets, HMMS, Instance-based method
Applications:
Sign Language Recognition
Gesture to Speech
Virtual Reality
3D Modeling
Control Systems (robots, tv, etc)

Discussion
A good starting point to find useful methods. Summarizes the state of the art at 1999. We're still using pretty much the same methods 8 years later.

Reference
Joseph J. LaViola, Jr. (1999). A Survey of Hand Posture and Gesture
Recognition Techniques and Technology, Brown University.

Real-Time Locomotion Control by Sensing Gloves

Real-Time Locomotion Control by Sensing Gloves

Komura maps the motions of the fingers as measured by P5 glove to the motion of a humanoid figure in a real time game. The user first mimicks a reference character so that finger bend and hand orientation can be mapped to various parts of the virtual figure. This is done by matching the perioid of change in finger flex to the period of the figure's motion. Users were successfully able to control a figure in a game.

Discussion
The authors subtly switch from P5 to cyberglove for their experiments suggesting that the P5 does not have the sensitivity necessary for this system. In the experiments the user took more time to complete tasks but collided fewer times on average, suggesting that they were more careful when using the glove.

Reference
Taku Komura, Wai-Chun Lam; Real-time locomotion control by sensing gloves; Computer Animation and Virtual Worlds 17:5, 513-525, 2006

A dynamic gesture recognition system for the Korean sign language (KSL)

A dynamic gesture recognition system for the Korean sign language (KSL)

Kim et al use a combination of Cybergloves and 3d position sensors to capture Korean sign language gestures using template matching and neural neworks. The overall gesture is match to templates. The x and y axis are divided into 8 regions and the motion of the gesture is tracked as positive or negative changes in region. Each gesture is matched to a set of template region changes to determine. Posture recognition is performed using Fuzzy Min Max Networks. Each class is determined by a max and min point defining a hyperbox and a membership function. After matching a sequence of positions to a template, the posture is used to determine which sign is represented.

Discussion
Gesture templates exist only in 2D; however it should be fairly easy to extend. Some of the templates seem to not match the motions from Figure 2. The templates also limit place a limit on the size of gestures performed.

Reference
J. S. Kim,W. Jang, and Z. Bien, "A dynamic gesture recognition system for the Korean sign language (KSL)," IEEE Trans. Syst., Man, Cybern. B, vol. 26, pp. 354–359, Apr. 1996.

Shape Your Imagination: Iconic Gestural-Based Interaction.

Shape Your Imagination: Iconic Gestural-Based Interaction.

Marsh and Watt present a study of naturally made gestures. By presenting a set of card with common items written on them and asking the study members to describe the items nonverbally, the researchers could determine what types of gesture the study memeber used. They found that people tend to prefer virtual to substitutive gestures and model shapes using 2 hands rather than one..

Discussion


Reference
T. Marsh, A. Watt, "Shape Your Imagination: Iconic Gestural-Based Interaction," vrais , p. 122, 1998.

A Survey of POMDP Applications

A Survey of POMDP Applications

Anthony Cassandra present the Partially Observable Markov Decision Process and summarizes several applications. This is largely a literature survey. Types of applications:

Machine Maintenance
Structural Inspection
Elevator Control Policies
Fishery Industry
Autonomous Robots
Behavioral Ecology
Machine Vision
Network Troubleshooting
Distributed Database Queries
Marketing
etc

Looking at the references for applications similar to yours could provide useful information about applying POMDPs or HMMs to your situation.

Reference
Anthony Cassandra. A Survey of Partially-Observable Markov Decision Process Applications. Presented at the AAAI Fall Symposium, 1998

Simultaneous Gesture Segmentation and Recognition based on Forward Spotting Accumulative HMMs

Simultaneous Gesture Segmentation and Recognition based on Forward Spotting Accumulative HMMs

Song and Kim modify the usual HMM model dividing the observation sequence into block through use of a sliding window. Each block of a gesture is used to train the corresponding HMM. Each HMM is used to recognize partial segments of the gesture over the block (train on [O1], then [o1,o2], etc), and a gesture is recognized through majority voting over the block. They determined that the optimal window size was 3. After the gesture is selected from the set of gesture HMMs it is compared either to a manually set threshold or the output of an HMM train on non-gestures. If the gesture HMM probablity exceeds the threshold or non-gesture HMM it is determined to be a gesture. Testing demonstrates that use of the non-gesture HMM spot gestures more accurately than manual thresholding.

Discussion
The gestures used in the experiment are very simple, mostly lift one arm or the other. Template matching could probably achieve similar results while being much less complex to implement.

Reference
Jinyoung Song, Daijin Kim, "Simultaneous Gesture Segmentation and Recognition based on Forward Spotting Accumulative HMMs," icpr , pp. 1231-1235, 2006.

Cyber Composer: Hand Gesture-Driven Intelligent Music Composition and Generation

Cyber Composer: Hand Gesture-Driven Intelligent Music Composition and Generation

Ip et al present a system that interprets hand and arm motion into music. They begin with a discussion of music theory that helps them determine what tones to play. Musical theory such as chord coherence and cadence can help determine which chord should follow previous ones. The system itself is composed of a pair of CyberGloves and a Polhemus 3D position tracker for each glove. The right hand determines the melody. New notes are generated when the user flexes his wrist, and the height of the hand determines the pitch of the note. Vertical movement of the hand can cause the pitch to shift with the motion. Finger flexion determines the dynamics and volume of the note. Lifting the left hand brings in a second instrument to play in either unison or harmony, and clenching the left hand initiates cadence and terminates the music.

Discussion
After desribing the system a usability discussion would have been useful. Waving the hand to generate notes seems like it would be very tiring. The use of chord coherence used to determine the next chord suggests that picking the chord you want may be difficult, and only possible by trial and error through pitch shifting. Also the use of Cybergloves is somewhat over kill, since all they are interested in is when the wrist bends and the general degree of finger flex.

Reference
Ip, H. H. S., K. C. K. Law, et al. (2005). Cyber Composer: Hand Gesture-Driven Intelligent Music Composition and Generation. Multimedia Modelling Conference, 2005. MMM 2005. Proceedings of the 11th International.