Oct 22, 2004 17:05
A Gesture Recognition System for Human Computer Interfacing
A Final Year Project by-
Sunil Pai and Sajith Shetty
Department of Computer Science Engineering
NITK, Surathkal
Objectives
1.The system should be able to track the movements of BOTH hands of the user, ignoring extraneous movement behind him/her.
2.The system should then be able to recognize the gestures made by the user with BOTH hands, taking care of occlusions (overlapping hands), and respond accordingly.
3.The system should perform the above in real-time, or atleast a close approximation to real-time. Post-capture processing is a definite no-no.
4.The system should be future-compatible; i.e.- there must be provision for upgrading the source code by simply adding or replacing modules in the source code.
Software/ Hardware
-The system consists of a normal computer fitted with a cheap off-the-shelf webcam capable of delivering upto 30 fps at 640x480.
-Programming will be in Microsoft Visual C++ (say 6.0), along with OpenCV libraries (Intel Open Computer Vision) and IPL (Image procesing libraries). We're still considering whether to use MATLAB (or other signal-processing s/w) in conjunction with the above.
Basic outline
Phase 1- Locate hands
We could use Camshift/ Kalman filters to locate the region of the hand(s), after which a Canny Edge detector could be used to detect the actual contours of the hand. Another option is to use the CONDENSATION filter approach as described by Isaard and Blake. Mammen, Chaudhuri and Agarwal (2001) present what they claim to be a robust two-hand tracking system; we are currently reading up on it. We would of course like to make sure that the head, torso and other details are ignored; for the region of interest lies only with the palms.
Phase 2- Gesture recognition
We will be using either a HiddenMarkovModel approach, or a FiniteStateMachine approach to determine what gestures the user is actually describing. The gestures can consist not only of various finger configurations (e.g.- 2 fingers held up, fist closed, etc), but also movements of the hand itself (e.g.- sweeping motions, etc). Thus, hypothetically, holding up 2 fingers of the left hand and then making a circular motion with a closed fist with the right hand could be a gesture that means 'Reboot'.
Issues
Calibration- Background subtraction?
We can't really justify whether to use a simple background subtraction scheme for calibration or to assume that there will be unwanted noise behind the user (people moving around, changing illumination conditions, etc). This choice is important since otherwise, the entire code must be rewritten, and we can't really modularize this as a separate option. Therefore, for sake of ease, we simply assume background subtraction as an acceptable option. In case time permits, we'll change the code to adapt to a dynamic background.
Machine learning/training- ("I've got to save my ass"- Shrek)
The human brain goes through a period of intense learning in the first 1-2 years of it's life, after which learning happens along with the normal course of life. And even though one goal of developing computer systems is to replicate ourselves (the ultimate flattery), we sometimes wonder if that's the way to go. Apurva Patel (IISc) claims that evolution is taking us down the optimal path, that eventually a highly optimized being will be evolved (Or so we deduced from his literature). Our question, getting back to the project, is whether we want to build a system that will require a mandatory period of training, or should it be immediately executable? An open question, of course. We figure that training would make the system more robust and adaptive to individual users, while the other option would make the product more user-friendly. We'll let this sleeping dog lie until the time comes for it to bite us on the ass.
New algorithms- Is the sum getting to be greater than the whole?
Are we blindly doing this in a purely mechanical way? We view this project as having broken it down to individual blocks, and searching for what are the commonly accepted solutions for each of the blocks; nothing to do with the system as a whole. Which begs the question to be asked- Can a customized algorithm be written for this problem exclusively? i.e.- Combining the problems of motion tracking/analysis and gesture recognition into one problem; does there exist an optimum solution? If so, what is it? This thought is a side project; comments are always welcome.
Future work
We envision a world where there exists no PHYSICAL contact between Humans and Computers, where a computer system can be controlled by simply talking and gesturing at it, decreasing the gap that exists between man and machine. We of course have the fantasies spun by lots of science-fiction books, movies and whatnot. When Tom Cruise donned his three-diode gloves and messed around with multiple video inputs on a transparent viewscreen, he set off a chain of thought amongst hundreds of researchers in the fields of Computer Vision and Pervasive Computing. The truth is that this future is closer than we think. Work is going on in labs across the world towards making this dream a reality. We only hope that our work is appreciated as another step in the goal of making the ultimate Human-Machine Interface.
Please contact us in case you have source code, literature, tips/tricks or just general advice on how to get ahead with this project-
Sunil Pai- sunil_gpai@yahoo.com
Sajith Shetty- sajithrshetty@yahoo.com