Stroke Gesture Recognition Algorithm
I wanted to consider the problem of how to recognize gestures made on a tablet device or computer screen using a mouse, stylus, or one's finger. Recognizing strokes is a useful and friendly way for users to interact with software running on keyboard-less devices such as phones, tablets, and whiteboards.
Gesture recognition at first appears like a subset of image recognition and many existing algorithms take some sort of 2 dimensional image recognition approach. I was looking for a simpler and more efficient solution. One that could be run directly in javascript on a web browser (for example). The image recognition approach is computationally expensive since it requires performing matches at many different scales and angles, requiring a large number of computations. However, gestures have several properties that make them simpler and more efficient to deal with:
- They have a clear start point, end point, and sequence
- They have no color or intensity dimension (they could in theory, but this is generally not considered)
This means a gesture can be thought of as a sequence of 2 dimensional coordinates. A mathematically convenient way to represent 2 dimensional coordinates is as a complex number. Representing a gesture as a sequence of complex numbers turns out to be a powerful approach that allows for a fast and simple matching algorithm that is scale and rotationally invariant, a highly desirable feature to improve matching efficiency.
The Basic Algorithm
-
Record pixel coordinates along the stroke path as a sequence of complex numbers.
-
Resample the sequence to a consistent length (i.e. 100 points)
-
Center the sequence around coordinate (0,0)
-
Calculate a matching function against each of a set of known sequences. The set of known sequences have also been resampled and centered and thus have the same number of samples as our unknown sequence.
1. Record pixel coordinates along the stroke path
Most computer systems record mouse movement at a fixed rate (60 times per second), with no event fired if the mouse doesn't move at all. The recorded sequences can then be thought of as being samples in time.
2. Resample the sequence to a consistent length
Here we have to decide how we want to resample. Do we want to resample in time, so that each sample has constant time between them, or do we want to resample in space so that each sample has a constant distance between them? There are good arguments for both approaches. On the one hand, time information can be valuable since different stroke features will generally be performed at different speeds. A user drawing a square box will tend to slow down at the corners for example. On the other hand, sampling in space would be more robust to a user pausing or slowing down unexpectedly while drawing a path. Users can see when they are drawing a shape badly, but they are unlikely to realize if their timing is also off. Resampling in space is more complicated than resampling in time, although not significantly slower. In most cases the difference in performance will likely be small, but only experimentation can say for sure. Time sampling can probably be safely used if one wants to keep their code as simple as possible. For my demonstration I decided to do resampling in distance.
3. Center the sequence around coordinate (0,0)
This part of our operation is similar to most image recognition techniques. Just as our eyes automatically center on an object we are attempting to recognize, we recenter the sampled sequence to make matching easier. In our case we simply subtract the average complex value.
4. Calculate matching function against known sequences
For the matching function (m) I used the normalized dot product of the unknown sequence (a) and the complex conjugate of each known sequence (b).
$$m = (a . bar b)/sqrt((a . bar a) * (b . bar b))$$
The match $m$ is a complex number. The magnitude $|m|$ represents how strong the match is and the phase $arg(m)$ represents the relative rotation of $a$ versus $b$. The magnitude $|m|$ is scale and rotationally invariant. We can choose to use or ignore the relative rotation however we like. We could (for example) use it to see whether the user meant to draw a vertical or horizontal line or one at another preset angle. I use the phase information to differentiate between matching a 6 and a 9, which have almost identical stroke patterns except that one is rotated 180° to the other. Unlike most other approaches this matching function only has to be calculated once, making it very efficient.
Demonstration
Below is a demonstration of this matching algorithm for the set of numbers from 0 to 9. The open circle represent the expected start of the stroke and the closed circle represent the end. The magnitude and phase of matching function is shown for our unknown (a) stroke versus each known stroke (b). The best match is shown by highlighting the match in red.
I wrote this demonstration using typescript, which contains a bit more typing information that may assist in reading and understanding the code. The source code for this demonstration can be seen here gesture_recognition.ts
Conclusion and discussion
I think this approach is a good first effort, but it isn't perfect. It's very efficient and makes stroke gesture recognition accessible to even interpreted language environment and slow devices. However, its main drawback is its dependence on centering. If a shape is drawn with some part of it extending much further than expected, it will not be centered properly and this will result in a poor match. More thought needs to be put into how to address this drawback.
In retrospect the 4 and the 9 stroke are very similar and often confused. This may be an area where time sampling might have produced superior results. In general it would be safest to design strokes that are more distinct than these two.
An important direction to investigate in the future is in matching multistroke patterns. Single stroke patterns are very limited. Being able to combine stroke patterns expands the stroke 'vocabulary' geometrically (the product of the number of strokes times the number of patterns). Here it is likely timing information (in addition to position) may be critical.