Microsoft has taken a metaphorical scalpel to Kinect's insides - to give us layman an insight into the device's "secret sauce".
Kinect program managers Ron Forbes and Arjun Dayal go into great detail on the Xbox blog about the device's "brain".
They write: "In the analog world, it's not just about yes and no, but it's about maybe. It's not just about true or false, but it's about probability. Think briefly about all of the possible variations of a human waving his hand: the range of physical proportions of the body, the global diversity of environmental conditions, differences in clothing properties, cultural nuances to performing even a simple gesture. Quickly, you end up with a search space around 10^23, an unrealistic problem domain to solve through conditional-based programming.
"We knew early on that we had to invent a new way of approaching this problem, one that works like the human brain does. When you encounter someone in the world, your brain instantly focuses on him and recognizes him based on years of prior training. It doesn't crawl a decision tree hundreds of levels deep to discern one human from another. It just knows. Where a baby would have a hard time telling the two apart, you've learned to do so in a split second. In fact, you could probably make a reasonable guess about their age, gender, ethnicity, mood, or even their identity (but that's a blog post for another day). This is part of what makes us human.
"Kinect was created in the same way. It sees the world around it. It focuses on you. And even though it's never seen how you wave your hands before, it instantly approximates your movements to the terabytes of information it's already learned."
The pair liken the sensitivity of Kinect's sensor to "those pinpoint impression toys that used to be all the rage", and discusses how the peripheral can "segment" humans from other background items. They claim: "Kinect can actively track the full skeletons of up to two human players as well as passively track the shape and position of four passive players at once."
Perhaps the most interesting technical revelation is a discussion of 'The Brain Inside Kinect'.
Forbes and Dayal write: "Each pixel of the player segmentation is fed into a machine learning system that's been trained to recognize parts of the human body. This gives us a probability distribution of the likelihood that a given pixel belongs to a given body part. For example, one pixel may have an 80% chance of belonging to a foot, a 60% chance of belonging to a leg, and a 40% chance of belonging to the chest. It might make sense initially to keep the most probable proposal and throw out the rest, but that would be a bit premature. Rather, we send all of these possibilities (called multiple centroid proposals) down the rest of the pipeline and delay this judgement call until the very end.
"As a brief aside, you might start to wonder how we taught this brain to recognize body parts. Training this artificial intelligence (called the Exemplar system) was no small feat: terabytes of data were fed into a machine cluster to teach Kinect a pixel-by-pixel technique for identifying arms, legs, and other body parts it sees. The harlequin-looking figures you see here are some of the data points we used to train and test the Exemplar system."
CVG: The duo also reveal that, during testing - as with all of its trials - Microsoft's philosophy was "to "fail fast, fail often"- "so that we could push out the ideas that were not going to work and reveal the ones that would", including gesture control with feet.