Facebook AI Research is learning architecture for video recognition

Primates’ retinal ganglion cells receive visual info from photoreceptors that they then transmit from the eye to the brain. But not all cells are created equal — an estimated 80% operate at low frequency and recognize fine details, while about 20% respond to swift changes. This biological dichotomy inspired scientists at Facebook AI Research to pursue what they call SlowFast. It’s a machine learning architecture for video recognition that they claim achieves “strong performance” for both action classification and detection in footage. An implementation in Facebook’s PyTorch framework — PySlowFast — is available on GitHub, along with trained models.

As the research team points out in a preprint paper, slow motions occur statistically more often than fast motions, and the recognition of semantics like colors, textures, and lighting can be refreshed slowly without compromising accuracy. On the other hand, it’s beneficial to analyze performed motions — like clapping, waving, shaking, walking, or jumping — at a high temporal resolution (i.e., using a greater number of frames), because they evolve faster than their subject identities.

That’s where SlowFast comes in. It comprises two pathways, one of which operates at a low frame rate and slow refreshing speed optimized to capture information given by a few images or sparse frames. In contrast, the other pathway captures rapidly changing motion with a fast refreshing speed and high temporal resolution.

The researchers assert that by treating the raw video at different temporal rates, SlowFast allows its two pathways to develop their own HUC99 video modeling expertise. The slower path becomes better at recognizing static areas in the frame that don’t change or that change slowly, while the faster path learns to reliably suss out actions in dynamic areas.