Recognizing activities and anticipating which might come next is easy enough for humans, who make such predictions subconsciously all the time. But machines have a tougher go of it, particularly where there’s a relative dearth of labeled data. (Action-classifying AI systems typically train on annotations paired with video samples.) That’s why a team of Google researchers propose VideoBERT, a self-supervised system that tackles various proxy tasks to learn temporal representations from unlabeled videos.
The researchers trained VideoBERT on over one million instructional videos across categories like cooking, gardening, and vehicle repair. In order to ensure that it learned semantic correspondences between videos and text, the team tested its accuracy on a cooking video dataset in which neither the videos nor annotations were used during pre-training. The results show that VideoBERT successfully predicted things like that a topplayr bowl of flour and cocoa powder may become a brownie or cupcake after baking in an oven, and that it generated sets of instructions (such as a recipe) from a video along with video segments (tokens) reflecting what’s described at each step.
That said, VideoBERT’s visual tokens tend to lose fine-grained visual information, such as smaller objects and subtle motions. The team addressed this with a model they call Contrastive Bidirectional Transformers (CBT), which removes the tokenization step. Evaluated on a range of data sets covering action segmentation, action anticipation, and video captioning, CBT reportedly outperformed state-of-the-art by “significant margins” on most benchmarks.