📹 🤖 Transformers for Video

The impact of transformers in natural language understanding (NLU) tasks has challenged the imagination of even the most hard-core believers in neural networks.

Jun 26, 2022

📝 Editorial

Transformers are universally acknowledged as the most important development in deep learning architectures of the last decade. The impact of transformers in natural language understanding (NLU) tasks has challenged the imagination of even the most hard-core believers in neural networks. In recent years, we have seen steady contributions of transformers to domains such as computer vision but mostly related to image-related tasks such as classification. Now transformer architectures are expanding into a new frontier: video intelligence.

The idea of using transformers for video intelligence tasks makes a lot of sense. Typically, video intelligence techniques require large amounts of labeled data to understand the predicted actions in a video frame. Transformers excel at learning from unlabeled datasets, and there are a lot of videos available on the internet to learn from. Just like in NLU tasks, transformer models could be pretrained in large sets of unlabeled videos and fine-tuned for specific tasks. Last week, OpenAI unveiled its work on video pertaining (VPT) models. This type of model adapts the principle of transformers to video intelligence tasks. To push the boundaries, OpenAI pretrained VPT in Minecraft videos, and the model was able to master tasks that required large training pipelines with techniques such as reinforcement learning which have produced some of the best results in video intelligence tasks in recent years.

With GPT-3, OpenAI established kind of the gold standard for transformers in NLU tasks. They follow up with their work on Dall-E and Dall-E2 to apply transformers to both images and language tasks. VPT seems to be their first major step in extending this work into the area of video intelligence. Maybe VPT is the foundation for OpenAI’s new supermodel.

🔺🔻TheSequence Scope – our Sunday edition with the industry’s development overview – is free. To receive high-quality content about the most relevant developments in the ML world every Tuesday and Thursday, please subscribe to TheSequence Edge 🔺🔻

🗓 Next week in TheSequence Edge:

Edge#203: we explain what Graph Recurrent Neural Networks are, discuss GNNs on Dynamic Graphs, explore DeepMind’s Jraph, a GNN Library for JAX.

Edge#204: we deep dive into Imagen, Google’s impressive text-to-image alternative to OpenAI’s DALLE-2.

Now, let’s review the most important developments in the AI industry this week

🔎 ML Research

Mastering Minecraft with Video Pretraining

OpenAI published a paper detailing video pretraining(VPT), a semi-supervised, imitation learning method that was able to learn to play Minecraft from unlabeled datasets →read more on OpenAI blog

QML Improvements

AI labs from Google, Microsoft, CalTech, Harvard and others collaborated on quantum ML (QML) techniques that show tangible improvements over classical counterparts →read more on Google Research blog

Swin Transformer Improvements

Microsoft Research published details about improvements to Swin Transformer, its 3 billion parameter computer vision model →read more on Microsoft Research blog

GODEL

Microsoft Research published a paper detailing GODEL, a new form of pretrained language model that also leverages external datasets allowing to focus on specific tasks or engage in open-ended conversation →read more on Microsoft Research blog

📌 Event: June 29th – Arize:Observe Unstructured

Only three days left to register for Arize:Observe Unstructured. This free, virtual event on Wednesday features an all-star lineup of speakers including from OpenAI, Hugging Face, the creator of UMAP & more! Register now.

🤖 Cool AI Tech Releases

GitHub Copilot GA

GitHub AI-based pair programming agent reached general availability →read more on GitHub blog

TorchGeo

PyTorch open-sourced TorchGeo, a library for processing geospatial data in ML models →read more on PyTorch blog

🛠 Real World ML

PyTorch at Disney

The Disney Media & Entertainment Distribution (DMED) detailed the PyTorch architecture used for activity recognition across video, audio, and text datasets →read more on PyTorch blog

💸 Money in AI

Unified data management platform Ataccama received $150 million in growth capital from Bain Capital Tech Opportunities. Hiring in the US, Canada, and Europe.
Predictive procurement orchestration platform Arkestro raised a $26 million Series A funding round led by NEA, Construct, Koch Disruptive Technologies (KDT) and Four More Capital. Hiring across the US.
Browser-based automation platform Bardeen.ai raised $15.3 million in a Series A funding round led by Insight Partners.
Big data startup Zenysis Technologies, Inc. raised a $13.3 million Series B round led by the Steele Foundation for Hope (SFFH). Hiring in the US.
AI quality management tools provider TruEra announced an investment from Hewlett Packard Enterprise. This exten s the $25 million Series B round TruEra announced in March and brings TruEra’s total funding to date to over $45 million. Hiring in the US and India.

Acqusitions

Data lake analytics accelerator Varada was acquired by data analytics company Starburst. Starburst s hiring globally.

TheSequence