💪🏻 Edge#116: AI2-Thor is an Open-Source Framework for Embodied AI Research

The challenges of embodied AI are many and highly diverse

Aug 19, 2021

This is an example of TheSequence Edge, a Premium newsletter that our subscribers receive every Tuesday and Thursday. On Thursdays, we do deep dives into one of the freshest research papers or technology frameworks that is worth your attention. It helps you become smarter about ML and AI.

💥 What’s New in AI: AI2-Thor is an Open-Source Framework for Embodied AI Research

Training deep learning models to interact with visual environments is one of the most complex and expensive challenges in the modern AI ecosystem. Many scenarios in robotics require interaction with physical objects and environments, which is incredibly hard and costly to model in lab environments. The AI community refers to this branch of research as Embodied AI and remains a very active area of research.

The challenges of Embodied AI are many and highly diverse. For starters, training intelligence agents to interact with physical environments requires photorealistic 3D simulations, which are very difficult to produce. Modeling physics is really hard compared to other deep learning environments. Additionally, physical environments' richness and complexity often result in agents encountering situations they haven't seen before in training. These challenges have constrained embodied AI to the big AI research labs, given that it remains cost prohibited for many data science teams.

One of the fundamental building blocks needed to accelerate the research and implementation of embodied AI techniques is creating training environments that can provide simulations of real-world scenarios.

Enter AI2-THOR

The Allen Institute for AI (AI2) is home to an initiative known as Perceptual Reasoning and Interaction Research (PRIOR). The goal of PRIOR is to advance research in different fields of computer vision. One of the first projects launched within the PRIOR umbrella was AI2-THOR, an open-source framework to train computer vision agents in the interaction with visual environments. AI2-THOR provides highly visually detailed environments that simulate the physical properties of objects in the real world. AI2-THOR leverages Unity 3D to create realistic simulations of real-world environments and abstract programmatic interactions via a simple API. From a conceptual standpoint, AI2-THOR provides the following benefits:

Visually Complex Environment: AI2-THOR environments try to mimic the conditions of real-world scenarios, which allow computer vision agents to transition from the lab to production without major changes.
Bias Mitigations: The scenes in AI2-THOR are designed manually, which mitigates the biases incurred in automated generated environments.
API: AI2-THOR provides a Python API that abstracts the programmatic interactions with environments.

Despite their diversity, AI2-THOR environments are based on a basic set of concepts:

Scene: AI2-THOR scenes abstract virtual rooms that an agent can interact with.
Agent: An agent is an entity that is trained in a visual environment.
Action: AI2-THOR uses actions to abstract commands that need to be executed by specific agents.
Object: This concept abstracts any 3D model inside a scene. Objects can be visible and visible if they are within the camera viewport. Similarly, an object can be interactive if they are unobstructed from other objects.
Receptacles: This is an object that can contain other objects.

The architecture of AI2-THOR is based on a Flask-based HTTP service that receives requests and executes actions within the Unity game engine via a controller interface. The HTTP interface has been built using the Flask framework and the responses are encoded in JSON format.

The challenges of embodied AI are many and highly diverse. To provide specialized support for different embodied AI areas, the AI2-THOR has launched three main subprojects that are available in the current version of the framework:

iTHOR: A framework that enables research in embodied common sense reasoning.
ManipulaTHOR: A framework focused on object manipulation with a robotic arm.
RoboTHOR: A framework that simulated scenes with counterparts in the physical world.

iTHOR

In simple terms, we can think of iTHOR as an environment of interactive scenes and objects that simulate the physics of the real world. Built on the AI2-THOR framework, iTHOR includes a series of key building blocks that are essential to building agents that interact with the physics of real-world environments:

Object Manipulation: iTHOR provides actions such as dropping, pushing, and many other relevant manipulations of real-world objects.
Physics: iTHOR uses the Unity physics engine to model key characteristics of objects such as mass, volume, friction, and many others.
State Changes: iTHOR captures the state changes in objects due to the execution of a specific action. For instance, objects can go from open to closed, on to off, etc.
Multi-Agent: iTHOR can simulate environments with multiple agents performing different tasks.

ManipulaTHOR

ManipulaTHOR is a framework specialized in object manipulations using a robotic arm. At a basic level, ManipulaTHOR is an extension of the AI2-THOR framework that adds an arm to the agents so that they can not only navigate environments but also manipulate objects within them. From a functional standpoint, ManipulaTHOR enables the following key capabilities:

Mobile Manipulation: ManipulaTHOR combines navigation and manipulation in a single framework.
Sensor Suite: ManipulaTHOR incorporates a suite of non-visual sensors such as touch, which are essential to simulate object manipulations.
Real Arm design: ManipulaTHOR includes a robotic arm based on the Kinova Gen3 specification.
DOF Manipulation: ManipulaTHOR enables advanced 6-depth of field (DOF) manipulation of objects which includes very specific actions such as grasping or rotating.

RoboTHOR

RoboTHOR is an environment built on AI2-THOR that provides a catalog of simulated images with counterparts in the physical world. The goal is to minimize the friction of transferring agents from simulated to real-world environments. Functionally, RoboTHOR enables the following key features:

Simulated-Real Pairings: In RoboTHOR, each synthetic room is accompanied by a real one to make it easier to study discrepancies.
Modular: The composition of rooms in RoboTHOR is very modular and based on a configurable asset library. This allows the rapid creation of diverse room environments.
Reconfigurable: RoboTHOR’s physical environments have also been built using modular components, facilitating the creation of many different layouts.
Open-Source: RoboTHOR simulation environment is entirely open-source, which should help advance research in the space.

Conclusion

Embodied AI research is notoriously challenging and resource expensive. AI2-THOR provides one of the most advanced open-source frameworks to help advance embodied AI research. Composed of three main projects: iTHOR, ManipulaTHOR, and RoboTHOOR, the AI2-THOR framework enables the key components to simplify the implementation and training of embodied AI agents. With a simple programming model that leverages state-of-the-art simulation platforms such as Unity, AI2-THOR should be one of the key frameworks to consider in embodied AI projects.

TheSequence