AI Sparks

LLMs help robots understand vague instructions and focus on important details | MIT News

Imagine working in a warehouse or office in the near future, and you are asked to help an intern learn the basics of his job. Catch: Robot. To teach them, you may want to play a game of “show and tell” — that is, physically demonstrate how to do something in a few different ways, while also explaining what you’re doing.

Let’s say you ask a robot to put coffee on your desk without interrupting you during a Zoom call. You will prefer that the robot is not too close to you and the laptop so that it does not interfere with your meeting. To enable this behavior, the robot must be trained with data that clearly shows it the full task. Computer scientists have tried to explain manipulative tasks to robots by recording many visual demonstrations or writing broad directions. But if you don’t have both, the machine may not understand what to do.

It’s a difficult task for humans to do all that you show and say, so researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) automated the process of teaching a robot, while automatically specifying instructions and using nearly five times as much display data. Their “Masked Inverse Reinforcement Learning” (Masked IRL) approach uses a large-scale linguistic model (LLM) to elaborate on abstract information based on data collected from a user demo. Another LLM then reduces the details that must be integrated into the algorithm in the motion system, so that the robot can safely complete household tasks, in offices, and in factories.

“Our approach may be useful when a human interacts with a robot but does not want to explain all the details of the task,” said MIT PhD student and CSAIL researcher Minyoung Hwang, lead author of a paper introducing the project. “We reduce human effort by allowing machines to get to the bottom of what users really want.”

According to Hwang, Masked IRL can help robots navigate safely in settings where there are features that a human can’t immediately decipher, but that’s important nonetheless. For example, the snack machine in the kitchen may not know to avoid collision with your laptop. Similarly, a factory robot that places items in different boxes must carefully navigate the shelves.

To learn new functions in these situations, Masked IRL uses the robot’s sensors to capture information about its surroundings. These components also record each movement of the kinesthetic display – a training method where a person moves the robot to perform a specific action. It’s like being a mechanical physiotherapist, bending the joints in a certain direction to show the robot how to hold, move, and position things.

The MIT program then called the LLM to compare this sequence of movements (called a trajectory) to the shortest path. The model also elaborates on what may not be immediately obvious, turning a request like “stay close” into “sit close to the face of the table.” Using the comparison of the trajectory and the specified directions, the LLM begins to understand why the movements he has been trained in are important for the job.

The second LLM then evaluates the details of the environment, such as the location of obstacles and the position of the robot’s target. During this process, it “hides” (in other words, ignores) elements it deems unimportant to the task at hand, scoring each one as “1” (important) or “0” (not so important). For example, whether or not the user was leaning on the table during the display would be “0,” making it irrelevant. Any information that is considered a “1” is entered into the final action plan by the algorithm.

These masks gave Masked IRL a significant advantage over comparable bases in both 3D demos and in the real world because they taught the robot what information to prioritize. Thanks to the researchers’ system, virtual and virtual robots alike are able to maneuver objects around obstacles, such as moving a coffee cup around a laptop to different locations on a table. In these tasks, Masked IRL correctly identified the preferences of users, who did not clearly mention them in their notification, up to 15 percent more often than comparable bases.

During simulation tests, CSAIL researchers also found that Masked IRL was a fast learner. It took a few demos to understand how to move the cup from its base. They also found that the robots worked better when the LLM cleared the instructions, instead of the machine trying to follow a vague request.

This highly focused approach also accurately translated the real robot arm, making commands that the system did not recognize during its training phase. After being trained with 50 kinesthetic demonstrations, the robot carefully moved the cup to the human while avoiding collision with the user’s computer—an obstacle it learned to avoid by elaborating on a simple request to “no go.” It also cleared the table while “sitting next to” it, and gave the user a bag of chips while not “sitting” with the person and the table.

Hidden IRL hears and interprets what users leave unsaid, but soon, it may “see” again. CSAIL researchers plan to make their approach more powerful by equipping it with cameras, allowing the robot to capture images of its surroundings. It can then highlight and focus on certain nearby objects. For example, if you asked the machine to pick up a toy, it might see a banana nearby and ignore it before grabbing the target.

Hwang co-authored the paper with three CSAIL colleagues: PhD student Alexandra Forsey-Smerek ’20, SM ’22; postdoc Nathaniel Dennler; and MIT Assistant Professor Andreea Bobu, a member of the Aeronautics and Astronautics department and CSAIL. Their work was supported, in part, by the Tata Group through an MIT Generative AI Impact Consortium Award, and the Department of Defense. They will present the project at the 2026 IEEE International Conference on Robotics and Automation in June.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button