AI Sparks

Google DeepMind Releases Gemini Robotics-ER 1.6: Brings Advanced Thinking and Machine Learning to Physical AI

Google DeepMind’s research team has unveiled Gemini Robotics-ER 1.6, a significant development in its integrated thinking model designed to act as the ‘cognitive brain’ of robots operating in real-world environments. The model focuses on the critical thinking capabilities of robots, including visual and spatial perception, task planning, and success detection – serving as a high-level thinking model of a robot, capable of performing tasks with native calling tools such as Google Search, visual language action models (VLAs), and any other third-party user-defined tasks.

Here’s an important architectural idea to understand: Google DeepMind takes a two-model approach to robotics AI. Gemini Robotics 1.5 is a vision-language-action (VLA) model – it processes visual input and user information and translates it directly into visual motor commands. Gemini Robotics-ER, on the other hand, is an integrated thinking model: it focuses on understanding physical environments, planning, and making logical decisions, but it does not directly control the robot’s limbs. Instead, it provides high-level information to help the VLA model decide what to do next. Think of it as the difference between a tactician and a villain – Gemini Robotics-ER 1.6 is a tactician.

What’s New in Gemini Robotics-ER 1.6

Gemini Robotics-ER 1.6 features significant improvements over both Gemini Robotics-ER 1.5 and Gemini 3.0 Flash, specifically improving spatial and physical thinking skills such as pointing, counting, and success detection. But an important addition is an ability that was completely absent from the previous versions: the reading of the instrument.

Targeting as a Basis for Spatial Consultation

Pointing — the model’s ability to pinpoint precise locations at the pixel level in an image — is more powerful than it sounds. Points can be used to express spatial reasoning (object detection with precision and counting), relational reasoning (making comparisons such as identifying the smallest object in a set, or defining a to-to relationship such as ‘moving X to Y position’), motion reasoning (ways to map and identify perfect grip points), and constraint compliance (orienting an object within a problem such as an adequate blue measurement).

In internal benchmarks, Gemini Robotics-ER 1.6 shows a clear advantage over its predecessor. Gemini Robotics-ER 1.6 correctly identifies the number of hammers, scissors, paint brushes, pliers, and garden tools in the scene, and does not show the requested items that are not in the picture – such as the wheelbarrow and Ryobi drill. In comparison, the Gemini Robotics-ER 1.5 fails to identify the correct number of hammers or paintbrushes, misses scissors entirely, and identifies a wheelbarrow. For AI Robotics experts this is important because the discovery of a forgotten object in the robot’s pipeline can cause downstream failure – a robot that ‘sees’ a missing object will try to communicate with an empty space.

Success Discovery and Multi-Consultation

For robots, knowing when a task is finished is as important as knowing how to start it. Success detection acts as an important decision-making engine that allows the agent to intelligently choose between retrying a failed attempt or moving on to the next stage of the process.

This is a harder problem than it looks. Most modern robotic setups include multiple camera views such as overhead and wrist-mounted. This means that the system needs to understand how different views fit together to create a coherent picture at all times and at all times. Gemini Robotics-ER 1.6 improves multi-view thinking, enabling it to better integrate information from multiple camera feeds, even in closed or dynamic environments.

Instrumental Learning: Real-World Success

A real new capability in Gemini Robotics-ER 1.6 is instrument reading — the ability to interpret analog gauges, pressure meters, optical glasses, and digital readings in industrial settings. This work stems from the needs of institutional testing, a key area of ​​focus for Boston Dynamics. Spot, a Boston Dynamics robot, is able to visit tools throughout the facility and take pictures of them for Gemini Robotics-ER 1.6 to interpret.

Reading instruments requires complex visual thinking: one must accurately see various inputs – including needles, fluid levels, container boundaries, markings, and more – and understand how they all relate to each other. In the case of eyeglasses, this involves measuring how much liquid fills the eyeglass while accounting for distortion from the camera’s perspective. Gauges often have text that defines the unit, which must be read and interpreted, and some have multiple needles that refer to different decimal places that need to be combined.

Gemini Robotics-ER 1.6 achieves the reading of its instruments by using agent vision (a skill that combines visual reasoning and coding, introduced in Gemini 3.0 Flash and extended in Gemini Robotics-ER 1.6). The model takes intermediate steps: first by zooming in on the image to better read the small details on the gauge, then using pointing and coding to estimate measurements and intervals, and finally using world knowledge to interpret the meaning.

Gemini Robotics-ER 1.5 achieves a 23% success rate in reading the instrument, Gemini 3.0 Flash achieves 67%, Gemini Robotics-ER 1.6 achieves 86%, and Gemini Robotics-ER 1.6 with agent vision achieves 93%. One important caveat: Gemini Robotics-ER 1.5 was tested without agent perspective because it does not support that capability. The other three models were tested with the agent’s perspective enabled by the task of learning the tool, making 23% of the basis to reduce the performance gap and the basic difference of the structures. For AI developers testing generations of models, this distinction is important – you’re not comparing apples to apples in a full stop column.

Key Takeaways

  • Gemini Robotics-ER 1.6 is a thinking model, not an action model: It acts as the high-level ‘brain’ of the robot — handling spatial awareness, task planning, and success detection — while a separate VLA model (Gemini Robotics 1.5) handles the actual motor commands of the body.
  • Pointing is more powerful than appearance: Gemini Robotics-ER 1.6’s targeting capability goes beyond simple object detection — it enables relative reasoning, motion trajectory mapping, grasp point identification, and constraint-based reasoning, all fundamental to reliable robot manipulation.
  • Learning instruments is a great new skill: Developed in collaboration with Boston Dynamics’ Spot robot for industrial site inspection, Gemini Robotics-ER 1.6 can now read analog scales, pressure meters, and optical glasses with 93% accuracy using the agent’s vision — up from just 23% in Gemini Robotics-ER 1.5, which lacked the ability entirely.
  • Finding success is what empowers true independence: Knowing when a task has actually been completed – across multiple camera views, in a closed or dynamic environment – is what allows the robot to decide whether to try again or move on to the next step without human intervention.

Check it out Technical details again Model Information. Also, feel free to follow us Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.? Connect with us


Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button