Loading stock data...

Google DeepMind Unveils Its First ‘Thinking’ Robotics AI, Opening the Era of Agentic Robots

Google DeepMind is pursuing a new frontier in robotics by pairing generative AI with embodied agents. The Gemini Robotics program introduces two complementary models that together enable robots to “think” before acting. DeepMind’s researchers describe this approach as the dawn of agentic robots—systems that can interpret, reason about, and plan actions in physical environments using the same generative techniques that now drive text, image, audio, and video generation. The idea is to move beyond the conventional, task-specific robotics that rely on narrow, painstakingly engineered capabilities and toward flexible, generalized behavior that can adapt to new situations without reprogramming. This shift reflects a broader industry trend where large language models and other foundation AI systems increasingly inform physical intelligence, enabling machines to perceive, reason, and respond with greater autonomy and versatility.

In the broader context of AI-enabled robotics, DeepMind argues that generative models bring a uniquely powerful capability: the capacity to generalize across tasks and contexts. Traditional robotics has often required extensive retooling, reconfiguration, and retraining to add even a single new capability. The cost and complexity of deploying a new function can stretch across months, with bespoke hardware and software integrations that tie a robot to a narrow set of tasks. According to Carolina Parada, who leads Google’s DeepMind robotics effort, “Robots today are highly bespoke and difficult to deploy, often taking many months in order to install a single cell that can do a single task.” The promise of a generative approach is that the same underlying AI can interpret new environments, generate the sequence of actions needed to accomplish a goal, and adapt plans as circumstances change. The thinking, in other words, is not hard-coded into every movement; it emerges from the model’s ability to reason with visual and linguistic cues.

The Gemini Robotics project rests on a two-model architecture designed to handle perception, planning, and action in a coordinated fashion. On one side sits the action-oriented model, which translates perceptual input into bench-level robot movements. On the other side sits the reasoning-oriented model, which analyzes the task, construes potential strategies, and generates a step-by-step plan. This separation is intentional: the thinking model can explore multiple pathways, call external tools to augment its understanding, and produce structured instructions that the action model can execute. The expectation is that such a pairing will yield robots capable of handling novel tasks with minimal, or even no, task-specific reprogramming.

Gemini Robotics 1.5 and Gemini Robotics-ER 1.5: a duo designed to think and to act

The two new models introduced are named Gemini Robotics 1.5 and Gemini Robotics-ER 1.5. The first, Gemini Robotics 1.5, is described as a vision-language-action (VLA) model. In practical terms, this means it ingests both visual information from the robot’s environment and textual data (which could include instructions, descriptions, or contextual cues) and then outputs actionable robot commands. The combination of what the robot sees, what it reads or is told, and how it interprets those inputs leads to a concrete sequence of motor actions that drive the robot’s behavior. The “action” component does not act in a vacuum; it relies on the ER model’s prior deliberations and the stream of inputs available during execution to inform and refine its movements.

The second model, Gemini Robotics-ER 1.5, carries the abbreviation ER for embodied reasoning. This model is a vision-language model (VLM) that processes visual inputs and textual context to reason through complex tasks and determine the steps needed to complete them. Unlike the action model, ER does not directly issue motor commands. Instead, it generates natural language instructions or structured task plans that outline how a task should be approached, step by step. In this sense, ER acts as a planning and reasoning engine, producing a blueprint for action rather than executing it.

Together, these two models form a loop: the ER model reasons about a task given the current environment and context, then passes its instructions to the 1.5 action model, which converts those instructions into concrete robot actions while incorporating real-time visual feedback. DeepMind emphasizes that the thinking process is not merely a superficial layer of commentary; it represents an integrated form of simulated reasoning that helps the robot anticipate challenges, select appropriate strategies, and adjust its approach as needed. The distinction between “thinking” and “doing” is important in this framework, as it signals a move toward more deliberate and adaptable robotic behavior.

Understanding the thinking machines: what ER 1.5 can do, and what it cannot

Gemini Robotics-ER 1.5 is described by DeepMind as the first robotics AI capable of simulated reasoning akin to contemporary chatbots. The company notes that while the ER model can demonstrate high performance on a range of academic and internal benchmarks, it does not itself perform physical actions. Instead, it makes informed decisions about how to interact with a physical space by evaluating sensory inputs, desired outcomes, and available tools. The emphasis is on accurate decision-making regarding spatial interaction and task feasibility rather than direct movement. In other words, ER’s strength lies in understanding, planning, and coordinating, while action is carried out by the corresponding action-focused model.

To illustrate the ER-1.5 capability, consider a common household or workplace task: sorting a pile of laundry into whites and colors. The ER model would receive the visual cues from the environment—images of the clothing, their textures, colors, and possible categorical labels—and the textual instruction describing the goal. It can also call upon external tools to augment its knowledge, such as performing a Google search to verify colorfastness, material properties, or handling constraints. From this input, ER generates natural language instructions that specify the exact steps needed to achieve the task. Those steps might include instructions like “sort by color, then check fabric type for delicate items, and place whites in bin A and colors in bin B,” augmented by precise sub-steps tailored to the robot’s capabilities. The ER model’s power lies in its capacity to reason over a broader context and to plan a sequence of steps that can adapt if the environment or the task conditions change.

The action model, Gemini Robotics 1.5, then picks up from those instructions and translates them into action. It processes the same set of inputs and uses its own internal thinking pathways to decide how best to execute each step. This means it not only follows the explicit instructions but also reasons through the execution strategy—for instance, adjusting grip strength, sequencing gripper movements, or altering trajectories in response to sensor feedback. DeepMind stresses that 1.5 can integrate a person’s intent with environmental cues to generate a robust plan, even when the environment includes partial occlusions, dynamic obstacles, or ambiguous inputs. In practice, this combination enables a more fluid, adaptable approach to physical tasks than traditional robotics pipelines that depend on rigid, pre-programmed sequences.

The two models are built on the same Gemini foundation, but they have been fine-tuned to operate in the physical world. This fine-tuning is essential because transitioning from purely digital tasks to embodied action introduces a host of challenges: real-world perception noise, latency in motor execution, uncertainty in tool use, and safety requirements. The pair is designed to learn across different embodiments, enabling a form of cross-robot transfer that reduces the need for bespoke tuning for each new platform.

Cross-embodiment learning: a key step toward general-purpose robotic intelligence

A standout claim from the DeepMind team is that Gemini Robotics 1.5 can perform cross-embodiment learning. In practical terms, this means that skills learned on one robot, such as the grasping strategy used by Aloha 2, can be transferred to another robot with a different architecture, like Apollo, without bespoke retuning for the latter. The developers describe a workflow where the system can generalize across embodiments, transferring competencies between a two-armed robot and a humanoid robot with more complex manipulation capabilities. The goal is to avoid the tedious process of creating and tuning separate models for each robot and each end-effector configuration.

This capability has significant implications for robotics as a field. If a single robust model or a small family of models can handle multiple hardware configurations, deployment becomes faster, more scalable, and more cost-efficient. It could enable organizations to maintain a fleet of diverse robotic platforms that share a common cognitive backbone, with only modest calibration layers to account for device-specific constraints. DeepMind argues that this approach could unlock more complex, multi-stage tasks that require coordinated perception, planning, and execution across different hardware platforms. The idea is to create a more flexible, resilient, and generalizable robotic system, capable of adapting to new devices and new tasks with relatively little human intervention.

Testing, deployment, and the path to practical adoption

In their demonstrations, DeepMind researchers tested Gemini Robotics with a variety of machines, including the two-armed Aloha 2 and the humanoid Apollo. Historically, researchers in robotics faced a persistent bottleneck: to deploy AI-driven control on a new robot, they often needed to build task-specific models and spend significant time engineering integrations to match the robot’s hardware and sensing modalities. The Gemini approach promises to remove much of this customization friction by providing a shared cognitive architecture that can operate across embodiments.

Although the thinking-enabled ER model is undergoing broader rollout within Google’s AI tools, the actual robotic control model—Gemini Robotics 1.5—remains restricted to trusted testers for the time being. This limitation underscores both safety concerns and the maturity timeline required to validate these systems in real-world environments that include unpredictable human-robot interactions, safety-critical constraints, and complex manipulation tasks. By contrast, the ER model is being deployed in Google AI Studio, a platform designed to let developers generate instructional plans for their own embodied experiments. This staged rollout reflects a cautious but deliberate approach to scaling a foundational capability that, if generalized successfully, could underpin a wide range of research and industrial activities.

From thinking to action: a symbiotic pipeline for robotic autonomy

The operational workflow in Gemini Robotics is designed to connect the reasoning stage with the execution stage in a tightly integrated loop. The ER model begins by interpreting the objective and considering environmental context, potential obstacles, and available tools. It can invoke external information sources to refine its understanding and to gather any missing data necessary to formulate a credible plan. Once ER has established a plan, it produces detailed, task-specific instructions that guide the robot’s behavior. The 1.5 action model then takes these instructions and translates them into motor commands, guided by real-time sensory feedback. Throughout this phase, the action model engages its own internal deliberation—reasoning about the best approach to each subtask, potential risk factors, and alternative strategies if a step proves difficult or ambiguous. This dual-layer approach aims to combine human-like planning with machine-executed precision, resulting in a more robust approach to complex tasks.

The capability to “think” before acting distinguishes Gemini Robotics from earlier efforts that relied primarily on reactive control schemes or static planning pipelines. In a world where perception can be noisy, where tasks can unfold in unpredictable ways, and where objectives may shift in real time, having a reasoning layer that can pause, reassess, and recalibrate is a meaningful advancement. DeepMind frames this as a progression toward more agentic robots—systems that demonstrate a degree of initiative, adaptability, and strategic planning that previously belonged primarily to human operators. It is a conceptual shift as much as a technical one, signaling that robots may increasingly operate with a form of deliberation that mirrors human cognitive processes, albeit within the bounds of machine computation and safety constraints.

Operationalization across devices: the role of embodied learning and generalization

A central theme in Gemini Robotics is the emphasis on cross-embodiment learning. The team argues that their models can generalize from one hardware configuration to another, enabling a family of robots to share capabilities without bespoke reengineering for each new platform. In practice, this means that skills learned in the context of Aloha 2’s grippers could inform handling tasks on Apollo’s more intricate hands. The absence of specialized tuning for each new robot streamlines the expansion of robotic capabilities across fleets, laboratories, and industrial settings. This cross-embodiment generalization is an important step toward scalable robotics, where a single cognitive backbone can adapt to a diversity of actuation systems, sensor suites, and kinematic arrangements.

DeepMind also emphasizes that Gemini Robotics 1.5 is the model responsible for controlling robots in real-world tasks. Even though it is the action-capable model that executes on servo motors and grippers, its performance benefits from the planning and reasoning insights produced by the ER model. This division of labor allows for nuanced task decomposition, where the ER model can map out a plan that accounts for multiple potential contingencies, while the 1.5 model focuses on reliable physical execution. In practical terms, this could translate into more reliable manipulation, better handling of occlusions or partial visibility, and smoother adaptation when the workspace changes, all without requiring engineers to rewrite control policies for every new scenario.

Notes on current limitations and the roadmap ahead

Despite the advances, the article emphasizes that Gemini Robotics 1.5 remains inaccessible to broad consumer or even general enterprise use. Access is currently limited to trusted testers, reflecting a cautious approach to releasing systems that can operate in physical spaces, interact with humans, and potentially perform multi-step, high-stakes tasks. The ER model’s availability in Google AI Studio represents a more accessible avenue for developers to explore embodied instruction generation and task planning, but the actual embodiment—the robot acting in the world—remains a separate, guarded capability. This staged release pattern is common in nascent, safety-critical AI robotics work, where the risks and uncertainties of uncontrolled deployment could lead to unintended consequences or irresponsible usage if not properly contained.

The broader vision, however, is clear: a future in which robots can be deployed with a flexible cognitive backbone that can reason about tasks, select actions, call external data sources, and perform multi-step operations across a variety of physical devices. The emphasis on generalization across embodiments is intended to reduce the cost and friction of bringing new robotic capabilities online, enabling researchers and industry practitioners to scale up capabilities more rapidly and with fewer bespoke engineering efforts. If realized, this approach could shift the economics of robotics—from device-centric development toward platform-centric cognitive design, where a core set of models provides the intelligence for multiple machines.

Industry, society, and the evolving landscape of intelligent automation

The introduction of Gemini Robotics represents more than a technical milestone; it signals an inflection point for how robotics research intersects with advances in generative AI. The capacity to couple perception, reasoning, and action in a way that preserves generalization across devices opens up opportunities across manufacturing, logistics, service industries, health care, and home environments. For developers, the architecture offers a new paradigm for building robot software stacks: a reasoning module that can be trained once and deployed across a spectrum of hardware, paired with a robust action module that executes plans in real time. For organizations, the ability to transfer learned capabilities between robots can reduce deployment costs, accelerate experimentation, and enable more flexible work robots that can adapt to shifting operational needs.

Yet the path to widespread adoption will be shaped by safety, reliability, and governance considerations. The more capable robots become, the more important it is to ensure that their thinking processes are transparent, their decisions auditable, and their behavior aligned with human supervisors and policy constraints. The ongoing and cautious rollout—restricting direct robot control to trusted testers while opening planning and instruction generation tools through AI Studio—reflects a prudent approach to balancing innovation with risk management. In the near term, potential users can anticipate improvements in task adaptability, more resilient handling of dynamic environments, and a broader repertoire of activities that robots can perform with minimal customization.

Conclusion

In summary, Google DeepMind’s Gemini Robotics program represents a bold effort to fuse generative AI with embodied robotics in a way that aims to generalize capabilities across multiple hardware platforms. By pairing Gemini Robotics 1.5, the action-oriented model with vision-language-action capabilities, and Gemini Robotics-ER 1.5, the embodied reasoning model, the project seeks to create robots that can think through tasks, plan steps, and act with physical precision in real-world environments. The cross-embodiment learning capability is central to this effort, promising a more scalable path to multi-robot systems that share a common cognitive backbone while adapting to diverse devices and tasks. The staged deployment—ER planning tools available to developers via Google AI Studio, with the actual robot-control model restricted to trusted testers—reflects an emphasis on safety, reliability, and responsible innovation as the field advances toward more autonomous, agentic robotics.

As this research progresses, the robotics community and industry stakeholders will closely watch how the thinking-to-action loop holds up under real-world pressures: perception noise, dynamic obstacles, safety constraints, and the ethical implications of increasingly autonomous machines. If the Gemini approach fulfills its promises, it could redefine how robots are designed, deployed, and scaled across sectors, ushering in a new era where robotic systems are capable of flexible, goal-directed behavior with minimal bespoke programming. The coming years are likely to bring deeper integrations of planning, perception, and action—a move toward robotic intelligence that can adapt, reason, and collaborate in ways that are more closely aligned with human expectations and needs, while continuing to prioritize safety, governance, and responsible use.

Conclusion

Gemini Robotics marks a significant step in the ongoing evolution of AI-powered robotics. The project’s dual-model strategy—combining a thinking, embodied reasoning component with a thinking-forward, action-oriented controller—aims to create robots capable of generalizing tasks across different embodiments. This architecture promises faster deployment across diverse devices, more robust performance in complex environments, and the potential for a broader range of multi-stage, coordinated tasks. While practical, wide-scale deployment remains a work in progress, the approach sets a clear direction for how future robotic systems might be designed: with integrated perception, reasoning, and action that can adapt to changing goals and environments without extensive reprogramming. The road ahead will require careful attention to safety, ethics, and governance as these intelligent agents move from guarded tests toward broader real-world applications.