Earlier this week, Google DeepMind released Gemini Robotics-ER-1.6, a new vision and language model to help robots make sense of their surroundings. To show off its capabilities, Boston Dynamics—which has an agreement to use Gemini in its humanoid robots—published a video of its robot dogs using the model to read a thermometer during an inspection of an industrial facility.
Despite the eye-catching demos, Google’s new robotics model only notched incremental gains over previous models in terms of its ability to tell when it had finished a task using a single camera feed, according to Google’s benchmarks. But when taking in multiple camera feeds, the model showed an improvement. That’s important, Google says, because many robotics setups today, such as those in factories or warehouses, use multiple camera views like an overhead camera and a camera mounted on the robot’s arm. The robot must be able to use all of those cameras to create a coherent understanding of what it’s doing and know when the task is complete.