1X Technologies Humanoid Robot AI Voice Commands & Chaining Tasks

1X Technologies Humanoid Robot AI Voice Commands & Chaining Tasks


As robots become increasingly capable of performing a wide range of tasks autonomously, one of the key challenges is enabling them to handle long-horizon behaviors that require chaining together multiple skills. At 1X Technologies, a team of experts has been working on developing an autonomous model that can merge multiple tasks into a single goal-conditioned neural network. This breakthrough technology has the potential to revolutionize the way robots are programmed and controlled.


The Challenge of Multi-Task Models

One of the main challenges in developing multi-task models is the issue of forgetting. When a model is trained on multiple tasks, adding data to improve performance on one task can often negatively impact the performance on other tasks. This problem can be mitigated by increasing the model's parameter count, but this approach comes with its own set of challenges. Larger models take longer to train, which slows down the process of identifying the most effective demonstrations to gather for improving robot behavior.


To address this challenge, the researchers at 1X Technologies has developed a novel approach that decouples the ability to quickly improve task performance from the ability to merge multiple capabilities into a single neural network. This is achieved through a voice-controlled natural language interface that allows humans to chain short-horizon capabilities across multiple small models into longer, more complex tasks.

By having humans direct the skill chaining process, the robot can accomplish long-horizon behaviors that would otherwise be difficult to achieve through autonomous means. This is because each successive skill in the chain must be able to generalize to the slightly random starting positions that result from the completion of the previous skill. As the chain grows longer, the variation in outcomes compounds, making it increasingly challenging for the robot to adapt.

Abstracting Away the Complexity

From the user's perspective, the robot appears to be capable of performing a wide range of natural language tasks, while the actual number of models controlling the robot remains hidden. This abstraction allows the team to gradually merge single-task models into goal-conditioned models over time, without disrupting the user experience.

Single-task models also serve as a valuable baseline for conducting shadow mode evaluations, which involve comparing the predictions of a new model to those of an existing baseline at test-time. Once the goal-conditioned model's predictions closely match those of the single-task models, the team can switch to a more powerful, unified model without altering the user workflow.

The high-level language interface used to direct robots offers a new and innovative approach to data collection. Instead of relying on virtual reality to control a single robot, an operator can now direct multiple robots using high-level language commands, while the low-level policies execute the necessary actions to achieve those goals. This approach is so efficient that operators can even control robots remotely, as demonstrated in the video.

The Next Step: Automating High-Level Actions


While the current system still requires human intervention to dictate when robots should switch tasks, the next logical step is to automate the prediction of high-level actions using vision-language models like GPT-4o, VILA, and Gemini Vision. By building a dataset of vision-to-natural language command pairs, researchers can work towards developing fully autonomous robots capable of understanding and executing complex, multi-step tasks.

By combining cutting-edge machine learning techniques with intuitive, user-friendly interfaces, we are moving closer to a future in which robots can seamlessly integrate into our daily lives, assisting us with a wide range of tasks and making our world a better place.


Recent Posts