Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
Late nights with a newborn can lead to unexpected breakthroughs. Such was the case for OthersideAI developer Josh Bickett, who had an idea for a groundbreaking new “self-operating computer framework” while feeding his daughter in the middle of the night.
As Bickett explained to VentureBeat, “I’ve been really enjoying time with my daughter, who’s four weeks now old and I had a lot of new lessons in fatherhood and all that stuff. But I also had a little bit of time, and this idea kind of came to me because I saw different demos of GPT-4 vision. The thing we’re working on now can actually happen with GPT-4 vision.”
With his daughter cradled in one arm, Bickett sketched out the basic framework on his computer. “I just found an initial implementation…it’s not super good at clicking the mouse in the right way. But what we’re doing is defining the problem: we need to figure out how to operate a computer.”
When OthersideAI co-founder and CEO Matt Shumer saw the new framework, he recognized its tremendous potential. As Shumer told VentureBeat, “This is a milestone in the road to getting to the equivalent of a self-driving car but for a computer. We have the sensors now. We have the LIDAR systems. Next, we build the intelligence.”
An AI that decides where and what to click on your PC
As Bickett described, the framework “lets the AI control both the mouse where it clicks and all the keyboard triggers essentially. It’s like an agent like autoGPT except it’s not text based. It’s vision based so it takes a screenshot of the computer and then it decides mouse clicks and keyboards, exactly like a person would.”
Shumer elaborated on how this framework represents a major advance over previous approaches that relied solely on APIs.
“A lot of things that people do on computers, right, you can’t really do with APIs, which is how a lot of other people are approaching this problem, [when] they want to build an agent. They built it on top of the publicly available APIs for this service, but that doesn’t extend to everything.” As Shumer asserted, “If you truly want to solve something that is autonomous [and] can actually help us or get more done. You have to allow it to work like a person because the world is built for people.”
The framework takes screenshots as input and outputs mouse clicks and keyboard commands, just as a human would. But as both Bickett and Shumer acknowledged, the real potential lies not in the lightweight framework itself, but in the advanced computer vision and reasoning models that can be plugged into it. “The framework will just be like plug and play, you just plug in a better model and it gets better,” said Bickett.
How AI agents will change computing as we know it
When asked by VentureBeat about the future implications, Shumer painted a bold vision: “Once this thing is sufficiently reliable, it is going to be your computer, it is going to be your interface to the digital world.”
With the self-operating computer framework in place, advanced AI models could learn to take over all computer interactions just through conversational commands.
As Shumer predicted, different types of specialized computer agent models will likely emerge to handle different tasks.
Some may focus on speed for simpler tasks, while others excel at complex reasoning. Models may also vary for enterprise vs. consumer use cases. But the overarching goal, according to Shumer, is to develop agents that enable a world “where people can say, this is what I hate doing. Now, I don’t have to do it anymore. And we want to make it so damn easy that somebody who can barely use a computer from the beginning can do it.”
Open source to fuel development
Bickett believes the open source nature of the framework will further accelerate progress, allowing developers worldwide to experiment with new applications. Shumer agreed there is “room for a lot of players in this space…a range of model providers. A range of applications. And there are going to be a lot of spaces in this industry to build really really big businesses.”
While Bickett and Shumer see enormous potential, realizing the vision of truly intelligent computer agents will require immense resources and continued innovation.
To that end, AI research company Imbue, formerly known as Generally Intelligent, recently secured a $150 million partnership with Dell to build a powerful AI training platform.
The massive cluster of around 10,000 Nvidia H100 GPUs will allow Imbue to develop new foundation models optimized specifically for reasoning abilities, a key focus of their work. As Imbue co-founder and CEO Kanjun Qiu noted, “reasoning is the core blocker to agents that work really well.”
Imbue believes robust reasoning is paramount for developing truly effective AI agents, as it allows machines to handle uncertainty, adapt approaches, gather new information, make complex decisions, and grapple with real-world complexities – abilities crucial for functioning autonomously beyond narrow tasks.
Thecompany adopts a “full stack” methodology encompassing optimized foundation model training, experimental agent and interface prototyping, robust tool-building, and theoretical AI research – aiming to advance both the practical and fundamental understanding of deep learning with the goal of engineering AI capable of human-level reasoning and eventual artificial general intelligence..
While the self-operating computer framework is just the first step, Bickett and Shumer see it ushering in a new era where sophisticated AI agents replace human computing interfaces entirely. Late nights may keep yielding paradigm-shifting ideas, but it will take focused work to realize the full vision of computers that just work – for anyone, anywhere – through ordinary language alone.