Self-Operating Computer Framework

A framework to enable multimodal models to operate a computer.

Using the same inputs and outputs as a human operator, the model views the screen and decides on a series of mouse and keyboard actions to reach an objective.

Key Features

Compatibility: Designed for various multimodal models.
Integration: Currently integrated with GPT-4v as the default model, with extended support for Gemini Pro Vision.
Future Plans: Support for additional models.

Current Challenges

Note: GPT-4V's error rate in estimating XY mouse click locations is currently quite high. This framework aims to track the progress of multimodal models over time, aspiring to achieve human-level performance in computer operation.

Ongoing Development

At HyperwriteAI, we are developing Agent-1-Vision a multimodal model with more accurate click location predictions.

Agent-1-Vision Model API Access

We will soon be offering API access to our Agent-1-Vision model.

If you're interested in gaining access to this API, sign up here.

Demo

final-low.mp4

Run `Self-Operating Computer`

Install the project

pip install self-operating-computer

Run the project

operate

Enter your OpenAI Key: If you don't have one, you can obtain an OpenAI key here

Give Terminal app the required permissions: As a last step, the Terminal app will ask for permission for "Screen Recording" and "Accessibility" in the "Security & Privacy" page of Mac's "System Preferences".

Alternatively installation with `.sh`

Clone the repo to a directory on your computer:

git clone https://github.com/OthersideAI/self-operating-computer.git

Cd into directory:

cd self-operating-computer

Run the installation script:

./run.sh

Using `operate` Modes

Multimodal Models `-m`

An additional model is now compatible with the Self Operating Computer Framework. Try Google's gemini-pro-vision by following the instructions below.

Start operate with the Gemini model

operate -m gemini-pro-vision

Enter your Google AI Studio API key when terminal prompts you for it If you don't have one, you can obtain a key here after setting up your Google AI Studio account. You may also need authorize credentials for a desktop application. It took me a bit of time to get it working, if anyone knows a simpler way, please make a PR:

Set-of-Mark Prompting `-m gpt-4-with-som`

The Self-Operating Computer Framework now supports Set-of-Mark (SoM) Prompting with the gpt-4-with-som command. This new visual prompting method enhances the visual grounding capabilities of large multimodal models.

Learn more about SoM Prompting in the detailed arXiv paper: here.

For this initial version, a simple YOLOv8 model is trained for button detection, and the best.pt file is included under model/weights/. Users are encouraged to swap in their best.pt file to evaluate performance improvements. If your model outperforms the existing one, please contribute by creating a pull request (PR).

Start operate with the SoM model

operate -m gpt-4-with-som

Voice Mode `--voice`

The framework supports voice inputs for the objective. Try voice by following the instructions below. Clone the repo to a directory on your computer:

git clone https://github.com/OthersideAI/self-operating-computer.git

Cd into directory:

cd self-operating-computer

Install the additional requirements-audio.txt

pip install -r requirements-audio.txt

Install device requirements For mac users:

brew install portaudio

For Linux users:

sudo apt install portaudio19-dev python3-pyaudio

Run with voice mode

operate --voice

Contributions are Welcomed!:

If you want to contribute yourself, see CONTRIBUTING.md.

Feedback

For any input on improving this project, feel free to reach out to Josh on Twitter.

Join Our Discord Community

For real-time discussions and community support, join our Discord server.

If you're already a member, join the discussion in #self-operating-computer.
If you're new, first join our Discord Server and then navigate to the #self-operating-computer.

Follow HyperWriteAI for More Updates

Stay updated with the latest developments:

Follow HyperWriteAI on Twitter.
Follow HyperWriteAI on LinkedIn.

Compatibility

This project is compatible with Mac OS, Windows, and Linux (with X server installed).

OpenAI Rate Limiting Note

The gpt-4-vision-preview model is required. To unlock access to this model, your account needs to spend at least $5 in API credits. Pre-paying for these credits will unlock access if you haven't already spent the minimum $5.
Learn more here

Name		Name	Last commit message	Last commit date
Latest commit History 500 Commits
.github		.github
operate		operate
readme		readme
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
evaluate.py		evaluate.py
requirements-audio.txt		requirements-audio.txt
requirements.txt		requirements.txt
run.sh		run.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Self-Operating Computer Framework

Key Features

Current Challenges

Ongoing Development

Agent-1-Vision Model API Access

Demo

Run `Self-Operating Computer`

Alternatively installation with `.sh`

Using `operate` Modes

Multimodal Models `-m`

Set-of-Mark Prompting `-m gpt-4-with-som`

Voice Mode `--voice`

Contributions are Welcomed!:

Feedback

Join Our Discord Community

Follow HyperWriteAI for More Updates

Compatibility

OpenAI Rate Limiting Note

About

Releases 30

Contributors 24

Languages

License

OthersideAI/self-operating-computer

Folders and files

Latest commit

History

Repository files navigation

Self-Operating Computer Framework

Key Features

Current Challenges

Ongoing Development

Agent-1-Vision Model API Access

Demo

Run Self-Operating Computer

Alternatively installation with .sh

Using operate Modes

Multimodal Models -m

Set-of-Mark Prompting -m gpt-4-with-som

Voice Mode --voice

Contributions are Welcomed!:

Feedback

Join Our Discord Community

Follow HyperWriteAI for More Updates

Compatibility

OpenAI Rate Limiting Note

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 30

Contributors 24

Languages

Run `Self-Operating Computer`

Alternatively installation with `.sh`

Using `operate` Modes

Multimodal Models `-m`

Set-of-Mark Prompting `-m gpt-4-with-som`

Voice Mode `--voice`