You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+21-21Lines changed: 21 additions & 21 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -86,27 +86,7 @@ Start `operate` with the Gemini model
86
86
operate -m gemini-pro-vision
87
87
```
88
88
89
-
**Enter your Google AI Studio API key when terminal prompts you for it** If you don't have one, you can obtain a key [here](https://makersuite.google.com/app/apikey) after setting up your Google AI Studio account. You may also need [authorize credentials for a desktop application](https://ai.google.dev/palm_docs/oauth_quickstart). It took me a bit of time to get it working, if anyone knows a simpler way, please make a PR:
90
-
91
-
### Optical Character Recognition Mode `-m gpt-4-with-ocr`
92
-
The Self-Operating Computer Framework now integrates Optical Character Recognition (OCR) capabilities with the `gpt-4-with-ocr` mode. This mode gives GPT-4 a hash map of clickable elements by coordinates. GPT-4 can decide to `click` elements by text and then the code references the hash map to get the coordinates for that element GPT-4 wanted to click.
93
-
94
-
Based on recent tests, OCR performs better than `som` and vanilla GPT-4 so we made it the default for the project. To use the OCR mode you can simply write:
95
-
96
-
`operate` or `operate -m gpt-4-with-ocr` will also work.
97
-
98
-
### Set-of-Mark Prompting `-m gpt-4-with-som`
99
-
The Self-Operating Computer Framework now supports Set-of-Mark (SoM) Prompting with the `gpt-4-with-som` command. This new visual prompting method enhances the visual grounding capabilities of large multimodal models.
100
-
101
-
Learn more about SoM Prompting in the detailed arXiv paper: [here](https://arxiv.org/abs/2310.11441).
102
-
103
-
For this initial version, a simple YOLOv8 model is trained for button detection, and the `best.pt` file is included under `model/weights/`. Users are encouraged to swap in their `best.pt` file to evaluate performance improvements. If your model outperforms the existing one, please contribute by creating a pull request (PR).
104
-
105
-
Start `operate` with the SoM model
106
-
107
-
```
108
-
operate -m gpt-4-with-som
109
-
```
89
+
**Enter your Google AI Studio API key when terminal prompts you for it** If you don't have one, you can obtain a key [here](https://makersuite.google.com/app/apikey) after setting up your Google AI Studio account. You may also need [authorize credentials for a desktop application](https://ai.google.dev/palm_docs/oauth_quickstart). It took me a bit of time to get it working, if anyone knows a simpler way, please make a PR.
110
90
111
91
### Locally Hosted LLaVA Through Ollama
112
92
If you wish to experiment with the Self-Operating Computer Framework using LLaVA on your own machine, you can with Ollama!
@@ -133,6 +113,26 @@ operate -m llava
133
113
134
114
Learn more about Ollama at its [GitHub Repository](https://www.github.com/ollama/ollama)
135
115
116
+
### Optical Character Recognition Mode `-m gpt-4-with-ocr`
117
+
The Self-Operating Computer Framework now integrates Optical Character Recognition (OCR) capabilities with the `gpt-4-with-ocr` mode. This mode gives GPT-4 a hash map of clickable elements by coordinates. GPT-4 can decide to `click` elements by text and then the code references the hash map to get the coordinates for that element GPT-4 wanted to click.
118
+
119
+
Based on recent tests, OCR performs better than `som` and vanilla GPT-4 so we made it the default for the project. To use the OCR mode you can simply write:
120
+
121
+
`operate` or `operate -m gpt-4-with-ocr` will also work.
122
+
123
+
### Set-of-Mark Prompting `-m gpt-4-with-som`
124
+
The Self-Operating Computer Framework now supports Set-of-Mark (SoM) Prompting with the `gpt-4-with-som` command. This new visual prompting method enhances the visual grounding capabilities of large multimodal models.
125
+
126
+
Learn more about SoM Prompting in the detailed arXiv paper: [here](https://arxiv.org/abs/2310.11441).
127
+
128
+
For this initial version, a simple YOLOv8 model is trained for button detection, and the `best.pt` file is included under `model/weights/`. Users are encouraged to swap in their `best.pt` file to evaluate performance improvements. If your model outperforms the existing one, please contribute by creating a pull request (PR).
129
+
130
+
Start `operate` with the SoM model
131
+
132
+
```
133
+
operate -m gpt-4-with-som
134
+
```
135
+
136
136
### Voice Mode `--voice`
137
137
The framework supports voice inputs for the objective. Try voice by following the instructions below.
138
138
**Clone the repo** to a directory on your computer:
0 commit comments