Skip to content

Commit 0e4f965

Browse files
committed
Move up LLaVA in README.MD
1 parent 3176711 commit 0e4f965

File tree

1 file changed

+21
-21
lines changed

1 file changed

+21
-21
lines changed

README.md

Lines changed: 21 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -86,27 +86,7 @@ Start `operate` with the Gemini model
8686
operate -m gemini-pro-vision
8787
```
8888

89-
**Enter your Google AI Studio API key when terminal prompts you for it** If you don't have one, you can obtain a key [here](https://makersuite.google.com/app/apikey) after setting up your Google AI Studio account. You may also need [authorize credentials for a desktop application](https://ai.google.dev/palm_docs/oauth_quickstart). It took me a bit of time to get it working, if anyone knows a simpler way, please make a PR:
90-
91-
### Optical Character Recognition Mode `-m gpt-4-with-ocr`
92-
The Self-Operating Computer Framework now integrates Optical Character Recognition (OCR) capabilities with the `gpt-4-with-ocr` mode. This mode gives GPT-4 a hash map of clickable elements by coordinates. GPT-4 can decide to `click` elements by text and then the code references the hash map to get the coordinates for that element GPT-4 wanted to click.
93-
94-
Based on recent tests, OCR performs better than `som` and vanilla GPT-4 so we made it the default for the project. To use the OCR mode you can simply write:
95-
96-
`operate` or `operate -m gpt-4-with-ocr` will also work.
97-
98-
### Set-of-Mark Prompting `-m gpt-4-with-som`
99-
The Self-Operating Computer Framework now supports Set-of-Mark (SoM) Prompting with the `gpt-4-with-som` command. This new visual prompting method enhances the visual grounding capabilities of large multimodal models.
100-
101-
Learn more about SoM Prompting in the detailed arXiv paper: [here](https://arxiv.org/abs/2310.11441).
102-
103-
For this initial version, a simple YOLOv8 model is trained for button detection, and the `best.pt` file is included under `model/weights/`. Users are encouraged to swap in their `best.pt` file to evaluate performance improvements. If your model outperforms the existing one, please contribute by creating a pull request (PR).
104-
105-
Start `operate` with the SoM model
106-
107-
```
108-
operate -m gpt-4-with-som
109-
```
89+
**Enter your Google AI Studio API key when terminal prompts you for it** If you don't have one, you can obtain a key [here](https://makersuite.google.com/app/apikey) after setting up your Google AI Studio account. You may also need [authorize credentials for a desktop application](https://ai.google.dev/palm_docs/oauth_quickstart). It took me a bit of time to get it working, if anyone knows a simpler way, please make a PR.
11090

11191
### Locally Hosted LLaVA Through Ollama
11292
If you wish to experiment with the Self-Operating Computer Framework using LLaVA on your own machine, you can with Ollama!
@@ -133,6 +113,26 @@ operate -m llava
133113

134114
Learn more about Ollama at its [GitHub Repository](https://www.github.com/ollama/ollama)
135115

116+
### Optical Character Recognition Mode `-m gpt-4-with-ocr`
117+
The Self-Operating Computer Framework now integrates Optical Character Recognition (OCR) capabilities with the `gpt-4-with-ocr` mode. This mode gives GPT-4 a hash map of clickable elements by coordinates. GPT-4 can decide to `click` elements by text and then the code references the hash map to get the coordinates for that element GPT-4 wanted to click.
118+
119+
Based on recent tests, OCR performs better than `som` and vanilla GPT-4 so we made it the default for the project. To use the OCR mode you can simply write:
120+
121+
`operate` or `operate -m gpt-4-with-ocr` will also work.
122+
123+
### Set-of-Mark Prompting `-m gpt-4-with-som`
124+
The Self-Operating Computer Framework now supports Set-of-Mark (SoM) Prompting with the `gpt-4-with-som` command. This new visual prompting method enhances the visual grounding capabilities of large multimodal models.
125+
126+
Learn more about SoM Prompting in the detailed arXiv paper: [here](https://arxiv.org/abs/2310.11441).
127+
128+
For this initial version, a simple YOLOv8 model is trained for button detection, and the `best.pt` file is included under `model/weights/`. Users are encouraged to swap in their `best.pt` file to evaluate performance improvements. If your model outperforms the existing one, please contribute by creating a pull request (PR).
129+
130+
Start `operate` with the SoM model
131+
132+
```
133+
operate -m gpt-4-with-som
134+
```
135+
136136
### Voice Mode `--voice`
137137
The framework supports voice inputs for the objective. Try voice by following the instructions below.
138138
**Clone the repo** to a directory on your computer:

0 commit comments

Comments
 (0)