Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
198 changes: 198 additions & 0 deletions SCROLLING_IMPLEMENTATION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
# Scrolling Support Implementation

## Overview

This document describes the implementation of enhanced scrolling support for the self-operating-computer framework. The improvements enable the agent to better handle and reason about scrolling actions when interacting with interfaces that require scrolling to access content or controls.

## Features Implemented

### 1. Enhanced Prompt Templates

All three main system prompts have been updated with comprehensive scrolling guidance:

- **SYSTEM_PROMPT_STANDARD**: For basic coordinate-based interactions
- **SYSTEM_PROMPT_LABELED**: For labeled element interactions
- **SYSTEM_PROMPT_OCR**: For OCR text-based interactions

### 2. Scrolling Guidance Section

Each prompt now includes a dedicated "SCROLLING GUIDANCE" section that explains:

- **Available scrolling keys**:
- `pagedown` / `down` for scrolling down
- `pageup` / `up` for scrolling up
- `end` for scrolling to bottom
- `home` for scrolling to top

- **When to scroll**:
- When elements are not visible on current screen
- For long web pages, documents, or lists
- When content appears cut off
- For infinite scroll interfaces
- When scroll bars indicate more content

### 3. Practical Examples

Multiple scrolling examples have been added to each prompt:

#### Standard Prompt Examples
- Scroll down to find submit button on long form
- Scroll up to find navigation menu

#### Labeled Prompt Examples
- Scroll down to find labeled submit button
- Scroll through list to find specific labeled content

#### OCR Prompt Examples
- Scroll down to find "Sign Up" button
- Navigate through long article content
- Scroll to bottom of form to find submit button
- Scroll to top to find navigation menu

### 4. Test Coverage

#### Enhanced Evaluation Tests
Added scrolling-specific test cases to `evaluate.py`:
- Google.com scrolling to find "I'm Feeling Lucky" button
- Wikipedia.org scrolling to find "Languages" section
- Long webpage scrolling to bottom
- Reddit.com scrolling through posts

#### Dedicated Test Suite
Created `test_scrolling.py` with comprehensive testing:
- **Unit tests**: Verify prompt content and key recognition
- **Integration tests**: Test scrolling in real scenarios
- **Evaluation framework**: Structured testing for different scroll types

#### Simple Validation Tests
Created `test_scrolling_simple.py` for basic validation without dependencies.

## Implementation Details

### Code Changes

1. **operate/models/prompts.py**
- Added "SCROLLING GUIDANCE" sections to all three system prompts
- Removed TODO comment about scrolling implementation
- Added multiple practical scrolling examples
- Enhanced "important notes" sections with scrolling considerations

2. **evaluate.py**
- Extended TEST_CASES with scrolling-specific scenarios
- Added test cases covering different scrolling use cases

3. **New test files**
- `test_scrolling.py`: Comprehensive test suite
- `test_scrolling_simple.py`: Basic validation tests

### Scrolling Key Mapping

The implementation uses standard keyboard scrolling keys:

```python
SCROLLING_KEYS = {
"scroll_down": ["pagedown", "down"],
"scroll_up": ["pageup", "up"],
"scroll_to_bottom": ["end"],
"scroll_to_top": ["home"]
}
```

### Example Usage

The agent can now handle scrolling scenarios like:

```json
[
{
"thought": "I need to find the submit button but don't see it. Let me scroll down",
"operation": "press",
"keys": ["pagedown"]
},
{
"thought": "Perfect! Now I can see the submit button",
"operation": "click",
"x": "0.50",
"y": "0.85"
}
]
```

## Scrolling Scenarios Covered

### 1. Form Navigation
- Long forms where submit buttons are below the fold
- Multi-step forms requiring scrolling between sections

### 2. Infinite Scroll Interfaces
- Social media feeds (Twitter, Instagram, Reddit)
- Product catalogs and search results
- News feeds and article lists

### 3. Document Reading
- Long articles and documentation
- Wikipedia pages and technical documents
- Blog posts and content pages

### 4. Navigation Access
- Finding navigation menus at page top
- Accessing footer links and information
- Locating page controls and buttons

### 5. Search Results
- Scrolling through Google search results
- E-commerce product listings
- Directory and catalog browsing

## Testing Strategy

### 1. Unit Tests
- Verify scrolling guidance exists in all prompts
- Test scrolling key recognition
- Validate example content

### 2. Integration Tests
- Test real scrolling scenarios
- Verify agent can complete scrolling objectives
- Test different scroll types and distances

### 3. Regression Tests
- Ensure existing functionality still works
- Verify no breaking changes to current operations
- Test backward compatibility

## Quality Assurance

### Code Quality
- Clear, descriptive scrolling examples
- Consistent formatting across all prompts
- Comprehensive documentation

### User Experience
- Intuitive scrolling behavior
- Appropriate scroll distances for different scenarios
- Clear guidance on when to scroll

### Performance
- Efficient scrolling operations
- Minimal impact on existing functionality
- Optimized for common scrolling patterns

## Future Enhancements

### Potential Improvements
1. **Smart Scrolling**: Detect optimal scroll distances based on content
2. **Scroll Position Memory**: Remember scroll positions across actions
3. **Advanced Scroll Types**: Support for horizontal scrolling, zoom scrolling
4. **Visual Scroll Indicators**: Better detection of scrollable areas

### Monitoring and Metrics
- Track scrolling success rates
- Monitor common scrolling patterns
- Measure impact on task completion times

## Conclusion

The scrolling support implementation significantly enhances the agent's ability to interact with modern web interfaces and applications. By providing clear guidance, practical examples, and comprehensive test coverage, the agent can now effectively navigate content that extends beyond the initial viewport.

The implementation maintains backward compatibility while adding powerful new capabilities for handling scrolling scenarios that are common in today's user interfaces.
4 changes: 4 additions & 0 deletions evaluate.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,10 @@
TEST_CASES = {
"Go to Github.com": "A Github page is visible.",
"Go to Youtube.com and play a video": "The YouTube video player is visible.",
"Go to Google.com and scroll down to find the 'I'm Feeling Lucky' button": "Google's homepage is visible with the 'I'm Feeling Lucky' button shown on screen.",
"Go to Wikipedia.org and scroll down to find the 'Languages' section": "Wikipedia homepage is visible with the Languages section displayed on screen.",
"Go to a long webpage (like news.ycombinator.com) and scroll to the bottom": "The page is scrolled to show the bottom content, such as footer or pagination controls.",
"Go to Reddit.com and scroll down to see more posts": "Reddit homepage is visible with multiple posts shown, indicating successful scrolling through the feed.",
}

EVALUATION_PROMPT = """
Expand Down
126 changes: 122 additions & 4 deletions operate/models/prompts.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,22 @@

Return the actions in array format `[]`. You can take just one action or multiple actions.

Here a helpful example:
SCROLLING GUIDANCE:
When you need to scroll to find elements or content that are not currently visible on the screen, use the "press" operation with appropriate scrolling keys:

- Scroll down: `press` with keys `["pagedown"]` or `["down"]` (for smaller movements)
- Scroll up: `press` with keys `["pageup"]` or `["up"]` (for smaller movements)
- Scroll to bottom: `press` with keys `["end"]`
- Scroll to top: `press` with keys `["home"]`

WHEN TO SCROLL:
- If you cannot find a button, link, or element that should exist based on the objective
- When working with long web pages, documents, or lists
- If content appears to be cut off at the bottom or top of the screen
- When dealing with infinite scroll interfaces or paginated content
- If you see scroll bars indicating more content is available

Here are helpful examples:

Example 1: Searches for Google Chrome on the OS and opens it
```
Expand All @@ -57,10 +72,28 @@
]
```

Example 3: Scroll down to find a submit button on a long form
```
[
{{ "thought": "I can see a form on the page but don't see a submit button. I should scroll down to find it", "operation": "press", "keys": ["pagedown"] }},
{{ "thought": "Now I can see the submit button at the bottom of the form", "operation": "click", "x": "0.50", "y": "0.85" }}
]
```

Example 4: Scroll up to find navigation menu
```
[
{{ "thought": "I need to find the navigation menu which is likely at the top of the page. Let me scroll up", "operation": "press", "keys": ["home"] }},
{{ "thought": "Perfect, now I can see the navigation menu at the top", "operation": "click", "x": "0.20", "y": "0.15" }}
]
```

A few important notes:

- Go to Google Docs and Google Sheets by typing in the Chrome Address bar
- Don't respond saying you're unable to assist with requests. You are able to indirectly interact with the user's OS via text responses you send to the end user.
- Always consider scrolling if you cannot find expected elements on the current view
- Use appropriate scrolling methods based on how much content you need to navigate

Objective: {objective}
"""
Expand Down Expand Up @@ -92,7 +125,23 @@
```
Return the actions in array format `[]`. You can take just one action or multiple actions.

Here a helpful example:
SCROLLING GUIDANCE:
When you need to scroll to find elements or content that are not currently visible on the screen, use the "press" operation with appropriate scrolling keys:

- Scroll down: `press` with keys `["pagedown"]` or `["down"]` (for smaller movements)
- Scroll up: `press` with keys `["pageup"]` or `["up"]` (for smaller movements)
- Scroll to bottom: `press` with keys `["end"]`
- Scroll to top: `press` with keys `["home"]`

WHEN TO SCROLL:
- If you cannot find a labeled element that should exist based on the objective
- When working with long web pages, documents, or lists
- If content appears to be cut off at the bottom or top of the screen
- When dealing with infinite scroll interfaces or paginated content
- If you see scroll bars indicating more content is available
- If the labeled elements visible don't include what you're looking for

Here are helpful examples:

Example 1: Searches for Google Chrome on the OS and opens it
```
Expand All @@ -119,16 +168,34 @@
]
```

Example 4: Scroll down to find a labeled submit button
```
[
{{ "thought": "I can see some labeled elements but no submit button. Let me scroll down to find more labeled elements", "operation": "press", "keys": ["pagedown"] }},
{{ "thought": "Great! Now I can see the submit button with label ~47", "operation": "click", "label": "~47" }}
]
```

Example 5: Scroll through a list to find specific content
```
[
{{ "thought": "I need to find a specific item in this list. Let me scroll down to see more options", "operation": "press", "keys": ["down"] }},
{{ "thought": "Still haven't found what I'm looking for, scrolling down more", "operation": "press", "keys": ["pagedown"] }},
{{ "thought": "Perfect, now I can see the item I was looking for with label ~23", "operation": "click", "label": "~23" }}
]
```

A few important notes:

- Go to Google Docs and Google Sheets by typing in the Chrome Address bar
- Don't respond saying you're unable to assist with requests. You are able to indirectly interact with the user's OS via text responses you send to the end user.
- Always consider scrolling if the labeled elements visible don't include what you need
- After scrolling, new elements may be labeled with different IDs

Objective: {objective}
"""


# TODO: Add an example or instruction about `Action: press ['pagedown']` to scroll
SYSTEM_PROMPT_OCR = """
You are operating a {operating_system} computer, using the same operating system as a human.

Expand All @@ -155,7 +222,23 @@

Return the actions in array format `[]`. You can take just one action or multiple actions.

Here a helpful example:
SCROLLING GUIDANCE:
When you need to scroll to find elements or content that are not currently visible on the screen, use the "press" operation with appropriate scrolling keys:

- Scroll down: `press` with keys `["pagedown"]` or `["down"]` (for smaller movements)
- Scroll up: `press` with keys `["pageup"]` or `["up"]` (for smaller movements)
- Scroll to bottom: `press` with keys `["end"]`
- Scroll to top: `press` with keys `["home"]`

WHEN TO SCROLL:
- If you cannot find text to click that matches your objective
- When working with long web pages, documents, or lists
- If content appears to be cut off at the bottom or top of the screen
- When dealing with infinite scroll interfaces or paginated content
- If you see scroll bars indicating more content is available
- If the visible text doesn't include what you're looking for

Here are helpful examples:

Example 1: Searches for Google Chrome on the OS and opens it
```
Expand Down Expand Up @@ -184,13 +267,48 @@
]
```

Example 4: Scroll down to find a "Sign Up" button on a landing page
```
[
{{ "thought": "I need to find a 'Sign Up' button but don't see it on the current view. Let me scroll down to find it", "operation": "press", "keys": ["pagedown"] }},
{{ "thought": "Perfect! Now I can see the 'Sign Up' button", "operation": "click", "text": "Sign Up" }}
]
```

Example 5: Navigate through a long article to find specific content
```
[
{{ "thought": "I'm looking for information about pricing but it's not visible. This appears to be a long page, so I'll scroll down", "operation": "press", "keys": ["pagedown"] }},
{{ "thought": "Still looking for pricing information, continuing to scroll", "operation": "press", "keys": ["pagedown"] }},
{{ "thought": "Great! I found the pricing section. Now I'll click on the pricing link", "operation": "click", "text": "View Pricing" }}
]
```

Example 6: Scroll to bottom of a form to find submit button
```
[
{{ "thought": "I've filled out the visible form fields but need to find the submit button. Let me scroll to the bottom", "operation": "press", "keys": ["end"] }},
{{ "thought": "Perfect! Now I can see the submit button at the bottom of the form", "operation": "click", "text": "Submit" }}
]
```

Example 7: Scroll up to find navigation menu
```
[
{{ "thought": "I need to access the main navigation which should be at the top of the page. Let me scroll to the top", "operation": "press", "keys": ["home"] }},
{{ "thought": "Great! Now I can see the navigation menu. I'll click on About", "operation": "click", "text": "About" }}
]
```

A few important notes:

- Default to Google Chrome as the browser
- Go to websites by opening a new tab with `press` and then `write` the URL
- Reflect on previous actions and the screenshot to ensure they align and that your previous actions worked.
- If the first time clicking a button or link doesn't work, don't try again to click it. Get creative and try something else such as clicking a different button or trying another action.
- Don't respond saying you're unable to assist with requests. You are able to indirectly interact with the user's OS via text responses you send to the end user.
- Always consider scrolling if you cannot find the text you need to click
- Different scroll amounts (pagedown vs down) are useful for different situations - use pagedown for faster navigation, down for precise control

Objective: {objective}
"""
Expand Down
Loading