You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat: Enhance smartscraper with OpenAPI spec compliance - add HTML/Markdown processing and new parameters
## New Features
Based on the official ScrapeGraphAI OpenAPI specification, smartscraper now supports:
### Three Input Modes (Mutually Exclusive)
1. **URL-based scraping** (website_url) - Scrape live websites
2. **HTML processing** (website_html) - Process local/pre-fetched HTML (max 2MB)
3. **Markdown processing** (website_markdown) - Extract from markdown documents (max 2MB)
### New Parameters
- **output_schema**: JSON schema for structured output definition
- **total_pages**: Pagination support (1-100 pages, default 1)
- **render_heavy_js**: Heavy JavaScript rendering for SPAs (default false)
- **stealth**: Stealth mode to avoid bot detection (default false)
### Removed Parameters
- **markdown_only**: Deprecated in favor of the new extraction modes
## Implementation Details
### Client Method Updates
- Updated `ScapeGraphClient.smartscraper()` method signature
- Added mutually exclusive validation for input sources
- All new parameters properly passed to API endpoint
- Maintains backward compatibility with existing website_url usage
### MCP Tool Updates
- Enhanced tool signature with all new parameters
- Comprehensive parameter descriptions for better UX
- JSON schema parsing for output_schema (accepts dict or JSON string)
- Clear documentation of mutually exclusive input modes
- Examples for each parameter
## Use Cases Enabled
1. **Process pre-fetched HTML**: Useful for cached content or generated HTML
2. **Extract from markdown**: Process documentation or markdown exports
3. **Pagination**: Handle multi-page content automatically
4. **JavaScript-heavy sites**: Better support for SPAs and dynamic content
5. **Stealth scraping**: Bypass bot detection mechanisms
6. **Structured output**: Define exact output format with JSON schema
## Backward Compatibility
Existing usage with website_url continues to work without changes.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Extract structured data from a webpage using AI-powered extraction.
877
+
Extract structured data from a webpage, HTML, or markdown using AI-powered extraction.
846
878
847
879
This tool uses advanced AI to understand your natural language prompt and extract specific
848
-
structured data from any webpage. Ideal for extracting product information, contact details,
849
-
article metadata, or any structured content. Supports infinite scrolling for dynamic content.
850
-
Costs 10 credits per page. Read-only operation with no side effects.
880
+
structured data from web content. Supports three input modes: URL scraping, local HTML processing,
881
+
or local markdown processing. Ideal for extracting product information, contact details,
882
+
article metadata, or any structured content. Costs 10 credits per page. Read-only operation.
851
883
852
884
Args:
853
-
user_prompt: Natural language instructions describing what data to extract from the webpage. Be specific about the fields you want. Example: 'Extract product name, price, description, and availability status'
854
-
website_url: The complete URL of the webpage to scrape. Must include protocol (http:// or https://). Example: https://example.com/products/item
855
-
number_of_scrolls: Number of infinite scrolls to perform on the page before scraping (useful for dynamically loaded content). Default is 0. Example: 3 for pages with lazy-loading
856
-
markdown_only: If true, returns only the markdown content of the page without AI processing. Useful for simple content extraction. Default is false (AI extraction enabled)
885
+
user_prompt: Natural language instructions describing what data to extract. Be specific about the fields you want. Example: 'Extract product name, price, description, and availability status'
886
+
website_url: The complete URL of the webpage to scrape (mutually exclusive with website_html and website_markdown). Must include protocol. Example: https://example.com/products/item
887
+
website_html: Raw HTML content to process locally (mutually exclusive with website_url and website_markdown, max 2MB). Useful for processing pre-fetched or generated HTML
888
+
website_markdown: Markdown content to process locally (mutually exclusive with website_url and website_html, max 2MB). Useful for extracting from markdown documents
number_of_scrolls: Number of infinite scrolls to perform before scraping (0-50, default 0). Useful for dynamically loaded content. Example: 3 for pages with lazy-loading
891
+
total_pages: Number of pages to process for pagination (1-100, default 1). Useful for multi-page content
892
+
render_heavy_js: Enable heavy JavaScript rendering for Single Page Applications and dynamic sites (default false). Increases processing time but captures client-side rendered content
893
+
stealth: Enable stealth mode to avoid bot detection (default false). Useful for sites with anti-scraping measures
857
894
858
895
Returns:
859
-
Dictionary containing the extracted data in structured format matching your prompt requirements,
860
-
or markdown content if markdown_only is enabled
896
+
Dictionary containing the extracted data in structured format matching your prompt requirements
0 commit comments