Skip to content

Commit 3acd879

Browse files
VinciGit00claude
andcommitted
feat: Enhance smartscraper with OpenAPI spec compliance - add HTML/Markdown processing and new parameters
## New Features Based on the official ScrapeGraphAI OpenAPI specification, smartscraper now supports: ### Three Input Modes (Mutually Exclusive) 1. **URL-based scraping** (website_url) - Scrape live websites 2. **HTML processing** (website_html) - Process local/pre-fetched HTML (max 2MB) 3. **Markdown processing** (website_markdown) - Extract from markdown documents (max 2MB) ### New Parameters - **output_schema**: JSON schema for structured output definition - **total_pages**: Pagination support (1-100 pages, default 1) - **render_heavy_js**: Heavy JavaScript rendering for SPAs (default false) - **stealth**: Stealth mode to avoid bot detection (default false) ### Removed Parameters - **markdown_only**: Deprecated in favor of the new extraction modes ## Implementation Details ### Client Method Updates - Updated `ScapeGraphClient.smartscraper()` method signature - Added mutually exclusive validation for input sources - All new parameters properly passed to API endpoint - Maintains backward compatibility with existing website_url usage ### MCP Tool Updates - Enhanced tool signature with all new parameters - Comprehensive parameter descriptions for better UX - JSON schema parsing for output_schema (accepts dict or JSON string) - Clear documentation of mutually exclusive input modes - Examples for each parameter ## Use Cases Enabled 1. **Process pre-fetched HTML**: Useful for cached content or generated HTML 2. **Extract from markdown**: Process documentation or markdown exports 3. **Pagination**: Handle multi-page content automatically 4. **JavaScript-heavy sites**: Better support for SPAs and dynamic content 5. **Stealth scraping**: Bypass bot detection mechanisms 6. **Structured output**: Define exact output format with JSON schema ## Backward Compatibility Existing usage with website_url continues to work without changes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
1 parent 32abb0a commit 3acd879

File tree

1 file changed

+90
-28
lines changed

1 file changed

+90
-28
lines changed

src/scrapegraph_mcp/server.py

Lines changed: 90 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -62,32 +62,59 @@ def markdownify(self, website_url: str) -> Dict[str, Any]:
6262

6363
return response.json()
6464

65-
def smartscraper(self, user_prompt: str, website_url: str, number_of_scrolls: int = None, markdown_only: bool = None) -> Dict[str, Any]:
65+
def smartscraper(
66+
self,
67+
user_prompt: str,
68+
website_url: str = None,
69+
website_html: str = None,
70+
website_markdown: str = None,
71+
output_schema: Dict[str, Any] = None,
72+
number_of_scrolls: int = None,
73+
total_pages: int = None,
74+
render_heavy_js: bool = None,
75+
stealth: bool = None
76+
) -> Dict[str, Any]:
6677
"""
6778
Extract structured data from a webpage using AI.
6879
6980
Args:
7081
user_prompt: Instructions for what data to extract
71-
website_url: URL of the webpage to scrape
72-
number_of_scrolls: Number of infinite scrolls to perform (optional)
73-
markdown_only: Whether to return only markdown content without AI processing (optional)
82+
website_url: URL of the webpage to scrape (mutually exclusive with website_html and website_markdown)
83+
website_html: HTML content to process locally (mutually exclusive with website_url and website_markdown, max 2MB)
84+
website_markdown: Markdown content to process locally (mutually exclusive with website_url and website_html, max 2MB)
85+
output_schema: JSON schema defining expected output structure (optional)
86+
number_of_scrolls: Number of infinite scrolls to perform (0-50, default 0)
87+
total_pages: Number of pages to process for pagination (1-100, default 1)
88+
render_heavy_js: Enable heavy JavaScript rendering for dynamic pages (default false)
89+
stealth: Enable stealth mode to avoid bot detection (default false)
7490
7591
Returns:
76-
Dictionary containing the extracted data or markdown content
92+
Dictionary containing the extracted data
7793
"""
7894
url = f"{self.BASE_URL}/smartscraper"
79-
data = {
80-
"user_prompt": user_prompt,
81-
"website_url": website_url
82-
}
83-
84-
# Add number_of_scrolls to the request if provided
95+
data = {"user_prompt": user_prompt}
96+
97+
# Add input source (mutually exclusive)
98+
if website_url is not None:
99+
data["website_url"] = website_url
100+
elif website_html is not None:
101+
data["website_html"] = website_html
102+
elif website_markdown is not None:
103+
data["website_markdown"] = website_markdown
104+
else:
105+
raise ValueError("Must provide one of: website_url, website_html, or website_markdown")
106+
107+
# Add optional parameters
108+
if output_schema is not None:
109+
data["output_schema"] = output_schema
85110
if number_of_scrolls is not None:
86111
data["number_of_scrolls"] = number_of_scrolls
87-
88-
# Add markdown_only to the request if provided
89-
if markdown_only is not None:
90-
data["markdown_only"] = markdown_only
112+
if total_pages is not None:
113+
data["total_pages"] = total_pages
114+
if render_heavy_js is not None:
115+
data["render_heavy_js"] = render_heavy_js
116+
if stealth is not None:
117+
data["stealth"] = stealth
91118

92119
response = self.client.post(url, headers=self.headers, json=data)
93120

@@ -836,33 +863,68 @@ def markdownify(website_url: str, ctx: Context) -> Dict[str, Any]:
836863
@mcp.tool(annotations={"readOnlyHint": True, "destructiveHint": False, "idempotentHint": True})
837864
def smartscraper(
838865
user_prompt: str,
839-
website_url: str,
840866
ctx: Context,
867+
website_url: Optional[str] = None,
868+
website_html: Optional[str] = None,
869+
website_markdown: Optional[str] = None,
870+
output_schema: Optional[Union[str, Dict[str, Any]]] = None,
841871
number_of_scrolls: Optional[int] = None,
842-
markdown_only: Optional[bool] = None
872+
total_pages: Optional[int] = None,
873+
render_heavy_js: Optional[bool] = None,
874+
stealth: Optional[bool] = None
843875
) -> Dict[str, Any]:
844876
"""
845-
Extract structured data from a webpage using AI-powered extraction.
877+
Extract structured data from a webpage, HTML, or markdown using AI-powered extraction.
846878
847879
This tool uses advanced AI to understand your natural language prompt and extract specific
848-
structured data from any webpage. Ideal for extracting product information, contact details,
849-
article metadata, or any structured content. Supports infinite scrolling for dynamic content.
850-
Costs 10 credits per page. Read-only operation with no side effects.
880+
structured data from web content. Supports three input modes: URL scraping, local HTML processing,
881+
or local markdown processing. Ideal for extracting product information, contact details,
882+
article metadata, or any structured content. Costs 10 credits per page. Read-only operation.
851883
852884
Args:
853-
user_prompt: Natural language instructions describing what data to extract from the webpage. Be specific about the fields you want. Example: 'Extract product name, price, description, and availability status'
854-
website_url: The complete URL of the webpage to scrape. Must include protocol (http:// or https://). Example: https://example.com/products/item
855-
number_of_scrolls: Number of infinite scrolls to perform on the page before scraping (useful for dynamically loaded content). Default is 0. Example: 3 for pages with lazy-loading
856-
markdown_only: If true, returns only the markdown content of the page without AI processing. Useful for simple content extraction. Default is false (AI extraction enabled)
885+
user_prompt: Natural language instructions describing what data to extract. Be specific about the fields you want. Example: 'Extract product name, price, description, and availability status'
886+
website_url: The complete URL of the webpage to scrape (mutually exclusive with website_html and website_markdown). Must include protocol. Example: https://example.com/products/item
887+
website_html: Raw HTML content to process locally (mutually exclusive with website_url and website_markdown, max 2MB). Useful for processing pre-fetched or generated HTML
888+
website_markdown: Markdown content to process locally (mutually exclusive with website_url and website_html, max 2MB). Useful for extracting from markdown documents
889+
output_schema: JSON schema dict or JSON string defining the expected output structure. Example: {'type': 'object', 'properties': {'title': {'type': 'string'}, 'price': {'type': 'number'}}}
890+
number_of_scrolls: Number of infinite scrolls to perform before scraping (0-50, default 0). Useful for dynamically loaded content. Example: 3 for pages with lazy-loading
891+
total_pages: Number of pages to process for pagination (1-100, default 1). Useful for multi-page content
892+
render_heavy_js: Enable heavy JavaScript rendering for Single Page Applications and dynamic sites (default false). Increases processing time but captures client-side rendered content
893+
stealth: Enable stealth mode to avoid bot detection (default false). Useful for sites with anti-scraping measures
857894
858895
Returns:
859-
Dictionary containing the extracted data in structured format matching your prompt requirements,
860-
or markdown content if markdown_only is enabled
896+
Dictionary containing the extracted data in structured format matching your prompt requirements
897+
and optional output_schema
861898
"""
862899
try:
863900
api_key = get_api_key(ctx)
864901
client = ScapeGraphClient(api_key)
865-
return client.smartscraper(user_prompt, website_url, number_of_scrolls, markdown_only)
902+
903+
# Parse output_schema if it's a JSON string
904+
normalized_schema: Optional[Dict[str, Any]] = None
905+
if isinstance(output_schema, dict):
906+
normalized_schema = output_schema
907+
elif isinstance(output_schema, str):
908+
try:
909+
parsed_schema = json.loads(output_schema)
910+
if isinstance(parsed_schema, dict):
911+
normalized_schema = parsed_schema
912+
else:
913+
return {"error": "output_schema must be a JSON object"}
914+
except json.JSONDecodeError as e:
915+
return {"error": f"Invalid JSON for output_schema: {str(e)}"}
916+
917+
return client.smartscraper(
918+
user_prompt=user_prompt,
919+
website_url=website_url,
920+
website_html=website_html,
921+
website_markdown=website_markdown,
922+
output_schema=normalized_schema,
923+
number_of_scrolls=number_of_scrolls,
924+
total_pages=total_pages,
925+
render_heavy_js=render_heavy_js,
926+
stealth=stealth
927+
)
866928
except Exception as e:
867929
return {"error": str(e)}
868930

0 commit comments

Comments
 (0)