Skip to content

Commit 323f26a

Browse files
CopilotVinciGit00
andcommitted
Add comprehensive timeout feature documentation
Co-authored-by: VinciGit00 <88108002+VinciGit00@users.noreply.github.com>
1 parent 9439fe5 commit 323f26a

File tree

1 file changed

+292
-0
lines changed

1 file changed

+292
-0
lines changed

docs/timeout_configuration.md

Lines changed: 292 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,292 @@
1+
# FetchNode Timeout Configuration
2+
3+
## Overview
4+
5+
The `FetchNode` in ScrapeGraphAI supports configurable timeouts for all blocking operations to prevent indefinite hangs when fetching web content or parsing files. This feature allows you to control execution time limits for:
6+
7+
- HTTP requests (when using `use_soup=True`)
8+
- PDF file parsing
9+
- ChromiumLoader operations
10+
11+
## Configuration
12+
13+
### Default Behavior
14+
15+
By default, `FetchNode` uses a **30-second timeout** for all blocking operations when a `node_config` is provided:
16+
17+
```python
18+
from scrapegraphai.nodes import FetchNode
19+
20+
# Default 30-second timeout
21+
node = FetchNode(
22+
input="url",
23+
output=["doc"],
24+
node_config={}
25+
)
26+
```
27+
28+
### Custom Timeout
29+
30+
You can specify a custom timeout value (in seconds) via the `timeout` parameter:
31+
32+
```python
33+
# Custom 10-second timeout
34+
node = FetchNode(
35+
input="url",
36+
output=["doc"],
37+
node_config={"timeout": 10}
38+
)
39+
```
40+
41+
### Disabling Timeout
42+
43+
To disable timeout and allow operations to run indefinitely, set `timeout` to `None`:
44+
45+
```python
46+
# No timeout - operations will wait indefinitely
47+
node = FetchNode(
48+
input="url",
49+
output=["doc"],
50+
node_config={"timeout": None}
51+
)
52+
```
53+
54+
### No Configuration
55+
56+
If you don't provide any `node_config`, the timeout defaults to `None` (no timeout):
57+
58+
```python
59+
# No timeout (backward compatible)
60+
node = FetchNode(
61+
input="url",
62+
output=["doc"],
63+
node_config=None
64+
)
65+
```
66+
67+
## Use Cases
68+
69+
### HTTP Requests
70+
71+
When `use_soup=True`, the timeout applies to `requests.get()` calls:
72+
73+
```python
74+
node = FetchNode(
75+
input="url",
76+
output=["doc"],
77+
node_config={
78+
"use_soup": True,
79+
"timeout": 15 # HTTP request will timeout after 15 seconds
80+
}
81+
)
82+
83+
state = {"url": "https://example.com"}
84+
result = node.execute(state)
85+
```
86+
87+
If the timeout is `None`, no timeout parameter is passed to `requests.get()`:
88+
89+
```python
90+
node = FetchNode(
91+
input="url",
92+
output=["doc"],
93+
node_config={
94+
"use_soup": True,
95+
"timeout": None # No timeout for HTTP requests
96+
}
97+
)
98+
```
99+
100+
### PDF Parsing
101+
102+
The timeout applies to PDF file parsing operations using `PyPDFLoader`:
103+
104+
```python
105+
node = FetchNode(
106+
input="pdf",
107+
output=["doc"],
108+
node_config={
109+
"timeout": 60 # PDF parsing will timeout after 60 seconds
110+
}
111+
)
112+
113+
state = {"pdf": "/path/to/large_document.pdf"}
114+
try:
115+
result = node.execute(state)
116+
except TimeoutError as e:
117+
print(f"PDF parsing took too long: {e}")
118+
```
119+
120+
If parsing exceeds the timeout, a `TimeoutError` is raised with a descriptive message:
121+
122+
```
123+
TimeoutError: PDF parsing exceeded timeout of 60 seconds
124+
```
125+
126+
### ChromiumLoader
127+
128+
The timeout is automatically propagated to `ChromiumLoader` via `loader_kwargs`:
129+
130+
```python
131+
node = FetchNode(
132+
input="url",
133+
output=["doc"],
134+
node_config={
135+
"timeout": 30, # ChromiumLoader will use 30-second timeout
136+
"headless": True
137+
}
138+
)
139+
140+
state = {"url": "https://example.com"}
141+
result = node.execute(state)
142+
```
143+
144+
If you need different timeout behavior for ChromiumLoader specifically, you can override it in `loader_kwargs`:
145+
146+
```python
147+
node = FetchNode(
148+
input="url",
149+
output=["doc"],
150+
node_config={
151+
"timeout": 30, # General timeout for other operations
152+
"loader_kwargs": {
153+
"timeout": 60 # ChromiumLoader gets 60-second timeout
154+
}
155+
}
156+
)
157+
```
158+
159+
## Graph Examples
160+
161+
### SmartScraperGraph
162+
163+
```python
164+
from scrapegraphai.graphs import SmartScraperGraph
165+
166+
graph_config = {
167+
"llm": {
168+
"model": "gpt-3.5-turbo",
169+
"api_key": "your-api-key"
170+
},
171+
"timeout": 20 # 20-second timeout for fetch operations
172+
}
173+
174+
smart_scraper = SmartScraperGraph(
175+
prompt="Extract all article titles",
176+
source="https://news.example.com",
177+
config=graph_config
178+
)
179+
180+
result = smart_scraper.run()
181+
```
182+
183+
### Custom Graph with FetchNode
184+
185+
```python
186+
from scrapegraphai.nodes import FetchNode
187+
from langgraph.graph import StateGraph
188+
189+
# Create a custom graph with timeout
190+
fetch_node = FetchNode(
191+
input="url",
192+
output=["doc"],
193+
node_config={
194+
"timeout": 15,
195+
"headless": True
196+
}
197+
)
198+
199+
# Add to graph...
200+
```
201+
202+
## Best Practices
203+
204+
1. **Choose appropriate timeouts**: Consider the expected response time of your target websites
205+
- Fast APIs: 5-10 seconds
206+
- Regular websites: 15-30 seconds
207+
- Large PDFs or slow sites: 60+ seconds
208+
209+
2. **Handle TimeoutError**: Always wrap your code in try-except when using timeouts:
210+
211+
```python
212+
try:
213+
result = node.execute(state)
214+
except TimeoutError as e:
215+
logger.error(f"Operation timed out: {e}")
216+
# Handle timeout gracefully
217+
```
218+
219+
3. **Use different timeouts for different operations**: Set higher timeouts for PDF parsing and lower for HTTP requests:
220+
221+
```python
222+
# For PDFs
223+
pdf_node = FetchNode("pdf", ["doc"], {"timeout": 120})
224+
225+
# For web pages
226+
web_node = FetchNode("url", ["doc"], {"timeout": 15})
227+
```
228+
229+
4. **Monitor timeout occurrences**: Log timeout errors to identify problematic sources:
230+
231+
```python
232+
import logging
233+
234+
logger = logging.getLogger(__name__)
235+
236+
try:
237+
result = node.execute(state)
238+
except TimeoutError as e:
239+
logger.warning(f"Timeout for {state.get('url', 'unknown')}: {e}")
240+
```
241+
242+
## Implementation Details
243+
244+
The timeout feature is implemented using:
245+
246+
- **HTTP requests**: `requests.get(url, timeout=X)` parameter
247+
- **PDF parsing**: `concurrent.futures.ThreadPoolExecutor` with `future.result(timeout=X)`
248+
- **ChromiumLoader**: Propagated via `loader_kwargs` dictionary
249+
250+
When `timeout=None`, no timeout constraints are applied, allowing operations to run until completion.
251+
252+
## Troubleshooting
253+
254+
### Timeout is too short
255+
256+
If you're seeing frequent timeout errors, increase the timeout value:
257+
258+
```python
259+
node_config = {"timeout": 60} # Increase from 30 to 60 seconds
260+
```
261+
262+
### Need different timeouts for different operations
263+
264+
Use separate FetchNode instances with different configurations:
265+
266+
```python
267+
fast_fetcher = FetchNode("url", ["doc"], {"timeout": 10})
268+
slow_fetcher = FetchNode("pdf", ["doc"], {"timeout": 120})
269+
```
270+
271+
### ChromiumLoader timeout not working
272+
273+
Ensure you're not overriding the timeout in `loader_kwargs`:
274+
275+
```python
276+
# ❌ Wrong - explicit loader_kwargs timeout overrides node timeout
277+
node_config = {
278+
"timeout": 30,
279+
"loader_kwargs": {"timeout": 10} # This takes precedence
280+
}
281+
282+
# ✅ Correct - let node timeout propagate
283+
node_config = {
284+
"timeout": 30 # ChromiumLoader will use 30 seconds
285+
}
286+
```
287+
288+
## See Also
289+
290+
- [FetchNode API Documentation](../api/nodes/fetch_node.md)
291+
- [Graph Configuration](./graph_configuration.md)
292+
- [Error Handling](./error_handling.md)

0 commit comments

Comments
 (0)