-
-
Notifications
You must be signed in to change notification settings - Fork 324
Description
The main motivation is to make the output more suitable for LLM ingestion, dataset creation, and reproducible text comparisons. Markdown provides a cleaner, more standardized structure compared to raw HTML, which is usually full of layout noise, scripts, and temporary attributes.
Basic behavior
- <h1> → #, <h2> → ##, <h3> → ###
- <p> → simple text line
- <ul>/<ol> → Markdown lists
- <a> → [text](url)
- <img> →  (optional, configurable)
- <pre><code> → fenced code blocks
- Tables converted to Markdown or CSV fallback
- Inline spans or styling without semantic meaning are discarded
- Scripts, styles, and invisible nodes are ignored
Not yet decided how to handle sidebars, navigation blocks, and asides. Options: drop them entirely, append them at the bottom as “Notes,” or let the user configure with include/exclude. Needs discussion.
Likely to be implemented as a separate library (e.g. pydoll-markdown-exporter) to keep Pydoll’s core lightweight. Pydoll will call this library internally. A minimal prototype will be released first, covering essential mappings and already useful for RAG/LLM scenarios.