Skip to content

[Feature]: Export HTML to Markdown #243

@thalissonvs

Description

@thalissonvs

The main motivation is to make the output more suitable for LLM ingestion, dataset creation, and reproducible text comparisons. Markdown provides a cleaner, more standardized structure compared to raw HTML, which is usually full of layout noise, scripts, and temporary attributes.

Basic behavior

- <h1> → #, <h2> → ##, <h3> → ###
- <p> → simple text line
- <ul>/<ol> → Markdown lists
- <a> → [text](url)
- <img> → ![alt](src) (optional, configurable)
- <pre><code> → fenced code blocks
- Tables converted to Markdown or CSV fallback
- Inline spans or styling without semantic meaning are discarded
- Scripts, styles, and invisible nodes are ignored

Not yet decided how to handle sidebars, navigation blocks, and asides. Options: drop them entirely, append them at the bottom as “Notes,” or let the user configure with include/exclude. Needs discussion.

Likely to be implemented as a separate library (e.g. pydoll-markdown-exporter) to keep Pydoll’s core lightweight. Pydoll will call this library internally. A minimal prototype will be released first, covering essential mappings and already useful for RAG/LLM scenarios.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestfuture planningIdeas or features proposed for future development.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions