[Feature]: Export HTML to Markdown

The main motivation is to make the output more suitable for LLM ingestion, dataset creation, and reproducible text comparisons. Markdown provides a cleaner, more standardized structure compared to raw HTML, which is usually full of layout noise, scripts, and temporary attributes.

### Basic behavior

```
- <h1> → #, <h2> → ##, <h3> → ###
- <p> → simple text line
- <ul>/<ol> → Markdown lists
- <a> → [text](url)
- <img> → ![alt](src) (optional, configurable)
- <pre><code> → fenced code blocks
- Tables converted to Markdown or CSV fallback
- Inline spans or styling without semantic meaning are discarded
- Scripts, styles, and invisible nodes are ignored
```

Not yet decided how to handle sidebars, navigation blocks, and asides. Options: drop them entirely, append them at the bottom as “Notes,” or let the user configure with include/exclude. Needs discussion.

Likely to be implemented as a separate library (e.g. pydoll-markdown-exporter) to keep Pydoll’s core lightweight. Pydoll will call this library internally. A minimal prototype will be released first, covering essential mappings and already useful for RAG/LLM scenarios.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: Export HTML to Markdown #243

Basic behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: Export HTML to Markdown #243

Description

Basic behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions