GitHub - dylan-sutton-chavez/llm-web-crawler: A horizontally scalable web crawling engine designed for structured content extraction. It performs URL normalization, HTML parsing, Markdown conversion, and LLM post-processing (using xai_sdk with grok). Output is serialized in line-delimited JSON (.jsonl).

1. Mathematical Fundamentals

"A type of non-linear data structure, a graph consists of nodes and edges. Each node, also called vertices, represents an entity, while each relationship in the graph, represented as an edge, signifies a relationship between two vertices. This fundamental concept in graph theory allows us to model a wide array of real-world scenarios" — (SaWang, PuppyGraph. Jan, 2025)

A crawler functions similarly to how a graph works. Basically, each web page behaves like a node, where a single node can point to n number of websites, and multiple nodes can point to the same website. The navigation of a crawler is based on the same principle as a graph traversal; in this case, I will base it on a breadth-first search (BFS) model.

2. Crawler Conceptualization

Starts from a seed (URL) as the initial node
Maintains two sets:
- visited: URLs already visited
- queue: URLs pending to be visited
Iterates by depth levels, based on the pages discovered at each level
Explores all nodes at the current depth before moving on to the next one

3. Crawler Module

The Crawler module implements a horizontal-scalable architecture, where you can create one Crawler module with hundreds of nodes (sharing the same memmory, but managing hundreds of asynchronous processes).

3.1 Requirements

Python 3.9+
html-to-markdown 1.16.0
url_normalize 2.2.1
lxml 6.0.1
jsonlines 4.0.0
requests 2.32.3+
xai_sdk 1.1.0+

pip install -r requirements.txt

3.2 Usage

Load the enviroment variable of xai_sdk

from dotenv import load_dotenv
load_dotenv()

Initialize the crawler object with ta seed URL

constructor = Crawler('https://www.geeksforgeeks.org/machine-learning/what-is-perceptron-the-simplest-artificial-neural-network/')

Start a node in the crawler object with a depth

constructor.node(filename='crawled-sites.json', depth=2)

Depth 1/2:
    1 Sites Crawled | 82.218219 Seconds

Error HTTP https://in.linkedin.com/company/geeksforgeeks: 999
Error HTTP https://www.geeksforgeeks.org/machine-learning/machine-learning-interview-questions/)_: 404
Error HTTP https://www.geeksforgeeks.org/deep-learning/deep-learning-interview-questions/)_: 404
Error HTTP https://www.geeksforgeeks.org/deep-learning/5-deep-learning-project-ideas-for-beginners/)_: 404
Error HTTP https://www.geeksforgeeks.org/machine-learning/machine-learning-projects/)_: 404
Error HTTP https://geeksforgeeksapp.page.link/: 400
Depth 2/2:
    154 Sites Crawled | 4113.7781 Seconds

You have the exctected content in the crawled-sites.jsonl file

Remember

You can implement multithread where all the nodes share the same memory (Crawler), but this module its not optimized to manage race conditions
The xai_sdk can chang the API structure

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
crawled-sites.jsonl		crawled-sites.jsonl
main.py		main.py
requirements.txt		requirements.txt
system-prompt.md		system-prompt.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

1. Mathematical Fundamentals

2. Crawler Conceptualization

3. Crawler Module

3.1 Requirements

3.2 Usage

Remember

About

Uh oh!

Releases

Packages

Languages

dylan-sutton-chavez/llm-web-crawler

Folders and files

Latest commit

History

Repository files navigation

1. Mathematical Fundamentals

2. Crawler Conceptualization

3. Crawler Module

3.1 Requirements

3.2 Usage

Remember

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages