Skip to content

Conversation

@prajeeta15
Copy link

Issue #315

Changes Proposed

  1. New Database Format for --save Flag
    Added database as a new choice for the --save argument
    Location: main.py lines 138-139
  2. Core Database Module
    Created src/torbot/modules/database.py
    Implements SearchResultsDatabase class for SQLite management
    No external database server required (uses built-in sqlite3)
  3. Integration with LinkTree
    Added saveDatabase() method in src/torbot/modules/linktree.py (lines 159-195)
    Extracts all discovered links and metadata for persistent storage
  4. Query Utilities
    Created src/torbot/modules/db_query.py for result retrieval
    Created scripts/query_database.py CLI for database operations

Explanation of Changes

Database Engine & Architecture
SQLite (file-based, no server)<project_root>/torbot_search_results.db
Auto-initialized on first use

Database Schema
searches Table (Search Metadata)

- id (INTEGER PRIMARY KEY): Auto-incrementing search ID
- root_url (TEXT): The root URL that was crawled
- search_timestamp (DATETIME): ISO 8601 formatted timestamp of search
- depth (INTEGER): Crawl depth setting used
- total_links (INTEGER): Count of total links discovered
- links_data (TEXT): JSON array of all link metadata
- created_at (DATETIME): Record creation timestamp

links Table (Individual Link Records)

- id (INTEGER PRIMARY KEY): Auto-incrementing link ID
- search_id (INTEGER): Foreign key referencing searches table
- url (TEXT): Full URL of discovered link
- title (TEXT): Page title or hostname
- status_code (INTEGER): HTTP response code (200, 404, etc.)
- classification (TEXT): Content classification from NLP module
- accuracy (REAL): Classification confidence score (0.0-1.0)
- emails (TEXT): JSON array of emails found on page
- phone_numbers (TEXT): JSON array of phone numbers found

Relationship: One search has many links (1:N relationship with CASCADE delete)

Metadata Captured Per Search

Root-Level Metadata:
✅ Root URL being crawled
✅ Exact timestamp of search (ISO 8601)
✅ Crawl depth configuration
✅ Total link count

Per-Link Metadata:
✅ Full URL
✅ Page title
✅ HTTP status code (connectivity indicator)
✅ Content classification (marketplace, forum, etc.)
✅ Classification accuracy/confidence
✅ Email addresses extracted
✅ Phone numbers extracted

Core Features:

  • Save Results -> searchResultsDatabase.save_search_results()->Stores search + links
  • Retrieve History -> get_search_history() -> Query with optional URL filter
  • Get Details -> get_search_by_id() - Full search details with all links
  • Close Connection -> close() -> Proper resource cleanup

Usage
Basic Save:

python main.py -u http://example.onion --depth 2 --save database

Benefits:

  • Persistence: Search results survive program restarts
  • Auditability: Full timestamp history of all crawls
  • Queryability: Filter and search previous results
  • Scalability: SQLite handles thousands of records efficiently
  • No Dependencies: Uses Python's built-in sqlite3 module
  • Relationship Integrity: Foreign keys prevent orphaned records
  • Export Ready: JSON data format enables easy integration with other tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant