Skip to content

specializeddevel/Image-Deduplicator-CNN-PHash-with-AVIF-support

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Image Deduplication and Grouping Tools

This is a collection of Python scripts designed to find duplicate images, index large collections for fast searches, and group visually similar images using various techniques (CNN, PHash, CLIP).

Key Features

  • Multiple Detection Methods: Utilizes Convolutional Neural Networks (CNN), Perceptual Hashing (PHash), and OpenAI's CLIP model for different types of similarity analysis.
  • Fast Indexing Workflow: Scan a collection once with indexer.py to create an index, then find duplicates for a specific image almost instantly with find_duplicate.py.
  • Batch Deduplication: Find all duplicate sets in a folder and move them to a revision directory for easy review with main.py.
  • Semantic Grouping: Group images that are visually or conceptually similar (e.g., all sunset photos) using the powerful CLIP model with group_clip.py.
  • Broad Format Support: Compatible with common image formats, including JPG, PNG, WEBP, and AVIF.

Installation

  1. Clone the Repository:

    git clone <repository-url>
    cd <repository-name>
  2. (Recommended) Create a Virtual Environment:

    # On Windows
    python -m venv venv
    .\venv\Scripts\Activate.ps1
    
    # On macOS/Linux
    python3 -m venv venv
    source venv/bin/activate
  3. Install Dependencies:

    pip install -r requirements.txt

    Note: If requirements.txt is not available, install the dependencies manually:

    pip install imagededup pillow pillow-avif-plugin numpy scikit-learn torch torchvision torchaudio transformers

    For systems without a GPU, you can use a CPU-only version of PyTorch to speed up the installation:

    pip install torch --index-url https://download.pytorch.org/whl/cpu

Scripts and Usage

The available workflows and scripts are detailed below.


Workflow 1: Index a Collection and Search for Duplicates

This is the most efficient method for managing large collections. You index everything once and then perform fast searches.

1. indexer.py

  • Purpose: Recursively scans a directory of images and creates an index file (image_database.pkl) containing the "fingerprints" (embeddings) of each image.
  • Usage:
    python indexer.py "path/to/your/collection"
  • Details: The script processes all images and saves the index in the same collection folder. This index is essential for the next script.

2. find_duplicate.py

  • Purpose: Finds duplicates of a single image by comparing it against the pre-computed index. It is extremely fast.
  • Usage:
    python find_duplicate.py "path/to/image.jpg" "path/to/collection_with_index" [options]
  • Arguments:
    • input_image (Required): The path to the image for which you want to find duplicates.
    • collection_dir (Required): The path to the folder containing the image_database.pkl file.
  • Options:
    • -t, --threshold (Optional): Similarity threshold to consider an image a duplicate (range: 0.0 to 1.0). A higher value means the images must be more similar. Default: 0.98.
  • Example:
    python find_duplicate.py "D:\my_photos\vacation.jpg" "D:\my_photos" --threshold 0.99

Workflow 2: Batch Deduplication

Useful for cleaning up a folder by finding all duplicate groups at once.

main.py

  • Purpose: Finds all groups of duplicate images in the imagenes folder and moves them to the revision folder for you to review.
  • Usage:
    python main.py
  • Configuration: This script does not use command-line arguments. It is configured by editing the variables at the top of the file:
    • use_cnn: True to use the CNN method (more accurate, slower) or False to use PHash (faster, for nearly identical duplicates).
    • cnn_threshold / hash_threshold: The similarity or distance threshold depending on the chosen method.
    • crear_subcarpetas: True to create a subfolder for each group of duplicates in revision, or False to move all files to revision with a numeric prefix.

Workflow 3: Semantic Grouping with CLIP

This script goes beyond exact duplicates and groups images that look alike or share a theme.

group_clip.py

  • Purpose: Uses the OpenAI CLIP model to group semantically similar images from the imagenes folder and moves them to revision_sets.
  • Usage:
    python group_clip.py
  • Configuration: It has no arguments. It works directly on the imagenes (input) and revision_sets (output) folders.

Utility Scripts

cnn.py

  • Purpose: A diagnostic script that uses CLIP to generate a heatmap visualizing the similarity between all images in the imagenes folder. Useful for analysis and experimentation.
  • Usage:
    python cnn.py

Folder Structure

  • imagenes/: Default input folder for main.py and group_clip.py.
  • revision/: Output folder for duplicates found by main.py.
  • revision_sets/: Output folder for semantic groups found by group_clip.py.
  • image_database.pkl: Index file generated by indexer.py within the folder being indexed.

License

MIT.

About

An Image Deduplicator (using CNN/PHash classification models) with AVIF support

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages