This is a collection of Python scripts designed to find duplicate images, index large collections for fast searches, and group visually similar images using various techniques (CNN, PHash, CLIP).
- Multiple Detection Methods: Utilizes Convolutional Neural Networks (CNN), Perceptual Hashing (PHash), and OpenAI's CLIP model for different types of similarity analysis.
- Fast Indexing Workflow: Scan a collection once with
indexer.pyto create an index, then find duplicates for a specific image almost instantly withfind_duplicate.py. - Batch Deduplication: Find all duplicate sets in a folder and move them to a
revisiondirectory for easy review withmain.py. - Semantic Grouping: Group images that are visually or conceptually similar (e.g., all sunset photos) using the powerful CLIP model with
group_clip.py. - Broad Format Support: Compatible with common image formats, including JPG, PNG, WEBP, and AVIF.
-
Clone the Repository:
git clone <repository-url> cd <repository-name>
-
(Recommended) Create a Virtual Environment:
# On Windows python -m venv venv .\venv\Scripts\Activate.ps1 # On macOS/Linux python3 -m venv venv source venv/bin/activate
-
Install Dependencies:
pip install -r requirements.txt
Note: If
requirements.txtis not available, install the dependencies manually:pip install imagededup pillow pillow-avif-plugin numpy scikit-learn torch torchvision torchaudio transformers
For systems without a GPU, you can use a CPU-only version of PyTorch to speed up the installation:
pip install torch --index-url https://download.pytorch.org/whl/cpu
The available workflows and scripts are detailed below.
This is the most efficient method for managing large collections. You index everything once and then perform fast searches.
- Purpose: Recursively scans a directory of images and creates an index file (
image_database.pkl) containing the "fingerprints" (embeddings) of each image. - Usage:
python indexer.py "path/to/your/collection" - Details: The script processes all images and saves the index in the same collection folder. This index is essential for the next script.
- Purpose: Finds duplicates of a single image by comparing it against the pre-computed index. It is extremely fast.
- Usage:
python find_duplicate.py "path/to/image.jpg" "path/to/collection_with_index" [options]
- Arguments:
input_image(Required): The path to the image for which you want to find duplicates.collection_dir(Required): The path to the folder containing theimage_database.pklfile.
- Options:
-t,--threshold(Optional): Similarity threshold to consider an image a duplicate (range: 0.0 to 1.0). A higher value means the images must be more similar. Default: 0.98.
- Example:
python find_duplicate.py "D:\my_photos\vacation.jpg" "D:\my_photos" --threshold 0.99
Useful for cleaning up a folder by finding all duplicate groups at once.
- Purpose: Finds all groups of duplicate images in the
imagenesfolder and moves them to therevisionfolder for you to review. - Usage:
python main.py
- Configuration: This script does not use command-line arguments. It is configured by editing the variables at the top of the file:
use_cnn:Trueto use the CNN method (more accurate, slower) orFalseto use PHash (faster, for nearly identical duplicates).cnn_threshold/hash_threshold: The similarity or distance threshold depending on the chosen method.crear_subcarpetas:Trueto create a subfolder for each group of duplicates inrevision, orFalseto move all files torevisionwith a numeric prefix.
This script goes beyond exact duplicates and groups images that look alike or share a theme.
- Purpose: Uses the OpenAI CLIP model to group semantically similar images from the
imagenesfolder and moves them torevision_sets. - Usage:
python group_clip.py
- Configuration: It has no arguments. It works directly on the
imagenes(input) andrevision_sets(output) folders.
- Purpose: A diagnostic script that uses CLIP to generate a heatmap visualizing the similarity between all images in the
imagenesfolder. Useful for analysis and experimentation. - Usage:
python cnn.py
imagenes/: Default input folder formain.pyandgroup_clip.py.revision/: Output folder for duplicates found bymain.py.revision_sets/: Output folder for semantic groups found bygroup_clip.py.image_database.pkl: Index file generated byindexer.pywithin the folder being indexed.
MIT.