Image Deduplication and Grouping Tools

This is a collection of Python scripts designed to find duplicate images, index large collections for fast searches, and group visually similar images using various techniques (CNN, PHash, CLIP).

Key Features

Multiple Detection Methods: Utilizes Convolutional Neural Networks (CNN), Perceptual Hashing (PHash), and OpenAI's CLIP model for different types of similarity analysis.
Fast Indexing Workflow: Scan a collection once with indexer.py to create an index, then find duplicates for a specific image almost instantly with find_duplicate.py.
Batch Deduplication: Find all duplicate sets in a folder and move them to a revision directory for easy review with main.py.
Semantic Grouping: Group images that are visually or conceptually similar (e.g., all sunset photos) using the powerful CLIP model with group_clip.py.
Broad Format Support: Compatible with common image formats, including JPG, PNG, WEBP, and AVIF.

Installation

Clone the Repository:

git clone <repository-url>
cd <repository-name>

(Recommended) Create a Virtual Environment:

# On Windows
python -m venv venv
.\venv\Scripts\Activate.ps1

# On macOS/Linux
python3 -m venv venv
source venv/bin/activate

Install Dependencies:

pip install -r requirements.txt

Note: If requirements.txt is not available, install the dependencies manually:

pip install imagededup pillow pillow-avif-plugin numpy scikit-learn torch torchvision torchaudio transformers

For systems without a GPU, you can use a CPU-only version of PyTorch to speed up the installation:

pip install torch --index-url https://download.pytorch.org/whl/cpu

Scripts and Usage

The available workflows and scripts are detailed below.

Workflow 1: Index a Collection and Search for Duplicates

This is the most efficient method for managing large collections. You index everything once and then perform fast searches.

1. `indexer.py`

Purpose: Recursively scans a directory of images and creates an index file (image_database.pkl) containing the "fingerprints" (embeddings) of each image.

Usage:

python indexer.py "path/to/your/collection"

Details: The script processes all images and saves the index in the same collection folder. This index is essential for the next script.

2. `find_duplicate.py`

Purpose: Finds duplicates of a single image by comparing it against the pre-computed index. It is extremely fast.

Usage:

python find_duplicate.py "path/to/image.jpg" "path/to/collection_with_index" [options]

Arguments:
- input_image (Required): The path to the image for which you want to find duplicates.
- collection_dir (Required): The path to the folder containing the image_database.pkl file.
Options:
- -t, --threshold (Optional): Similarity threshold to consider an image a duplicate (range: 0.0 to 1.0). A higher value means the images must be more similar. Default: 0.98.

Example:

python find_duplicate.py "D:\my_photos\vacation.jpg" "D:\my_photos" --threshold 0.99

Workflow 2: Batch Deduplication

Useful for cleaning up a folder by finding all duplicate groups at once.

`main.py`

Purpose: Finds all groups of duplicate images in the imagenes folder and moves them to the revision folder for you to review.
Usage:
```
python main.py
```
Configuration: This script does not use command-line arguments. It is configured by editing the variables at the top of the file:
- use_cnn: True to use the CNN method (more accurate, slower) or False to use PHash (faster, for nearly identical duplicates).
- cnn_threshold / hash_threshold: The similarity or distance threshold depending on the chosen method.
- crear_subcarpetas: True to create a subfolder for each group of duplicates in revision, or False to move all files to revision with a numeric prefix.

Workflow 3: Semantic Grouping with CLIP

This script goes beyond exact duplicates and groups images that look alike or share a theme.

`group_clip.py`

Purpose: Uses the OpenAI CLIP model to group semantically similar images from the imagenes folder and moves them to revision_sets.
Usage:
```
python group_clip.py
```
Configuration: It has no arguments. It works directly on the imagenes (input) and revision_sets (output) folders.

Utility Scripts

`cnn.py`

Purpose: A diagnostic script that uses CLIP to generate a heatmap visualizing the similarity between all images in the imagenes folder. Useful for analysis and experimentation.
Usage:
```
python cnn.py
```

Folder Structure

imagenes/: Default input folder for main.py and group_clip.py.
revision/: Output folder for duplicates found by main.py.
revision_sets/: Output folder for semantic groups found by group_clip.py.
image_database.pkl: Index file generated by indexer.py within the folder being indexed.

License

MIT.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Image Deduplication and Grouping Tools

Key Features

Installation

Scripts and Usage

Workflow 1: Index a Collection and Search for Duplicates

1. `indexer.py`

2. `find_duplicate.py`

Workflow 2: Batch Deduplication

`main.py`

Workflow 3: Semantic Grouping with CLIP

`group_clip.py`

Utility Scripts

`cnn.py`

Folder Structure

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
cnn.py		cnn.py
find_duplicate.py		find_duplicate.py
group_clip.py		group_clip.py
indexer.py		indexer.py
main.py		main.py
requirements.txt		requirements.txt

specializeddevel/Image-Deduplicator-CNN-PHash-with-AVIF-support

Folders and files

Latest commit

History

Repository files navigation

Image Deduplication and Grouping Tools

Key Features

Installation

Scripts and Usage

Workflow 1: Index a Collection and Search for Duplicates

1. indexer.py

2. find_duplicate.py

Workflow 2: Batch Deduplication

main.py

Workflow 3: Semantic Grouping with CLIP

group_clip.py

Utility Scripts

cnn.py

Folder Structure

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. `indexer.py`

2. `find_duplicate.py`

`main.py`

`group_clip.py`

`cnn.py`

Packages