difPy Guide
v4.2.1

Getting started

  • Installation
  • Basic Usage
  • CLI Usage
  • Output

Methods and parameters

  • difpy.build
  • difPy.search
    • difPy_obj
    • similarity (str, int, float)
    • same_dim (bool)
    • rotate (bool)
    • show_progress (bool)
    • processes (int)
    • chunksize (int)
  • search.move_to
  • search.delete

Contributing

  • Contributing to difPy
  • Support difPy

Further Resources

  • difPy for Desktop
  • Using difPy with Large Datasets
  • Supported File Types
  • Report a Bug 🐛
difPy Guide
  • difPy.search
  • View page source

difPy.search

After the dif object has been built using difpy.build, the search can be initiated with difPy.search.

When invoking difPy.search(), difPy starts comparing the images to find duplicates or similarities, based on the MSE (Mean Squared Error) between both image tensors. The target similarity rate i. e. MSE value is set with the similarity (str, int, float) parameter.

After the search is completed, further actions can be performed using search.move_to and search.delete.

difPy.search(difPy_obj, similarity='duplicates', same_dim=True, rotate=True, processes=None, chunksize=None, show_progress=False, logs=True)

difPy.search supports the following parameters:

Parameter

Input Type

Default Value

Other Values

difPy_obj

difPy_obj

similarity (str, int, float)

str, int, float

'duplicates'

'similar', int or float >= 0

same_dim (bool)

bool

True

False

rotate (bool)

bool

True

False

show_progress (bool)

bool

True

False

processes (int)

int

os.cpu_count()

int >= 1 and <= os.cpu_count()

chunksize (int)

int

None

int >= 1

difPy_obj

The required difPy_obj parameter should be pointing to the dif object that was built during the invocation of difpy.build.

similarity (str, int, float)

difPy compares the images to find duplicates or similarities, based on the MSE (Mean Squared Error) between both image tensors. The target similarity rate i. e. MSE value is set with the similarity parameter.

"duplicates" = (default) searches for duplicates. MSE threshold is set to 0.

"similar" = searches for similar images. MSE threshold is set to 5.

The search for similar images can be useful when searching for duplicate files that:

  • have different file types (f. e. imageA.png has a duplicate imageA.jpg)

  • have different file sizes (f. e. imageA.png (100MB) has a duplicate imageA.png (50MB))

  • are cropped versions of one another (f. e. imageA.png is a cropped version of imageB.png) (in this case, same_dim (bool) should be set to False)

In these cases, the MSE between the two image tensors might not be exactly == 0, hence they would not be classified as being duplicates even though in reality they are. Setting similarity to "similar" searches for duplicates with a certain tolerance, increasing the likelihood of finding duplicate images of different file types and sizes.

Manual setting: the match MSE threshold can be adjusted manually by setting the similarity parameter to any int or float. difPy will then search for images that match an MSE threshold equal to or lower than the one specified.

same_dim (bool)

By default, when searching for matches, difPy assumes images to have the same dimensions (width x height).

True = (default) assumes matches have the same dimensions

False = assumes matches can have different dimensions

Note

same_dim should be set to False if you are searching for image matches that have different file types (i. e. imageA.png is a duplicate of imageA.jpg) and/or if images are cropped versions of one another.

rotate (bool)

By default, difPy will rotate the images on comparison. In total, 3 rotations are performed: 90°, 180° and 270° degree rotations.

True = (default) rotates images on comparison

False = images are not rotated before comparison

show_progress (bool)

See show_progress (bool).

processes (int)

See processes (int).

chunksize (int)

Warning

Recommended not to change default value. Only adjust this value if you know what you are doing. See Adjusting ‘processes’ and ‘chunksize’.

chunksize is only used when dealing with image datasets of more than 5k images. See the “Using difPy with Large Datasets” section for further details.

difPy leverages a different comparison algorithm depending on the size of the input dataset. If the dataset contains more than 5k images, then the Chunking algorithm is used, which leverages generators and vectorization for more efficient computation with large datasets. The chunksize parameter defines how many chunks of image sets should be compared at once. Therefore, the higher the chunksize value, the faster the computation but the higher the memory consumption.

The chunksize parameter is already automatically set to an optimal value relative to the size of the dataset. Nonetheless, it can also be adjusted manually, in order to provide more control over Multiprocessing strategies and memory consumption.

By default, chunksize is set to None which implies: 1'000'000 / number of images in dataset. Parameter can only be >= 1.

Manual setting: chunksize can be manually adjusted by setting it to any int >= 1.

Previous Next

© Copyright 2025, Elise Landman.

Built with Sphinx using a theme provided by Read the Docs.