difpy.build

Before difPy can perform any search, it needs to build its image repository and transform the images in the provided directory into tensors. This is what is done when difPy.build() is invoked.

Upon completion, difPy.build() returns a dif object that can be used in difPy.search to start the search process.

difPy.build supports the following parameters:

difPy.build(*directory, recursive=True, in_folder=False, limit_extensions=True, px_size=50, show_progress=True, processes=None)

Parameter

Input Type

Default Value

Other Values

directory (str, list)

str, list

recursive (bool)

bool

True

False

in_folder (bool)

bool

True

False

limit_extensions (bool)

bool

True

False

px_size (int)

int

50

int >= 10 and <= 5000

show_progress (bool)

bool

True

False

processes (int)

int

os.cpu_count()

int >= 1 and <= os.cpu_count()

Note

If you want to reuse the image tensors generated by difPy in your own application, you can access the generated repository by calling difPy.build._tensor_dictionary. To reverse the image IDs to the original filenames, use difPy.build._filename_dictionary.

directory (str, list)

difPy supports single and multi-folder search.

Single Folder Search:

import difPy
dif = difPy.build("C:/Path/to/Folder/")
search = difPy.search(dif)

Multi Folder Search:

import difPy
dif = difPy.build(["C:/Path/to/Folder_A/", "C:/Path/to/Folder_B/", "C:/Path/to/Folder_C/", ... ])
search = difPy.search(dif)

Folder paths can be specified as standalone Python strings, or within a list.

recursive (bool)

By default, difPy will search for matching images recursively within the subdirectories of the directory (str, list) parameter. If set to False, subdirectories will not be scanned.

True = (default) searches recursively through all subdirectories in the directory paths

False = disables recursive search through subdirectories in the directory paths

in_folder (bool)

By default, difPy will search for matches in the union of all directories specified in the directory (str, list) parameter. To have difPy only search for matches within each folder separately, set in_folder to True. The structure of the search.result output will be slightly different if in_folder is set to True (see Output).

True = searches for matches only among each individual directory, including subdirectories

False = (default) searches for matches in the union of all directories

limit_extensions (bool)

Warning

Recommended not to change default value. Only adjust this value if you know what you are doing. difPy result accuracy can not be guaranteed for file formats not covered by “limit_extensions”.

By default, difPy only searches for images with a predefined file type. This speeds up the process, since difPy does not have to attempt to decode files it might not support. Nonetheless, you can let difPy try to decode other file types by setting limit_extensions to False.

Note

Predefined image types includes: apng, bw, cdf, cur, dcx, dds, dib, emf, eps, fli, flc, fpx, ftex, fits, gd, gd2, gif, gbr, icb, icns, iim, ico, im, imt, j2k, jfif, jfi, jif, jp2, jpe, jpeg, jpg, jpm, jpf, jpx, jpeg, mic, mpo, msp, nc, pbm, pcd, pcx, pgm, png, ppm, psd, pixar, ras, rgb, rgba, sgi, spi, spider, sun, tga, tif, tiff, vda, vst, wal, webp, xbm, xpm.

True = (default) difPy’s search is limited to a set of predefined image types

False = difPy searches through all the input files

difPy supports most popular image formats. Nevertheless, since it relies on the Pillow library for image decoding, the supported formats are restricted to the ones listed in the Pillow Documentation. Unsupported file types will by marked as invalid and included in the process statistics output under invalid_files (see Process Statistics).

px_size (int)

Note

Recommended not to change default value.

Absolute size in pixels (width x height) of the images before being compared. The higher the px_size, the more precise the comparison, but in turn more computational resources are required for difPy to compare the images. The lower the px_size, the faster, but the more imprecise the comparison process gets.

By default, px_size is set to 50.

Manual setting: px_size can be manually adjusted by setting it to any int.

show_progress (bool)

By default, difPy will show a progress bar of the running process.

True = (default) displays the progress bar

False = disables the progress bar

processes (int)

Warning

Recommended not to change default value. Only adjust this value if you know what you are doing. See Adjusting ‘processes’ and ‘chunksize’.

difPy leverages Multiprocessing to speed up the image comparison process, meaning multiple comparison tasks will be performed in parallel. The processes parameter defines the maximum number of worker processes (i. e. parallel tasks) to perform when multiprocessing. The higher the parameter, the more performance can be achieved, but in turn, the more computing resources will be required. To learn more, please refer to the Python Multiprocessing documentation.

By default, processes is set to os.cpu_count(). This means that difPy will spawn as many processes as number of CPUs in your machine, which can lead to increased performance, but can also cause a big computational overhead depending on the size of your dataset. To reduce the required computing power, it is recommended to reduce this value.

Manual setting: processes can be manually adjusted by setting it to any int. It is dependant on values supported by the process parameter in the Python Multiprocessing package. To learn more about this parameter, please refer to the Python Multiprocessing documentation.