.. _difPy.build: difpy.build ^^^^^^^^^^ Before difPy can perform any search, it needs to build its image repository and transform the images in the provided directory into tensors. This is what is done when ``difPy.build()`` is invoked. Upon completion, ``difPy.build()`` returns a ``dif`` object that can be used in :ref:`difPy.search` to start the search process. ``difPy.build`` supports the following parameters: .. code-block:: python difPy.build(*directory, recursive=True, in_folder=False, limit_extensions=True, px_size=50, show_progress=True, processes=None) .. csv-table:: :header: Parameter,Input Type,Default Value,Other Values :widths: 10, 10, 10, 20 :class: tight-table :ref:`directory`,"``str``, ``list``",, :ref:`recursive`,``bool``,``True``,``False`` :ref:`in_folder`,``bool``,``True``,``False`` :ref:`limit_extensions`,``bool``,``True``,``False`` :ref:`px_size`,``int``,50, "``int`` >= 10 and <= 5000" :ref:`show_progress`,``bool``,``True``,``False`` :ref:`processes`,``int``,``os.cpu_count()``, "``int`` >= 1 and <= ``os.cpu_count()``" .. note:: If you want to reuse the image tensors generated by difPy in your own application, you can access the generated repository by calling ``difPy.build._tensor_dictionary``. To reverse the image IDs to the original filenames, use ``difPy.build._filename_dictionary``. .. _directory: directory (str, list) ++++++++++++ difPy supports single and multi-folder search. **Single Folder Search**: .. code-block:: python import difPy dif = difPy.build("C:/Path/to/Folder/") search = difPy.search(dif) **Multi Folder Search**: .. code-block:: python import difPy dif = difPy.build(["C:/Path/to/Folder_A/", "C:/Path/to/Folder_B/", "C:/Path/to/Folder_C/", ... ]) search = difPy.search(dif) Folder paths can be specified as standalone Python strings, or within a list. .. _recursive: recursive (bool) ++++++++++++ By default, difPy will search for matching images recursively within the subdirectories of the :ref:`directory` parameter. If set to ``False``, subdirectories will not be scanned. ``True`` = (default) searches recursively through all subdirectories in the directory paths ``False`` = disables recursive search through subdirectories in the directory paths .. _in_folder: in_folder (bool) ++++++++++++ By default, difPy will search for matches in the union of all directories specified in the :ref:`directory` parameter. To have difPy only search for matches within each folder separately, set ``in_folder`` to ``True``. The structure of the ``search.result`` output will be slightly different if ``in_folder`` is set to ``True`` (see :ref:`output`). ``True`` = searches for matches only among each individual directory, including subdirectories ``False`` = (default) searches for matches in the union of all directories .. _limit_extensions: limit_extensions (bool) ++++++++++++ .. warning:: Recommended not to change default value. Only adjust this value if you know what you are doing. difPy result accuracy can not be guaranteed for file formats not covered by "limit_extensions". By default, difPy only searches for images with a predefined file type. This speeds up the process, since difPy does not have to attempt to decode files it might not support. Nonetheless, you can let difPy try to decode other file types by setting ``limit_extensions`` to ``False``. .. note:: Predefined image types includes: ``apng``, ``bw``, ``cdf``, ``cur``, ``dcx``, ``dds``, ``dib``, ``emf``, ``eps``, ``fli``, ``flc``, ``fpx``, ``ftex``, ``fits``, ``gd``, ``gd2``, ``gif``, ``gbr``, ``icb``, ``icns``, ``iim``, ``ico``, ``im``, ``imt``, ``j2k``, ``jfif``, ``jfi``, ``jif``, ``jp2``, ``jpe``, ``jpeg``, ``jpg``, ``jpm``, ``jpf``, ``jpx``, ``jpeg``, ``mic``, ``mpo``, ``msp``, ``nc``, ``pbm``, ``pcd``, ``pcx``, ``pgm``, ``png``, ``ppm``, ``psd``, ``pixar``, ``ras``, ``rgb``, ``rgba``, ``sgi``, ``spi``, ``spider``, ``sun``, ``tga``, ``tif``, ``tiff``, ``vda``, ``vst``, ``wal``, ``webp``, ``xbm``, ``xpm``. ``True`` = (default) difPy's search is limited to a set of predefined image types ``False`` = difPy searches through all the input files difPy supports most popular image formats. Nevertheless, since it relies on the Pillow library for image decoding, the supported formats are restricted to the ones listed in the `Pillow Documentation`_. Unsupported file types will by marked as invalid and included in the process statistics output under ``invalid_files`` (see :ref:`Process Statistics`). .. _Pillow Documentation: https://pillow.readthedocs.io/en/stable/handbook/image-file-formats.html .. _px_size: px_size (int) ++++++++++++ .. note:: Recommended not to change default value. Absolute size in pixels (width x height) of the images before being compared. The higher the ``px_size``, the more precise the comparison, but in turn more computational resources are required for difPy to compare the images. The lower the ``px_size``, the faster, but the more imprecise the comparison process gets. By default, ``px_size`` is set to ``50``. **Manual setting**: ``px_size`` can be manually adjusted by setting it to any ``int``. .. _show_progress: show_progress (bool) ++++++++++++ By default, difPy will show a progress bar of the running process. ``True`` = (default) displays the progress bar ``False`` = disables the progress bar .. _processes: processes (int) ++++++++++++ .. warning:: Recommended not to change default value. Only adjust this value if you know what you are doing. See :ref:`Adjusting processes and chunksize`. difPy leverages `Multiprocessing`_ to speed up the image comparison process, meaning multiple comparison tasks will be performed in parallel. The ``processes`` parameter defines the maximum number of worker processes (i. e. parallel tasks) to perform when multiprocessing. The higher the parameter, the more performance can be achieved, but in turn, the more computing resources will be required. To learn more, please refer to the `Python Multiprocessing documentation`_. .. _Multiprocessing: https://docs.python.org/3/library/multiprocessing.html .. _Python Multiprocessing documentation: https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool By default, ``processes`` is set to `os.cpu_count()`_. This means that difPy will spawn as many processes as number of CPUs in your machine, which can lead to increased performance, but can also cause a **big computational overhead** depending on the size of your dataset. To reduce the required computing power, it is recommended to reduce this value. .. _os.cpu_count(): https://docs.python.org/3/library/os.html#os.cpu_count **Manual setting**: ``processes`` can be manually adjusted by setting it to any ``int``. It is dependant on values supported by the ``process`` parameter in the Python Multiprocessing package. To learn more about this parameter, please refer to the `Python Multiprocessing documentation`_.