The loky
module manages a pool of worker that can be re-used across time.
It provides a robust and dynamic implementation os the
ProcessPoolExecutor
and a function get_reusable_executor()
which
hide the pool management under the hood.
loky.
get_reusable_executor
(max_workers=None, context=None, timeout=10, kill_workers=False, reuse='auto', job_reducers=None, result_reducers=None, initializer=None, initargs=())[source]¶Return the current ReusableExectutor instance.
Start a new instance if it has not been started already or if the previous instance was left in a broken state.
If the previous instance does not have the requested number of workers, the executor is dynamically resized to adjust the number of workers prior to returning.
Reusing a singleton instance spares the overhead of starting new worker processes and importing common python packages each time.
max_workers
controls the maximum number of tasks that can be running in
parallel in worker processes. By default this is set to the number of
CPUs on the host.
Setting timeout
(in seconds) makes idle workers automatically shutdown
so as to release system resources. New workers are respawn upon submission
of new tasks so that max_workers
are available to accept the newly
submitted tasks. Setting timeout
to around 100 times the time required
to spawn new processes and import packages in them (on the order of 100ms)
ensures that the overhead of spawning workers is negligible.
Setting kill_workers=True
makes it possible to forcibly interrupt
previously spawned jobs to get a new instance of the reusable executor
with new constructor argument values.
The job_reducers
and result_reducers
are used to customize the
pickling of tasks and results send to the executor.
When provided, the initializer
is run first in newly spawned
processes with argument initargs
.
To share function definition across multiple python processes, it is necessary to rely on a serialization protocol. The standard protocol in python is pickle
but its default implementation in the standard library has several limitations. For instance, it cannot serialize functions which are defined interactively or in the __main__
module.
To avoid this limitation, loky
relies on cloudpickle
when it is present. cloudpickle
is a fork of the pickle protocol which allows the serialization of a greater number of objects and it can be installed using pip install cloudpickle
. As this library is slower than the pickle
module in the standard library, by default, loky
uses it only to serialize objects which are detected to be in the __main__
module.
There are three ways to temper with the serialization in loky
:
job_reducers
and result_reducers
, it is possible to register custom reducers for the serialization process.LOKY_PICKLER
to an available and valid serialization module. This module must present a valid Pickler
object. Setting the environment variable LOKY_PICKER=cloudpickle
will force loky
to serialize everything with cloudpickle
instead of just serializing the object detected to be in the __main__
module.loky.wrap_non_picklable_objects
decorator. In this case, all other objects are serialized as in the default behavior and the wrapped object is pickled through cloudpickle
.The benefits and drawbacks of each method are highlighted in this example.
loky.
wrap_non_picklable_objects
(obj)[source]¶Wrapper for non-picklable object to use cloudpickle to serialize them.
Note that this wrapper tends to slow down the serialization process as it is done with cloudpickle which is typically slower compared to pickle. The proper way to solve serialization issues is to avoid defining functions and objects in the main scripts and to implement __reduce__ functions for complex classes.
loky
¶The API in loky
provides a set_start_method()
function to set the default start_method
, which controls the way Process
are started. The available methods are {'loky'
, 'loky_int_main'
, 'spawn'
}. On unix, the start methods {'fork'
, 'forkserver'
} are also available.
Note that loky
isnot compatible with multiprocessing.set_start_method()
function. The default start method needs to be set with the provided function to ensure a proper behavior.
The memory size of long running worker processes can increase indefinitely if a memory leak is created. This can result in processes being shut down by the OS if those leaks are not resolved. To prevent it, loky provides leak detection, memory cleanups, and workers shutdown.
If psutil
is installed, each worker periodically [1] checks its
memory usage after it completes its task. If the usage is found to be
unusual [2], an additional gc.collect()
event is triggered to remove
objects with potential cyclic references.
If even after that, the memory usage of a process worker remains too high,
it will shut down safely, and a fresh process will be automatically spawned by
the executor.
If psutil
is not installed, there is no easy way to monitor worker
processes memory usage. gc.collect()
events will still be called
periodially [1] inside each workers, but there is no guarantee that a leak is
not happening.
Footnotes
[1] | (1, 2) every 1 second. This constant is define in loky.process_executor._MEMORY_LEAK_CHECK_DELAY |
[2] | an increase of 100MB compared to a reference, which is defined as the residual memory usage of the worker after it completed its first task |