Skip to content Skip to sidebar Skip to footer

Creating A Dask Bag From A Generator

I would like to create a dask.Bag (or dask.Array) from a list of generators. The gotcha is that the generators (when evaluated) are too large for memory. delayed_array = [delayed(

Solution 1:

A decent subset of Dask.bag can work with large iterators. Your solution is almost perfect, but you'll need to provide a function that creates your generators when called rather than the generators themselves.

In [1]: import dask.bag as db

In [2]: import dask

In [3]: b = db.from_delayed([dask.delayed(range)(i) for i in [100000000] * 5])

In [4]: b
Out[4]: dask.bag<bag-fro..., npartitions=5>

In [5]: b.take(5)
Out[5]: (0, 1, 2, 3, 4)

In [6]: b.sum()
Out[6]: <dask.bag.core.Item at 0x7f852d8737b8>

In [7]: b.sum().compute()
Out[7]: 24999999750000000

However, there are certainly ways that this can bite you. Some slightly more complex dask bag operations do need to make partitions concrete, which could blow out RAM.

Post a Comment for "Creating A Dask Bag From A Generator"