Creating A Dask Bag From A Generator
I would like to create a dask.Bag (or dask.Array) from a list of generators. The gotcha is that the generators (when evaluated) are too large for memory. delayed_array = [delayed(
Solution 1:
A decent subset of Dask.bag can work with large iterators. Your solution is almost perfect, but you'll need to provide a function that creates your generators when called rather than the generators themselves.
In [1]: import dask.bag as db
In [2]: import dask
In [3]: b = db.from_delayed([dask.delayed(range)(i) for i in [100000000] * 5])
In [4]: b
Out[4]: dask.bag<bag-fro..., npartitions=5>
In [5]: b.take(5)
Out[5]: (0, 1, 2, 3, 4)
In [6]: b.sum()
Out[6]: <dask.bag.core.Item at 0x7f852d8737b8>
In [7]: b.sum().compute()
Out[7]: 24999999750000000
However, there are certainly ways that this can bite you. Some slightly more complex dask bag operations do need to make partitions concrete, which could blow out RAM.
Post a Comment for "Creating A Dask Bag From A Generator"