Dask logo

Understanding the Task Graph¶

Everything operation in Dask whether it's dataframe or array or delayed, creates some tasks. Each task knows its dependencies so Dask can create a graph of all the tasks and which depend on which. Put another way: a task graph is a DAG (Directed Acyclyic Graph). Read all about task graphs and high level graphs.

In [1]:

import dask.array as da

arr = da.random.random(size=(1_000, 1_000), chunks=(250, 250))
arr.dask

Out[1]:

HighLevelGraph

HighLevelGraph with 1 layers.

Layer 0: random_sample

random_sample-247f4bfed181679f520df03050ae4862

layer_type	MaterializedLayer
is_materialized	True
shape	(1000, 1000)
dtype	float64
chunksize	(250, 250)
type	dask.array.core.Array
chunk_type	numpy.ndarray

arr is itself a graph, and doing computations on arr produces another graph.

In [2]:

result = (arr + arr.T).sum(axis=0).mean()
result.dask

Out[2]:

HighLevelGraph

HighLevelGraph with 7 layers.

Layer 0: random_sample

random_sample-247f4bfed181679f520df03050ae4862

layer_type	MaterializedLayer
is_materialized	True
shape	(1000, 1000)
dtype	float64
chunksize	(250, 250)
type	dask.array.core.Array
chunk_type	numpy.ndarray

Layer 1: transpose

transpose-528fb1b5144ae969909c5aa7a6fc18ee

layer_type	Blockwise
is_materialized	False
shape	(1000, 1000)
dtype	float64
chunksize	(250, 250)
type	dask.array.core.Array
chunk_type	numpy.ndarray

Layer 2: add

add-9091506d774063d175eb93d5dd9699cd

layer_type	Blockwise
is_materialized	False
shape	(1000, 1000)
dtype	float64
chunksize	(250, 250)
type	dask.array.core.Array
chunk_type	numpy.ndarray

Layer 3: sum

sum-2d3c814648914f3a7d5935d8889873cf

layer_type	Blockwise
is_materialized	False
shape	(1000, 1000)
dtype	float64
chunksize	(250, 250)
type	dask.array.core.Array
chunk_type	numpy.ndarray

Layer 4: sum-aggregate

sum-aggregate-dea1aa6295b1188748f9c3ab24b06348

layer_type	MaterializedLayer
is_materialized	True
shape	(1000,)
dtype	float64
chunksize	(250,)
type	dask.array.core.Array
chunk_type	numpy.ndarray

Layer 5: mean_chunk

mean_chunk-96f1f080ecc4444b1e9f23e76f7a8d73

layer_type	Blockwise
is_materialized	False
shape	(1000,)
dtype	float64
chunksize	(250,)
type	dask.array.core.Array
chunk_type	numpy.ndarray

Layer 6: mean_agg-aggregate

mean_agg-aggregate-d461f56db078ce142b2daa1b1f15e194

layer_type	MaterializedLayer
is_materialized	True
shape	()
dtype	float64
chunksize	()
type	dask.array.core.Array
chunk_type	numpy.ndarray

For small enough graphs, we can get a visual representation of the graph. Each layer in the drawing of the task graph (below) corresponds to a layer in the high level graph (above).

In [3]:

result.visualize()

Out[3]:

Once the graph is constructed it goes through optimization, scheduling, and execution.

In [4]:

result.visualize(optimize_graph=True, color="order", node_attr={"penwidth": "4"})

Out[4]: