If you have experimented with Dask before, you have likely seen examples using Dask Dataframe and Dask Array. These are some of the high-level interfaces to Dask. They mimic pandas and numpy. These APIs construct task graphs which then get executed on the cluster. But Dask tries to take care of that part, so you only have to think about the high-level.
import dask.array as da
arr = da.random.random(size=(1_000, 1_000), chunks=(250, 500))
arr
|
arr.sum()
|
_.compute()
499807.1679973727
(arr + arr.T).sum(axis=0).mean()
|
_.compute()
999.6143359947454
import dask
ddf = dask.datasets.timeseries()
ddf
id | name | x | y | |
---|---|---|---|---|
npartitions=30 | ||||
2000-01-01 | int64 | object | float64 | float64 |
2000-01-02 | ... | ... | ... | ... |
... | ... | ... | ... | ... |
2000-01-30 | ... | ... | ... | ... |
2000-01-31 | ... | ... | ... | ... |
ddf.groupby("name").sum()
id | x | y | |
---|---|---|---|
npartitions=1 | |||
int64 | float64 | float64 | |
... | ... | ... |
_.compute()
id | x | y | |
---|---|---|---|
name | |||
Alice | 99988092 | 3.672683 | -603.242898 |
Bob | 99491666 | -30.556705 | -107.608403 |
Charlie | 99457254 | -58.842019 | 168.359914 |
Dan | 99888563 | -15.870235 | -162.695709 |
Edith | 99729490 | 83.402682 | -65.868199 |
Frank | 99516999 | 99.684014 | -267.065279 |
George | 99325146 | -44.165091 | -235.215796 |
Hannah | 99282032 | 70.649959 | -228.623198 |
Ingrid | 99679220 | -230.035396 | 136.336888 |
Jerry | 99702106 | 121.091422 | -205.846074 |
Kevin | 99516485 | -192.304167 | -22.732018 |
Laura | 100257897 | -281.199519 | 104.492034 |
Michael | 99622137 | -23.252458 | -33.149406 |
Norbert | 99976880 | -133.840093 | 82.190995 |
Oliver | 99923135 | -49.054660 | -69.711770 |
Patricia | 99413582 | 167.956305 | -35.632570 |
Quinn | 99655323 | 81.625377 | -81.761760 |
Ray | 99445037 | -201.510445 | 129.305448 |
Sarah | 99224362 | 188.307865 | -227.253566 |
Tim | 99954279 | 173.205178 | -362.966166 |
Ursula | 99856910 | 115.580875 | -35.032542 |
Victor | 100077831 | 38.778105 | -234.513779 |
Wendy | 99980779 | 50.250366 | -204.591499 |
Xavier | 99468968 | 37.642749 | -153.014391 |
Yvonne | 99665690 | -188.781225 | -326.251246 |
Zelda | 99879498 | 61.840510 | 166.020667 |
ddf.x.resample("1H").mean().loc["2000-01-15"]
Dask Series Structure: npartitions=1 2000-01-15 00:00:00.000000000 float64 2000-01-15 23:59:59.999999999 ... Name: x, dtype: float64 Dask Name: loc, 151 tasks
_.compute()
timestamp 2000-01-15 00:00:00 0.000730 2000-01-15 01:00:00 0.002879 2000-01-15 02:00:00 0.000252 2000-01-15 03:00:00 -0.004852 2000-01-15 04:00:00 -0.004158 2000-01-15 05:00:00 0.006094 2000-01-15 06:00:00 0.004820 2000-01-15 07:00:00 0.008705 2000-01-15 08:00:00 0.014731 2000-01-15 09:00:00 0.007589 2000-01-15 10:00:00 0.005049 2000-01-15 11:00:00 0.000763 2000-01-15 12:00:00 -0.008005 2000-01-15 13:00:00 0.006928 2000-01-15 14:00:00 0.004477 2000-01-15 15:00:00 -0.005120 2000-01-15 16:00:00 -0.003076 2000-01-15 17:00:00 -0.004609 2000-01-15 18:00:00 0.010669 2000-01-15 19:00:00 0.001993 2000-01-15 20:00:00 -0.011027 2000-01-15 21:00:00 0.008612 2000-01-15 22:00:00 0.008686 2000-01-15 23:00:00 -0.009387 Freq: H, Name: x, dtype: float64