Xarray and Dask: An Alternative to GDAL and NumPy
A common task for GIS (geographic information systems) engineers is to extract regions from a map. Given a TIFF file and some coordinates (in dataframes or other tabular form), the regions of interest are cropped using slicing or some other indexing methods.
One approach to this task is to use the GDAL-NumPy solution. The TIFF file is loaded as a GDAL raster object. Then the raster object is converted into a NumPy array. Regions of interest can simply be cropped with Python slicing.
This kind of solution is fine for small examples, but would scale poorly with larger datasets.
Firstly, Python is a language designed for rapid development and prototyping at the expense of speed. Now imagine trying to crop out thousands of regions from maps that are tens or hundreds of gigabytes big.
Secondly, GDAL is notorious for its poor threading support. The GDAL docs has stated that the GDALDataset object should not be accessed from different threads at the same time. Any attempts to parallelize the task with Python’s threading, process or concurrent.futures module will just crash the program. Trying to crop big TIFF files sequentially would be prohibitively slow.
Lastly, loading the crops take up a lot of RAM. The solution would not work if the crops cannot fit inside the RAM. Ideally, the solution should also work for machines with limited RAM, not just beefy workstations.
Fortunately, Python has libraries like Xarray and Dask that exist to tackle this sort of problem.
Xarray is an open-source Python package that makes working with multi-dimensional arrays efficient. Think of it like NumPy on steroids.
Like GDAL, Xarray comes with the ability to load TIFF rasters as a DataArray. The DataArray object behaves a lot like GDAL’s DataSet object, supporting features like metadata querying and other GDAL operations. Because the DataArray operates like a NumPy array, typical NumPy indexing and operations can be applied to it. There is no intermediary conversion from DataArray to NumPy array.
Compared to GDAL, Xarray’s dataset has much better parallelization support, especially with Dask. As of this post, Dask is an optional feature, but the benefits are sufficiently strong that it may become a required dependency in the future.
Dask is a parallel computing library written to handle large NumPy-like data structures.
Unlike NumPy, which uses eager evaluation, operations on Dask are lazy. Dask functions are first converted into task graphs. As shown in the Dask Delayed docs:
Each inc operation run lazily. Execution is deferred until the value is needed. When the results of the inc operations are required, Dask will run both operations in parallel. This is a simple example, but can be extended for more complex functions.
Another benefit to using Dask is that it can work with datasets that are too huge to fit into RAM.
More on parallel computation with Dask can be found here.
TIFF croppings for 12000, 62000 and 120000 ROIs using GDAL (Python/C++) and Xarray-Dask are benchmarked:
GDAL uses GDAL_Translate to crop out ROIs. Cropping with GDAL is done sequentially due to the poor threading support. Even though the C++ implementation is faster, the cropping time using GDAL still increased exponentially.
Meanwhile, Xarray/Dask is blazing fast even at 120000 ROIs. The speedup is very noticeable: at 120000 ROIs, Xarray/Dask has a staggering 219 times speedup.
That wraps up today’s post. If you want to learn more about Xarray and Dask, this and this are good places to get started. Hope you learned something useful today.