Compressing atmospheric data into its real information content

by Milan Kloewer (University of Oxford)


Hundreds of petabytes of data are produced annually at weather and climate forecast centres worldwide. Compression is inevitable to reduce storage and to facilitate data sharing. Current techniques do not exploit the spatio-temporal correlation in climate variables nor is only the real information in floating-point numbers compressed. Here, the bitwise real information content is calculated for data from the Copernicus Atmospheric Monitoring Service (CAMS). Most variables contain less than 7 bits of real information per value, which are also highly compressible due to spatio-temporal correlation. Identifying information-less bits with this technique is widely applicable to geoscientific variables. Rounding information-less bits to 0 facilitates available lossless compression algorithms, while bounding various error norms. Variables from the CAMS data set can consequently be compressed by a factor of 20-50x, relative to 64-bit floats. In comparison, the 24-bit linear quantisation currently used for storing CAMS data has a fixed compression factor of 2.7x. While most lossless compression algorithms act in one dimension only, the potential of multi-dimensional correlation can be exploited using modern compression libraries such as zfp. Our information-preserving approach is generalised to 4-dimensional space-time compression with zfp, achieving compression factors beyond 100x, without an impact on forecast skill scores. Widely-used modern data formats such as netCDF and HDF5 support this multi-dimensional and information-preserving compression, providing the basis to compress large climate data archives without a loss of real information.