Each Answer to this Q is separated by one/two green lines.
I tried various methods to do data compression when saving to disk some numpy arrays
.
These 1D arrays contain sampled data at a certain sampling rate (can be sound recorded with a microphone, or any other measurment with any sensor) : the data is essentially continuous (in a mathematical sense ; of course after sampling it is now discrete data).
I tried with HDF5
(h5py) :
f.create_dataset("myarray1", myarray, compression="gzip", compression_opts=9)
but this is quite slow, and the compression ratio is not the best we can expect.
I also tried with
numpy.savez_compressed()
but once again it may not be the best compression algorithm for such data (described before).
What would you choose for better compression ratio on a numpy array
, with such data ?
(I thought about things like lossless FLAC (initially designed for audio), but is there an easy way to apply such an algorithm on numpy data ?)
What I do now:
import gzip
import numpy
f = gzip.GzipFile("my_array.npy.gz", "w")
numpy.save(file=f, arr=my_array)
f.close()

Noise is incompressible. Thus, any part of the data that you have which is noise will go into the compressed data 1:1 regardless of the compression algorithm, unless you discard it somehow (lossy compression). If you have a 24 bits per sample with effective number of bits (ENOB) equal to 16 bits, the remaining 2416 = 8 bits of noise will limit your maximum lossless compression ratio to 3:1, even if your (noiseless) data is perfectly compressible. Nonuniform noise is compressible to the extent to which it is nonuniform; you probably want to look at the effective entropy of the noise to determine how compressible it is.

Compressing data is based on modelling it (partly to remove redundancy, but also partly so you can separate from noise and discard the noise). For example, if you know your data is bandwidth limited to 10MHz and you’re sampling at 200MHz, you can do an FFT, zero out the high frequencies, and store the coefficients for the low frequencies only (in this example: 10:1 compression). There is a whole field called “compressive sensing” which is related to this.

A practical suggestion, suitable for many kinds of reasonably continuous data: denoise > bandwidth limit > delta compress > gzip (or xz, etc). Denoise could be the same as bandwidth limit, or a nonlinear filter like a running median. Bandwidth limit can be implemented with FIR/IIR. Delta compress is just y[n] = x[n] – x[n1].
EDIT An illustration:
from pylab import *
import numpy
import numpy.random
import os.path
import subprocess
# create 1M data points of a 24bit sine wave with 8 bits of gaussian noise (ENOB=16)
N = 1000000
data = (sin( 2 * pi * linspace(0,N,N) / 100 ) * (1<<23) + \
numpy.random.randn(N) * (1<<7)).astype(int32)
numpy.save('data.npy', data)
print os.path.getsize('data.npy')
# 4000080 uncompressed size
subprocess.call('xz 9 data.npy', shell=True)
print os.path.getsize('data.npy.xz')
# 1484192 compressed size
# 11.87 bits per sample, ~8 bits of that is noise
data_quantized = data / (1<<8)
numpy.save('data_quantized.npy', data_quantized)
subprocess.call('xz 9 data_quantized.npy', shell=True)
print os.path.getsize('data_quantized.npy.xz')
# 318380
# still have 16 bits of signal, but only takes 2.55 bits per sample to store it
The HDF5 file saving with compression can be very quick and efficient: it all depends on the compression algorithm, and whether you want it to be quick while saving, or while reading it back, or both. And, naturally, on the data itself, as it was explained above.
GZIP tends to be somewhere in between, but with low compression ratio. BZIP2 is slow on both sides, although with better ratio. BLOSC is one of the algorithms that I have found to get quite compression, and quick on both ends. The downside of BLOSC is that it is not implemented in all implementations of HDF5. Thus your program may not be portable.
You always need to make, at least some, tests to select the best configuration for your needs.
What constitutes the best compression (if any) highly depends on the nature of the data. Many kinds of measurement data are virtually completely incompressible, if lossfree compression is indeed required.
The pytables docs contains a lot of useful guidelines on data compression. It also details speed tradeoffs and so on; higher compression levels are usually a waste of time, as it turns out.
http://pytables.github.io/usersguide/optimization.html
Note that this is probably as good as it will get. For integer measurements, a combination of a shuffle filter with a simple ziptype compression usually works reasonably well. This filter very efficiently exploits the common situation where the highestendian byte is usually 0, and only included to guard against overflow.
You might want to try blz. It can compress binary data very efficiently.
import blz
# this stores the array in memory
blz.barray(myarray)
# this stores the array on disk
blz.barray(myarray, rootdir="arrays")
It stores arrays either on file or compressed in memory. Compression is based on blosc.
See the scipy video for a bit of context.
First, for general data sets, the shuffle=True
argument to create_dataset
improves compression dramatically with roughly continuous datasets. It very cleverly rearranges the bits to be compressed so that (for continuous data) the bits change slowly, which means they can be compressed better. It slows the compression down a very little bit in my experience, but can substantially improve the compression ratios in my experience. It is not lossy, so you really do get the same data out as you put in.
If you don’t care about the accuracy so much, you can also use the scaleoffset
argument to limit the number of bits stored. Be careful, though, because this is not what it might sound like. In particular, it is an absolute precision, rather than a relative precision. For example, if you pass scaleoffset=8
, but your data points are less then 1e8
you’ll just get zeros. Of course, if you’ve scaled the data to max out around 1, and don’t think you can hear differences smaller than a part in a million, you can pass scaleoffset=6
and get great compression without much work.
But for audio specifically, I expect that you are right in wanting to use FLAC, because its developers have put in huge amounts of thought, balancing compression with preservation of distinguishable details. You can convert to WAV with scipy, and thence to FLAC.