Programmer Thoughts

By John Dickinson

Compressed File Reader

April 15, 2010

Recently, I had need of streaming compressed data from an uncompressed file without buffering the entire file in memory. Using Python’s gzip library would require me to create a new file on disk. The zlib module offers streaming, but it does not produce the gzip headers. I wanted something that would produce gzip-compatible output in a streaming fashion.

To solve this, I wrote a class that wraps a file object and provides a read method to generate the compressed data.

class CompressedFileReader(object):
    def __init__(self, file_obj, compresslevel=9):
        self._f = file_obj
        self._compressor = zlib.compressobj(compresslevel,
                                            zlib.DEFLATED,
                                            -zlib.MAX_WBITS,
                                            zlib.DEF_MEM_LEVEL,
                                            0)
        self.done = False
        self.first = True
        self.crc32 = 0
        self.total_size = 0
    
    def read(self, *a, **kw):
        if self.done:
            return ''
        x = self._f.read(*a, **kw)
        if x:
            self.crc32 = zlib.crc32(x, self.crc32) & 0xffffffffL
            self.total_size += len(x)
            compressed = self._compressor.compress(x)
            if not compressed:
                compressed = self._compressor.flush(zlib.Z_SYNC_FLUSH)
        else:
            compressed = self._compressor.flush(zlib.Z_FINISH)
            crc32 = struct.pack("<L", self.crc32 & 0xffffffffL)
            size = struct.pack("<L", self.total_size & 0xffffffffL)
            footer = crc32 + size
            compressed += footer
            self.done = True
        if self.first:
            self.first = False
            header = '\037\213\010\000\000\000\000\000\002\377'
            compressed = header + compressed
        return compressed

This code, with some simple tests and examples, is available on github: http://github.com/notmyname/python_scripts/blob/master/compressed_file_reader_test.py.

One potential use case is streaming compressed data to a web service. For example, one could use this class to compress data as it is streamed to cloud files.

1 conn = cloudfiles.get_connection(username, apikey)
2 container = conn.create_container('some_container')
3 test_object = container.create_object('file.gz')
4 test_object.content_type = 'application/x-gzip'
5 with open('path/to/large/uncompressed/file', 'rb') as f:
6     compressed_f = CompressedFileReader(f)
7     test_object.write(compressed_f)

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

The thoughts expressed here are my own and do not necessarily represent those of my employer.