diskcache with zlib
I noticed this segment in the diskcache docs that made me want to run a quick benchmark.
If you don't know, diskcache is an amazing little library that gives you a sqlite-powered caching mechanism in Python. By default it uses pickle to store everything but you can write a custom disk class that lets you determine how data is serialized yourself.
This is the example on the docs:
class JSONDisk(diskcache.Disk):
def __init__(self, directory, compress_level=1, **kwargs):
self.compress_level = compress_level
super().__init__(directory, **kwargs)
def put(self, key):
json_bytes = json.dumps(key).encode('utf-8')
data = zlib.compress(json_bytes, self.compress_level)
return super().put(data)
def get(self, key, raw):
data = super().get(key, raw)
return json.loads(zlib.decompress(data).decode('utf-8'))
def store(self, value, read, key=UNKNOWN):
if not read:
json_bytes = json.dumps(value).encode('utf-8')
value = zlib.compress(json_bytes, self.compress_level)
return super().store(value, read, key=key)
def fetch(self, mode, filename, value, read):
data = super().fetch(mode, filename, value, read)
if not read:
data = json.loads(zlib.decompress(data).decode('utf-8'))
return data
with Cache(disk=JSONDisk, disk_compress_level=6) as cache:
pass
I had Claude write me a quick benchmark to check on the disk-space that you might save and here's the summary chart:
A few things to note:
- the compression mainly works if you're dealing with loads of text, so if you're dealing with LLMs input/output you can get a lot of mileage out of this
- the compression works per json object, so don't expect any savings if the cache values contain many duplicates across keys
- this trick works with JSON, but I can also imagine that you might be able to pull off something clever with embeddings too
If you want to play with the notebook, you can find it here.