Programming horror stories:

Today I was working on some code that was doing hash-set operations in Python. Fairly normal stuff, unpickling, collecting and merging sets of keys to get the unique ones. The keys in this case are three-tuples of floats. I know, I know, hashing on floats is a bad idea (and I told them so myself) but we are where we are.

It was taking many hours to build the set of unique keys. The dataset is large but it's not that large, so I attached GDB to it and had a look around, poked at the CPython set code, made some educated guesses and eventually had my code print out the individual keys and their hashes.

It turns out that (nan, None, None) was a super common key in this data. So of course each one of those tuples would hash to the same value, but since nan != nan, they would not count as the same object, creating endless hash collisions as Python built up a million identical-but-not tuples and had to check each and every one for equality every time another one was inserted into the set.

Follow

@quietbrooke

The amusing (and slightly absurdist) way in which golang deals with that is that hash of a nan is a random value. See github.com/golang/go/blob/097b

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.