I had a chat today with @xblasco about hash maps, and wether they really are O(1) on insertion, deletion and get value. Did anyone ever had this thought of "no way it can actually be O(1)"? Or the problem that in benchmarks the O(1) doesn't seem to hold.
@nosherwan @xblasco we are still discussing... It's not an easy topic.
The algorithm is O(1) in theory and practice.
If we say that an algorithm is O(1) that for any number of items in the dictionary, we can find a constant C which is an upperbound on the time it takes to to do the look-up.
As n grows we get more cache-misses and 100% of look-ups will be cache-misses.
In that case I would expect the lookup time to be capped by the time it takes to find the key from RAM. (a little bit below 500 ns)
For larger dictionaries you might even find a second jump. This will happen when the dictionary doesn't fit into RAM anymore. Most modern operating systems will start writing parts of the dictionary to the hard-drive. This is called swapping.
@erikdesmedt @nosherwan @xblasco wait a second... If we make a btree, several accesses would be out of cache, while hashmaps would have only one ram access. There's no way to make a tree perform around a single memory access. I guess we would always get an order of magnitude higher timings for the right side of the graph
@erikdesmedt @nosherwan @xblasco And circling more, now it makes even more sense why B-Trees (not binary, just B) are used in #Postgres https://en.wikipedia.org/wiki/B-tree
It seems to me that they will perform not that bad compared to hashmaps in real life even being O(log n) for access instead of O(1).
https://www.enterprisedb.com/postgres-tutorials/are-hash-indexes-faster-btree-indexes-postgres
(couldn't find a better article for PostgreSQL comparing both indexes, client count is not the metric I want)
@erikdesmedt @nosherwan @xblasco I think you're right but the upperbound on time doesn't seem good enough reason to prove, as with a high upper bound you could claim O(log n) to be O(1). For example, we could compare a bad hashmap implementation vs a specialized btree, in a scenario where btree excels. We put the same upper bound and btree looks like is O(1) under the same rules. Which is wrong.
I'd like to hear more ideas on how to prove hashmaps are O(1) in practice