**Paul Ganssle** @pganssle@qoto.org · Nov 03, 2021, 14:01

**Paul Ganssle** @pganssle@qoto.org · Nov 03, 2021, 14:01

Paul Ganssle @pganssle@qoto.org

Nov 03, 2021, 14:01

@schlink@octodon.social By the way, I dunno if this is helpful, but a while back I developed an application where I wanted to know if I had duplicate files anywhere on a disk.

Rather than hash the entire disk just to build an index, I used some deterministic algorithm to choose a *small* pseudorandom subset of each file and hash that to build my index. There were collisions among non-duplicate files, but they were few and far between (and often not even among files with the same size), and it was a simple matter to do a full hash on the small subset of files with collisions.

**Paul Ganssle** @pganssle@qoto.org · 2021-11-03T14:04:09Z

Paul Ganssle @pganssle@qoto.org

@schlink@octodon.social The reason for choosing a pseudorandom subset rather than a fixed amount at the beginning or something is that a lot of files have a bunch of "header" or "footer" matter that will be identical. Choosing a random subset, you are more likely to encounter non-identical bits in similar files.

Nov 03, 2021, 14:04 · · · ·

**Paul Ganssle** @pganssle@qoto.org · Nov 03, 2021, 14:06

**Paul Ganssle** @pganssle@qoto.org · Nov 03, 2021, 14:06

Nov 03, 2021, 14:06

Paul Ganssle @pganssle@qoto.org

@schlink@octodon.social It's a variation on this strategy: https://github.com/pganssle/python-norm-estimate

Resources

Developers

What is Mastodon?

qoto.org

More…