@schlink The reason for choosing a pseudorandom subset rather than a fixed amount at the beginning or something is that a lot of files have a bunch of “header” or “footer” matter that will be identical. Choosing a random subset, you are more likely to encounter non-identical bits in similar files.
@schlink It’s a variation on this strategy: https://github.com/pganssle/python-norm-estimate