**【Ξｎｉｇｍａｔｉｃｏ】** @enigmatico@fedi.absturztau.be · Jun 22, 2022

**【Ξｎｉｇｍａｔｉｃｏ】** @enigmatico@fedi.absturztau.be · Jun 22, 2022

【Ξｎｉｇｍａｔｉｃｏ】 @enigmatico@fedi.absturztau.be

Jun 22, 2022

【Ξｎｉｇｍａｔｉｃｏ】 @enigmatico@fedi.absturztau.be

How to spot encrypted data with simple statistics:

Get each byte inside the file and calculate how many times each value repeats in the file, then calculate the average, median and percentiles (In this case I calculated every 10th percentile).

If the value repetitions are not uniform, with values that stand out over others (way over or under the average and median), or the average and median are separated from each other, or the percentiles are all over the place, then the file is not encrypted. You can also tell easily when there are chains of bytes forming repeating patterns in the file, which is common for images and also executable files.

However, if the repetitions for each value are very uniform (all very close to the average and median), the average and median are identical values (not the same, but you couldn't distinguish them unless you are using a very precise floating value to calculate them), and the percentiles are also very uniform, then your file is encrypted.

In these images you can tell how 90sset.blend has it's byte value distribution mostly dominated by 0 and the peak percentile is the 90th percentile. In the case of the encrypted 7z file, all the repetitions are very close to the average and median, which are identical (you can't see the average because the median bar is covering it), and the percentiles are very uniform.

**robryk** @robryk@qoto.org · 2022-06-23T19:50:12Z

robryk @robryk@qoto.org

@enigmatico

This isn't really correct.

A well-compressed file will have the same property.

If I take an encrypted file and interleave it with zeroes, this is still a well-encrypted version of the original file (you decrypt by removing extraneous zeroes and doing original decryption). What you are saying is true for encryption that has ciphertexts of the same length as plaintexts.

June 23, 2022 at 7:50 PM · · · ·

**【Ξｎｉｇｍａｔｉｃｏ】** @enigmatico@fedi.absturztau.be · Jun 23, 2022

**【Ξｎｉｇｍａｔｉｃｏ】** @enigmatico@fedi.absturztau.be · Jun 23, 2022

Jun 23, 2022

【Ξｎｉｇｍａｔｉｃｏ】 @enigmatico@fedi.absturztau.be

@robryk I already did the experiment of using compressed files and no, there are differences, although subtle.

Encryption and compression are different things and while encryption attempts to make everything look like completely random garbage, compression will only attempt to remove repetitions to decrease the size of the file. It really is not the same.

**robryk** @robryk@qoto.org · Jun 23, 2022

**robryk** @robryk@qoto.org · Jun 23, 2022

Jun 23, 2022

robryk @robryk@qoto.org

@enigmatico

> while encryption attempts to make everything look like completely random garbage

This is not true, as long as you aren't asking encryption not to increase the length of message. If you also ask for that, I agree.

> compression will only attempt to remove repetitions to decrease the size of the file.

_No_. Compression is trying to make size the smallest. If the probability distribution over the output is not uniform[1], you could make it even smaller. Granted, doing that will cost CPU time (on both compression and decompression side likely).

Have you tried any compression scheme that was optimized purely for compression ratio, as opposed to some "middle" point of the pareto frontier between ratio and (de)compression runtime performance? For example, winning entries of http://mattmahoney.net/dc/text.html

[1] In a way (if we ignore concerns of runtime performance), compression and prediction/modeling are the same problem: if we have a compressor, it provides us with a model of probabilities over inputs (it assigns probability 2^|comp(x)| where |y| is length of y). If we have a model that assigns probabilities, we can create a compression scheme such that |comp(x)| = log(p(x)) by the Kraft's inequality.

Obviously, that assumes that the logarithsm above are integers. That is an approximation that gets as close to equality as you wish by increasing the input length (more precisely, let's assume that the input is an i.i.d. sequence of symbols; then taking a long enough i.i.d sequence of symbols with the same distribution brings us arbitrarily close to equality there).

Large Text Compression Benchmark

mattmahoney.net

Trending now

Resources

Developers

What is Mastodon?

qoto.org

More…