How to spot encrypted data with simple statistics:

Get each byte inside the file and calculate how many times each value repeats in the file, then calculate the average, median and percentiles (In this case I calculated every 10th percentile).

If the value repetitions are not uniform, with values that stand out over others (way over or under the average and median), or the average and median are separated from each other, or the percentiles are all over the place, then the file is not encrypted. You can also tell easily when there are chains of bytes forming repeating patterns in the file, which is common for images and also executable files.

However, if the repetitions for each value are very uniform (all very close to the average and median), the average and median are identical values (not the same, but you couldn't distinguish them unless you are using a very precise floating value to calculate them), and the percentiles are also very uniform, then your file is encrypted.

In these images you can tell how 90sset.blend has it's byte value distribution mostly dominated by 0 and the peak percentile is the 90th percentile. In the case of the encrypted 7z file, all the repetitions are very close to the average and median, which are identical (you can't see the average because the median bar is covering it), and the percentiles are very uniform.

Follow

@enigmatico

This isn't really correct.

A well-compressed file will have the same property.

If I take an encrypted file and interleave it with zeroes, this is still a well-encrypted version of the original file (you decrypt by removing extraneous zeroes and doing original decryption). What you are saying is true for encryption that has ciphertexts of the same length as plaintexts.

@robryk I already did the experiment of using compressed files and no, there are differences, although subtle.

Encryption and compression are different things and while encryption attempts to make everything look like completely random garbage, compression will only attempt to remove repetitions to decrease the size of the file. It really is not the same.

@enigmatico

> while encryption attempts to make everything look like completely random garbage

This is not true, as long as you aren't asking encryption not to increase the length of message. If you also ask for that, I agree.

> compression will only attempt to remove repetitions to decrease the size of the file.

_No_. Compression is trying to make size the smallest. If the probability distribution over the output is not uniform[1], you could make it even smaller. Granted, doing that will cost CPU time (on both compression and decompression side likely).

Have you tried any compression scheme that was optimized purely for compression ratio, as opposed to some "middle" point of the pareto frontier between ratio and (de)compression runtime performance? For example, winning entries of mattmahoney.net/dc/text.html

[1] In a way (if we ignore concerns of runtime performance), compression and prediction/modeling are the same problem: if we have a compressor, it provides us with a model of probabilities over inputs (it assigns probability 2^|comp(x)| where |y| is length of y). If we have a model that assigns probabilities, we can create a compression scheme such that |comp(x)| = log(p(x)) by the Kraft's inequality.

Obviously, that assumes that the logarithsm above are integers. That is an approximation that gets as close to equality as you wish by increasing the input length (more precisely, let's assume that the input is an i.i.d. sequence of symbols; then taking a long enough i.i.d sequence of symbols with the same distribution brings us arbitrarily close to equality there).

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.