**jonny** @jonny@social.coop · Jan 25, 2022, 23:36

**jonny** @jonny@social.coop · Jan 25, 2022, 23:36

jonny @jonny@social.coop

Jan 25, 2022, 23:36

More fun publisher surveillance:
Elsevier embeds a hash in the PDF metadata that is *unique for each time a PDF is downloaded*, this is a diff between metadata from two of the same paper. Combined with access timestamps, they can uniquely identify the source of any shared PDFs.

053de9506bb007c6.jpg

**jonny** @jonny@social.coop · Jan 26, 2022, 00:03

**jonny** @jonny@social.coop · Jan 26, 2022, 00:03

Jan 26, 2022, 00:03

jonny @jonny@social.coop

You can see for yourself using exiftool.
To remove all of the top-level metadata, you can use exiftool and qpdf:

exiftool -all:all= <path.pdf> -o <output1.pdf>
qpdf --linearize <output1.pdf> <output2.pdf>

To remove *all* metadata, you can use dangerzone or mat2

**jonny** @jonny@social.coop · Jan 26, 2022, 00:23

**jonny** @jonny@social.coop · Jan 26, 2022, 00:23

Jan 26, 2022, 00:23

jonny @jonny@social.coop

Also present in the metadata are NISO tags for document status indicating the "final published version" (VoR), and limits on what domains it should be present on. Elsevier scans for PDFs with this metadata, so good idea to strip it any time you're sharing a copy.

9040527c7e60d24d.jpg

**jonny** @jonny@social.coop · Jan 26, 2022, 00:28

**jonny** @jonny@social.coop · Jan 26, 2022, 00:28

Jan 26, 2022, 00:28

jonny @jonny@social.coop

Links:
exiftool: https://www.exiftool.org/
qpdf: https://qpdf.sourceforge.io/
dangerzone (GUI, render PDF as images, then re-OCR everything): https://dangerzone.rocks/
mat2 (render PDF as images, don't OCR): https://0xacab.org/jvoisin/mat2

**jonny** @jonny@social.coop · Jan 26, 2022, 02:38

**jonny** @jonny@social.coop · Jan 26, 2022, 02:38

Jan 26, 2022, 02:38

jonny @jonny@social.coop

here's a shell script that recursively removes metadata from pdfs in a provided (or current) directory as described above. For mac/*nix-like computers, and you need to have qpdf and exiftool installed:
https://gist.github.com/sneakers-the-rat/172e8679b824a3871decd262ed3f59c6

f5b43f49b4762cd1.jpg

**jonny** @jonny@social.coop · Jan 26, 2022, 04:08

**jonny** @jonny@social.coop · Jan 26, 2022, 04:08

Jan 26, 2022, 04:08

jonny @jonny@social.coop

The metadata appears to be preserved on papers from sci-hub. since it works by using harvested academic credentials to download papers, this would allow publishers to identify which accounts need to be closed/secured
https://twitter.com/json_dirs/status/1486135162505072641?t=Wg5XAzujycz79Cop_ap8vQ&s=19

**robryk** @robryk@qoto.org · Jan 27, 2022, 01:33

**robryk** @robryk@qoto.org · Jan 27, 2022, 01:33

Jan 27, 2022, 01:33

robryk @robryk@qoto.org

@jonny I wonder whether uploading every paper to sci-hub twice would be feasible (i.e. would we still have enough people do that). (If we did so, then it would allow sci-hub to verify with reasonable certainty that whatever watermark-removal method they would use still works.)

**jonny** @jonny@social.coop · Jan 27, 2022, 05:39

**jonny** @jonny@social.coop · Jan 27, 2022, 05:39

Jan 27, 2022, 05:39

jonny @jonny@social.coop

@robryk
I think it may be easier to scrub it server side, like to have admins clean the PDFs they have. I don't know of any crowdsourced sci-hub-like projects. scrubbing metadata does seem to render the PDFs identical

**robryk** @robryk@qoto.org · 2022-01-27T08:57:01Z

robryk @robryk@qoto.org

@jonny And then obviously the watermarking techniques will adapt. Asking for two copies is a way to ensure that whatever we are doing still manages to scrub the watermark (they should be identical after scrubbing).

Jan 27, 2022, 08:57 · · · ·

**jonny** @jonny@social.coop · Jan 27, 2022, 09:02

**jonny** @jonny@social.coop · Jan 27, 2022, 09:02

Jan 27, 2022, 09:02

jonny @jonny@social.coop

@robryk
yes, definitely. all of the above. fix what you have now, adapt to changes, making double grabs part of the protocol makes sense :)

Trending now

Resources

Developers

What is Mastodon?

qoto.org

More…