@clathrin @richardsever Let me tell you a story about data reuse
I have a student who wants to re-analysing some sequencing data from a paper.
1. Paper reports the data will be shared on GEO at the time for publication. No GEO ID is present, although the paper has been published for a while now.
2. After digging onto GEO for longer than needed (it always feels like I'm back in the 90s on MySpace), I find the data. Well, part of it, the rest is nowhere to be found.
3. Several emails to the authors, no reply
4. In the meantime we manage to find the data. I confirm my hate for GEO.
5. Data is from human patients, which the paper indicates as patient 1,2,3 etc.
The patient numbers in the deposited data only partially match those in the paper. Luckily we're pretty confident we can tell who's who but still...
Data from one patient (out of 6) is missing
6. Out of three set of experiments, data is available only for two, the other are missing
The student has pretty much finished her project.
We're still waiting for the authors to reply to our emails.
I have to say, I've had much better experience with other papers, but still, I think we're a long way to have properly archived data.
@nicolaromano @clathrin @richardsever you’ve hit a touchy subject for the @roylelab! I had a similar experience with trying to re-analyse proteomics data claimed to be “reposited in full in [public proteomic data repositories]” It’s proven very difficult/impossible within a reasonable time to match datasets with experimental conditions, leading to lots of insights from comparative proteomics remaining untapped!
@nicolaromano @clathrin @richardsever
It would be helpful to document reuse/replication attempts on PubPeer, especially the missing data. It might prevent others from wasting as much time as you. And - who knows - occasionally the authors do reply.
@BorisBarbour @clathrin @richardsever Absolutely, I have put a note on PubPeer, but no response so far!
@nicolaromano @clathrin @richardsever I completely share your pain about data reuse. I think one huge obstacle is also that there isn't a standardized way to store the data online. #OpenML (https://www.openml.org/) used to be a great initiative to ease access and availability of data, but unfortunately the site doesn't work as expected anymore. I'm still looking for open alternatives.