There are two philosophies in #programming toward handling questionable #data. The first is to check the #integrity of the data every time it's used. This takes a fair amount of #programmer time, and depending on the size of the data may also take a fair amount of #computer time. It's a PITA to write, test, debug, and run.
The second is to say "I've already checked this data a bunch of times in the program, it's fine" and skip the integrity checks after the first time. In #scientific programming, this is particularly tempting: the data sets are huge, and writing checks is annoying. The whole thing feels like a waste of time when you're reasonably sure your code will never run on anything except this particular data set which you already see more of than your family and your pets and you just want to get the damned thing done.
About 95% of the time, I take the first approach. Every time I do it, I'm grumbling to myself. Just finish it, already! And I am uneasily aware that those who take the second approach get their work done faster than I do.
Yes. This is true.
They also get a lot of #garbage results—many of which don't look like garbage at all. Here comes the ritual chest-thumping ... in #bioinformatics, and #biomedical #research generally, those mistakes don't just lead to flawed publications, as bad as that is. Garbage results kill people.
I just received a lesson in why the first is a really good idea. Let's be careful out there.
@medigoth @ChemBob
Only _slightly_ relevant...