Leave a comment

steer October 12 2015, 12:02:02 UTC
Re: Replicability -- it's much harder than you think.

A lot of my papers are a combination of mathematics, computer code and data.

The mathematics (if I don't screw up) is replicable with effort (we rarely publish enough "steps" to follow an argument without intellectual work because papers are limited in length). The data may be proprietary but if it is not then I try to publish this somewhere. The code is the problem. So code is code... but a few months ago someone emailed me about a seven year old paper "send me the code".

I still had the code. If I did not I am relatively sure none of the other authors would have had it -- I wrote the code and did all the runs. The code was stored in a subversion repository -- but not all academics use even that. Version control came late to academia compared with industry because usually people are coding on their own.

Because I am still friends with the people who I worked with then I still have access to the subversion repository and it has not been shut down. Had it been shut down the code would be gone except for local copies on my machines. I might be able to find a copy of the code from a backup -- but how often do you check your "seven years ago" backup?

Then having got the code I had to compile it. That was the big problem. The default parameters to the compiler had changed loads in the intervening time. The makefile didn't work. After a lot of effort I figured out that it needed a particular compiler flag (-std=c99 I think) to build.

Having done that I then needed the results analysis code -- this was in perl. Just not the version of perl currently stored on my machine.

You *could* make this kind of thing replicable by having a VM image... and having something to run that VM image on that still exists seven years later. But at the time this would have been kind of a ridiculous proposition as the VM emulator would not have been powerful enough to run the code in reasonable time.

Nowadays I guess I could save a vagrant image and hope that in seven years time vagrant is still a usable thing.

Networking and systems papers are even harder as you're testing the performance of particular machines in combination with the hardware. Your paper won't be replicated unless the person also has access to that hardware. After seven years there's not a chance of this.

Sure... it's not ideal but it's a surprisingly hard problem to solve. Some colleagues are trying to solve it by insisting that papers come with a VM and scripts that do essentially "build paper". This only works if you don't need real hardware/networking performance.

Reply

andrewducker October 12 2015, 12:46:48 UTC
Oh, I'm sure it's harder with some subjects than others. And replicating it when you're dealing with specific machine configurations clearly isn't going to go well.

But with the state of worries about quality of research I suspect that replication is going to be something that there's a bigger and bigger demand for.

Reply

steer October 12 2015, 13:32:57 UTC
I suspect that replication is going to be something that there's a bigger and bigger demand for.

Hmm... yes and no. My original degree is physics and to some extent I still have the mindset. A lot of famous studies were non-replicable.

Millikan's oil drop is a pretty famous experiment to physicists -- not sure about outside the field. It's notoriously tetchy. I'm a pretty bad experimenter (too impatient I think) and I hated doing it -- but it's a famous experiment about the charge on the electron. Apparently Millikan's original with modern stats analysis had so much uncertainty it essentially proved nothing. Completely unrepeatable junk result in the terms of those guys. But in fact it was the "right" experiment. It was repeatable in the better sense that it was the right thing to do and could be repeated and refined along the years.

Another classic was Eddington's WWI eclipse measurements in Principe which was the canonical first "confirmation" of the predictions of General Relativity (and also rather neatly allowed the awkwardness of Eddington being a conscientious objector -- he was "forced" to do fieldwork in Principe instead of serving in WWI). Again the errors were (by modern standards) larger than the measurement and it's not replicable without a time machine or another convenient eclipse.

Individual studies don't get replicated but the ideas that they propound either become confirmed by further experiment, refuted by further experiment or ignored completely (in which case it doesn't matter).

I really enjoyed this book
http://www.amazon.co.uk/The-Golem-Second-Edition-Classics/dp/1107604656

It's got some great case studies of times when various disciplines have split on the possibility or otherwise of a certain experiment/method. One of the most compelling was about the flatworm memory experiments. Which is a classic tale about how hard it is to repeat results, except I'm not sure what the moral really is. Eventually the answer will be known but sometimes it takes a long time to get there.

http://www.theverge.com/2015/3/18/8225321/memory-research-flatworm-cannibalism-james-mcconnell-michael-levin

Reply

simont October 12 2015, 14:09:20 UTC
Because I am still friends with the people who I worked with then I still have access to the subversion repository and it has not been shut down. Had it been shut down the code would be gone except for local copies on my machines.

This is the sort of thing that DVCS improves on, of course - if you'd started a similar project today, you might (I'm guessing) have naturally used git rather than svn, in which case any local copy lying around on any machine or backup you could find would have automatically come with a copy of the complete history, and the availability of the upstream server wouldn't be so critical.

Of course that wouldn't solve the rest of the problems, like the scripts not running right in up-to-date Perl and similar bit-rot. But it would be a start, at least.

Reply

steer October 12 2015, 14:16:09 UTC
Hmm... I think the difference is marginal as seven years ago I wasn't storing which checkout version I was using in experiments -- actually I rarely do this now though I know I should (deadlines deadlines). But yes it's something. I think software replication is becoming better. So often though you find that old links lead to dead services whether privately or publicly hosted.

Today, for example, hosting repos on github makes sense and some of my papers refer to that as a code repos. In seven years will it be there? Bet now.

Reply

simont October 12 2015, 14:20:30 UTC
Yes, I was just thinking that really you want the code to be stored alongside the paper, because if the paper itself isn't available any more then you have worse problems than the unavailability of the code.

In maths, for example, it seems increasingly that everybody who is anybody posts their papers on the arXiv, so I suppose the right answer would be that the arXiv should provide a means of hosting a git repository alongside the PDF, and that any paper on there with a vital computational component (which in maths, I expect, would be less about replicability and more in the 4CT 'computer-assisted proof' sort of space) would take advantage of that. (Bet they don't, though.)

In a discipline where papers are still mostly in hard-copy journals, that might be (even) harder to arrange...

Reply

steer October 12 2015, 14:25:59 UTC
I'm not sure that a git repository is likely to outlast arXiv though? Or do you mean arXiv should be a git host in itself?

Of course journals themselves don't last forever and do fuck up. I discovered after about 5 years that the journal with my most cited paper had screwed up and not ever put my paper online (link error). Nobody seemed to have noticed this as it was available at my web site and IIRC on arXiv so continued to be cited at the journal where it wasn't available except in hard copy.

Reply

simont October 12 2015, 14:28:16 UTC
Yes, I meant arXiv should be a git host itself. It needn't be a large and complicated sysadmin job - if you're only hosting a repository for RO access, you can just stick the repository directory somewhere it's accessible over straightforward HTTP and run 'git update-server-info' in it, and then it's basically no different from offering any other static file(s) for download.

Reply

steer October 12 2015, 14:40:06 UTC
Hmm... I'm not sure I'd care to bet which will live longer from arXiv or github?

Reply

simont October 12 2015, 15:14:19 UTC
But that's my point - if arXiv goes away, with all the actual papers on it, then nobody will be picking up a paper from it and saying 'help, I can't reproduce this result!' in the first place. The mathematical community is so dependent on arXiv that they'll need to salvage the data from it if the server itself vanishes, and if they can't, then they have bigger problems anyway.

The scenario you want to avoid is that the paper is still out there claiming some result, and the critical supporting code isn't.

Reply

steer October 12 2015, 15:30:56 UTC
Are very many successful academics are publishing arXiv alone simply because your funding rests on getting things into prestigious journals. So if you say "hey, look, I'm doing great stuff, I'm publishing on arXiv and it's brilliant" your HoD says "What the hell are you thinking, get that into somewhere decent right now".

There are exceptions of course but how many? Yes, Perelman's paper on the Poincare conjecture -- but he's Perelman. He could scrawl it on a loo wall, someone would put it online for him and it would live on.

So papers will (mostly) survive arXiv dying anyway if the journal they are in survives. :-)

Reply

simont October 12 2015, 15:37:59 UTC
A recently emerging phenomenon, and part of the reason I say mathematicians are close to critically dependent on the arXiv, is the concept of an 'arXiv overlay journal', which consists of a set of links to papers on the arXiv.

Rationale: the point of a journal is not the physical publication and distribution of the paper, which the arXiv does better anyway; the real added value is the selection and peer review which winnows the great mass of proto-papers out there into ones that are judged by sensible people to be both correct and important.

So you upload your preprint to the arXiv, you submit to the journal by sending you a link, and if it passes peer review, then a link to that arXiv entry appears on the journal website.

Reply

steer October 12 2015, 15:45:18 UTC
Ah... I wasn't aware of that. Paper copy exists or no (not that a paper copy is that much more reliable but it means there's a good chance a copyright library has a copy)? Good idea though -- makes sense. Any of these journals prestigious yet?

TBPH, the likelihood of arxiv going down without warning and with no backup is negligible though.

(Goodness knows how many people analyse arXiv anyway as part of their research -- I bet a good chunk of the network science community is holding a local copy.)

Reply


Leave a comment

Up