What do we want? Persistent identifiers! When do we want them? NOW!!: bungo

bungo

What do we want? Persistent identifiers! When do we want them? NOW!!

Sep 30, 2013 22:05

I find myself suddenly in the position of trying to understand the persistent identifier meme for our data centre. This is a perilous environment. DOI ("de-oh-eye"? "doiy"?) and their ilk are ostensibly about the more-perfect implementation of the scientific method - supporting reproducible research and acknowledging sources - but really seem to be about ensuring credit goes to ME! [In this, of course, they are much like the rest of institutionalised science, I suppose.] So there is potential for something of a land-grab unfolding at the moment.

Making something work is such a tangle of social and technical issues I hardly know where to begin in attempting a rational strategy, and expect we will end up muddling along with something ad hoc and last minute. This is fine, but we and our descendents will be (hope to be) left with the resulting persistent mess afterwards for a long time. Will they thank us for it?

First we must stake out the ground with identifiers for data sets, and my terminology problem begins: what's a "data set"?

Is it (a) an entire time series - such as temperature observations in Berlin since the dawn of time to t=+infinity? A series produced by a specific method/experimenter e.g. issued by the weather service? The volumes of these carved up into time periods? For now, for me, a data set is something like the second of these, but it is likely be a union of time series - the temperature, pressure, and rainfall observations in all German cities, say.

Or is it (b) the only the set of values I used in my analysis? "We computed monthly averages for January between 1980 and 2000 from four-hourly samples obtained from the Deutsches Wetter Dienst [DWD Climate Series, 1900-, doi:10.9999/argomigod.45rwfas4rq3453q]."

What are the adjectives/words to distinguish these cases? "data holdings" and "particular data set"? "specific data set"? "data archive"? I have no idea.

Eventually we need a system for citing "dynamic" data sets, in which the observations may be revised as more complete or better quality data arrives. So where I have that old persistent identifier from last year's paper, what does it actually identify now? It should be the state of things then, not now. But now the data centre operator (er, wait, that's me!) has to be able to dig out that old state. And, oops, we overwrote that part of the file with the new values. The magic medicine coming to the rescue that I'm hearing are "temporal database", but, yikes, that will be a hard pill to swallow, and a slow one to digest.

Ideas welcomed.

english, language, politics, work