Wednesday 25 January 2012

dropbox, evernote and the digital repository

Over lunch I watched a couple of videos from the DepositMo project and one thing that grabbed ny attention was the way one of the speakers referred to repositories as 'dropboxes'.

Conventionally they are quite different and do different things:

A dropbox is where you want to put stuff you either want to move/share between different machines eg home and work, or share with a small number of other people. The files have no context, ie no metadata at all, except any you create by way of the file name eg picture_of_lily.jpg. Only you of course know if it's a picture of a lily or a person, dog, cat, hamster or whatever called Lily. In fact a quick look at wikipedia's disambiguation page  for lily shows an even wider range of possibilities for 'lily'.And of course just because it's called .jpg doesn't mean it has to be a JPEG.

So dropbox items are context free. You may create a set of naming and directory conventions but they mean nothing. An example might be the mp3 of a presentation that you transfer to home, download to an mp3 player and listen to on the bus. You might give it a meaningful name or you might simply call it preso.mp3. As long as the name means something to you that's all that matters

Evernote or indeed OneNote, is different. You could simply treat it as a dropbox, but infact it lends itself to organising data, and the natural trend is to group data thematically. Therefore I have material grouped by project, so I can find anything that relates to DC7D. I can also add metadata as tags, eg 'invoices' so I can search for all invoices or indeed only the invoices referring to DC7D.

This is of course reliant on me being organised but at least instead of picture_of_lilly.jpg I have a pictures notebook, with an entry tagged 'Lilly'.

If I've done my tagging right I end up with a pile of data organised as a de facto folksonomy. Thus in my pictures folder I have a picture of Wen Xiu, the mistress of Pu Yi, and it is tagged 'China' and 'Manchuria'. (I also have a folder of material related to Manchuria some of which is tagged 'Russia' and "Korea', which contains material relating to a personal project, which may or may not turn into something about writers and journalists 1930's China - the point being it makes sense to me to classify things that way, not due to strict logic. A folksonomy is a contextual aid to organistion, not a substitute classification schema)

A repository is of course something else. In the classic model it is a collection of published documents about which we need to know a number of standard things, basically who, what, where, and the format. In a digital preservation system - say for holding electronic versions of historic documents, say early pictures and recordings of Yolngu ceremonial events - we never want to change things. In a repository of research preprints we may want to replace items with corrected versions of documents.

Of course we may want to transfer the content to a curated system such as that being developed by Project Bamboo and add value by creating a transcript of the sound recording and an English language translation of the transcript and annotations for the image data.

As this work is revisable we may of course want to put it in a separate transcripts and annotation repository separate from the preservation repository.

Just to muddy the waters we could imagine a work in progress repository, where updates to a document  are regularly submitted but the basic metadata remains the same. In fact we should probably just admit that repositories are really (just) content management systems and it's a repository when used by librarians, a preservation system when used by archivists and a CMS when used by everyone else. Architecturally they're the same, it's just that the workflows around how content is ingested, retrieved, displayed and disposed of differ.

However let's assume that when we say repository we mean a system that has the characteristics of using standard metadata data and containing objects subject to little or no revision, as in any classic university research repository.

Digital asset management systems, or preservation repositories are effectively the same. Not quite as they need to have systems in place to maintain the integrity of the data and more complex metadata and access control models.

This might lead you to think that a data repository was effectively the same. After all, if you digitise an audio recording using a specialised digitisation workstation like a Quadriga workstation you capture some machine based information and technical information which is typically embedded in the technical metadata section of a WAV file, perhaps with some added vendor extension fields.

The preservation repository ingest process would typically both extract the technical metadata from teh file and have some human created metadata - the who, what, where component.

Functionally the process is the same as acquiring information from any other instrument, be it a seismometer, a radio telescope, or whatever.

Except it's not. When you are preserving data you are preserving 1's and 0's. Unlike TIFF's or WAV's there are no rules about data or format integrity checks, all you have is the metadata, either that entered by humans or acquired from instruments. Even though we pay lispervice to using standard schemas really it's much more like a an Evernote notebook, with some tags and information that make sense to the user or groups of users, plus a human readable description of what all the columns in the data mean. Without that it's meaningless. Context is everything with data. At least with an image you can guess what it might be showing. A spreadsheet or set of spreadsheets can be utterly opaque.

As a system a data repository look a lot more like a software archive, such as that run by Mirrorservice.org, than a classic dspace implementation. Yes, it needs to speak someting standard such as RIF-CS to produce standard descriptions of the items, but unlike a print or image repository where we know implictly how to deal with the different media types, we have no idea of how to deal with the data stored in the object.

A data repository is a collection of data objects stored according to a standard set of rules. So, just as we expect a software tar file to unpack and show a readme, a manifest, and perhaps a makefile we should expect a data set to unpack to contain the data, the technical metadata, and a description of the files, both their structure and significance, a bit like you get with either SEED or FITS.

So, when we build a data repository are we really building something more like an evernote for data, rather than a dspace for data?

And in that context should simply use off the shelf CMS technology, such as Alfresco, rather than a dedicated repository application ?



No comments: