Tuesday, 20 February 2018

Provenance - it's all about provenance

Six months ago, on a plane between Singapore and Melbourne, I watched a remarkable documentary about the attempt by the city of Detroit to sell off the contents of its art museum to defray the city's debts.

The scheme ultimately foundered - because of provenance.

One of the original founders of the museum apparently used to go on collecting tours of Europe, buying paintings from cash strapped aristocrats who had lost everything in the first world war - so you would think it would be easy to work out provenance.

But, no. The person in question was an art collector in his own right, and while he would sometimes use the museum's money to pay for items, sometimes he would use his own, and sometimes he would 'sell' a piece to the museum at below cost and claim it as a tax loss.

And the records were a complete mess.

It wasn't clear which had been bought on behalf of the museum, which were on loan, and which were donations - and of course the more saleable paintings' records were as confused as the less valuable.

In this case having an unclear provenance worked for the museum - they couldn't sell what wasn't theirs, and the didn't know what wasn't theirs.

And I suspect that this is the case with a lot of museums who developed their collections in the late nineteenth and early twentieth century - the documentation is quite unclear.

Not for everything of course, for example the Elgin Marbles have a clear provenance and the case really depends on the legality or otherwise of Elgin's actions and whether his firman from the Ottoman governor really gave him permission.

But then we have cases like the Nizam of Hyderabad's mummy, which I blogged about back in 2015, where provenance is unclear, we know he bought it, but not if it was illegally acquired. Likewise in Amelia Edwards' account of her trip up the Nile in the 1870's, she recounts the story of the tourists who bought a mummy at vast expense, and after a week or so found that they could not stand the sweet odour emanating from it, and (literally) jettisoned their losses by throwing it in the Nile.

And this all makes the problem of artefacts acquired during the period of European colonialism.

Were the items acquired legally, were they acquired under duress or what.

And of course rules change. Egypt, for example started to license archaeological digs quite early and had clear rules about both documentation and ownership - basically that the more significant items were automatically property of the Egyptian department of Antiquities, which is why so much of the material is in the Egyptian Museum in Cairo, but as we know, there are significant collections elsewhere, and as the Nizam of Hyderabad case shows us, the system was not perfect.

Other countries, especially those under colonial rule, were not so strict.

And for this reason, probably one thing that should be done is to digitise the museum records and correspondence, as well as that of individual archaeologists and collectors, to both settle the question of provenance, but also to provide an unrivalled insight into the history of archaeology, and it's relationship to the antiquities trade in the nineteenth century ...

Sunday, 21 January 2018

Lenovo Ideapad K1 six years on ...

Yesterday was ferociously hot, so I did what I usually do when it’s too hot for gardening, and played with some old hardware, this time J’s old Lenovo IdeaPad K1, an android tablet dating from late 2011.

In its day it was pretty slick, slicker than the zPad, and a pretty nice bit of kit with an excellent screen - being an artist J spends a lot of time looking at pictures and illustrations - but it was a bit heavy to hold, and even though we'd invested in stand cum charging station for it, it could be a pain to use for extended periods. Not only that, it would occasionally lose its network connection, or more accurately not recover gracefully when our router flipped from adsl to the backup 3G connection, so eventually it was replaced by a Samsung Galaxy.

By the time it was replaced, Lenovo had more or less abandoned the K1, but had unusually, provided an option to upgrade it to an unsupported version of Android 4 - the K1 having originally shipped with 3.2.

We never followed that up at the time, as the only thing I used it for was downloading podcasts, and gPodder was happy with things as they were.

In retrospect, this was probably not such a good idea, as the links to the generic version have now (understandably) disappeared off of Lenovo’s website.

So, what can you do with 3.2?

Well, no modern browser, but Opera mini installs and runs quite nicely.

The previously installed wikipedia, gmail and twitter apps still work as does inoreader - an rss feed reader. You can’t, of course install anything recent, which means no decent text editor or anything like that.

But, given that most of what I use my  current tablet for is wikipedia, email and twitter, plus a bit of rss feed reading it isn’t a disaster. Not having access to OneNote or Evernote is a bit of a pain, but were my existing tablet to unexpectedly come to a bad end it would be good enough for a stopgap, which isn’t too bad for a device over six years old running an old operating system ...

Friday, 5 January 2018

Transcribing a blot

One of the tasks in documenting artifacts as part of the project is transcribing labels on the bottles of materia medica in the pharmacy.

Mostly this is fairly straightforward - the labels are on the whole beautifully stencilled in india ink on good quality paper, and so while they may be a little yellowed they're perfectly legible. It's the early twentieth century ones that are more of a problem - cheaper paper and sloppilly writen in faded fountain pen ink.

To be sure they have their peculiarities - the extensive use of Æ  in nineteenth century pharmaceutical latin and outdated abbreviations like TṚ for tincture, but it's all fairly straightforward.

Until a couple of days ago, when I came across the following

where the label had been corrected at a later date - if you look carefully you can see what appears to be an extra L which has been blotted out in a different thinner ink. presumably at a later date.

This of course raises an number of questions about transcribing the label - should I transcribe the label as it was meant to be read, or include the blot, or transcribe it as the original text and note that the first L had been blotted out at (presumably) a later date.

I decided to go for the middle route and transcribe the label as you would read it today, blot and all.

While I knew about the Text Encoding Initiative and the Leiden Epigraphy conventions, which I'm using to indicate missing or illegible characters, I didn't know about blots.

My first thought was to simply insert a unicode blot symbol, except there isn't one - as a stopgap until I could spend more time with Google I decided to use the cyrillic Zhe (Ж) as

  • there was no cyrillic text involved in the pharmacy anywhere
  • it sort of looked like the H^HZ^HN sequence we used to use in Wordstar days to generate a cursor symbol on daisywheel printers when doing documentation
  • having learned to read and write Russian I could write it with a degree of fluidity
I guess I could have used the unicode block character ( █ ) but as I also keep a longhand paper workbook in parallel with the transcription spreadsheet Ж seemed a better choice.

I started off by searching for things like 'epigraphy blot' without much success - well I guess stone inscriptions don't have blots, although they do have erasures, so I don't think it was that silly a search. 

Changing the search terms to something like 'TEI transcription blot' was more useful and produced a lot of information on how to represent blots in XML as well as important questions such as whether it was a correction by the author or a correction at a later date and differentiating between the two, as well as what to do if you weren't sure.

The only problem was all this information was for creating XML markup, and I was transcribing the labels to an excel spreadsheet using unicode, and I needed a standard pre-XML way of doing this that was going to be intelligible to someone else.

In the end I found the answer in the epidoc documentation maintained by Stoa.org. Under erased and lost  it not only documented the TEI XML but also referenced previous pre XML paper technology conventions, in this case [[[...]]], which was ideal.

This little journey has raised a whole lot of questions, including should we be using TEI XML encoding for the labels.

The short answer is probably not, unicode in excel plus some standard notation is more than adequate in 99.9% of cases, and the whole majestic edifice that is TEI seems like complete overkill, but certainly this little diversion shows the importance of discussing and agreeing on transcription standards before starting on something as seemingly straightforward as a sequence on nineteenth century materia medica labels ...

Thursday, 28 December 2017

Orage revisited

Way back in 2007 I wrote a fairly simple script to download a google calendar file in ics format and stuff it into the Orage, a desktop calendar application bundled with the xfce window manager that came with xubuntu.

I did it just to see how easy it was to do. Nothing more.

Even though a year or so on I started using a ppc imac with Xubuntu as my principal desktop machine, I didn't really invest a lot of effort in the script, even though some people at the time found it useful, preferring to use evolution to handle mail and calendar type stuff.

Fast forward to 2017:

For no good reason other than it was the day after Xmas I decided to see if I could get jpilot to import a google calendar file with a bit of handwritten code to convert the ics file to a basic palm compatible csv file.

Well I havn't yet got as far as doing the csv conversion bit as I found my orage download script didn't work any more.

Orage now keeps its ics file in ./local/share/orage and google's calendar file syntax has changed.

So I fixed it:

touch ~/calendar/basic.ics
date >> ~/calendar/google_download.log
while test ! -s ~/calendar/basic.ics
wget -rK -nH \
  https://calendar.google.com/calendar/ical/yourprivateicsfile.ics \ 
  -O ~/calendar/basic.ics -a ~/calendar/google_download.log
sleep 30
if test -s ~/calendar/basic.ics
mv ~/.local/share/orage/orage.ics ~/.local/share/orage/orage_old.ics
mv ~/calendar/basic.ics ~/.local/share/orage/orage.ics

Obviously you replace yourprivateicsfile.ics with the link to your private google calendar ics file. If you are unsure how to find this check out this google help page - the bit you want is titled 'See your calendar...".

I've also spread the wget command over three lines for improved readibility. Depending on the unix shell you are using you may need to get rid of the backslashes and turn it back into a single very long line to get it to execute

wget now whinges about the combination of command line options but you can cheerfully ignore that (or fix it if you want)...

Wednesday, 20 December 2017

Using open source products for data collection

Following on from my little to do with Excel and the problems in getting a product activation key updated when I was off the corporate network, I'm even more strongly of the opinion that open source products are the way to go.

While the organisation that provides our IT support resolved the problem efficiently. professionally, and with good humour, it did take an hour of phone calls to resolve the problem. Given that I'm IT literate, even though I'm no windows engineer, I do wonder how easy it would have been had I been a less expert user.

In contrast, with open source the maintenance overhead is so little - no licence keys to worry about, and while there is clearly still a day to day support cost, it's probably not much different from proprietary, and in these days of Google and StackOverflow, it's less than it might once have been.

There is of course a case for ensuring that the applications used are of suitable quality and perhaps also a need for standard toolkits. It is of course unrealistic to expect individual researchers to do this, which is where product directories such as the Dirt Tools directory play a crucial role in allowing researchers to select and use suitable tools, but equally we also need to think about putting together a set of standard toolkits as a means of enabling the development of a set of community knowledge as to how to resolve common problems ...

Friday, 15 December 2017

Laptops for data collection

Over the years, a number of people have asked me about what I would suggest in the way of a computer for fieldwork, or research work in dusty libraries without internet or convenient power sockets.

Fieldwork computers tend to have a hard life, carried about repeatedly, bounced about in trucks, and always at risk of the wet, either as rain or spillages, or from dust and dirt.

My advice has always been to aim for the longest battery life for the lowest cost to keep the replacement cost down. Also these devices don’t need to do a lot - run a spreadsheet to record data, some sort of note management program and a text editor.

I’ve tried the cheap android tablet and keyboard combo. and that’s pretty good for straight note taking or even creating structured text (eg markdown) but tends not to shine for creating tabular data. Which is a pity as they are cheap enough to be treated as a consumable.

So recently I’ve swung back to the refurbished netbook or laptop with linux, and a combination of basic tools. The software base of linux is so large that you can find just about anything, but I tend to favour CherryTree for notes management, Gnumeric for recording tabular data, gedit or kate for basic text, and perhaps something more specialist such as ReText for structured text, although kate’s syntax checker is pretty good.

If you want something for writing up draft reports, Focuswriter is fast and lightweight.

The downside is that battery life is poor. Two hours, three hours at most. Not enough for a decent session.

However, there are a number of these cheap eMMc memory based  windows laptops available. Mostly I’ve avoided these as the amount of storage, typically 32Gb, is too small, given that Windows will take around 20Gb, depending exactly how it’s configured.

Add a few extra programs and a bit of data, and there’s not a lot of headroom there. However devices with 64Gb storage are beginning to appear at a price that’s reasonable, for example the Lenovo Yoga 310-6K can be picked up from the usual suspects at around $400 - 450 from the usual suspects, which is about the midway price for a refurbished laptop.

But there’s two downsides to the refurbished laptop route - firstly if you want to keep windows, you’ll probably end up having to pay for a Windows 10 upgrade, and secondly battery life won’t be great. And if you go for an older or cheaper machine it’ll probably have a 5400 rpm SATA drive, so you won’t be getting lightening disk performance anyway.

These cheaper eMMc laptops come with Windows 10. Versions of CherryTree, Gnumeric, and Focuswriter are available for windows. There’s always notepad or windows Codewriter as an editor, and if you need something a little more flashy for structured text there’s Typora, or Texts.io which will cost you around US$15 for a licence key.

What of course you’re getting is the longer battery life. You also get the bonus of being able to use the device in tablet mode, which makes showing people images - be it of plants, finds, sites, or handwritten text - much easier than on a laptop. The other bonus is OneNote, Microsoft’s note management tool.

I didn’t use to like OneNote - it seemed clumsy and slow compared to Evernote, but since working on the Dow’s Pharmacy project I’ve warmed to it.

Evernote remains the best ragbag management tool ever for categorising snippets garnered from everywhere. OneNote really isn’t good at imposing structure on chaos. What it is good for is building up a collection or collections of related notes - a subtle difference but an important one.

And of course you can have the best of both worlds and have both Evernote and OneNote on your machine.

So, what would I choose?

A few months ago I would have gone down the refurbished laptop with linux route, and if we’re talking about clever stuff like using R or iPython notebooks for on site data management and analysis I still would. For pure data collection, I’m not so sure. The increased storage and longer battery life certainly makes these eMMc based devices an interesting option ...

Update 16/12/2017

I've ignored iPads - deliberately - simply because they have the same problems as using an android tablet, the lack of a decent software base for data entry

Friday, 1 December 2017

More on spreadsheet preservation and normalisation

Yesterday, inspired on a post about preserving Google sheets I blogged about spreadsheet preservation in general.

As  is the way of these things the question has been rumbling round my brain ever since.

A long time ago, the National Archive of Australia released Xena, a normalisation tool that converts files into open xml based formats - essentially the open office formats used by Libre Office and others, on the basis that the xml produced is both documented  and readily parsable and that it would be possible to recover the data and the calculations from any preservation file.

And in fact when we built the original ANU data archive, we silently implemented this normalisation process as part of the workflow. We didn't use Xena, but after using Pronom to work out if we could recognise the file type, and if we had a normalisation engine for it - essentially an xml export tool, we would use that to produce a long term preservation copy which we would store, along with the original, in a bagit archive.

The idea of storing both, of course, is that as we didn't test the normalisation processes, and tended to trust the tools, it is just possible we could have produced garbage as part of the normalisation process.

In fact we deliberately ignored the year 1900 problem, as we reckoned that only a small number of spreadsheets would be affected.

So what does this mean for Google sheets?

Exporting to an xml format such as ods would seem to be the way to go, but given that it's not possible to preserve the original document, the sensible thing would be to download the spreadsheet in two formats, both ods and xlsx, given that both are in xml and that parsers exist for both formats.

The reconstituted spreadsheets should of course give identical results imported into the appropriate utilities.

Exporting a single sheet spreadsheet as as csv, or whatever, is only appropriate where there are no calculations involved, an example being where the spreadsheet was used to record species abundances in a number of quadrats.

The decision about whether to use an ascii format such as csv is best left to the researcher, they know their data, and whether it's appropriate.

The standard procedure should be to use a richer xml based format, and preferably two of them.

Ideally there should be some sanity checking before ingest ...