Writes Dr Victoria Pickering, Post-Doc Research Fellow (Natural History Museum).
Kexin Li, a UCL student in MSc Digital Humanities worked with the NHM team on a part time basis for six weeks. The NHM work package continues to be focused on mobilising data from Sloane’s copy of John Ray’s Historia Plantarum, the publication in which he catalogued the specimens in his herbarium. Extracting plant names and specimen annotations from Ray will, for the first time, allow the Sloane botanical collection to be digitally available and searchable.
The NHM has been working closely with the Sloane Lab tech team to experiment with ways of mobilising this data. Using a sample of 10 pages/images from Ray, Amazon Textract has been able to detect printed text (plant names) and handwritten annotations (specimen locations). Kexin analysed the outputs of this process and reported on the success and limitations of using Textract for this purpose. Focusing on the printed text (plant names), Kexin highlighted all of the outputted errors and found that the most common letter errors included s, v, l, r, g, and q. For example, 54 instances of ‘s’ were misidentified as ‘f’. Likewise, every instance of an English common plant name appearing in Gothic print was either incorrectly captured or not detected at all. Kexin’s work has provided important evidence for our workflow discussion and enabled us to think clearly about next steps and priorities, for example comparing these outputs to those that might come out of using Transkribus, a text extraction platform developed for historical handwritten sources in particular.