Columbia's Group for Experimental Methods in the Humanities about events impact projects lab-culture people

Middle English Manuscript OCR tool


03/18 The project received summer grant funding ($9000) with generous support from the Data Science Institute Scholars Program and the Data, Media, & Society Center.

In collaboration with a team of international scholars at the Open Islamicate Texts Initiative (OpenITI), our project aims to train an OCR system on a corpus of medieval manuscripts. The engine developed by OpenITI, Kraken, has a unique advantage in its line-based, rather than character-based, recognition of text, which makes it especially suitable for the density and occasional obscurity of Middle English handwriting.

We will train Kraken on a select set of manuscripts attributed to the same scribe (“Scribe D”). These include the Corpus Christi MS 198, the Plimpton MS 265, and the London Library MS V. 88. If successful, we would later train the system on an even larger set of manuscripts. Such a tool would have immense impact on medieval studies. Scholars could more easily compare manuscripts across a single textual tradition, or create digital editions for lesser-known texts. Above all, it would allow them to work on a massive number of untranscribed texts that seem “lost” on the current academic radar.