Transcription aggregation update

Many of our regular contributors will have seen (and indeed been interviewed for!) this article by Roberta Kwok for the New Yorker back in January of 2017. Roberta briefly describes the algorithm that merges the multiple transcriptions of pages by independent volunteers in Shakespeare’s World into a single transcription for a given page. The algorithm is called MAAFT, and is typically used to align genetic sequences. The idea behind this approach to transcription is that it allows us to combine transcriptions of smaller strings with longer strings in the same line. So, if I transcribe only one word on a line, and @parsfan and @mutabilitie transcribe the whole line, my small piece will still be counted by being compared with theirs and then slotted into the right place in the transcription. Zooniverse’s former data scientist and I decided to implement this method in an effort to enable people to contribute even small transcriptions to this relatively difficult project.

The other reasons we went for the MAAFT approach is that it means that anyone to take part in the project who wants to, without having prior paleography training. It’s ok if you make mistakes, because multiple people are transcribing the same material, so if you’ve transcribed a few letters correctly and some incorrectly, the correct ones will get taken through to the final transcription. This method of combining multiple volunteers’ efforts stems from the same scientific practices that underpin Galaxy Zoo, Penguin Watch, and all other Zooniverse projects, and is vital to the acceptance of crowdsourced data by experts in the sciences, humanities and museum and library fields. Enabling independent crowdsourcing, and then putting the results together is supposed to create a rigorous dataset, and I’m sure that it will for Shakespeare’s World in the near future, but we’re in a holding pattern until September, when a new data scientist will start working to unpick and rebuild our text aggregation process.

Rest assured that we’re not losing anyone’s work—it’s all saved in a database—but we’re not currently able to piece the separate transcriptions together reliably. This is why, in part, we have not yet published much data in the Early Modern Manuscripts Online (EMMO) database at the Folger. However, as Philip announced in previous posts on this blog, quite a few interesting finds are already being incorporated into the Oxford English Dictionary, and he says that Talk is the most effective platform for getting crowdsourced updates into the dictionary. Meanwhile our guest scholars from the Early Modern Recipes Online Collective (EMROC) have also been gathering invaluable data for their work via Talk, and many of you have been helping me gather #catholic and #womanwriter sources.

So, for now, one of the best ways to contribute to Shakespeare’s World is to hop on Talk and use those hashtags, for example #paper, #oed, #Catholic and #womanwriter. We hope to have our aggregation sorted soon, and the transcriptions freely available to all.

-By Victoria Van Hyning, @vvh


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: