Digitizing the OED

I recently picked up a copy of John Simpson’s memoir The Word Detective: Searching for the Meaning of It All at the Oxford English Dictionary, largely because I like words and dictionaries and the OED in particular. A nice surprise, however, is that Simpson, the former chief editor of the OED, oversaw the digitization of the OED to CD-ROM, beginning in the early 1980s, as well as it’s later migration online. This is exciting stuff, not only for the description of transferring a  massive database from print to computer (~67 million characters), but also because Simpson nicely describes how digitization did not replace the original function of the OED, but rather added new dimensions to it.

One of the benefits of the OED was that it’s data was already structured; i.e. definitions, pronunciations, etymology, etc. are distinguishing by their formatting, “a change of typeface, size of print, special print characters, indentation, etc.,” consistently and repetitively.  The OED teamed up with International Computaprint Coporation, IBM, and the University of Waterloo (all in North America). The first two helped with digitizing the data, while Waterloo’s Computer Science Department helped construct the database. The typing took 150 people working for 18 months. After words, the 20,000 pages of type, each three columns of small print, had to by proofread, which was taken on by 50 freelancers.

Simpson’s descriptions of how this large project took shape and was organized are interesting, but he shines when describing the new possibilities that digitization would open for the OED. Up to this point, dictionaries were incredibly linear: you looked up the word you wanted, and there you were. But what if, as Simpson describes it, you were able “to search the entire content of the dictionary instantly for information relating to the language”? He gives the example of finding all the words in English that end in -ology (1,011 in the OED), followed by comparing them with all the words that end in -ography (508). Given how time-consuming doing this would be with the print dictionary, it wasn’t done, but digitization could make such a search feasible and quick.

“Hundreds of other questions which might have been asked about the language were not asked, or were only answered falteringly by considering just a sample of the data. What if you could dream up more or less any question you wanted about the language, ask it, and receive an answer seconds later?” Simpson writes. This seems to be the common-sense attraction of what is now collectively referred to as digital humanities: it opens up the possibility of new questions, new forms of analysis, and the ability to see patterns and meanings that would be impossible or extremely impractical to reach without digital tools. At the same time, the possibilities these new avenues offer do not mean we abandon other avenues of research and analysis. Just because we can search the entire corpus for all instances of -ology doesn’t mean that sometimes we just need or want to know the specific meaning(s) and history of amphibology or tropology – both of which are just two of the many words that Simpson explores in his fascinating memoir.

 

Weekly Roundup: September 26

pietro mellini

  • Quadrigam, a tool to “create & publish data driven websites” (via Miriam Posner)