Parsing the Past: Data Extraction in Medieval Text
by Daniel Gordon
There are many questions we have that can be answered through looking at letter collections. They can give valuable information about the speed letters move at, time taken between towns, and if they are effective and timely forms of communication. These collections can end up being quite large, with hundreds, if not thousands, of individual documents. Manually searching every single item can take an excessive amount of time. To get around this, I resolved to write a program that could automatically search every document, pulling out key information like names and dates, and tagging important parts of the letters for data gathering and quick examination. I based this on the Cely Letter Collection, since it only has about 150 entries, allowing for manual checking of work. If applied successfully in this case, then with slight modification it could be applied to other collections, providing a valuable tool for parsing information. To effectively do this, though, there is one major roadblock, the middle English dialect the letters are written in.
Due to the Middle English dialect, a word can be spelled a variety of different ways, sometimes within the same letter. This can create problems when we need to find specific words or phrases. Though writing a program to account for these things can be difficult, I used three techniques to get around it, often in combination.
- Create a list of possible spellings for a given word, so that whenever a spelling comes up, you know it’s that word. This has its advantages because it means you’ll never pick up a wrong word by accident, but is prone to error in that if you miss a spelling, you won’t pick it up. In addition it can take time to sift through the information to find spelling. That being said, you are 100% sure it’d finding the words you want.
- Look for patterns within the world. In my case , the word “September” reliably begins with the letters “Sept”, and not many other words begin in that way. By looking for words beginning with those letters, you can find spellings of September. Looking for capitalization can also be a useful tool. Problems arise if you have spellings that break the pattern, or if your pattern isn’t specific enough to that word.
- Looking for patterns around the word. This is connected to the last strategy but not exactly the same. In the Cely letters, dates are almost always presented in the form “the (number designation) day of (month)” as in “the 3rd day of March”. What sticks out is the phrase “day of” which doesn’t really have an alternate spelling aside from “daye of”, and that word combination doesn’t really appear in other contexts. Thus, by looking for it, we can find when a date is mentioned in a letter. Problems are the same as the last strategy, with the addition that these kinds of patterns are more rare.
There are three other methods I considered but never implemented. The first was to look fat string distance. There are programs in python that can tell you the ratio of how much words differ. Hypothetically you could estimate that if a word is 80% similar, it’s that word. However, some of the spellings can be drastically different from their modern day ones, and if you lower the ratio too much you start picking up words outside of what you’re looking for. Since it’s too vague, I favored more specific methods.
The second was a sort of “translation” program. I noticed that the differences in spellings tend to follow certain rules. C is replaced by s, certain letters get doubled, and the letter e gets added to the end of words. Theoretically, if you apply these rules to any middle English words, you get the modern spelling, making data extraction significantly easier. The program would look at a word, make a list of possible rules that could be applied to it, then try every combination of the rules, checking to see if a given combination was in the dictionary. It would take a while, but eventually you’d translate the letter, making it much easier to parse. I decided to try other techniques because I was unsure how effective a program like that would be, and wanted to try other methods before resorting to it.
The third has to do with more sophisticated data extracted techniques. Using programs designed for data extraction, such as the natural language toolkit (nltk) 3rd party module for python, you could theoretically make the job easier. This gets complicated by the fact that these programs rely on tagging sentences with parts of speech and extracting information based on the structure of the sentence, a process complicated by the middle English spellings. In addition, the nltk module can be difficult to download and implement. Thus, I put it aside.
After developing methods to look at the letters, we can begin extracting information. The format of the letters can vary, and sometimes the information we want is simply not in the letter. But often, patterns hold true. This enables us to look for the first important piece of information: location. In other words, where the letter was sent, and where it came from. The former is a simple matter, there is an address line at the end of every letter, Proceeded with the word “Addressed:” (Due to its consistent spelling I can expect this to be a modern addition). Find that word and the phrase after it, you have an end location. The sending location can be a bit more complex. Though there is no 100% consistent place this information is in the letters, at the end of letters there is often a “Writ at” statement, such as “Wryt at London.” As a bonus, these statements often have dates associated with them, such as “Wryt at London on the 3rd day of March.” By looking for the phrase “Writ at”, you can find this information. True, the phrase can appear in other parts of the letter, but the statement we’re looking for is often at the end of the letter, so if we start looking at the end, we can reliably find what we’re looking for.
Now that we have a sense of distance and timing, what else gives us relevant information? Looking for mentioned dates can give us a sense of time frame. As I’ve said, we have a reliable way to find dates by looking for “Day of” statements, and by using combinations of techniques one and two we can find different mentions of months. Another helpful piece is looking for words that imply urgency, or reference sequences of events. Words like “Haste” and “Tidings” are good candidates. I also found looking for “Understand” to be a good method, since it is often used to describe events, such as, “I understand that you’re trying to make a deal.” We can write a program that returns the number of appearances of the words, and use that to create statistics. For the Cely Letters we got these results:
‘Haste’ mentions: 33/147
‘Tidings’ mentions: 22/147
‘Understand’ mentions: 85/147
Tidings and understand mentions within same letter: 15/147
If we want, we can also return a program that returns the words around the program (in this case I went with five words ahead and behind) to get a sense of the context of the words within the sentence.
The last thing I looked for are something I call “Receive statements”. Sometimes, a letter is written in response to another letter, or the sender wants the receiver to know that they were told a specific piece of information. To acknowledge this, a letter will often have a phrase along the lines of “I received your letter written at x place on y day”. This gives us a direct sense of time periods, especially when compared to when the letter in question was written. We can find these statements by looking for the various versions of received, then returning a text chunk that begins there, and ends to either a date, or an arbitrary number of words (I chose 15). This way we account for formatting irregularities.
After going through the process of extracting the data, we come away with a wealth of information about distance and timing, ready to be critically analyzed. On the surface the letters can be tedious and confusing to work through, and the use of programming to parse them allows us to pick up thing that had a large chance of being missed. In addition, though no letter collection has the exact same format as the Cely letters, others share great similarities, and even if the programs already written cannot be directly applied, the techniques can be reimplemented to allow for quick and efficient information extraction. Overall, the use of programming languages can greatly aid our examination of letters and texts, teaching us more about travel in the medieval world.