A New GitHub Data-Set

Written by Adam Franklin-Lyons

In earlier articles on this blogroll, we have written a couple of times about extracting, analyzing and organizing medieval itineraries as a source of data for doing geographic studies of medieval movement and travel (see: “Itineraries, Gazetteers, and Roads” and “Notes on the Margins“). Currently we have compiled and organized an increasing number of itinerary data-sets including digitizing older itineraries compiled in the 19th century or early 20th century. These data-sets include multiple royal itineraries from the Crown of Aragon and dozens of episcopal itineraries from England. We are planning to expand this project to include new travel itineraries from new places around Europe. To facilitate this expansion, we have moved a portion of our data onto GitHub – a site that specializes in version control usually used for software development.

The GitHub site currently hosts several of our itineraries along with a small amount of code (written in python with the Pandas library) that allows for the conversion of itineraries to trip sets, compilation of itineraries, looking up of existing names in other itineraries or in geonames.org, and other transformations that will assist in data collection and visualization. This should help us better organize all of the diverse data points, especially by linking each location (when possible) with a geonames id, a broadly cross-referenced and computer legible reference point. This leaves the door open to connecting our projects to linked open data sometime in the future (Wikidata is probably the most famous version of a linked open data project). There are also instructions available for how to insure that the geonames id is correct, not to mention what to do when there is no obviously available geonames referent.

The medium term goal is to compile a larger bibliography along with a set of usable assignments that other scholars and teachers could use as samples in digital humanities courses or as a digital example in an appropriate history course. These assignments will provide a two-fold benefit. First, each assignment undertaken will add to the scope of the overall data-set. The bibliography will provide other opportunities for history students to create their own data which we can then add to the collection. Second, because of the scope of the data already present, students will be able to more quickly use digital tools on a larger data-set to see the potential of geographic and statistical analyses. Eventually, we will add some of the more successful visualizations along with instructions to try with new data or to modify and expand on.

It is a key difficulty in teaching digital humanities that data collection on a scope large enough to produce compelling results must often be balanced with the class time needed to actually learn the digital tools used to run the analyses (GIS tools, stats packages like R, and other tools all involve whole courses in their own right.) By creating an iterative platform with model analyses and large amounts of already usable data, students and professors can participate in each stage of the project only for as long as they have available for their course without sacrificing the ability of students to practice all the steps along with way and produce satisfying historical results. Ideally, the project components on GitHub will allow courses to contribute incomplete data that other groups can pick up and continue. This makes concrete how we have attempted to run the lab in the past – with students able to work on a project at multiple stages, passing it off to a new group of students to move further when they are done. However, the instructions, future work, and goals have generally lived in the individual heads of the professors overseeing each project rather than in more publicly usable formats.

The long-term goal is to structure the data into an SQL database for easier querying, but also to increase substantially the number of trips and itineraries available in the data-set. We are also aiming to add data from a number of letter collections, including some data-sets we have already worked on (see: “Parsing the Past“). Eventually, there should be enough data about small movements to be able to ask broader questions about European mobility in the late medieval period. If we can reach a few hundred thousand individual data points (known short trips of less than a couple days), we will be able to ask systemic questions about the nature of movement. We could look for patterns such as seasonality, the influence of topography, linguistic boundaries, or observe potential regional differences across Europe.

The extended vision of the project could resemble the Stanford Orbis project – an interactive map of the ancient world for which there is no medieval equivalent. However, while Orbis is based on original research and primary sources, the sources are more diaphanous and descriptive. The map is built on algorithms that encode assumptions based on those sources, whereas a large enough data-set of individual trips would allow for algorithms that can give travel times and methods as inductive statistical guesses. These inductive guesses would be built on an extensive underlying source base of individual trips.

So, for the moment, we have tens of thousands, not quite hundreds of thousands. If you want to contribute to the bibliography or have good suggestions for primary sources that could reasonably produce an itinerary, get in touch so we can get it on the site. If you are planning on teaching a digital history course of some sort, use the data, try out the instructions, or create your own itinerary that can compliment the data already available. If you do try out some of the methods on the site, please let us know if some portion of the instructions are hard to follow or do not work as you move through it. We are always looking to update and improve the usability of the data.

So Check It Out!