big data – Traveler's Lab

Sonification and the Datini Letter Meta-data

September 16, 2019 by adamfl

Written by Adam Franklin-Lyons (History professor at Marlboro College) and Logan Davis (Research and Development Engineer at Pairity Greater Boston Area Computer Software)

Which means what exactly? It’s like a visualization, but instead of something you see, it’s something you hear. Let me start with a little background…

A couple of years ago, we attempted a couple of “sonifications” (rendering complex data in sound) using the metadata from the letters sent by the Datini Company in 14th and 15th century Italy. (We in this context are Adam Franklin-Lyons, professor of history at Marlboro College and Logan Davis, a skilled programming student, now alum, at Marlboro with a strong background in music and sound). The Datini data collection contains over 100,000 letters with multiple variables including origin, destination, sender, receiver, travel time, and others. There is an earlier blogpost with more about Datini and some regular old visualizations from a conference talk. We made a few preliminary experiments, often connecting individual people to a timbre and moving the pitch when that person changed locations. Here is a short version of one of our experiments where three different individuals each “inhabit” an octave of space as they move around – we made both a midi-Piano version and a synth-sound version. The sounds are built using a python sound generator and attaching certain pieces of data (in this case, the locations of three names agents of the Datini company, Boni, Tieri, and Gaddi) to numeric markers that the generator then translates into specific pitches, timbres, decay lengths, etc. What follows here are some of our thoughts about what sonification is, and how you might create your own. This post does not go into specific tools, which can be complicated, but is more of a general introduction to the idea. Hopefully in the future we will include another couple of posts that also talk about the technical side of things.

Despite not being intensively used, you are probably already familiar with the basic idea of sonification. Several well-known modern tools (the Geiger counter is the most widely cited example) use a sonic abstraction to portray data inputs that we can not otherwise sense. (for the Geiger counter, beeps or clicks indicate the quantity of radioactivity emitted. Basic metal detectors work similarly.) In contrast, researchers portray vast amounts of data in visual forms – graphs, charts, maps, videos, and so on. Perhaps this is because of the dominance of visual input for most people, perhaps not. Either way, the goals is the same: how do you take a large quantity of data and distill or organize it into a form that demonstrates patterns or meaningful structures to the person trying to understand the data?

Fields like statistics and data science teach and use visualization constantly, including using many known methods of comparing data sets, measuring variance, or testing changes over time. Researchers have also studied the reliability of different types of visualizations. For one example, visual perception can measure distance much better than area. Thus people consistently get more accurate data from bar graphs than from pie charts. The goals of sonification thus present one important question: what are types of patterns or structures in the data that would actually become clear when heard rather than seen? Are there particular types of patterns that lend themselves well to abstraction in audio rather than visually? (And I will be honest here – I have talked to at least a couple of people who do stats work who have said, “well, there probably aren’t any. Visual is almost bound to be better.” But admittedly, neither of them were particularly “auditory” people anyway – they do not, for instance, share my love of podcasts…their loss.)

Thus, the most difficult aspect is not simply duplicating what visualizations already do well – a sonification of communication practices where the volume matches the number of messages getting louder and louder over the course of a 45 second clip and then drops off more precipitously doesn’t actually communicate more than a standard bar graph. It would take less than 45 seconds to grasp the same concept in its visual form. Visualizations employ color, saturation, pattern, size, and other visual aspects for multiple variables. Combining aspects like attack and decay of notes, pitch level, and volume could potentially allow for multiple related pieces of data to become part of even a fairly simply sonic line. Like visualizations, certain forms of sound patterns will catch our attention better or provide a more accurate rendition of the data. Researchers have not studied the advantages and disadvantages of sound to the same extent, making these questions ripe for exploration.

So what are some examples? There is at least one professional group that has been dedicated to this research for a number of years: The International Community for Auditory Display. Their website has a number of useful links and studies (look particularly at the examples). Although these are not the most recent, there is a good handbook from 2011 and a review article from 2005 that describe some of the successes and failures of sonification. Many of their examples and suggestions recommend reducing the quantity of data or not overloading the auditory output, much as you would not want to draw thousands of lines of data on a single graph. However, at least a couple of recent experiments have moved towards methods of including very large quantities of data. While promotional in nature, here is a video demonstrating the concept as used by Robert Alexander to help NASA look at solar wind data.

So, how to proceed? First, the work of audification does not escape from the day to day tasks of data science, especially the normalization of data. If your audification cannot reasonably handle minor syntactic differences in data (ie: “PRATO” vs “prato” vs “Prato, Italy”), then your ability to leverage your dataset will be limited, just as it would with visualizations. The work to normalize and the choices you make in the normalization may be made far more efficient with a little leg work in the beginning.

Like visualizations, sonifications should be tailored to the data-set at hand. Then you will have to make choices about which aspects of sound you will relate to which data points. This is the main intellectual question of sonification. What are we voicing? What is time representing? What does timbre (or voice – different wave forms) give us here? Timbre and pitch nicely convey proper nouns and verbs in data-sets. Timbre has a far more accessible (articulated) range of possible expressions for data with higher dimensions (though for a particularly trained ear, micro-tonalism may erase a great deal of that advantage). Decay, in my experience can contain interesting metadata, such as confidence or freshness of a fact; the action of the tone relates to how concretely we know something in the data.

After cleaning, pitch, timbre, decay assignments, etc., you listen. Much of what you will find sonification good for is finding hot-spots in data sets. What stands out? Are there motifs or harmonic patterns that seem especially prevalent? Some of these questions, obviously, will relate to how the data has been coded, but every time we have tried this, there are also at least a few surprising elements. And finally, is it beautiful? (A question becoming more popular in visualization circles, also…) Particularly when intersecting with some of the wild data-set available today, what is the sound world created? Are there tweaks to the encoding that will both make data observations clearer while also making the sound more enjoyable to listen to? When creating an auditory representation of data, you are quite literally choosing what parts are worth hearing.

A New GitHub Data-Set

September 13, 2019 by adamfl

Written by Adam Franklin-Lyons

In earlier articles on this blogroll, we have written a couple of times about extracting, analyzing and organizing medieval itineraries as a source of data for doing geographic studies of medieval movement and travel (see: “Itineraries, Gazetteers, and Roads” and “Notes on the Margins“). Currently we have compiled and organized an increasing number of itinerary data-sets including digitizing older itineraries compiled in the 19th century or early 20th century. These data-sets include multiple royal itineraries from the Crown of Aragon and dozens of episcopal itineraries from England. We are planning to expand this project to include new travel itineraries from new places around Europe. To facilitate this expansion, we have moved a portion of our data onto GitHub – a site that specializes in version control usually used for software development.

The GitHub site currently hosts several of our itineraries along with a small amount of code (written in python with the Pandas library) that allows for the conversion of itineraries to trip sets, compilation of itineraries, looking up of existing names in other itineraries or in geonames.org, and other transformations that will assist in data collection and visualization. This should help us better organize all of the diverse data points, especially by linking each location (when possible) with a geonames id, a broadly cross-referenced and computer legible reference point. This leaves the door open to connecting our projects to linked open data sometime in the future (Wikidata is probably the most famous version of a linked open data project). There are also instructions available for how to insure that the geonames id is correct, not to mention what to do when there is no obviously available geonames referent.

The medium term goal is to compile a larger bibliography along with a set of usable assignments that other scholars and teachers could use as samples in digital humanities courses or as a digital example in an appropriate history course. These assignments will provide a two-fold benefit. First, each assignment undertaken will add to the scope of the overall data-set. The bibliography will provide other opportunities for history students to create their own data which we can then add to the collection. Second, because of the scope of the data already present, students will be able to more quickly use digital tools on a larger data-set to see the potential of geographic and statistical analyses. Eventually, we will add some of the more successful visualizations along with instructions to try with new data or to modify and expand on.

It is a key difficulty in teaching digital humanities that data collection on a scope large enough to produce compelling results must often be balanced with the class time needed to actually learn the digital tools used to run the analyses (GIS tools, stats packages like R, and other tools all involve whole courses in their own right.) By creating an iterative platform with model analyses and large amounts of already usable data, students and professors can participate in each stage of the project only for as long as they have available for their course without sacrificing the ability of students to practice all the steps along with way and produce satisfying historical results. Ideally, the project components on GitHub will allow courses to contribute incomplete data that other groups can pick up and continue. This makes concrete how we have attempted to run the lab in the past – with students able to work on a project at multiple stages, passing it off to a new group of students to move further when they are done. However, the instructions, future work, and goals have generally lived in the individual heads of the professors overseeing each project rather than in more publicly usable formats.

The long-term goal is to structure the data into an SQL database for easier querying, but also to increase substantially the number of trips and itineraries available in the data-set. We are also aiming to add data from a number of letter collections, including some data-sets we have already worked on (see: “Parsing the Past“). Eventually, there should be enough data about small movements to be able to ask broader questions about European mobility in the late medieval period. If we can reach a few hundred thousand individual data points (known short trips of less than a couple days), we will be able to ask systemic questions about the nature of movement. We could look for patterns such as seasonality, the influence of topography, linguistic boundaries, or observe potential regional differences across Europe.

The extended vision of the project could resemble the Stanford Orbis project – an interactive map of the ancient world for which there is no medieval equivalent. However, while Orbis is based on original research and primary sources, the sources are more diaphanous and descriptive. The map is built on algorithms that encode assumptions based on those sources, whereas a large enough data-set of individual trips would allow for algorithms that can give travel times and methods as inductive statistical guesses. These inductive guesses would be built on an extensive underlying source base of individual trips.

So, for the moment, we have tens of thousands, not quite hundreds of thousands. If you want to contribute to the bibliography or have good suggestions for primary sources that could reasonably produce an itinerary, get in touch so we can get it on the site. If you are planning on teaching a digital history course of some sort, use the data, try out the instructions, or create your own itinerary that can compliment the data already available. If you do try out some of the methods on the site, please let us know if some portion of the instructions are hard to follow or do not work as you move through it. We are always looking to update and improve the usability of the data.

So Check It Out!