Lab #2: Distant Reading 1


The idea for this lab was adapted (blatantly lifted, really) from Ryan Cordell of Northeastern University, who in turn borrowed it from Paul Fyfe of North Carolina State University. Consider this one more bud on the text analysis workshop tree.

Several of the readings we’ve examined so far have discussed a scholar and theorist named Franco Moretti. While we won’t be reading Moretti himself during this seminar, his theory of ‘distant reading’ is a foundational one for the collection of practices we call digital humanities. The idea is basically that a literary scholar should seek to study a lot of texts, with the assistance of computers, in order to gain a broad understanding of genre and period. In this way, the scholar may produce a more inclusive literary history, more so than if he/she were to study single texts or a small canon of texts. This is how Fyfe introduced his exercise to his students:

Franco Moretti was dissatisfied with how literary scholars accept just a handful of possible texts as representative of cultural eras. Even if those texts are diverse and interesting, how can they possibly represent broader trends at scale? Moretti wants to change our sense of literary history by enlarging it, or by increasing our critical distance from it. He coined the phrase “distant reading” as an approach to analyzing lots and lots of texts instead of an unrepresentative few. Distant reading uses other modes of analysis and models of interpretation than the “close reading” we are familiar with. In his own work, Moretti compiles textual information from lots and lots of novels into maps, graphs, and logical trees. Seen this way, texts can reveal new patterns and language trends than we could otherwise discover close up. An array of digital visualization and text analysis tools now make Moretti’s methods more accessible to the casual user. The first paper will be an experiment in using these tools. We will consider “distance” not only as the subject of our course but also as a potential mode of reading and interpretation. What does literary criticism and analysis look like if we accept distance “as a condition of knowledge”?

The idea of distance produces a bit of discomfort in most of us, who are accustomed to reading the traditional way, one word after another, without the (perceptible) mediation of digital tools and visualizations. This is especially true for humanists, who train in the practice of close reading, which is to say the careful reading of a text, analyzing it for style, structural elements, cultural references or rhetorical features. That said, distance is not a bad way to approach the nineteenth century canon, which was the most productive of any prior period. The Culturomics paper assigned today makes clear that 1.8 billion words per year were being added to their corpus by 1900; a huge growth rate compared to the three previous centuries, which contributed a relatively paltry 98 million words per year. Definitely not something you’d want to sit down and read. But perhaps we can experiment with… not reading them.

In today’s lab, you will use the tools of the HathiTrust Research Center portal to analyze a big Victorian novel and then write your second Lab Report explaining your questions and insights. There’s a twist: please choose a novel you have *not* read.

A few prefatory wishes, before we begin:

  • Treat this exercise as an experiment. You are testing the methods as much as you are learning about the text. Your goal is to “read” the novel in a totally new way, and explore the implications of doing so, not to uncover a hidden interpretation of the text.
  • Use frustration creatively. You will probably produce some garbage topic models and nonsense word clouds in your experiments. All of this is to be expected. The challenging, but interesting part comes from turning this dead end into a reflection on your critical approach. How might you fine tune your questions, or the data you use to explore those questions, in order to get different or better results?

I’m assuming that you’ve already set up your account on the HTRC portal at But if you still need to do this step, follow the instructions here. You might want to favor Google Chrome as your browser for this lab exercise. Firefox works as well, although it has a minor bug with the Dunning Log Likelihood algorithm. I have less experience with Internet Explorer, but it seems to work well with the HTRC tools.w_irving-sleepy-hollow

  1. Choose a work not to read

    I’d like you to pick a nineteenth century novel that you have not read. Consider this your chance to assuage your guilt about never having read that doorstop of a novel your artsy friend wouldn’t stop carrying on about, e.g. Anna Karenina, Great Expectations, Frankenstein, Moby Dick, or Pride and Prejudice. You may reuse any of these hyperlinked worksets in the HTRC portal, or you may create a workset of your own. To do that, follow the instructions on using the HTRC Workset Builder. It requires a second login, but the credentials are the same as those you created for the main portal.

  2. Make some predictions

    Following Prof. Cordell’s lead: “[w]hat do you think this work is about? You’ve never read it, but if it’s a well-known book you probably have some idea what it’s about. Before you begin your computational analysis, then, list some predicted themes, characters, plot elements, or stylistic characteristics of the text. Be sure to write your ideas down in a document you can refer back to later.”

  3. Create word clouds

    When provided with a bunch of text, a tag or word cloud engine will produce a graphical display of the most common words, sized by the frequency with which they appear in the text (larger means higher frequency). Described as a gateway drug to text analysis, word clouds are simple but often insightful ways of visualizing the most significant themes or concepts of a text. Go to the Algorithms page of the HTRC portal and select the Meandre_Tagcloud_with_Cleaning algorithm (no. 9 on the list). Execute it on your chosen text (if you are reusing one of my worksets, you can get to it quickly in the “all worksets” dropdown menu by typing “fg” in quick succession). Have a look at the Ngram corrections and stop words supplied with this algorithm (load the URL in a separate tab in your browser). What do you think their purpose is? What kind of results might you get by using a different stop word list, like this one? The algorithm’s output takes a bit of time to appear on the Results page; refresh the page often to gauge its progress. How you might “read” your results? What kinds of words appear? Are there trends or in/consistencies? What words occur more or less frequently? I’ll note in passing that you can export your data from the HTRC portal, and, after some minor manipulation, paste it into Wordle, another excellent tool for making word clouds.

  4. Compare two similar(ish) novels

    A word cloud is our first step. Next we’ll do a bit of comparative analysis using the Meandre_Dunning_LogLikelihood_to_Tagcloud algorithm (no. 4 in the Algorithms list). This algorithm will take your “analysis workset”–the one you chose in step 1–and compare it to a “reference workset,” which should ideally exhibit some thread of commonality you propose to investigate. This algorithm produces two tag clouds: the first shows the words that are “overrepresented” or most likely to occur in your first workset, while the second shows the “underrepresented” words that are mostly only present in your second, reference workset. As an example, you could look at novels with an element of the supernatural, and ask yourself how Frankenstein’s monster, a plausible zombie forerunner, compares to Bram Stoker’s Dracula, the quintessential vampire. Alternately, you might know that Anna Karenina and Madame Bovary both had affairs, and you might wonder how the Russian and French tragic heroines compare to each other. Does this comparative analysis tell you anything new about your first text? What are some observations or speculations you could make now about two books you haven’t read, based on the tag clouds?

  5. Create a topic model

    Topic modeling is useful for getting a sense of the contents of your workset, however many texts it may include. There’s an argument to be made that topic modeling as a technique only begins to become informative when applied to a very large collection of texts (n>=1000 or so, depending on whom you ask). But for today’s purposes, we are going to scale this method down to our one primary text. Select Meandre_Topic_Modeling from the algorithms list (it is no. 10). This algorithm creates a list of “topics” from the workset. A “topic” is simply a set of words that have a high probability of occurring together within the contents of the workset. The topics are displayed as tag clouds. What are the topic clusters that emerge from your primary text? Spend a bit of time trying to characterize each topic cluster. Is there a way to describe or label them? Are there any surprises or mysteries? Junk topics (full of OCR errors, or empty words)? Points of interest? Objects for further study? Can you think of a strategy to extend your study of any one topic?

  6. Explore Ngrams

    The HTRC has an Ngram viewer called the HathiTrust Bookworm. Actually, in its current iteration, HathiTrust Bookworm supports 1-grams or unigrams (one word) as search terms. As you learned from the Culturomics reading, a 1-gram is “a string of characters uninterrupted by a space” like “banana,” but also including acronyms, numbers and typos. Generally speaking, an ngram is a measure of usage frequency that is calculated by dividing the number of instances of the n-gram in a given year by the total number of words in the corpus in that year. Take some of the words you’ve identified from the previous steps and enter them into the HathiTrust Bookworm (you can add more than one word by pressing on the green + symbol). Pay attention to the frequency of those words through time, and be particularly mindful of their frequency when your chosen novel was published. Do any of them stand out, either as particularly common words during their time or, perhaps as interestingly, as particularly uncommon words during their time? The broader question here: can a tool like the HathiTrust Bookworm, which analyzes over 4 million texts, help you understand anything about the historical place of a novel you’ve never read? Another note in passing: the Google Ngram viewer supports 1- to 5-grams with a few additional features. It does not, to my knowledge, provide the same level of metadata faceting available through HathiTrust Bookworm.

  7. Read the first chapter

    Quoting Cordell again:

    Now that you’ve not read the entire work, go back and actually read its first chapter or section. Did the textual analyses you performed prepare you to understand the themes, character, setting, or any other aspects of this first chapter? Did the trends you studied through “distant reading” cause you to focus on things in the chapter you would not otherwise have been paying attention to? Are there ideas you expected to encounter based on your textual analysis, but didn’t? Were there ideas in the first chapter that seem entirely unrelated to the analyses you performed beforehand?

    You will probably find it easiest to read your chapter in the HathiTrust Digital Library interface. Be sure to log in as Rutgers University if you want to download the novel as a PDF to your computer. You can click through to the novel from the HTRC portal by going to Worksets and clicking on the hyperlinked title of the novel. For example, if you chose Pride and Prejudice as your novel, you will be directed here.

  8. Write a Lab Report

    Lastly, please write your second lab report about what you did and what you learned in the process. Reflect specifically on what you learned about [1] your chosen text and [2] this method of “distant reading.” The goal of this assignment is to think about the kinds of knowledge a distant reading can or cannot produce. In other words, consider how textual analysis changes our attention to texts. It is completely acceptable to include unanswered questions.




Further Reading

Fyfe, Paul. “How to Not Read a Victorian Novel.” Journal Of Victorian Culture (Routledge) 16, no. 1 (April 2011): 84-88.

Moretti, Franco. Distant Reading. London: Verso, 2013.


One thought on “Lab #2: Distant Reading

Comments are closed.