Right now DH is all texts, but not enough perspectives. -Andrew Piper
Summary: Digital Humanities suffers from a lack of perspectives in two ways: we need to focus more on the perspectives of those who interact with the cultural objects we study, and we need more outside academic perspectives. In Part 1, I cover Russian Formalism, questions of validity, and what perspective we bring to our studies. In Part 2, 1 I call for pulling inspiration from even more disciplines, and for the adoption and exploration of three new-to-DH concepts: Appreciability, Agreement, and Appropriateness. These three terms will help tease apart competing notions of validity.
Let’s begin with the century-old Russian Formalism, because why not? 2 Syuzhet, in that context, is juxtaposed against fabula. Syuzhet is a story’s order, structure, or narrative framework, whereas fabula is the underlying fictional reality of the world. Fabula is the story the author wants to get across, and syuzhet is the way she decides to tell it.
It turns out elements of Russian Formalism are resurfacing across the digital humanities, enough so that there’s an upcoming Stanford workshop on DH & Russian Formalism, and even I co-authored a piece that draws on work of Russian formalists. Syuzhet itself has a new meaning in the context of digital humanities: it’s a piece of code that chews books and spits out plot structures.
You may have noticed a fascinating discussion developing recently on statistical analysis of plot arcs in novels using sentiment analysis. A lot of buzz especially has revolved around Matt Jockers and Annie Swafford, and the discussion has bled into larger academia and inspired 246 (and counting) comments on reddit. Eileen Clancy has written a two-part broad link summary (I & II).
The idea of deriving plot arcs from sentiment analysis has proven controversial on a number of fronts, and I encourage those interested to read through the links to learn more. The discussion I’ll point to here centers around “validity“, a word being used differently by different voices in the conversation. These include:
- Do sentiment analysis algorithms agree with one another enough to be considered valid?
- Do sentiment analysis results agree with humans performing the same task enough to be considered valid?
- Is Jockers’ instantiation of aggregate sentiment analysis validly measuring anything besides random fluctuations?
- Is aggregate sentiment analysis, by human or machine, a valid method for revealing plot arcs?
- If aggregate sentiment analysis finds common but distinct patterns and they don’t seem to map onto plot arcs, can they still be valid measurements of anything at all?
- Can a subjective concept, whether measured by people or machines, actually be considered invalid or valid?
In this particular iteration of the discussion, validity implies a connection between the algorithm’s results and some interpretive consensus among experts. Piper points out that consensus doesn’t yet exist, because:
We have the novel data, but not the reader data. Right now DH is all texts, but not enough perspectives.
And he’s right. So far, DH seems to focus its scaling up efforts on the written word, rather than the read word.
This doesn’t mean we’ve ignored studying large-scale reception. In fact, I’m about to argue that reception is built into our large corpora text analyses, even though it wasn’t by design. To do so, I’ll discuss the tension between studying what gets written and what gets read through distant reading.
The Great Unread
[…] the “lost best-sellers” of Victorian Britain: idiosyncratic works, whose staggering short-term success (and long-term failure) requires an explanation in their own terms.
The phrase has since become synonymous with large text databases like Google Books or HathiTrust, and is used in concert with distant reading to set digital literary history apart from its analog counterpart. Distant reading The Great Unread, it’s argued,
significantly increase[s] the researcher’s ability to discuss aspects of influence and the development of intellectual movements across a broader swath of the literary landscape. -Tangherlini & Leonard
Which is awesome. As I understand it, literary history, like history in general, suffers from an exemplar problem. Researchers take a few famous (canonical) books, assume they’re a decent (albeit shining) example of their literary place and period, and then make claims about culture, art, and so forth based on those novels which are available.
Matthew Lincoln raised this point the other day, as did Matthew Wilkins in his recent article on DH in the study of literature and culture. Essentially, both distant- and close-readers make part-to-whole generalized inferences, but the process of distant reading forces those generalizations to become formal and explicit. And hopefully, by looking at The Great Unread (the tens of thousands of books that never made it into the canon), claims about culture can better represent the nuanced literary world of the past.
But this is weird. Without exemplars, what the heck are we studying? This isn’t a representation of what’s stood the test of time—that’s the canon we know and love. It’s also not a representation of what was popular back then (well, it sort of was, but more on that shortly), because we don’t know anything about circulation numbers. Most of these Google-scanned books surely never caught the public eye, and many of the now-canonical pieces of literature may not have been popular at the time.
It turns out we kinda suck at figuring out readership statistics, or even at figuring out what was popular at any given time, unless we know what we’re looking for. A folklorist friend of mine has called this the Sophus Bauditz problem. An expert in 19th century Danish culture, my friend one day stumbled across a set of nicely-bound books written by Sophus Bauditz. They were in his era of expertise, but he’d never heard of these books. “Must have been some small print run”, he thought to himself, before doing some research and discovering copies of these books he’d never heard of were everywhere in private collections. They were popular books for the emerging middle class, and sold an order of magnitude more copies than most books of the era; they’d just never made it into the canon. In another century, 50 Shades of Grey will likely suffer the same fate.
In this light, I find The Great Unread to be a weird term. The Forgotten Read, maybe, to refer to those books which people actually did read but were never canonized, and The Great Tsundoku 4 for those books which were published, lasted to the present, and became digitized, but for which we have no idea whether anyone bothered to read them. The former would likely be more useful in understanding reception, cultural zeitgeist, etc.; the latter might find better use in understanding writing culture and perhaps authorial influence (by seeing whose styles the most other authors copy).
In the present data-rich world we live in, we can still only grasp at circulation and readership numbers. Library circulation provides some clues, as does the number, size, and sales of print editions. It’s not perfect, of course, though it might be useful in separating zeitgeist from actual readership numbers.
Mathematician Jordan Ellenberg recently coined the tongue-in-cheek Hawking Index, because Stephen Hawking’s books are frequently purchased but rarely read, to measure just that. In his Wall Street Journal article, Ellenberg looked at popular books sold on Amazon Kindle to see where people tended to socially highlight their favorite passages. Highlights from Kahneman’s “Thinking Fast and Slow”, Hawking’s “A Brief History of Time”, and Picketty’s “Capital in the Twenty-First Century” all tended to cluster in the first few pages of the books, suggesting people simply stopped reading once they got a few chapters in.
Kindle and other ebooks certainly complicate matters. It’s been claimed that one reason behind 50 Shades of Grey‘s success was the fact that people could purchase and read it discreetly, digitally, without worry about embarrassment. Digital sales outnumbered print sales for some time into its popularity. As Dan Cohen and Jennifer Howard pointed out, it’s remarkably difficult to understand the ebook market, and the market is quite different among different constituencies. Ebook sales accounted for 23% of the book market this year, yet 50% of romance books are sold digitally.
And let’s not even get into readership statistics for novels that are out copyright, or sold used, or illegally attained: they’re pretty much impossible to count. Consider It’s a Wonderful Life (yes, the 1946 Christmas movie). A clerical accident pushed the movie into the public domain (sort of) in 1974. It had never really been popular before then, but once TV stations could play it without paying royalties, and VHS companies could legally produce and sell copies for free, the movie shot to popularity. Importantly, it shot to popularity in a way that was impossible to see on official license reports, but which Google ngrams reveals quite clearly.
This ngram visualization does reveal one good use for The Great Tsundoku, and that’s to use what authors are writing about as finger on the pulse of what people care to write about. This can also be used to track things like linguistic influence. It’s likely no coincidence, for example, that American searches for the word “folks” doubled during the first month’s of President Obama’s bid for the White House in 2007. 5
Matthew Jockers has picked up on this capability of The Great Tsundoku for literary history in his analyses of 19th century literature. He compares books by various similar features, and uses that in a discussion of literary influence. Obviously the causal chain is a bit muddled in these cases, culture being ouroboric as it is, and containing a great deal more influencing factors than published books, but it’s a good set of first steps.
But this brings us back to the question of The Great Tsundoku vs. The Forgotten Read, or, what are we learning about when we distant read giant messy corpora like Google Books? This is by no means a novel question. Ted Underwood, Matt Jockers, Ben Schmidt, and I had an ongoing discussion on corpus representativeness a few years back, and it’s been continuously pointed to by corpus linguists 6 and literary historians for some time.
Surely there’s some appreciable difference when analyzing what’s often read versus what’s written?
Surprise! It’s not so simple. Ted Underwood points out:
we could certainly measure “what was printed,” by including one record for every volume in a consortium of libraries like HathiTrust. If we do that, a frequently-reprinted work like Robinson Crusoe will carry about a hundred times more weight than a novel printed only once.
if we’re troubled by the difference between “what was written” and “what was read,” we can simply create two different collections — one limited to first editions, the other including reprints and duplicate copies. Neither collection is going to be a perfect mirror of print culture. Counting the volumes of a novel preserved in libraries is not the same thing as counting the number of its readers. But comparing these collections should nevertheless tell us whether the issue of popularity makes much difference for a given research question.
While his claim skirts the sorts of issues raised by Ellenberg’s Hawking Index, it does present a very reasonable natural experiment: if you ask the same question of three databases (1. The entire messy, reprint-ridden corpus; 2. Single editions of The Forgotten Read, those books which were popular whether canonized or not; 3. The entire Great Tsundoku, everything that was printed at least once, regardless of whether it was read), what will you find?
Underwood performed 2/3rds of this experiment, comparing The Forgotten Read against the entire HathiTrust corpus on an analysis of the emergence of literary diction. He found that the trend results across both were remarkably similar.
Clearly they’re not precisely the same, but the fact that their trends are so similar is suggestive that the HathiTrust corpus at least shares some traits with The Forgotten Read. The jury is out on the extent of those shared traits, or whether it shares as much with The Great Tsundoku.
The cause of the similarities between historically popular books and books that made it into HathiTrust should be apparent: 7 historically popular books were more frequently reprinted and thus, eventually, more editions made it into the HathiTrust corpus. Also, as Allen Riddell showed, it’s likely that fewer than 60% of published prose from that period have been scanned, and novels with multiple editions are more likely to appear in the HathiTrust corpus.
This wasn’t actually what I was expecting. I figured the HathiTrust corpus would track more closely to what’s written than to what’s read—and we need more experiments to confirm that’s not the case. But as it stands now, we may actually expect these corpora to reflect The Forgotten Read, a continuously evolving measurement of readership and popularity. 8
Lastly, we can’t assume that greater popularity results in larger print runs in every case, or that those larger print runs would be preserved. Ephemera such as zines and comics, digital works produced in the 1980s, and brittle books printed on acidic paper in the 19th century all have their own increased likelihoods of vanishing. So too does work written by minorities, by the subjected, by the conquered.
The Great Unreads
There are, then, quite a few Great Unreads. The Great Tsundoku was coined with tongue planted firmly in-cheek, but we do need a way of talking about the many varieties of Great Unreads, which include but aren’t limited to:
- Everything ever written or published, along with size of print run, number of editions, etc. (Presumably Moretti’s The Great Unread.)
- The set of writings which by historical accident ended up digitized.
- The set of writings which by historical accident ended up digitized, cleaned up with duplicates removed, multiple editions connected and encoded, etc. (The Great Tsundoku.)
- The set of writings which by historical accident ended up digitized, adjusted for disparities in literacy, class, document preservation, etc. (What we might see if history hadn’t stifled so many voices.)
- The set of things read proportional to what everyone actually read. (The Forgotten Read.)
- The set of things read proportional to what everyone actually read, adjusted for disparities in literacy, class, etc.
- The set of writings adjusted proportionally by their influence, such that highly influential writings are over-represented, no matter how often they’re actually read. (This will look different over time; in today’s context this would be closest to The Canon. Historically it might track closer to a Zeitgeist.)
- The set of writings which attained mass popularity but little readership and, perhaps, little influence. (Ellenberg’s Hawking-Index.)
And these are all confounded by hazy definitions of publication; slowly changing publication culture; geographic, cultural, or other differences which influence what is being written and read; and so forth.
The important point is that reading at scale is not clear-cut. This isn’t a neglected topic, but nor have we laid much groundwork for formal, shared notions of “corpus”, “collection”, “sample”, and so forth in the realm of large-scale cultural analysis. We need to, if we want to get into serious discussions of validity. Valid with respect to what?
This concludes Part 1. Part 2 will get into the finer questions of validity, surrounding syuzhet and similar projects, and will introduce three new terms (Appreciability, Agreement, and Appropriateness) to approach validity in a more humanities-centric fashion.
- Coming in a few weeks because we just received our proofs for The Historian’s Macroscope and I need to divert attention there before finishing this. ↩
- And anyway I don’t need to explain myself to you, okay? This post begins where it begins. Syuzhet. ↩
- The phrase was originally coined by Margaret Cohen. ↩
- (see illustration below) ↩
- COCA and other corpus tools show the same trend. ↩
- Heather Froelich always has good commentary on this matter. ↩
- Although I may be reading this as a just-so story, as Matthew Lincoln pointed out. ↩
- This is a huge oversimplification. I’m avoiding getting into regional, class, racial, etc. differences, because popularity obviously isn’t universal. We can also argue endlessly about representativeness, e.g. whether the fact that men published more frequently than women should result in a corpus that includes more male-authored works than female-authored, or whether we ought to balance those scales. ↩