Discover millions of ebooks, audiobooks, and so much more with a free trial

Only $9.99/month after trial. Cancel anytime.

Enumerations: Data and Literary Study
Enumerations: Data and Literary Study
Enumerations: Data and Literary Study
Ebook393 pages5 hours

Enumerations: Data and Literary Study

Rating: 0 out of 5 stars

()

Read preview

About this ebook

For well over a century, academic disciplines have studied human behavior using quantitative information. Until recently, however, the humanities have remained largely immune to the use of data—or vigorously resisted it. Thanks to new developments in computer science and natural language processing, literary scholars have embraced the quantitative study of literary works and have helped make Digital Humanities a rapidly growing field. But these developments raise a fundamental, and as yet unanswered question: what is the meaning of literary quantity?
          In Enumerations, Andrew Piper answers that question across a variety of domains fundamental to the study of literature. He focuses on the elementary particles of literature, from the role of punctuation in poetry, the matter of plot in novels, the study of topoi, and the behavior of characters, to the nature of fictional language and the shape of a poet’s career. How does quantity affect our understanding of these categories? What happens when we look at 3,388,230 punctuation marks, 1.4 billion words, or 650,000 fictional characters? Does this change how we think about poetry, the novel, fictionality, character, the commonplace, or the writer’s career? In the course of answering such questions, Piper introduces readers to the analytical building blocks of computational text analysis and brings them to bear on fundamental concerns of literary scholarship. This book will be essential reading for anyone interested in Digital Humanities and the future of literary study.
 
LanguageEnglish
Release dateSep 4, 2018
ISBN9780226568898
Enumerations: Data and Literary Study

Read more from Andrew Piper

Related to Enumerations

Related ebooks

Language Arts & Discipline For You

View More

Related articles

Reviews for Enumerations

Rating: 0 out of 5 stars
0 ratings

0 ratings0 reviews

What did you think?

Tap to rate

Review must be at least 10 words

    Book preview

    Enumerations - Andrew Piper

    Enumerations

    Enumerations

    Data and Literary Study

    Andrew Piper

    The University of Chicago Press

    Chicago and London

    The University of Chicago Press, Chicago 60637

    The University of Chicago Press, Ltd., London

    © 2018 by The University of Chicago. All rights reserved. No part of this book may be used or reproduced in any manner whatsoever without written permission, except in the case of brief quotations in critical articles and reviews. For more information, contact the University of Chicago Press, 1427 East 60th Street, Chicago, IL 60637.

    Published 2018

    Printed in the United States of America

    27 26 25 24 23 22 21 20 19 18    1 2 3 4 5

    ISBN-13: 978-0-226-56861-4 (cloth)

    ISBN-13: 978-0-226-56875-1 (paper)

    ISBN-13: 978-0-226-56889-8 (e-book)

    DOI: https://doi.org/10.7208/chicago/9780226568898.001.0001

    The University of Chicago Press gratefully acknowledges the generous support of the Social Sciences and Humanities Research Council of Canada toward the publication of this book.

    Library of Congress Cataloging-in-Publication Data

    Names: Piper, Andrew, 1973– author.

    Title: Enumerations : data and literary study / Andrew Piper.

    Description: Chicago ; London : The University of Chicago Press, 2018. | Includes bibliographical references and index.

    Identifiers: LCCN 2018003977 | ISBN 9780226568614 (cloth : alk. paper) | ISBN 9780226568751 (pbk. : alk. paper) | ISBN 9780226568898 (e-book)

    Subjects: LCSH: Criticism—Statistical methods. | Criticism—Data processing. | Digital humanities.

    Classification: LCC PN98.E4 P56 2018 | DDC 801/.95—dc23

    LC record available at https://lccn.loc.gov/2018003977

    This paper meets the requirements of ANSI/NISO Z39.48–1992 (Permanence of Paper).

    to my parents

    Contents

    Preface

    Introduction (Reading’s Refrain)

    1  Punctuation (Opposition)

    2  Plot (Lack)

    3  Topoi (Dispersion)

    4  Fictionality (Sense)

    5  Characterization (Constraint)

    6  Corpus (Vulnerability)

    Conclusion (Implications)

    Acknowledgments

    Appendix A

    Appendix B

    Data Sets

    Notes

    Index

    Preface

    This book is part of a longer exploration of the relationship between technology and reading, one that has occupied me for most of my career. In Dreaming in Books, I studied how romantic literature made sense of the bibliographic reorganizations that were sweeping across Europe and North America at the turn of the nineteenth century. Romanticism was, in this reading, a movement deeply invested in understanding material and technological changes that we are in many ways still grappling with. Book Was There sought to understand the recent technological upheavals around books by paying attention to more embodied dimensions of reading. Whether it is the touch or sight of the page or the places and practices of note-taking, game-playing, sharing, storing, or consuming books, I wanted to show how these experiences differ profoundly between print and digital media. Finally, Interacting with Print turned to the ways historical actors engaged with their reading material to produce new kinds of social communities, new models of creativity, and new structures of knowledge. Written with twenty-two coauthors, Interacting with Print put theory into practice in an elaborate process of scholarly interactivity of its own.

    This then is the intellectual background to the book you are reading. Enumerations is about how computation participates in the construction of meaning when we read. It argues that data and computation unquestionably have a role to play in understanding literature, but that the way we have so far approached this problem rests on a number of flawed premises. The notions of distance, bigness, or objectivity that are largely in circulation right now rely on overly binary models of reading, largely untethered from past practices. Enumerations tries to show how these frameworks do not adequately capture the nature of computational modeling and its place within the rich history of reading. We still do not have a clear picture of how emerging quantitative methods speak to the questions that matter within the discipline of literary studies. This book is an attempt to align new kinds of models with old kinds of questions.

    As I began to think about why I was interested in the question of literary quantity, I realized that it marked an even more general continuum with previous concerns. It belonged to my abiding interest in understanding the commensurability of seemingly incommensurable things. Instead of exploring the relationship between words and their objects, or bodies and reading, or paper and electronic books, as I had done previously, here I have moved to the relationship between letter, number, and image (in the form of the diagram). However disparate, behind each of these efforts lies the idea of translation, the act of moving between languages, cultures, and mentalities, as a core practice, but also an ideal, for humanistic scholarship. The first book I ever published was a translation, and it occurs to me now that I have been writing under this sign ever since. In the back of my mind, I keep trying to imagine an alternative future where students are not dutifully apportioned into silos of numeracy and literacy, but are placed in a setting where these worldviews mix more fluidly and interchangeably.

    As much as this book represents a continuum, it also marks a breaking point, both from my past work, but also in the sense of something being broken. The research for this book began concretely when I started retraining myself in the field of computational text analysis several years ago, combining the practice of computer programming with that of quantitative reasoning. As hard as this process was and continues to be, computation allowed me to gain two fundamental insights about our discipline that had so far been overlooked. The first is the pervasive quality of textual repetition. The vast bulk of any single text consists of elements that repeat themselves with great frequency. These repetitions in turn multiply out in the world, giving coherence to entire domains of writing, such as genres, periods, modes, topoi, and careers. And yet, we have had no way of accounting for this fact of recurrence. It was as though we had elected to orient ourselves around rare events to protect ourselves from the vast majority of textual features (not to mention texts themselves). Focusing on a single dash in Heinrich von Kleist makes sense only if you pretend that there are not tens of thousands of other ones floating around.

    The second problem is what I discuss in the introduction as a science of generalization. Until recently, we have had no way of testing our insights across a broader collection of texts, to move from observations about individual novels to arguments about things like the novel. And yet, we make these generalizations all the time. Indeed, one could argue that generalization is a crucial aspect to any scholarly method. It is what allows us to identify the significance of a particular instance as well as the social and historical significance of some larger set of practices. It is how we move between part and whole.

    As recent research has begun to suggest, those wholes have been expanding for some time. The scale of our categories (world literature, new media, post-canon) has been matched by an increasing attention to social critique, to questions of worldly mattering. And yet, our methods have remained largely unchanged. I will never forget the moment when I realized that the usual answer our field offers to initiates when faced with this problem—read more!—suddenly seemed incredibly, even senselessly, insufficient. As the Enlightenment scholar Johann Hamann once said, the imperative to read more feels like the punishment of carrying water through a sieve meted out to the daughters of Danaus. More reading could never by itself provide the evidentiary foundations to make categorical arguments—whether about Romanticism, modernity, the book, the novel, or even literature. We require some way of traversing scales, of testing our individual insights and observations against a set of texts that is more representative of the category about which we are speaking, especially as our analytical scales keep expanding. It is clear that more needs to be replaced by something we might call method. I had seen the crack in the table.

    I take this expression from a short 16 mm film produced by the artist Paul Sietsema (Anticultural Positions [2010]). In it, we see a number of still photographs of the surfaces upon which he worked while making the photographs and paintings used in an earlier film (Figure 3 [2008]), one that was itself largely concerned with the representation of surfaces, like paper and pottery. What makes these films so moving is Sietsema’s attention to the fissures and lines that corrugate any surface when looked at closely, the way he sees the furrows of surface. At one point, he shows us a marble tabletop in his studio, in which we see a slight crack. Behind the pristine surfaces of knowledge, the foundations upon which something else is made, Sietsema reminds us that we also need to see the cracks, the places of vulnerability within the whole. It was these cracks in the otherwise smooth surface of reading that computation had allowed me to see.

    Sietsema’s imagery offers a useful metaphor in another sense, too, because it draws attention to the visibility of the materials we use when we create. Filming the surfaces upon which the objects of a film are made highlights the infrastructures of how we know things. As I argue in the introduction, one of the affordances of computational reading is the way it makes the critical project more legible than has traditionally been the case. While there will always be tacit dimensions to knowledge (as Michael Polanyi was the first to remind us), computation can be far more exo- than endoskeletal when compared with inherited critical practices. It is in this spirit that I have tried to make as much of the data and code used in this book publicly available. This includes over 7,000 lines of code (paltry for some, elephantine for me), as well as hundreds of tables of derived data from the primary data sets, all organized by chapter. While many of the primary data sets, which are described more fully in the appendix, cannot be shared, due to copyright, I provide code and tables of metadata about the collections for you to extract and build your own versions of them (or at least understand what has been included in them). I am trying to set a standard of reproducibility that will, I hope, gradually become more of a norm.

    Throughout, I have adopted the convention of describing each model or calculation referenced in the text in the notes. I generally favor plain-language descriptions of models over formulas and equations. The notes also contain a subsequent reference to the accompanying piece of code beyond the book (i.e., see script 1.1), where the full implementation of a model can be reviewed in greater detail. In doing so, I am trying to strike a balance between the conventions of the humanities, which emphasize reading as a form of knowledge in its own right, and those of more quantitative disciplines, which put all the formulas and tables up front. Others may want a different approach, but my hope is that this allows for a thoughtful reading experience as well as the ability to replicate a model. It maintains the spirit of the foot- or endnote as a paratextual space with a difference. You are free to use the code for your own purposes or to try to reproduce the results I put forth here. I make no claims to elegance in programming, but I am confident that the scripts work, at least as of today. Durability has taken on a whole new scale of meaning when seen against the long timescales of bibliographic preservation.

    At its heart, this book is an attempt to bridge two very different intellectual worlds and ways of thinking and reading. It would not have been possible without much generosity on the part of people from both of these worlds, some of whom I explicitly name in the acknowledgments or notes, but there are many more. The field is too diverse to be captured by a single proper name. We would do well to acknowledge that. Throughout the research and writing of this book, I have received tremendous amounts of help from others. This work is unquestionably more collective than traditional scholarship in the humanities. But it is also more bootstrapped, to use a computing term borrowed from the world of horses, in the sense of being more improvisational. Much of what I have learned has been acquired through the meandering and chance encounters of someone making his way through new terrain. As Adam Hammond has argued, there is a DIY quality to programming and computational criticism that is inspiring and pedagogically encouraging. This book wants to convince you that if you are not already doing so, then you too can enter into the world of computational reading.

    If we are going to foster this sense of exploration (and the potential for getting lost), then we ultimately need a more flexible model of what it means to be an expert. Alongside the expertise of specialization, we need to value the expertise of synthesis and mediation, what it means to speak two different languages, or codes, or embody two different mentalities simultaneously. This book is dedicated to all of those people who don’t feel at home inside something, whether it is a culture, a club, or a discipline, and instead who think there is something important to be discovered, something novel and consequential, in the spaces between.

    Introduction (Reading’s Refrain)

    What is the sum of the text?

    ROLAND BARTHES

    Repetitions

    In The Jew’s Daughter, a now classic work of electronic fiction by Judd Morrissey, we see an image of a single, static page. As the cursor passes over a highlighted word, portions of the page change, while some of the words stay the same. Unlike the turning of the page in a book, where a visual space is entirely overwritten, here only parts of the page change, even as it maintains its overall formal stability. The pages of The Jew’s Daughter—if we can call them that—not only follow one another in a linear sequence. They are also woven into one another. We might say, drawing on a bibliographic metaphor, that they are interleaved.

    The title of Morrissey’s work was taken from the name of a ballad sung in James Joyce’s Ulysses. In one sense, it performs a familiar literary gesture where a more canonical work is cited by a later one, much as Joyce’s own work had done. At the same time, in the invocation of the genre of the ballad, The Jew’s Daughter also highlights the poetic device of the refrain as a cornerstone of literary expression, if not of the experience of reading itself. In this, it is perhaps drawing on the account from Augustine’s Confessions in which the child’s nearby refrain initiates the author’s personal conversion, one that takes the form of the repetitive command Take it and read, take it and read. Rhetoric and medium, the page and the refrain, recapitulate one another through their mutual investment in repetition and transformation. They too are interleaved.

    In taking seriously the long history of reading’s technics—in foregrounding this elementary association between reading and the refrain—Morrissey’s work invites us to think about reading in a profoundly new way. It asks us to reflect on the meaning of reading as a form of rereading, not in the Augustinian sense as that which comes after reading, as a form of personal transformation. Rather, Morrissey’s work asks us to think about the meaning of rereading within reading, to understand the redundancy that is reading. When an average of 56% of words repeat themselves with every turn of a novel’s page, what is the meaning, Morrissey’s work is asking, of the quantities that underlie such repetitions?¹ What does it mean to read the same thing again and again?

    Ever since Pliny the Younger said, Read much, not many, debates about quantity have been central to debates about reading and literature. To read the right amount has often carried with it a moral imperative.² At the same time, the sheer volume of reading material available at any given time has often been seen as culturally significant in its own right. A great deal of research in the fields of book history and bibliography has drawn attention over the years to how the quantity of physical documents has shaped the literary and intellectual landscape.³ According to this point of view, more things matter.

    This book offers a new perspective on the significance of quantity for the study of literature. Inspired by the emerging fields of natural language processing, machine learning, and text and data mining, as well as a host of colleagues beginning to work in this area, Enumerations explores the quantitative dimensions within texts, the ways in which the repetitions of language lend meaning to our experience as readers. As Gilles Deleuze once remarked, the literary critic has traditionally been a purveyor of rarity, watching over the singular achievements of singular individuals, much like the gatekeeper of Kafka’s parable Before the Law.⁴ And yet so much of the activity of reading is built upon repetition, all of the ways that words, and the ideas and feelings they give birth to, have quantitative dimensions. We are now able to move beyond simply counting books to measuring the complex features that reside within and between them. But these developments raise a fundamental, and as yet unanswered, question: What is the meaning of literary quantity?

    Enumerations is an attempt to answer that meaning question, as Alan Liu has called it, across some of the most elementary dimensions of literature.⁵ In keeping with the embryonic state of the field, it concentrates on the building blocks of literary study, from the role of punctuation in poetry, to the matter of emplotment in novels, to the dispersion of topics, the behavior of characters, to the nature of fictional language and the shape of a poet’s career. It does so by assembling new kinds of data that represent different literary forms over the past two centuries. This includes over 230,000 poems, 15,000 novels, and another 12,000 works of nonfiction.⁶ Through data, it attempts to tell the deep story of these elementary literary features, where by deep story I mean all of the ways that cultural practices manifest themselves in repetitive, often predictable, and sometimes excessive ways.⁷ Paying attention to quantity reveals the grooves and channels of cultural expression, the deep connections among words, ideas, and forms.⁸ It brings us back in many ways to an originary sense of culture as an agricultural form of cultivation. Repetitions etch themselves into the social fabric, like so many furrows in the ground. They are often invisible, because so common. It is only when we take into account these repetitions—in my case 3,388,230 punctuation marks of twentieth-century poetry, 1.4 billion words of novels over the past two centuries, or 650,000 fictional characters that populate the nineteenth century—that we are able to see the deep story of poetry, the novel, fictionality, character, the commonplace, and the writer’s career begin to emerge.

    In drawing attention to the quantities of literature, Enumerations aims first and foremost to rethink how the computational study of literature has initially been framed.⁹ The seemingly endless array of debates surrounding the field—indeed, the way debates has emerged as a kind of default genre within the field—has led to hardened and quite often deeply anachronistic notions of disciplinary history. We are talking not only past each other, but also past the past itself, overlooking the numerous ways that this mode of inquiry grows out of a variety of different traditions.¹⁰ The emphasis on novelty, but also bigness, empiricism and, as we will see, overly simplified and often binary models of reading, has led us to miss the important ways that computational reading is inevitably tied to the norms and practices of the past. In misapprehending this disciplinary inheritance, we too easily misjudge the ways in which it does indeed offer distinctive challenges to disciplinary norms in the present.

    The notion of repetition behind Morrissey’s digital page is a case in point. Ever since Augustine, the act of rereading has served as a deeply effective practice of cultural stabilization.¹¹ It has stood alongside a variety of other repetitive cultural practices, such as copying, reprinting, note-taking, commonplacing, anthologizing, canonizing, and memorizing, which have all assumed cultural importance and have been the focus of much academic research.¹² This book argues that the repetitions of language are no less important, even if until now largely overlooked. Quantity signals, but it also distinguishes and maintains. If there is a basic dogma against which this book argues, it is that literature is not founded on the rare and the singular, but rather the common and the collective, the fabric of repetition from which it is made.¹³ In this, it situates computational research within a much longer tradition of literary study that attends to questions of cultural distinction and durability, but does so in new ways.¹⁴

    For many in our field, the introduction of quantity into the study of literature has been seen as nothing short of an act of intellectual colonization. Numeracy’s rise signals literacy’s eclipse, and with it a host of highly charged concepts like subjectivity, individuality, creativity, or even agency. Language plays subjective foil to number’s objectivity. According to a critic like Friedrich Kittler, there has been an ongoing 2,000-year battle between the forces of literacy and numeracy.¹⁵ The impoverished view that this takes on the history of literature and thought should be obvious. The list of philosophers who were mathematically trained is long, from Leibniz and Descartes to Ludwig Wittgenstein, just as the centrality of number to literature is equally rich. That there are nine circles of hell in Dante’s Inferno, 100 stories in Boccaccio’s Decameron, 365 chapters in Hugo’s Les Miserables, 108 lines in Poe’s The Raven, and a genre of poetry with exactly 14 lines that has lasted for over half a millennium (not to mention the entire field of prosody) indicates just some of the ways that quantity has been an essential component of literary meaning since its inception.

    Far from seeing the computational turn in literary studies as something distinctively new or even alien, we can and should understand it as part of the history of humanism itself. Humanism was in many ways founded upon the notion of studying linguistic and material differences, a practice of fostering the ability to understand the ways in which texts differ between times and places. Translation would emerge as one of its core practices as well as ideals. The knowledge gained in moving between languages, historical epochs, and systems of writing was seen as the highest form of knowledge. Crossing the divide of textual and linguistic difference was a means of potentially crossing the divide to something more spiritually transcendent. Erasmus’s bilingual New Testament might be considered to be one of this tradition’s most important founding documents.

    Today, there is a new translational imperative at work, one that aims to move between letters and numbers. Translating texts into quantities has emerged as the overwhelming feature of our cultural moment. Rather than see this as a kind of fallen state, I think we would do well to reposition it within a longer tradition of translational humanism, to see it as part of an ongoing intellectual drama that tries to understand the act of commensuration, of making different sign systems compatible with one another. Seen in this way, the literate and the numerate are not agons engaged in a duel. They are two integral components of a more holistic understanding of human mentality. In its most general sense, this book is an attempt to think through the value of translating between letters and numbers as a newly vital form of humanistic thought.

    If our debates have missed some of the more important ways that computational criticism is imbedded in long-standing disciplinary practices, it has also obscured the need for more interdisciplinary conversation. Most of the methods for the computational study of literature are emerging from other disciplines, indeed from other faculties altogether. Our conversations have to date been far too hermetic. Where the humanistic side of digital humanities has been too removed from the norms and practices of the past, the digital side has been too detached from the present of computational research. Different researchers have emphasized different intellectual traditions from which to draw on. For James English, Ted Underwood, Hoyt Long, and Richard Jean So, sociology and the social sciences offer an important framework through which to understand the computational turn within literary studies.¹⁶ For researchers like Berenike Herrmann, Gerhard Lauer, and Winfried Menninghaus associated with the International Society for the Empirical Study of Literature, it is the cognitive sciences and their emphasis on experimental method that provide a generative new framework through which to understand literature’s social and aesthetic importance.¹⁷ As you will see, this book is far more indebted to methods emerging from fields like computational linguistics and information science, with their focus on the algorithmic and statistical understanding of language.¹⁸ What is important in all cases is a deeper immersion in the literature and methods that these fields have to offer.

    The reason to do so, however, is not to adopt uncritically another field’s wares (or ultimately be subsumed by them), but rather to better understand the limitations of transferring one discipline’s methodological apparatus onto another, and in the process improve both. More collaboration and cross-talk is needed to better understand not only how these models might be able to address the kinds of questions we care about, but also the problematic assumptions we see in their current applications out in the world. Concepts like machine learning, artificial intelligence, and data science will undoubtedly change for the better as more humanists enter into the conversation.

    Enumerations is the product of many years of such conversations. It is born from listening, but also the work of transferring, and the inspirations and frustrations that accompany inhabiting this space in between. Ultimately it is about moving us away from a polemical mind-set to an analytical one, to continue the work that is already underway of presenting concrete insights into how the study of literary quantity changes our understanding of literature. Each of the chapters attempts to intervene in ongoing debates in literary study, whether it be twentieth-century poetics, the history of the novel, the study of character, or theories of authorship. And they do so through the application of a variety of techniques that will gradually grow in sophistication as the book progresses. The literary elements that the book studies are thus mirrored in the methodological elements used to understand them. As I move from extracting regular expressions (grep) like punctuation in chapter one, to using vector space models and social networks to approximate plot in chapter two, to examining topic models in chapter three, to using machine learning in chapter four, to dependency parsing and co-reference

    Enjoying the preview?
    Page 1 of 1