[libraries, technology, search engines] Hamlet.doc versioning & Electronic Organization: bridgetester

bridgetester

[libraries, technology, search engines] Hamlet.doc versioning & Electronic Organization

Aug 21, 2007 09:25

Hamlet.doc? Literature in a Digital Age
Matthew Kirschenbaum

http://chronicle.com/free/v53/i50/50b00801.htm

August 17, 2007 issue

We may never know if Shakespeare had a sister, but we can be certain he didn't have a hard drive. What if he had? Details of his writing process and his life currently a mystery might be pitilessly exposed.

As scholars will tell you, there are no manuscripts of the plays surviving in the Bard's own hand. The text of King Lear, for example, comes to us from two published quartos and the First Folio (1623), with hundreds of lines and thousands of words differing between them. In the so-called "bad quarto" of Hamlet, a certain soliloquy begins: "To be or not to be. Aye, there's the point, /To Die, to sleep, is that all, aye all." The speech is also placed differently, in Act II, Scene 2, rather than its accustomed place in Act III, Scene 1. Today it is typically thought that the bad quarto is a memorial reconstruction of the play by an actor or spectator, but we can't be sure. In any case, the texts are rife with ambiguities. Which versions are right? What was closest to Shakespeare's own original (or final) intentions?

If Shakespeare had had a hard drive, if the plays had been written with a word processor on a computer that had somehow survived, we still might not know anything definitive about Shakespeare's original or final intentions - these are human, not technological, questions - but we might be able to know some rather different things. We might be able to know, for example, the precise date on which he began composing Hamlet indeed the precise minute and hour, time-stamped to the second. We would be able to know how long he had spent working on it, or at least how long the file containing the play had remained open on his desktop. We would very likely have access to multiple versions and states of the file, and if Shakespeare had "track changes" turned on while he wrote, we would be able to follow the composition of a soliloquy keystroke by keystroke, each revision also date- and time-stamped to the second. We might discover the play had originally been called GreatDane.doc instead of Hamlet.doc. We might also be able to know what else he had been working on that same day, or what Internet content he had browsed the night before (since we'll assume Shakespeare had Web access too). While he was online, he might have updated his blog or tagged some images in his Flickr account, or perhaps edited a Wikipedia entry or two. He might even have spent some time interacting with others by performing with an avatar in Second Life, an online place where all the world is truly a shared virtual stage.

This scenario will strike some readers as naïve, and not only because of the proposition of Shakespeare surfing the Internet. More significantly, the kind of data about electronic documents I have been describing - known as "metadata" in the trade - is by no means unimpeachable. Date and time stamps can be spoofed, and Mac and Windows systems handle them differently. (On a Macintosh, when a file is copied, its original creation date remains intact, whereas on a Windows machine, the creation date becomes the date of the copy.) The simple act of renaming a file will make it look like an entirely new creation, severing its relation to any earlier version saved under a different file name. A computer's internal clock can be wrong, and files can be erased beyond any practical hope of recovery.

But technical complications notwithstanding, while we know Shakespeare didn't have a hard drive, almost all writers working today do. Today nearly all literature is "born digital" in the sense that at some point in its composition, probably very early, the text is entered with a word processor, saved on a hard drive, and takes its place as part of a computer operating system. Often the text is also sent by e-mail to an editor, along with ancillary correspondence. Editors edit electronically, inserting suggestions and revisions and e-mailing the file back to the author to approve. Publishers use electronic typesetting and layout tools, and only at the very end of this process almost arbitrarily and incidentally, one might say is the electronic text of the manuscript (by now the object of countless transmissions and transformations) made into the static material artifact that is a printed book.

This new technological fact about writing is already having an impact, from office work to government and academe to literature and the creative arts. Sending a file as an attachment to an unwitting recipient without having first accepted or rejected track changes is a common workplace gaffe, since the recipient can view every step of the composition process. The Modern Language Association recently published a note in its newsletter informing its readership that the owner of the software on which a Microsoft Word file originates is identified in the document's easily accessible properties window, thereby jeopardizing blind peer review. Libraries have already been faced with the question of how to accession a hard drive as part of a literary estate (Emory University's acquisition last year of Salman Rushdie's papers, which include several of his laptops, is a good example), and an essay in The New York Times Book Review lamented the loss of literary heritage as correspondence among authors, editors, and publishers takes the form of evanescent e-mail or even instant messaging (or cellphone text messages).

To grasp what is at stake, we first have to understand a little about computers themselves. Computers are universal machines, meaning that they are machines designed to imitate other machines. This is the essence of software: Open one program and your computer is a 21st-century typewriter, open another and it is a video-production studio. It's the same physical machine, but a totally different virtual environment. There is therefore no natural state for electronic text on a computer. Often we're attracted to its perceived fluidity and flexibility, but that is an outcome of a particular way of modeling a document, not the inherent properties of the medium - as anyone who has ever tried to edit a file for which they lack appropriate permissions will know, as suddenly all that malleable text becomes maddeningly resolute and unyielding.

Likewise, we can edit and change an electronic document without leaving a trace, but we can also arrange to have our every keystroke recorded. We can save the same file over and over again, thereby overwriting any changes and edits, but we can also check documents in and out of a secure repository that will safeguard and maintain all versions and allow users to reconstruct the branching paths between them. (Wikipedia works in much this way, something often overlooked in discussions of its pros and cons; for every entry in Wikipedia, you can access a page history, allowing you to see how it has been edited and including full access to every earlier version, thereby helping a user to determine the stability and reliability of the information.)

Computers can even be programmed to display wear and tear on electronic documents, with frequently used files graphically altered to reflect their more frequent handling, in the same way the pages of a book become worn with use. Reading from a screen will never be the same as reading from the page, but that isn't the point; what we call an electronic document is actually a complex assemblage that is the constructed outcome of a set of assumptions and desires about how we want documents and text to behave in this environment. Often when attempting to make a definitive statement about the way computers work, we are really just commenting on an inherited set of conventions that are subject to change. If we don't like the way our electronic documents work today, in other words, we can decide to make them work otherwise tomorrow.

At the outset of the personal-computer era, authors tended to be drawn to the experimental aspect of the medium, and what emerged were the incunabula of the electronic age. In 1984, Robert Pinsky, the poet, wrote a piece of interactive fiction called Mindwheel, and a few years later, the author Michael Joyce, long fascinated by the prospect of "a story that changes every time you read it," wrote Afternoon, a Story after first helping to design the specialized authoring software on which it runs. Nowadays, however, photographs of an author at work often include a desktop computer or laptop in the same way that medieval portraits of scribes and saints show them in their studies, at their steep-sloped writing desks, with pens, inkwells, and other writing paraphernalia scattered about. Some writers, of course, still compose in longhand, and a few no doubt maintain a Rolodex of secondhand vendors from whom they can procure spare parts for their typewriter.

But by and large there has been a massive shift in the technological foundation of our writing, literary and otherwise; in the particular realm of literature and literary scholarship, this means that a writer working today will not and cannot be studied in the future in the same way as writers of the past, since the basic material evidence of their authorial activity - manuscripts and drafts, working notes, correspondence, journals - is, like all textual production, increasingly migrating to the electronic realm.

The implications here extend beyond scholarship to a need to reformulate our understanding of what becomes part of the public cultural record. If an author donates her laptop to a library, what are the boundaries of the collection? Old e-mail messages, financial records, Web-browser history files? Overwritten or erased data that is still recoverable from the hard drive? Since computers are now ground zero for so many aspects of our daily lives, the boundaries between our creative endeavors and more mundane activities are not nearly as clear as they might once have been in a traditional set of author's "papers." Indeed, what are the boundaries of authorship itself in an era of blogs, wikis, instant messaging, and e-mail? Is an author's blog part of her papers? What about a chat transcript or an instant message stored on a cellphone? What about a character or avatar the author has created for an online game? The question is analogous to Foucault's famous provocation about whether Nietzsche's laundry list ought to be considered part of his complete works, but the difference is not only in the extreme volume and proliferation of data but also in the relentless way in which everything on a computer operating system is indexed, stamped, quantified, and objectified.

With terabyte-scale drives due to start shipping in the next 18 months, it will soon be possible, indeed the norm, to save every version of every file by default because it will take more attention and energy to go to the trouble of locating and deleting it than to simply leave it on a disk whose storage capacity, at least for textual information, is for all practical purposes infinite. Future literary analysis may depend as much on data mining and visualization as on scholarly judgment and critical instinct. "To be or not to be. Aye, there's the point" might be shown to be statistically out of step with hundreds of other megabytes of textual data by dint of computational pattern recognition. More exotically, perhaps, literary editors may need to acquaint themselves with techniques of forensic information recovery, so as to help restore fragments of deleted or overwritten files that could prove vital in reconstructing a manuscript.

The issues raised here are not merely speculative. In early 2006, I spent a week at the Harry Ransom Humanities Research Center at the University of Texas at Austin, where the Michael Joyce papers are in the process of being cataloged. The physical part of the Joyce collection resides in acid-free manila folders in turn housed within Hollinger boxes, some 50 of them, which I was able to request by exchanging handwritten paper slips with the collection staff. The first accession of virtual materials, however, has been lifted from the almost 400 diskettes that make up their original storage media and uploaded to an electronic repository system known as DSpace. They are online but can be accessed only from a dedicated laptop located in the center's reading room. To actually work with the files, I had to download them to the desktop of the machine, where I used what means and know-how I could to get the cranky old binaries to execute on the up-to-date operating system. Sometimes I was unsuccessful. I suggested the addition of a hex editor, emulators, and other forensic tools to the utilities available on the computer. Since DSpace maintains the integrity of a master copy of every file, I could do what I pleased with the derivative I downloaded to my local desktop - hack at it, tweak it, break it. (This is not covered in the instructional video all new users of the collections are required to watch before they are admitted to the reading room.)

My experience at the Ransom center testifies to the extent to which our future understanding of our present (and future) literary imagination will depend on effective digital preservation. The challenges here are indeed momentous. Famously, in 1986, the British Broadcasting Corporation produced a digital edition of the Domesday Book on laser disc. The format was rendered unusable in just a few years, while the original has survived in legible form since the 11th century. What often goes unacknowledged in that story, however, is that the original has survived not only because of the inherent physical properties of ink and parchment and paper - it has also survived because we evolved the social practices necessary to recognize its significance and keep it safe, in a climate-controlled, limited-access vault.

There's nothing inherent in the technology that makes e-mail or other forms of electronic writing especially susceptible to vanishing into the electronic ether. On the contrary, as Oliver North and other malefactors have learned, the stuff is remarkably pesky and pernicious. A single e-mail message may leave traces of itself on a dozen different servers as it makes its way across the Internet, a potential for proliferation that is further exacerbated by backup services at each site. While I don't mean to minimize the very real challenges in the realm of digital preservation, in my view those challenges are best understood as at least as much social as technological. This brings me back to the fundamental nature of computers, which is that they really have no fundamental nature; to the extent that our current electronic records are fragile, unreliable, potentially misleading, and so forth, this is at least partly the result of implicit decisions made in how those records are constructed.

If we are worried that some modern-day Shakespeare isn't keeping early electronic drafts of her work, then we should build the capability to do so into the tools she is now working with. If we are worried that popular file formats are proprietary and hopelessly corporatized, then we should educate people about the benefits of standards and open source. This point is especially crucial for individual writers and authors, since effective preservation begins on the end-user's desktop. If the widespread perception is that electronic documents and records have no hope of surviving for posterity, then that will become a self-fulfilling prophecy as we all, individually and collectively, fail to take the steps necessary to ensure that they do survive.

The wholesale migration of literature to a born-digital state places our collective literary and cultural heritage at real risk. But for every problem that electronic documents create - problems for preservation, problems for access, problems for cataloging and classification and discovery and delivery - there are equal, and potentially enormous, opportunities. What if we could use machine-learning algorithms to sift through vast textual archives and draw our attention to a portion of a manuscript manifesting an especially rich and unusual pattern of activity, the multiple layers of revision captured in different versions of the file creating a three-dimensional portrait of the writing process? What if these revisions could in turn be correlated with the content of a Web site that someone in the author's MySpace network had blogged?

Literary scholars are going to need to play a role in decisions about what kind of data survive and in what form, much as bibliographers and editors have long been advocates in traditional library settings, where they have opposed policies that tamper with bindings, dust jackets, and other important kinds of material evidence. To this end, the Electronic Literature Organization, based at the Maryland Institute for Technology in the Humanities, is beginning work on a preservation standard known as X-Lit, where the "X-" prefix serves to mark a tripartite relationship among electronic literature's risk of extinction or obsolescence, the experimental or extreme nature of the material, and the family of Extensible Markup Language technologies that are the technical underpinning of the project. While our focus is on avant-garde literary productions, such literature has essentially been a test bed for a future in which an increasing proportion of documents will be born digital and will take fuller advantage of networked, digital environments. We may no longer have the equivalent of Shakespeare's hard drive, but we do know that we wish we did, and it is therefore not too late - or too early - to begin taking steps to make sure we save the born-digital records of the literature of today.

Will The Response Of The Library Profession To The Internet Be Self-Immolation?
by Martha M. Yee, with a great deal of help from Michael Gorman

http://www.slc.bc.ca/response.htm

(August?) 2007

There are two components of our profession that constitute the sole basis for our standing as a profession. The first is our expertise in imparting literacy to new generations, something we share with the teaching profession. The other is specific to our profession: human intervention for the organization of information, commonly known as cataloging. The greater goals of these kinds of expertise are an educated citizenry, maintenance of the cultural record for future generations, and support of research and scholarship for the greater good of society. If we cease to practice either of these kinds of expertise, we will lose the right to call ourselves a profession.

At the dawn of the modern age of our profession in the 19th century, heads of libraries were involved in cataloging (Antonio Panizzi and Charles Ammi Cutter among them). When the Library of Congress began to distribute catalog cards to libraries in 1901, fewer and fewer librarians learned to catalog. Now most LIS schools teach, at best, an introduction to information organization course in which students talk about such matters as how to organize supermarkets. That is the extent of the exposure of most new librarians to the principles of cataloging. Because so few librarians learn about or practice information organization any more, few librarians are aware of the danger that currently looms over the profession as a whole because influential people at the Library of Congress and our great research libraries want to do away with providing standard catalog records for trade publications to the nation's libraries. All librarians, not just catalogers, should take a look at the Calhoun report (Calhoun, Karen. The Changing Nature of the Catalog and its Integration with Other Discovery Tools (http://www.loc.gov/catdir/calhoun-report-final.pdf) and follow the progress of the Working Group on the Future of Bibliographic Control (www.loc.gov/bibliographic-future/). There you will find the argument that we should cede our information organization responsibilities to the publishing industry and other content providers. All this because some research studies show that undergraduates prefer to use Amazon.com and Google rather than libraries and their catalogs.

These library leaders have forgotten, or never knew, the fact that expertise in organization of information is at the core of the profession of librarianship. Because of their blindness to the nature of our profession, we are now in danger of losing not just standardized cataloging records and the Library of Congress Subject Headings, but the profession itself.

The excuse used, the preference on the part of undergraduates for quick answers, is nothing new. Undergraduates have always tended to over-use ready reference sources until they are taught by both librarians and professors how to do effective research and critical thinking. What has changed, apparently, is the willingness of these library administrators to shoulder the responsibility of teaching information literacy, research skills and critical thinking skills. I haven't heard anyone in the teaching profession argue yet that we should let recalcitrant elementary school students decide for themselves not to learn to read or do math, but perhaps that is next.

The implication in the Calhoun report that Google and Amazon.com are comparable to a library catalog and that libraries are in competition with Google and Amazon.com, are dangerous falsehoods. Google and Amazon.com are commercial entities. Their goal is not an educated citizenry, or maintenance of the cultural record for future generations, or support of research and scholarship for the greater good of society. Their goal is instead to get as much money as possible out of our pockets and into theirs, and to spend as little as possible on labor while making as much as possible in profit. Someday it is conceivable that their goal could evolve into that of quelling social unrest by limiting access to certain kinds of information.

Google and Amazon.com limit human intervention for information organization as much as possible in order to maximize profits. Computers are dumb machines. They cannot reason or make connections that a 2-year-old could make. The only logic available to a computer is based on either word counting or counting the number of times users gain access to a particular URL, the bases for their allegedly sophisticated search and display algorithms. A computer cannot discover broader and narrower term relationships, part-whole relationships, work-edition relationships, variant term or name relationships (the synonym or variant name or title problem), or the homonym problem in which the same string of letters means different concepts or refers to different authors or different works. In other words, a computer, by itself, cannot carry out the functions of a catalog.

I used Amazon.com to check to find a novel by Fannie Hurst called Lummox. They listed it as being in print and for sale for about $5.00. I ordered it, but when it arrived several weeks later it turned out to be a play adapted from Hurst's novel by someone else; none of this appeared in the description.

Thomas Mann, the great reference librarian, has written a wonderful book published by Oxford University Press that introduces scholars and researchers to LCSH and the LC classification so that they can do more effective and efficient research in libraries. He tells the story of searching for his book in Amazon.com and being told "Readers interested in book were also interested in Thus spake Zarathustra and Death in Venice."

When you search Google using Twain and Sawyer you get completely different results from what you get from a search using Clemens and adventures of Tom. The displays do not differentiate among the work, Adventures of Tom Sawyer, and works about it and works related to it.

When you search Google for power, Google does not ask you if you are interested in electrical power or in political power. When you search Google for cancer, you get 224 million hits. Even Google seems to realize that that is less than helpful; at the top of the screen it suggests that you refine your results by choosing Treatment, Symptoms, Tests, diagnosis, etc. When I investigated to see where refinement of results came from, it turned out that Google had asked for unpaid volunteers to break down large result sets such as this one.

It has become fashionable to criticize catalogs for not providing users with the evaluative information they desire, a la Amazon.com. Those who criticize seem unaware that catalogs currently do provide evaluative information, in that the presence of a work in the collection of a major research library implies (with some caveats) that that work was deemed of scholarly value. Catalogs can also help users identify the major authors in a field; if a user does a subject or classification search, and notices that half the books listed under a particular subject or in a particular discipline are by the same author, that is a good clue that that author may be a major author in that field. All of this happens only when humans intervene in order to organize information; it doesn't happen in Amazon.com or Google.

I once went to a talk by a colleague who was working in the business world on an information portal. He indicated that the project had begun as an automatic indexing project with relevance ranking, but that the people paying for the work were so dissatisfied with the results that the project had morphed into a thesaurus development project employing human indexers. Is this a vision of the future? Information organization only for those who pay for it and Google for the rest, instead of information organization for all as a social good paid for with tax dollars?

It is a fact universally acknowledged that librarianship is a woman-dominated profession. As such, ours is a deferential culture that avoids conflict and encourages humility, otherwise known as low self-worth. After all, what we do is perceived of by society at large as women's work, that is, work that anyone can do and that does not require any particular expertise (see Roma Harris. Librarianship: the Erosion of a Woman's Profession. 1992). The fact that Google and Amazon.com expect unpaid volunteers to do the work we do is evidence of this. Jeffrey Toobin's article on Google in the New Yorker (Feb. 5, 2007) casually and uncritically cedes to Google its claim to be the world's expert in information organization and is striking evidence of the ignorance of non-librarians about our work. Is it too much to ask for our colleagues in the profession, at least, to understand and acknowledge the value of human intervention for information organization, expensive though it is? Surely the richest country in the world can afford to pay for the human labor required to keep its cultural record in good order for future generations. The cost is peanuts compared to that of a missile defense system, and it would provide a much more effective defense for our way of life.

Many members of our profession, including catalogers, believe that information seekers prefer keyword access and that, for that reason, Amazon.com and Google are better designed than library catalogs. The reason catalog users seem to prefer keyword access is that system designers make keyword access the default search on the initial screen of nearly every OPAC in existence. It should be no surprise that transaction log studies then show that users do more keyword searches. The entities users seek when doing a catalog search (works, authors, and subjects) are actually much better represented by headings than by keywords. Keywords do not link synonyms (hypnosis vs. hypnotism) or variant names (Mark Twain vs. Samuel Clemens); keywords do not differentiate homonyms (electrical power vs. political power) or two different people of the same name (Bush, George, 1924- vs. Bush, George W. (George Walker), 1946-); keywords do not precoordinate complex concepts to indicate their relationships (e.g., Women in television broadcasting), and keywords do not suggest broader, narrower or related terms. However, *browse* searches with heading displays, which do all these things, are buried by system designers on advanced search screens, and put into indexes in which users are required to know the order of terms in a particular heading in order to find what they seek. The point I'm making here is that another major threat to our profession is posed by system designers who don't understand catalog records or catalog users. For the first time this year, our Voyager software has finally allowed us to provide users with a keyword in heading search of subject headings and cross references which responds with a display of matching headings and cross references, not an immediate display of bibliographic records. You can try it out at http://cinema.library.ucla.edu/. Try a topic/genre/form search on women or a topic/genre/form search on Poland, to see how useful it can be to let users see headings and cross references in response to a keyword search. When the same keyword in heading searching is applied to headings that identify works, users can search on both the author's name and the title and retrieve a sought work even when using a variant of the title (an ability denied to them in most current systems). Try a preexisting works search on Shakespeare in our file to see what I am talking about.

Close reading of catalog use research shows that users' searches almost always match LCSH headings, as long as the system provides access to the LCSH cross reference structure, as long as the system doesn't require users to know entry terms, and as long as the user knows how to type and spell (see: Yee, Martha M. and Sara Shatford Layne. Improving Online Public Access Catalogs. Chicago: American Library Association, 1998. p. 133-134). The many people who say otherwise in the literature participate in the wide-spread anti-intellectualism characteristic of our society, since they don't read critically the research in their own field. Problems with typing and spelling are by far the most common cause of search failure; sadly, at a time when spelling is more important than ever before for success in keyword searching over the Internet, it seems to be becoming a lost art. Typing is a big problem for older library users who grew up when typing was taught only to those intending to be secretaries.

To sum up, the threats to our profession are not from the Internet per se, which is just another tool we can use to do our jobs better, if we use it sensibly. The real threats are posed by the large number of our fellow librarians, including prominent leaders in the profession, who do not grasp the nature of our profession and the fact that human intervention for information organization is at its core; the low self-image those librarians have; and the failure of online catalog designers to learn about the nature of catalog records and the nature of catalog users so as to design systems that allow users to search for the entities they seek (works, authors, and subjects), which are represented in catalogs by headings, not by keywords.

Even if you disagree with, do not understand, or are not convinced by these arguments about the value of human intervention for information organization as currently practiced by the last of the catalogers in our profession, think about the larger implications of leaving information organization in the hands of the commercial interests that control content in our society. Up until now, libraries have played the role of intermediary between commercial interests and society in the provision of information as a social good and as part of the intellectual commons; we have worked hard to ensure that people have access to the information they need regardless of their socio-economic level, because we recognize that democracy does not work when the electorate is unable to determine the facts or to hear the arguments on both sides of an issue, and because we recognize that research and scholarship that advance our society are not carried out only by the wealthy who can afford to purchase all of the materials they need to do research. Leaving information organization in the hands of commercial interests such as Google and Amazon.com would be the first step in the process of removing the library and the library profession from the information provision chain altogether. Publishers already have the ability to sell information directly to the consumer on a pay-per-view basis. If we move toward a society in which that is the only way users can get information, we will have a society that replicates in the information sphere our current huge economic gap between haves and have-nots, and that places all the power to control the availability of information in the hands of entities that are completely profit-driven and have no incentive to serve the greater good of society as a whole. Do we really want to follow our leaders down this path?

technology, libraries, 2007august, search engines