Content Management Services, Metadata Cataloging, Taxonomy Design

Case Study
Media & Entertainment: The Harvard Crimson

Business Challenge:
Convert the Harvard Crimson newspaper archive to a fully searchable, tagged-text format for use on the Internet.

Business Team:
The Harvard Crimson, the nation's second-oldest daily college newspaper, has a distinguished history that traces its birth to the first issue of "The Magenta," published in 1873. The newspaper changed its name to "The Crimson" to reflect the new color of the college in 1875. Some of America's greatest journalists and statesmen have served as the newspaper's editor - the roster of Crimson editors who went on to win a Pulitzer Prize includes John F. Kennedy and Franklin D. Roosevelt.

Electronic Scriptorium, Ltd. solves complex information management and data conversion problems for corporations, government agencies, libraries, museums and other institutions. In a business association unique to Scriptorium, cloistered monks underpin a highly educated workforce that meets exacting standards of quality and accuracy. Electronic Scriptorium provides expertise in archives conversion, finding aids automation, bibliographic services, image cataloging, document conversion, XML/HTML encoding, offshore data conversion and more.


Technical Approach:
One hundred and twenty-eight years after its founding, the Crimson continues to flourish and remains the only daily newspaper in Cambridge. Although the Crimson covers local Harvard and Cambridge topics, it is also known for its editorial and political coverage. The Crimson's fully searchable text archive provides access to over 100 years of reporting and provides researchers and alumni with a rare and unique glimpse into history as viewed from the Harvard campus. In order to provide such extensive search and retrieval capabilities each individual story was converted into a full text file. The resulting online archive ensures that all names, places, dates and text can be retrieved accurately and consistently. The Crimson wanted to provide its viewers with access to newspapers by specific date as well as by specific reporters. Finally, it was important that the archive be displayed in an eye-pleasing format consistent with the original newspaper layout.

The range of page layouts, story formats, typefaces, and organizational styles used over the life of a 125 year-old newspaper is enormous. Changes in technology and editorial practices generally improve the newspaper reader's experience but generate significant interpretation challenges when converting the information into a standard archive format. For example, early newspapers usually did not contain photographs while modern newspapers without photographs are nearly unheard of. Also, today's newspaper stories generally lead the reader through a series of page turns designed to ensure exposure to advertising sponsors. Older newspaper formats did not use this strategy. Developing a data structure with the flexibility to bridge the gap between content from different centuries was a formidable challenge. Accuracy was accorded paramount importance - no amount of technical ingenuity would ever compensate for incorrectly converted information that prevented access to a desired story. Because accuracy was the so important, automated conversion processes, e.g., optical character recognition (OCR), were quickly excluded. While appropriate for some applications, OCR techniques could not reach the 99.99% accuracy required by the Crimson.

Electronic Scriptorium's extensive experience in multiple cataloging disciplines and database methodologies were invaluable in developing processing specifications that addressed the Crimson's broad range of needs. Electronic Scriptorium developed a proprietary text tagging system that met all of the Crimson's webpage display criteria and proved flexible enough to handle the changes required by the Crimson's evolution. The specification development phase included rigorous quality reviews to ensure excellent data quality. Because the conversion project encompassed 18,500 papers and more than 131,000 pages, it was impossible to anticipate every potential layout or format modification required by the Crimson. Periodic meetings between Electronic Scriptorium's management staff and the Crimson promoted open discussion of issues encountered during processing. Extensive use of electronic mail, along with a custom project status webpage maintained for the Crimson, allowed issues to be resolved quickly and successfully.


Implementation Methodology:
Achieving a 99.99% accuracy rate with such a diverse volume of material can only be accomplished with a process known as "double keying." A double-keyed project results in each document being entered twice. The two resulting documents are then electronically compared for errors. While types of errors can still result under certain circumstances, double keying is the most reliable method for ensuring a virtually error-free database. Due to the sheer volume of data (over 930 megabytes of data were delivered during the course of the project), economics dictated that Electronic Scriptorium rely on trusted partners in India to create the initial database.

Electronic Scriptorium and its Indian partners quickly implemented the Crimson processing specifications with good results. After a brief test phase, specifications were revised to produce truly excellent results. Then, using a carefully orchestrated process, ES ramped up production levels systematically to produce the levels of output necessary complete the project in just over 12 months.

Although our Indian partners provided an excellent starting point of double-keyed data, responsibility for overall consistency and quality review fell to Electronic Scriptorium's US based staff. In addition to quality control and specification management, the US staff addressed the many anomalies encountered during processing such as unreadable text due to discolored or torn paper, text obscured due to improper binding and printing as well as logic problems introduced through errors in the layout of a paper.

Electronic Scriptorium's team used its extensive problem-solving experience and broad general knowledge of text-based projects to create a coherent text archive that was delivered in "web-ready" format.


Summary:
Electronic Scriptorium's conversion allowed the Crimson to add a valuable fully searchable archive capability to its website (www.TheCrimson.com). Thanks to the foresight of its student staff, the Crimson's unique literary contribution is no longer in danger of being lost to the ravages of decaying newsprint and misplaced issues. The project demonstrated the level of quality and complexity that can be achieved through a collaborative effort between a newspaper staff focused on making its daily print deadline but concerned with preserving its heritage and an innovative, quality-conscious firm. The historic value of the project is not to be understated. The stories with bylines names like "Kennedy" and "Roosevelt" capture our imagination because they were the leaders of yesterday. Without a doubt, the Crimson newspaper today contains names that will someday be just as recognizable. Electronic Scriptorium is proud to have played a key role in preserving an important part of the nation's newspaper heritage.






Copyright © 1999-2008 Electronic Scriptorium, Ltd. All Rights Reserved