Creation of the Images and Database
Creation of the Images
The sheet music in the Historic American Sheet Music Project was scanned on UMAX Mirage II and Mirage IIse 11x17" flatbed scanners. These were connected to Power Macintosh 7300/200 workstations running Mac OS8 and Adobe Photoshop 4.0. Over the course of two semesters, Duke students working in the Digital Scriptorium scanned over 16,500 images. These master images were created at 150 dpi in 24-bit RGB color and saved in JPEG format. Testing indicated that the 150 dpi color scans provided great enough resolution for 1 mm characters to be adequately visible on both existing computer monitors and laser-quality prints.
Each master image was placed through a quality control process, checked for image quality, pagination, page orientation, amount of skew, cropping, color, and other problems which arose. The biggest scanning problem encountered was the prevalence of Moiré patterns caused by halftone dots in the page being scanned. Most often these appear on the illustrated title pages, but are frequently found elsewhere in the pieces as well. A variety of techniques were devised to deal with this issue. The descreening feature found in the image capture software being used frequently corrected the problem, while in more difficult cases slight application of Photoshop's blur filter was employed.
See Sheet Music Scanning Procedures for an overview of the process the student assistants used in scanning the sheet music.
Programming to automatically create 72 dpi images and thumbnails from the original 150 dpi scans was conceived and developed using the Perl scripting language and ImageMagick, a freely available UNIX graphics package. The conversion consisted of several steps. First, all 16,596 of the 150 dpi images were transferred to the Scriptorium machine (a dual-processor Sun Sparc 20) by FTP and then arranged according to the directory structure scheme devised at the beginning of the project. This scheme allows for quick server access and ease of file management by creating a tree-like structure in which each branch may contain no more than 100 subdirectories.
During the scanning phase, the 150 dpi images had been simply identified by a unique identifier based upon the call number followed by the image number. All the files were renamed from their working names to a regular and easily identifiable file naming system based on the call number of the piece of music, followed by the image number, followed by the size of the image - expressed as "150dpi" in this step. The unique identifier serves as the key which holds the database records and the images together.
Finally, by taking advantage of the ability of the Scriptorium machine to run multiple processes and employing another machine running Linux, multiple Perl conversion scripts were run both by day and night allowing the generation of 22,680 additional images in the period of approximately a week. These included 16,596 72 dpi images, 3,042 "small" images, and 3,042 thumbnails. It was decided that both a "small" image measuring 300 pixels in width and a thumbnail measuring 100 pixels in width would be produced for each of the illustrated title pages of the sheet music. The small image is embedded within the database record, and the thumbnail serves to maintain context while browsing through a piece.
As a page-turning mechanism, wrappers for each piece were created in HTML by a Perl script. The wrappers supply a table of contents listing each page and a method of viewing both 72 and 150 dpi image sizes while maintaining the context and pagination of the piece.
The sheet music database is comprised of a total of 39,276 individual images and utilizes 20.48 gigabytes of disk space including HTML wrappers. Individual 150 dpi images have an average file size of 0.99 megabytes, and the average number of pages/images per piece is 5.46. A 37.8 gigabyte RAID disk array was added to the Digital Scriptorium's Sun Solaris Internet server and the machine has been upgraded with additional enhancements designed to speed access to it's digital resources.Database Format
The flat-file database which contained the indexing information for the pieces was converted to SGML format in the form of the Encoded Archival Description (EAD) Version 1.0 DTD using the Microsoft Word macro language and Perl. Each field in the database was mapped to an EAD element which was made unique by use of attributes. See Map of Database Fields to EAD Elements which shows the relationship of each field of the indexing template to a unique EAD element. The full EAD document instance of the 1850-1859 section of the database is also available. See EAD at Duke for more information on EAD.
The resulting SGML database is presented in HTML for ordinary web browsers using DynaWeb software from INSO. This method allows for searches to be limited to the unique fields, which allows highly targeted searching (see Search). Access to the database is through both targeted searching and "canned searches" on the Subject Content, Illustration Type, and Advertising subject fields (See Browse Sheet Music). In addition a user may perform a targeted or keyword search within the DynaWeb interface.