jump to navigation

Open Repositories 2009 May 26, 2009

Posted by David Kennedy in : Trident , add a comment

I attended Open Repositories 2009 Conference this past week.  Overall it was a very informative conference on the open source repository platforms (Fedora, dSpace, ePrints, Zentity), current projects and developments using these platforms, and future directions of repositories.  Below are some relevant notes from the conference.

Repository Workflow

There were a few presentations that discussed how institutions were managing their repositories, in particular, repositories built with Fedora.  Two of these, eSciDoc and Hydra, had some very useful nuggets.

Hydra is a grant-funded collaboration between Hull University, University of Virginia and Stanford University to build a repository management toolkit to manage their three very different workflows, and be extensible to manage heterogeneous workflows around the Fedora community.  There are a few practices or ideas that we might want to adopt from this project, as well as some possible points of convergence with Trident.

eSciDoc is an eResearch environment built on top of Fedora.

Cloud Storage

Sandy Payette and Michele Kimpton gave an update on the emerging DuraCloud services.  They are currently in development, and will be tested with a few beta sites before general release.  The DuraCloud services will definitely be worth Duke looking into; however, will probably need to wait for more Akubra development before these services can be properly integrated into Fedora.  For Duke’s repository, cloud storage should be evaluated for storage of preservation masters.  Also on the topic of cloud storage, David Tarrant gave an update from ePrints, as well as a reminder, “Clouds do blow away.”

Smart storage underpinning repositories

JPEG2000

djatoka continues to impress me.  It takes the math out of jpeg2000.  Ryan Chute discussed how this can be integrated into Fedora, and the service definitions involved in doing so.  He also showed some of the image viewers that have been built using djatoka.  With djatoka, the primary use of jpeg2000 is as a presentation format.  The integration with Fedora relies on a separate jpeg2000 “caching” server for serving up jpeg2000 services, which would live outside of Fedora.  In this model, it may be that Fedora never even needs to hold a jpeg2000 file.  I need a little more understanding on how the caching server gets populated, but will be investigating this in the coming months.

Islandora

UPEI has packaged an integration of Drupal and Fedora.  There is a mixed bag between what Drupal content is stored in Fedora and what content gets stored in Drupal.  As new types of content are stored in Drupal, new content models need to be created in Fedora to support them.  Presenter indicated that work still needs to be done on updates on Fedora being reflected in Drupal and vice-versa.  Without more than a presentation to base my opinions on, this seems like an extensible model, but one that also requires continued hand-tuning and management.

Complex object packaging

METS and OAI-ORE, or should it be METS vs OAI-ORE.  There is a lot more discussion and work in the last year around OAI-ORE.  It is a lot more flexible packaging model for complex objects than is METS.  And it has been the medium by which SWORD and other similar models are based on.  With flexibility though comes programmatic complexity.  Our repository model is based on a METS-centric view of digital repositories.  We did generalize item structure in such a way though that we could conceivably change the underlying structure from METS to something like ORE.  More to come on this

Cool stuff

@mire showed off some authoring tools integrated into Microsoft Office as add-ins.  I’m told these won’t be released for at least six months, but showed some real possibility and value that repositories to add to authors.  The authoring tools decomposed powerpoint presentations and word documents and stored them in the repository, and then allowed for searching of the repository (from within powerpoint and word) to include slides, images, text, etc from the repository into the working document.

Peter Sefton showed off his Fascinator.  It features click to create portals that could then be customized fairly easily.  He also talked about work he is currently doing on a “desktop sucker upper” which extracts data from a laptop to store into a repository.

Programming notes

Fedora

FIZ Karlsruhe has done extensive performance testing and tuning of Fedora.  They tested with data sets up to 40 million objects.  In terms of scaling, performance was not effected by size of the repository.  They were also able to increase performance by tuning the database, as well as separating the database from the repository.  They found that I/O was the limiting factor in all cases.

Fedora 3.2 highlights – beginnings of Akubra, SWORD integration, will be switching to new development environment (maven, OSGi/Spring DM)

dSpace

SWORD support, Shibboleth supported out of the box, new content model in dSpace 2.0 (based on entities and relationships)

“Wow! This job sure keeps us hopping!” May 13, 2009

Posted by Rich in : metadata , 2comments

There are many steps involved in creating and publishing a new digital collection — it’s truly a team effort that requires a lot of hard work and coordination of efforts from people across the libraries, with many different skill sets, working in many different departments, in buildings across Duke’s campus.  People who aren’t familiar with the process often think that digitizing the materials is the most time-consuming part, and that once that’s done, the collection is ready to go.  The truth, though, is that our colleagues in the Digital Production Center, who do the digitizing, are so fast and wildly productive on their scanners and cameras that the rest of us are constantly trying to catch up with them.

One of the most time-consuming parts of the digital collections process, and the part that people often don’t think about, is creating the metadata.   Metadata is data about the materials we’ve digitized, and as part of the metadata process, we have to decide how to arrange the items in the digital collection, how to describe them, what information we need to collect about them, what kind of terminology to use so people can find them, and all sorts of other things.  We have to decide how we want users to be able to find and interact with the digital objects, and what metadata is necessary to make that possible.

To make things even trickier, not only is metadata perhaps the most time-consuming part of the process, but up until this point we’ve had only a small number of staff working on it.  Part of the problem has been that we haven’t had a good metadata creation/management tool, so the workflows and procedures we’ve concocted to get around that have been so unwieldy that it just didn’t make sense to throw tons of staff at them.  But now that our new metadata editor Trident is getting closer and closer to becoming a reality, we can finally think about bringing nearly all our catalogers and archivists into the metadata process, which has been our goal all along.  In early May, we brought two trainers in to teach a two-day metadata course for about 20 of our catalogers, archivists, and other staff to prepare them to do this work.  We’ll soon be putting a subset of that group to work on the huge Broadsides project we’ve been talking about elsewhere on this blog, and then once we really get going, we’ll bring even more of them into this project and others.

Our goal is that digital collections work will become just one of the many things our catalogers and archivists do as a regular part of their jobs.  These folks are already experts at describing, arranging, and providing access to the library’s collections, so now they’ll be applying that expertise to new types of materials.  Even if they only work on digital collections as a small part of their jobs, bringing all these new staff members into the process will allow us to create metadata — and therefore create digital collections — much faster than we ever have before.  And that means more images, more text, more audio, more video … more ideas and discoveries will be possible for users around the world than ever before.  The best is yet to come ….

Answering the important questions. May 5, 2009

Posted by nh48 in : Assessment , 1 comment so far

Recently we implemented Google Analytics to track usage of our digital collections.  Sean has already contributed several great posts about our digital collections use statistics, but one thing I find particularly interesting (and amusing) is that Google Analytics allows us to see the types of keywords our users are entering into Google, Yahoo, and other search engines, and where those keywords lead them in our digital collections.

Not surprisingly, some search queries are common and reveal the subject strengths of our digital collections.  For example, the top three queries that bring users to our collections are “sheet music,” “ad access,” and “history of advertising.”

After scanning through thousands of these search queries, several distinct categories emerge: the known-item query (an exact title in quotes), the URL as query (e.g.  http://library.duke.edu/digitalcollections/adaccess/), and the format query (e.g. “diaries” or “manuscripts”), among others.  The most entertaining category, however, is the query issued in the form of a question.

Below are some of the important questions our users have asked with links to where they’ve found answers to those questions in our digital collections.

Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States.