JCDL Conference 2008


SIGIR logo

IEEE computer society logo

Accepted Papers

Enhancing Digital Libraries Using Missing Content Analysis
David Carmel, Elad Yom-Tov and Haggai Roitman
Abstract: This work shows how the content of a digital library can be enhanced to better satisfy its users' needs. Missing content is identified by finding missing content topics in the system's query log or in a pre-defined taxonomy of required knowledge. The collection is then enhanced with new relevant knowledge, which is extracted from external sources that satisfy those missing content topics. Experiments we conducted measure the precision of the system before and after content enhancement. The results demonstrate a significant improvement in the system effectiveness as a result of content enhancement and the superiority of the missing content enhancement policy over several other possible policies.
Building a Dynamic Lexicon from a Digital Library
David Bamman and Gregory Crane
Abstract: We describe here in detail our work toward creating a dynamic lexicon from the texts in a large digital library. By leveraging a small structured knowledge source (a 30,537 word treebank), we are able to extract selectional preferences for words from a 3.5 million word Latin corpus. This is promising news for low-resource languages and digital collections seeking to leverage a small human investment into much larger gain. The library architecture in which this work is developed allows us to query customized subcorpora to report on lexical usage by author, genre or era and allows us to continually update the lexicon as new texts are added to the collection.
On Content-Driven Search—Keyword Suggesters for Literature Digital Libraries
Sulieman Bani-Ahmad and Gultekin Ozsoyoglu
Abstract: We propose and evaluate a “content-driven search keyword suggester” for keyword-based search in literature digital libraries. Suggesting search keywords at an early stage, i.e., while the user is entering search terms, is helpful for constructing more accurate, less ambiguous, and focused search keywords for queries. Our search keyword suggestion approach is based on an a priori analysis of the publication collection in the digital library at hand, and consists of the following steps. We (i) parse the document collection using the Link Grammar parser, a syntactic parser of English, (ii) group publications based on their “most-specific” research topics, (iii) use the parser output to build a hierarchical structure of simple and compound tokens to be used to suggest search terms, (iv) use TextRank, a text summarization tool, to assign topic-sensitive scores to keywords, and (v) use the identified research-topics to help user aggregate search keywords prior to the actual search query execution. We experimentally show that the proposed framework, which is optimized to work on literature digital libraries, promises a more scalable, high quality, and user-friendly search-keyword suggester when compared to its competitors. We validate our proposal experimentally using a subset of the ACM SIGMOD Anthology digital library as a testbed, and by employing the research-pyramid model to identify the “most-specific” research topics.
Unsupervised Semantic Markup of Literature for Biodiversity Digital Libraries
Hong Cui
Abstract: This paper reports the further development of machine learning techniques for semantic markup of biodiversity literature, especially morphological descriptions of living organisms such as those hosted at efloras.org and algaebase.org. Syntactic parsing and supervised machine learning techniques have been explored by earlier research. Limitations of these techniques promoted our investigation of an unsupervised learning approach that combines the strength of earlier techniques and avoids the limitations. Semantic markup at the organ and character levels is discussed. Research on semantic markup of natural heritage literature has direct impact on the development of semantic-based access in biodiversity digital libraries.
Seeking information in realistic books: A user study
Veronica Liesaputra and Ian Witten
Abstract: There are opposing views on whether readers gain any advantage from using a computer model of a 3D physical book. There is enough evidence, both anecdotal and from formal user studies, to suggest that the usual HTML or PDF presentation of documents is not always the most convenient, or the most comfortable, for the reader. On the other hand it is quite clear that while 3D book models have been prototyped and demonstrated, none are in routine use in today’s digital libraries. And how do 3D book models compare with actual books? This paper reports on a user study designed to compare the performance of a practical Realistic Book implementation with conventional formats (HTML and PDF) and with physical books. It also evaluates the annotation features that the implementation provides.
Understanding Cultural Heritage Experts’ Information Seeking Needs
Alia Amin, Jacco van Ossenbruggen, Lynda Hardman and Annelies van Nispen
Abstract: We report on our user study on the information seeking behavior of cultural heritage experts and the sources they use to carry out search tasks. Seventeen experts from nine cultural heritage institutes in the Netherlands were interviewed and asked to answer questionnaires about their daily search activities. The interviews helped us to better understand their search motivations, types, sources and tools. A key finding of our study is that the majority of search tasks involve relatively complex information gathering. This is in contrast to the relatively simple fact-finding oriented support provided by current tools. We describe a number of strategies that experts have developed to overcome the inadequacies of their tools. Finally, based on the analysis, we derive general trends of cultural heritage experts’ information seeking needs and discuss our preliminary experiences with potential solutions.
The Myth of Find: User Behaviour and Attitudes Towards the Basic Search Feature
Fernando Loizides and George Buchanan
Abstract: The ubiquitous within-document text search feature (Ctrl-F) is considered by users to be a key advantage in electronic information seeking [1]. However what people say they do and what they actually do are not always consistent. It is necessary to understand, acknowledge and identify the cause of this inconsistency. We must identify the physical and cognitive factors to develop better methods and tools, assisting with the search process. This paper discusses the limitations and myths of Ctrl-f in information seeking. A prototype system for within-document search is introduced. Three user studies portray shared behaviour and attitudes, common between participants regarding within-document searching.
A Longitudinal Study of Exploratory and Keyword Search
Max L. Wilson and m.c. schraefel
Abstract: Digital libraries are concerned with improving the access to collections to make their service more effective and valuable to users. In this paper, we present the results of a four-week longitudinal study investigating the use of both exploratory and keyword forms of search within an online video archive, where both forms of search were available concurrently in a single user interface. While we expected early use to be more exploratory and subsequent use to be directed, over the whole period there was a balance of exploratory and keyword searches and they were often used together. Further, to support the notion that facets support exploration, there were more than five times as many facet clicks than more complex forms of keyword search (boolean and advanced). From these results, we can conclude that there is real value in investing in exploratory search support, which was shown to be both popular and useful for extended use of the system.
Exploring Educational Standard Alignment: In Search of 'Relevance'
Rene Reitsma, Byron Marshall and Michael Dalton
Abstract:The growing availability of online K-12 curriculum is increasing the need for meaningful alignment of this curriculum with state-specific standards. Promising automated and semi-automated alignment tools have recently become available. Unfortunately, recent alignment evaluation studies report low inter-rater reliability, e.g., 32% with two raters and 35 documents. While these results are in line with studies in other domains, low reliability makes it difficult to accurately train automatic systems and complicates comparison of different services. We propose that inter-rater reliability of broadly defined, abstract concepts such as ‘alignment’ or ‘relevance’ must be expected to be low due to the real-world complexity of teaching and the multidimensional nature of the curricular documents. Hence, we suggest decomposing these concepts into less abstract, more precise measures anchored in the daily practice of teaching.

This article reports on the integration of automatic alignment results into the interface of the TeachEngineering collection and on an evaluation methodology intended to produce more consistent document relevance ratings. Our results (based on 14 raters x 6 documents) show high inter-rater reliability (61 - 95%) on less abstract relevance dimensions while scores on the overall ‘relevance’ concept are (as expected) lower (64%). Despite a relatively small sample size, regression analysis of our data resulted in an explanatory (R2 = .75) and statistically stable (p-values < .05) model for overall relevance as indicated by matching concepts, related background material, adaptability to grade level, and anticipated usefulness of exercises. Our results suggest that more detailed relevance evaluation which includes several dimensions of relevance would produce better data for comparing and training alignment tools.

From NSDL 1.0 to NSDL 2.0: Towards a Comprehensive Cyberinfrastructure for Teaching and Learning
David McArthur and Lee Zia
Abstract: NSDL is a premier provider of digital educational collections and services, which has been supported by NSF for eight years. As a mature program, NSDL has reached a point where it could either change direction or wind down. In this paper we argue there are reasons to continue the program and we outline several possible new program directions. These build on NSDL’s learning platform, and they also look towards NSF’s emerging interest in supporting work at the intersection of cyberinfrastructure and education. We consider NSDL’s potential roles in several grand challenges that confront education, including: tailoring educational resources to students’ needs, providing educators with a cyber-teaching environment, developing a cyber-workbench for researchers, and integrating education research and practice.
Cross-Disciplinary Molecular Science Education in Introductory Science Courses: An NSDL MatDL Collection
David Yaron, Jodi Davenport, Michael Karabinos, Gaea Leinhardt, Laura Bartolo, John Portman, Cathy Lowe, Donald Sadoway, W. Craig Carter and Colin Ashe
Abstract: This paper discusses a digital library designed to help undergraduate students draw connections across disciplines, beginning with introductory discipline-specific science courses (including chemistry, materials science, and biophysics). The collection serves as the basis for a design experiment for interdisciplinary educational libraries and is discussed in terms of the three models proposed by Sumner and Marlino. As a cognitive tool, the library is organized around recurring patterns in molecular science, with one such pattern being developed for this initial design experiment. As a component repository, the library resources support learning of these patterns and how they appear in different disciplines. As a knowledge network, the library integrates design with use and assessment.
Curriculum Overlay Model for Embedding Digital Resources
Huda Khan, Keith Maull and Tamara Sumner
Abstract: This paper describes the design and implementation of a curriculum overlay model for the representation of adaptable curriculum using educational digital library resources. We focus on representing curriculum to enable the incorporation of digital resources into curriculum and curriculum sharing and customization by educators. We defined this model as a result of longitudinal studies on educators' development and customization of curriculum and user interface design studies of prototypes representing curriculum. Like overlay journals or the information network overlay model, our curriculum overlay model defines curriculum as a compound object with internal semantic relationships and relationships to digital library metadata describing resources. We validated this model by instantiating the model using science curriculum which uses digital library resources and using this instantiation within an application which, built on FEDORA, supports curriculum customization. Findings from this work can support the design of digital library services for customizing curriculum which embeds digital resources.
Gazetiki: Automatic Creation of a Geographical Gazetteer
Adrian Popescu, Gregory Grefenstette and Pierre Alain Moëllic
Abstract: Geolocalized databases are becoming necessary in a wide variety of application domains. Thus far, the creation of such databases has been a costly, manual process. This drawback has stimulated interest in automating their construction, for example, by mining geographical information from the Web. Here we present and evaluate a new automated technique for creating and enriching a geographical gazetteer, called Gazetiki. Our technique merges disparate information from Wikipedia, Panoramio, and web search engines in order to identify geographical names, categorize these names, find their geographical coordinates and rank them. We show that our method provides a richer structure and an improved coverage compared to the other known attempt at automatically building a geographic database, TagMaps. Our technique correctly identifies 93% of geographical location candidates, with a much greater coverage than TagMaps, finding 2 to 30 times more items than TagMaps per location. The information produced in Gazetiki enhances and complements the Geonames database, using a similar domain model.
Discovering GIS Sources on the Web using Summaries
Ramaswamy Hariharan, Bijit Hore and Sharad Mehrotra
Abstract: In this paper, we consider the problem of discovering GIS data sources on the web. Source discovery queries for GIS data are specified using keywords and a region of interest. A source is considered relevant if it contains data that matches the keywords in the specified region. Existing techniques simply rely on textual metadata accompanying such datasets to compute relevance to user-queries. Such approaches result in poor search results, often missing the most relevant sources on the web. We address this problem by developing more meaningful summaries of GIS datasets that preserve the spatial distribution of keywords. We conduct experiments showing the effectiveness of proposed summarization techniques by significantly improving the quality of query results over previous approaches, while guaranteeing scalability and high performance.
SocialTrust: Tamper-Resilient Trust Establishment in Online Communities
James Caverlee, Ling Liu and Steve Webb
Abstract: Web 2.0 promises rich opportunities for information sharing, electronic commerce, and new modes of social interaction, all centered around the ``social Web'' of user-contributed content, social annotations, and person-to-person social connections. But the increasing reliance on this ``social Web'' also places individuals and their computer systems at risk. In this paper, we identify a number of vulnerabilities inherent in online communities and study opportunities for malicious participants to exploit the tight social fabric of these networks. With these problems in mind, we propose the SocialTrust framework for tamper-resilient trust establishment in online communities. Two of the salient features of SocialTrust are its dynamic revision of trust by (i) distinguishing relationship quality from trust; and (ii) incorporating a personalized feedback mechanism for adapting as the community evolves. We experimentally evaluate the SocialTrust framework using real online social networking data consisting of millions of MySpace profiles and relationships. We find that SocialTrust supports robust trust establishment even in the presence of large-scale collusion by malicious participants.
Personal & SME Archiving
Stephan Strodl, Florian Motlik, Kevin Stadler and Andreas Rauber
Abstract: Digital objects require appropriate measures for digital preservation to ensure that they can be accessed and used in the near and far future. While heritage institutions have been addressing the challenges posed by digital preservation needs for some time, private users and SMEs are way less prepared to handle these challenges. Yet, both have increasing amounts of data that represent considerable value, be it office documents or family photographs. Backup, common practice of home users, avoids the physical loss of data, but it does not prevent the loss of the ability to render and use the data in the long term. Research and development in the area of digital preservation is driven by memory institutions and large businesses. The available tools, services and models are developed to meet the demands of these professional settings.

This paper analyses the requirements and challenges of preservation solutions for private users and SMEs. Based on the requirements and supported by available tools and services, we are designing and implementing a home archiving system to provide digital preservation solutions specifically for digital holdings in the small office and home environment. It hides the technical complexity of digital preservation challenges and provides simple and automated services based on established best practice examples. The system combines bit preservation and logical preservation strategies to avoid loss of data and the ability to access and use the data similar in style. A first software prototype, called Hoppla, is presented in this paper.

Recovering a Website's Server Components from the Web Infrastructure
Frank McCown and Michael Nelson
Abstract: Our previous research has shown that the collective behavior of search engine caches (e.g., Google, Yahoo, Live Search) and web archives (e.g., Internet Archive) results in the uncoordinated but large-scale refreshing and migrating of web resources. Interacting with these caches and archives, which we call the Web Infrastructure (WI), allows entire websites to be reconstructed in an approach we call lazy preservation. Unfortunately, the WI only captures the client-side view of a web resource. While this may be useful for recovering much of the content of a website, it is not helpful for restoring the scripts, web server configuration, databases, and other server-side components responsible for the construction of the web resource.

This paper proposes a novel technique for storing and recovering the server-side components of a website from the WI. Using erasure codes to embed the server-side components as HTML comments throughout the website, we can effectively reconstruct all the server components of a website when only a portion of the client-side resources have been extracted from the WI. We present the results of a preliminary study that baselines the lazy preservation of ten EPrints repositories and then examines the preservation of an EPrints repository that uses the erasure code technique to store the server-side EPrints software throughout the website. We found nearly 100% of the EPrints components were recoverable from the WI just two weeks after the repository came online, and it remained recoverable three months after it was "lost".

A Data Model and Architecture for Long-Term Preservation
Greg Janee, Justin Mathena and James Frew
Abstract: The National Geospatial Digital Archive, one of eight initial projects funded under the Library of Congress’s NDIIPP program, has been researching how geospatial data can be preserved on a national scale and be made available to future generations. In this paper we describe an archive architecture that provides a minimal approach to the long-term preservation of digital objects based on co-archiving of object semantics, uniform representation of objects and semantics, explicit storage of all objects and semantics as files, and abstraction of the underlying storage system. This architecture ensures that digital objects can be easily migrated from archive to archive over time and that the objects can, in principle, be made usable again at any point in the future; its primary benefit is that it serves as a fallback strategy against, and as a foundation for, more sophisticated (and costly) preservation strategies. We describe an implementation of this architecture in a protoype archive running at UCSB that also incorporates a suite of ingest and access components.
HarvANA - Harvesting Community Tags to Enrich Collection Metadata
Jane Hunter, Imran Khan and Anna Gerber
Abstract: Collaborative, social tagging and annotation systems have exploded on the Internet as part of the Web 2.0 phenomenon. Systems such as Flickr, Del.icio.us, Technorati, Connotea and LibraryThing, provide a community-driven approach to classifying information and resources on the Web, so that they can be browsed, discovered and re-used. Although social tagging sites provide simple, user-relevant tags, there are issues associated with the quality of the metadata and the scalability compared with conventional indexing systems. In this paper we propose a hybrid approach that enables authoritative metadata generated by traditional cataloguing methods to be merged with community annotations and tags. The HarvANA (Harvesting and Aggregating Networked Annotations) system uses a standardized but extensible RDF model for representing the annotations/tags and OAI-PMH to harvest the annotations/tags from distributed community servers. The harvested annotations are aggregated with the authoritative metadata in a centralized metadata store. This streamlined, interoperable, scalable approach enables libraries, archives and repositories to leverage community enthusiasm for tagging and annotation, augment their metadata and enhance their discovery services. This paper describes the HarvANA system and its evaluation through a collaborative testbed with the National Library of Australia using architectural images from PictureAustralia.
Semi Automated Metadata Extraction for Preprints Archives
Emma Tonkin and Henk Muller
Abstract: In this paper we present a system called paperBase that aids users in entering metadata for preprints. PaperBase extracts metadata from the preprint. Using a Dublin-Core based REST API, third-party repository software populates a web form that the user can then proofread and complete. PaperBase also predicts likely keywords for the preprints, based on a controlled vocabulary of keywords that the archive uses and a Bayesian classifier.

We have tested the system on 12 individuals, and measured the time that it took them to enter data, and the accuracy of the entered metadata. We find that our system is not significantly faster than manual entry, even though all but two participants perceived it to be faster. However, some metadata, in particular the title of preprints, contains significantly fewer mistakes when entered automatically; even though the automatic system is not perfect, people tend to correct mistakes that paperBase makes, but would leave their own mistakes in place.

A Metadata Generation System for Scanned Scientific Volumes
Xiaonan Lu, James Z. Wang and C. Lee Giles
Abstract: Large scale digitization projects have been conducted at digital libraries with advancement in automatic document processing and popularity of digital libraries. Scientific literature originally printed on paper have been converted into collections of digital resources for preservation and open access purposes. In this work, we tackle the problem of extracting structural and descriptive metadata for scanned volumes of journal. These metadata information illustrate the internal structure of a scanned volume, link objects in different sources, and describe published articles within a scanned volume. These structural and descriptive information is critical for digital libraries to provide effective content access functionalities to users. We proposed methods for generating volume level, issue level, and article level metadata using format and text features extracted from OCRed text. We have developed the system and integrated it into an operational digital library for real world usage.
Exploring a Digital Library through Key Ideas
Billl N. Schilit and Okan Kolak
Abstract: Key Ideas is a technique for exploring digital libraries by navigating passages that repeat across multiple books. From these popular passages emerge quotations that authors have copied from book to book because they capture an idea particularly well: Jefferson on liberty; Stanton on women's rights; and Gibson on cyberpunk. We augment Popular Passages by extracting key terms from the surrounding context and computing sets of related key terms. We then create an interaction model where readers fluidly explore the library by viewing popular quotations on a particular key term, and follow links to quotations on related key terms. In this paper we describe our vision and motivation for Key Ideas, present an implementation running over a massive, real-world digital library consisting of over a million scanned books, and describe some of the technical and design challenges. The principal contribution of this paper is the interaction model and prototype system for browsing digital libraries of books using key terms extracted from the aggregate context of popularly quoted passages.
Math Information Retrieval: User Requirements and Prototype Implementation
Jin Zhao, Min-Yen Kan and Yin Leng Theng
Abstract: We report on the user requirements study and preliminary implementation phases in creating a digital library that indexes and retrieves educational materials on math. We first review the current approaches and resources for math retrieval, then report on the interviews of small group of potential users properly ascertain their needs. While preliminary, the results suggest that Meta-Search and Resource Categorization are two basic requirements for a math search engine. In addition, we implement a prototype categorization system and show that the generic features work well in identifying the math contents from the webpage but are weak in categorizing them. We believe this is mainly due to the training data and the segmentation. In near future, we plan to improve it further while integrating it and Meta-Search into a search engine. As a long-term goal, we will also look into how math expressions and text may be best handled.
A Competitive Environment for Exploratory Query Expansion
David Milne, David Nichols and Ian Witten
Abstract: Most information workers query digital libraries many times a day. Yet people have little opportunity to hone their skills in a controlled environment, or compare their performance with others in an objective way. Conversely, although search engine logs record how users evolve queries, they lack crucial information about the user’s intent. This paper describes an environment for exploratory query expansion that pits users against each other and lets them compete, and practice, in their own time and on their own workstation. The system captures query evolution behavior on predetermined information-seeking tasks. It is publicly available, and the code is open source so that others can set up their own competitive environments.
How people find videos
Sally Jo Cunningham and David M. Nichols
Abstract: At present very little is known about how people locate and view videos. This study draws a rich picture of everyday video seeking strategies and video information needs, based on an ethnographic study of New Zealand university students. These insights into the participants’ activities and motivations suggest potentially useful facilities for a video digital library.
Selection and Context Scoping for Digital Video Collections: An Investigation of YouTube and Blogs
Robert Capra, Christopher Lee, Gary Marchionini, Terrell Russell, Chirag Shah and Fred Stutzman
Abstract: Digital curators are faced with decisions about what part of the ever-growing, ever-evolving space of digital information to collect and preserve. The recent explosion of web video on sites such as YouTube presents curators with an even greater challenge – how to sort through and filter a large amount of information to find, assess and ultimately preserve important, relevant, and interesting video. In this paper, we describe research conducted to help inform digital curation of on-line video. Since May 2007, we have been monitoring the results of 57 queries on YouTube related to the 2008 U.S. presidential election and report results comparing these data to blogs that point to candidate videos on YouTube and discuss the effects of query-based harvesting as a collection development strategy.
A Study of Awareness in Multimedia Search
Robert Villa, Nick Gildea and Joemon Jose
Abstract: Awareness of another's activity is an important aspect of facilitating collaboration between users, enabling an "understanding of the activities of others". Techniques such as collaborative filtering enable a form of asynchronous awareness, providing recommendations generated from the past activity of a community of users. In this paper we investigate the role of awareness and its effect on search behavior in collaborative multimedia retrieval. We focus on the scenario where two users are searching at the same time on the same task, and via the interface, can see the activity of the other user. The main research question asks: does awareness of another searcher aid a user when carrying out a multimedia search session?

To encourage awareness, an experimental study was designed where two users were asked to find as many relevant video shots as possible under different awareness conditions. These were individual search (no awareness of each other), mutual awareness (where both user's could see each other's search screen), and unbalanced awareness (where one user is able to see the other's screen, but not vice-versa). Twelve pairs of users were recruited, and the four worst performing TRECVID 2006 search topics were used as search tasks, under four different awareness conditions. We present the results of this study, followed by a discussion of the implications for multimedia digital library systems.

Towards usage-based impact metrics: first results from the MESUR project
Johan Bollen, Herbert Van de Sompel and Marko A. Rodriguez
Abstract: Scholarly usage data holds the potential to be used as a tool to study the dynamics of scholarship in real time, and to form the basis for the definition of novel metrics of scholarly impact. However, the formal groundwork to reliably and validly exploit usage data is lacking, and the exact nature, meaning and applicability of usage-based metrics is poorly understood. The MESUR project funded by the Andrew W. Mellon Foundation constitutes a systematic effort to define, validate and cross-validate a range of usage-based metrics of scholarly impact. MESUR has collected nearly 1 billion usage events as well as all associated bibliographic and citation data from significant publishers, aggregators and institutional consortia to construct a large-scale usage data reference set. This paper describes some major challenges related to aggregating and processing usage data, and discusses preliminary results obtained from analyzing the MESUR reference data set. The results confirm the intrinsic value of scholarly usage data, and support the feasibility of reliable and valid usage-based metrics of scholarly impact.
Evaluating the Contributions of Video Representation for a Life Oral History Collection
Michael Christel and Michael Frisch
Abstract: A digital video library of over 900 hours of video and 18000 stories from The HistoryMakers is used to investigate the role of motion video for users of recorded life oral histories. Stories in the library are presented in one of two ways in two within-subjects experiments: either as audio accompanied by a single still photographic image per story, or as the same audio within a motion video of the interviewee speaking. 24 participants given a treasure-hunt fact-finding task, i.e., very directed search, showed no significant preference for either the still or video treatment, and no difference in task performance. 14 participants in a second study worked on an exploratory task in the same within-subjects experimental framework, and showed a significant preference for video. For exploratory work, video has a positive effect on user satisfaction. Implications for use of video in collecting and accessing recorded life oral histories, in student assignments and more generally, are discussed, along with reflections on long term use studies to complement the ones presented here.
From Writing and Analysis to the Repository: Taking the Scholars’ Perspective on Scholarly Archiving
Catherine C. Marshall
Abstract: This paper reports the results of a qualitative field study of the writing, collaboration, and archiving practices of researchers in a single organization; the researchers span five subdisciplines and bring different expertise to the papers they write together. The study focuses on the kinds of artifacts the researchers create in the process of writing a paper, how they exchange and store these artifacts over the short term, how they handle references and bibliographic materials, and the strategies they use to guarantee the long term safety of their scholarly materials. By attending to and supporting the upstream processes of writing and collaboration, we hope to facilitate personal digital archiving and deposit into institutional and disciplinary repositories as a side-effect to everyday aspects of research. The findings reveal a great range of scholarly materials, consequential differences in how researchers handle them now and what they expect to keep, and patterns of bibliographic practices and resource use. The findings also identify long term vulnerabilities for personal archives.
User-Assisted Ink-Bleed Correction for Handwritten Documents
Yi Huang and Michael S. Brown
Abstract: We describe a user-assisted framework for correcting ink-bleed in old handwritten documents housed at the National Archives of Singapore (NAS). Our approach departs from traditional correction techniques that strive for full automation. Fully-automated approaches make assumptions about ink-bleed characteristics that are not valid for all inputs. Furthermore, fully-automated approaches often have to set algorithmic parameters that have no meaning for the end-user. In our system, the user needs only to provide simple examples of ink-bleed, foreground ink, and background. These training examples are used to classify the remaining pixels in the document to produce a computer-generated result that is equal to or better than existing fully-automated approaches.

To offer a complete system we also provide tools that allow any errors in the computer-generated results to be quickly ``cleaned up'' by the user. The initial training markup, computer-generated results, and manual edits are all recorded with the final output, allowing subsequent viewers to see how a corrected document was created and to make changes or updates. While an on-going project, our feedback from the NAS staff has been overwhelmingly positive that this user-assisted framework is a practical way to address the ink-bleed problem.

CRF-Based Authors' Name Tagging for Scanned Documents
Manabu Ohta and Atsuhiro Takasu
Abstract: Authors' names are a critical bibliographic element when searching or browsing academic articles stored in digital libraries. Therefore, those creating metadata for digital libraries would appreciate an automatic method to extract such bibliographic data from printed documents. In this paper, we describe an automatic author name tagger for academic articles scanned with optical character recognition (OCR) mark-up. The method uses conditional random fields (CRF) for labeling the unsegmented character strings in authors’ blocks as those of either an author or a delimiter. We applied the tagger to Japanese academic articles. The results of the experiments showed that it correctly labeled more than 99% of the author name strings, which compares favorably with the under 96% correct rate of our previous tagger based on a hidden Markov model (HMM).
Automatic Information Extraction from 2-Dimensional Plots in Digital Documents
William Browuer, Saurabh Kataria, Sujatha Das, Prasenjit Mitra and C. Lee Giles
Abstract: Most search engines index the textual content of documents in digital libraries. However, scholarly articles often report important findings in figures. The contents of the figures are not indexed. Often scholars need to search for data reported in figures and process them. Therefore, searching for data reported in figures and extracting them is an important problem. To the best of our knowledge, there exists no tool to automatically extract data from figures in digital documents. If we can perform extraction tasks from these images automatically, there is the potential for an end-user to query the data from multiple digital documents simultaneously and efficiently. We propose a framework of algorithms based on image analysis and machine learning that can extract all information from 2-D plot images and store them in a database. We show how to identify 2-D plot figures, how to segment the plots to extract the axes, the legend and the data sections, how to extract the labels of the axes, separate the data symbols from the text in the legend, identify data points and segregate overlapping data points. We also show that our algorithms can extract information from 2-D plots accurately and scalably using a testbed of images available from multiple real-life sources.
A simple method for citation metadata extraction using hidden Markov models
Erik Hetzner
Abstract: This paper describes a simple method for extracting metadata fields from citations using hidden Markov models. The method is easy to implement and can achieve levels of precision and recall for heterogeneous citations comparable to other HMM-based methods. The method consists largely of string manipulation and otherwise depends only on an implementation of the Viterbi algorithm, which is widely available, and so can be implemented by diverse digital library systems.
Identification of Time-Varying Objects on the Web
Satoshi Oyama, Kenichi Shirasuna and Katsumi Tanaka
Abstract: We have developed a method for determining whether data found on the Web are for the same or different objects that takes into account the possibility of changes in their attribute values over time. Specifically, we estimate the probability that observed data were generated for the same object that has undergone changes in its attribute values over time and the probability that the data are for different objects, and we define similarities between observed data using these probabilities. By giving a specific form to the distributions of time-varying attributes, we can calculate the similarity between given data and identify objects by using agglomerative clustering on the basis of the similarity. Experiments in which we compared identification accuracies between our proposed method and a method that regards all attribute values as constant showed that the proposed method improves the precision and recall of object identification.
Using the Web for Creating Publication Venue Authority Files
Denilson Alves Pereira, Berthier Ribeiro-Neto, Nivio Ziviani and Alberto H. F. Laender
Abstract: Citations to publication venues in the form of journal, conference, and workshop contain spelling variants, acronyms, abbreviated forms and misspellings, all of which make more difficult to retrieve the item of interest. The task of discovering and reconciling these variant forms of bibliographic references is known as authority work. The key goal is to create the so called authority files, which maintain, for any given bibliographic item, a list of variant labels (i.e., variant strings) used as a reference to it. In this paper we propose to use the Web to create high quality publication venue authority files. Our idea is to recognize (and extract) references to publication venues in the text snippets of the answers returned by a search engine. References to a same publication venue are then reconciled in an authority file. Each entry in this file is composed of a canonical name for the venue, an acronym, the venue type (i.e., journal, conference, workshop) and a mapping to various forms of writing its name. Experimental results show that our Web-based approach for creating authority files is superior to previous work based on straight string matching techniques. Considering the average precision in finding correct venue canonical names, we observe gains up to 41.7%.
Application of Kalman Filters to Identify Unexpected Change in Blogs
Paul Bogen, Joshua Johnston, Unmil Karadkar, Richard Furuta and Frank Shipman
Abstract: Information on the Internet, especially blog content, changes rapidly. Users of information collections, such as the blogs hosted by technorati.com, have little, if any, control over the content or frequency of these changes. However, it is important for users to be able to monitor content for deviations in the expected pattern of change. If a user is interested in political blogs and a blog switches subjects to a literary review blog the user would want to know of this change in behavior. Since pages may change too frequently for manual inspection for “unwanted” changes, an automated approach is wanted. In this paper, we explore methods for indentifying unexpected change by using Kalman filters to model blog behavior over time. Using this model, we examine the history of 77 blogs and determine methods for flagging the significance of a blog's change from one time step to the next. We are able to predict large deviations in blog content, and allow user defined sensitivity parameters to tune a statistical threshold of significance for deviation from expectation.
NCore: Architecture and Implementation of a Flexible, Collaborative Digital Library
Dean Krafft, Aaron Birkland and Ellen Cramer
Abstract: NCore is an open source architecture and software platform for creating flexible, collaborative digital libraries. NCore was developed by the National Science Digital Library (NSDL) project, and it serves as the central technical infrastructure for NSDL. NCore consists of a central Fedora-based digital repository, a specific data model, an API, and a set of backend services and frontend tools that create a new model for collaborative, contributory digital libraries. This paper describes NCore, presents and analyzes its architecture, tools and services; and reports on the experience of NSDL in building and operating a major digital library on it over the past year and the experience of the Digital Library for Earth Systems Education in porting their existing digital library and tools to the NCore platform.
Acceptance and Use of Electronic Library Services in Ugandan University
Prisca Tibenderana
Abstract: Library as old as civilization were created to acquire, store, organise and provide access to information services to those in need albeit using manual operations. However, with information explosion and the coming of new technologies, libraries have opted to automate their operations and provice services using digital technology. For electronic library services to be utilized effectively, with special reference to Developing Countries, end-users need to accept them. This study is an effort to modify "The Unified Theory of Acceptance and Use of Technology" Model to cater for electronic library services in addressing a recommendation by Venkatesh, et. al (2003) study (conducted in USA) that the model be tested in different setting (such as Uganda) and in a different context. The study developed, tested and validated a "Service Oriented Unified Theory of Acceptance and Use of Technology" (SOUTAUT) Model.
Portable Digital Libraries on an iPod: Beyond the client-server model
David Bainbridge, Steve Jones, Sam McIntosh, Matt Jones and Ian Witten
Abstract: We have created an experimental prototype that enhances an ordinary iPod personal music player by adding digital library capabilities. It does not enable access to a remote DL from a user’s PDA; rather, it runs a complete, standard digital library server environment (Greenstone) on the iPod. Being optimized for multimedia information, this platform has truly vast storage capacity. It raises the possibility of not just personal collections but entire institutional-scale digital libraries that are fully portable. Our implementation even allows the iPod to be configured as a web server to provide digital library content over a network, inverting the standard mobile client-server configuration—and incidentally providing full-screen access.

Our system is not (yet) a practical implementation. Rather, it is a proof of concept intended to stimulate thinking on potential applications of a radically new DL configuration. This paper describes the facilities we built, focusing on interface issues and touching on the technical problems that were encountered and solved. It attempts to convey a feeling for the kind of issues that must be faced when adapting standard DL software for non-standard, leading-edge devices.

Annotated Program Examples as First Class Objects in an Educational Digital Library
Peter Brusilovsky, Michael Yudelson and I-Han Hsiao
Abstract: The paper analyzes three major problems encountered by our team as we endeavored to turn program examples into highly reusable educational activities, which could be included as first class objects in various educational digital libraries. It also suggests three specific approaches to resolving these problems, and reports on the evaluation of the suggested approaches. Our successful experience presented in the paper demonstrates how to make program examples self-sufficient, to provide students with personalized guidance to the most appropriate examples, and to increase the volume of annotated examples.
Annotating Historical Archives of Images
Xiaoyue Wang, Lexiang Ye, Eamonn Keogh and Christian Shelton
Abstract: Recent initiatives like the Million Book Project and Google Print Library Project have already archived several million books in digital format, and within a few years a significant fraction of world’s books will be online. While the majority of the data will naturally be text, there will also be tens of millions of pages of images. Many of these images will defy automation annotation for the foreseeable future, but a considerable fraction of the images may be amiable to automatic annotation by algorithms that can link the historical image with a modern contemporary, with its attendant metatags. In order to perform this linking we must have a suitable distance measure which appropriately combines the relevant features of shape, color, texture and text. However the best combination of these features will vary from application to application and even from one manuscript to another. In this work we propose a simple technique to learn the distance measure by perturbing the training set in a principled way. We show the utility of our ideas on archives of manuscripts containing images from natural history and cultural artifacts.
sLab: Smart Labeling of Family Photos Through an Interactive Interface
Ehsan Fazl-Ersi, I. Scott MacKenzie and John K. Tsotsos
Abstract: A novel technique for semi-automatic photo annotation is proposed and evaluated. The technique, iLab, uses face processing algorithms and a simplified user interface for labeling family photos. A user study compared our system with two others. One was Adobe Photoshop Element. The other was an inhouse implementation of a face clustering interface recently proposed in the research community. Nine participants performed an annotation task with each system on faces extracted from a set of 150 images from their own family photo albums. As the faces were all well known to participants, accuracy was near perfect with all three systems. On annotation time, iLab was 25% faster than Photoshop Element and 16% faster than the face clustering interface.
Autotagging to Improve Text Search for 3D Models
Corey Goldfeder and Peter Allen
Abstract: The most natural user interface for searching libraries of 3D models is to use standard text queries. However, text search on 3D models has traditionally worked poorly, as text anno- tations on 3D models are often unreliable or incomplete. In this paper we attempt to improve the recall of text search by automatically assigning appropriate tags to models. Our algorithm finds relevant tags by appealing to a large corpus of partially labeled example models, which does not have to be preclassified or otherwise prepared. For this purpose we use a copy of Google 3DWarehouse, a library of user con- tributed models which is publicly available on the Internet. Given a model to tag, we find geometrically similar mod- els in the corpus, based on distances in a reduced dimen- sional space derived from Zernike descriptors. The labels of these neighbors are used as tag candidates for the model with probabilities proportional to the degree of geometric similarity. We show experimentally that text based search for 3D models using our computed tags can reproduce the power of geometry based search. Finally, we demonstrate our 3D model search engine that uses this algorithm and discuss some implementation issues.
Slide Image Retrieval: A Preliminary Study
Guo Min Liew and Min-Yen Kan
Abstract: We consider the task of automatic slide image retrieval, in which slide images are ranked for relevance against a textual query. Our implemented system, SLIDIR caters specifically for this task using features specifically designed for synthetic images embedded within slide presentation. We show promising results in both the ranking and binary relevance task and analyze the contribution of different features in the task performance.
Perception-based Online News Extraction
Jinlin Chen and Keli Xiao
Abstract: A novel online news extraction approach based on human perception is presented in this paper. The approach simulates how a human perceives and identifies online news content. It first detects news areas based on content function, space continuity, and formatting continuity of news information. It further identifies detailed news content based on the position, format, and semantic of detected news areas. Experiment results show that our approach achieves much better performance (in average more than 99% in terms of F1 Value) compared to previous approaches such as Tree Edit Distance and Visual Wrapper based approaches. Furthermore, our approach does not assume the existence of Web templates in the tested Web pages as required by Tree Edit Distance based approach, nor does it need training sets as required in Visual Wrapper based approach. The success of our approach demonstrates the strength of the perception based Web information extraction methodology and represents a promising approach for automatic information extraction from sources with presentation design for humans.
Plato: a service-oriented decision support system for preservation planning
Christoph Becker, Hannes Kulovits, Andreas Rauber and Hans Hofman
Abstract: The fast changes of technologies in today's information landscape have considerably shortened the lifespan of digital objects. Digital preservation has become a pressing challenge. Different strategies such as migration and emulation have been proposed; however, the decision for a specific tool e.g. for format migration or an emulator is very complex. The process of evaluating potential solutions against specific requirements and building a plan for preserving a given set of objects is called preservation planning. So far, it is a mainly manual, sometimes ad-hoc process with little or no tool support. This paper presents a service-oriented architecture and decision support tool that implements a solid preservation planning process and integrates services for content characterisation, preservation action and automatic object comparison to provide maximum support for preservation planning endeavours.
Usage Analysis of a Public Website Reconstruction Tool
Frank McCown and Michael Nelson
Abstract: The Web is increasingly the medium by which information is published today, but due to its ephemeral nature, web pages and sometimes entire websites are often "lost" due to server crashes, viruses, hackers, run-ins with the law, bankruptcy and loss of interest. When a website is lost and backups are not available, an individual or third party can use Warrick to recover the website from several search engine caches and web archives (the Web Infrastructure). In this short paper, we present Warrick usage data obtained from Brass, a queueing system for Warrick hosted at Old Dominion University and made available to the public for free. Over the last six months, 520 individuals have reconstructed more than 700 websites with 800K resources from the Web Infrastructure. Sixty-two percent of the static web pages were recovered, and 42% of all the website resources were recovered. The Internet Archive was the largest contributor of recovered resources (78%).
Using Web Metrics to Analyze Digital Libraries
Michael Khoo, Joe Pagano, Anne Washington, Mimi Recker, Bart Palmer and Robert Donahue
Abstract: Web metrics tools and digital libraries vary widely in form and function. Bringing the two together is often not a straightforward exercise. This paper discusses the use of web metrics in the Instructional Architect, the Library of Congress, the National Science Digital Library, and WGBH Teachers' Domain. We explore similarities and differences in the use of web metrics across these libraries, and introduce a discussion of an emerging focus of web metrics research, the analysis of session time and page popularity. We conclude by discussing some of the current limitations and future possibilities of using web metrics to analyze and evaluate digital library use and impact.
A Lightweight Metadata Quality Tool
David Nichols, Chu-Hsiang Chan, David Bainbridge, Dana McKay and Michael Twidale
Abstract: We describe a Web-based metadata quality tool that provides statistical descriptions and visualisations of Dublin Core metadata harvested via the OAI protocol. The lightweight nature of development allows it to be used to gather contextualized requirements and some initial user feedback is discussed.
Improving Navigation Interaction in Digital Documents
George Buchanan and Tom Owen
Abstract: This paper investigates novel interactions for supporting within--document navigation. We first study navigation more broadly through interviews with intensive users of document reader applications. We then focus on a specific interaction: the following of figure references. This interaction is used to illuminate factors also found in other forms of navigation. Several alternative interactions for supporting figure navigation are described and evaluated through a user study. Experimentation proves the advantages of our interaction design that can be applied to other navigation needs.
Keeping Narratives of a Desktop to Enhance Continuity of On-going Tasks
Youngjoo Park and Richard Furuta
Abstract: We describe a novel interface by which a user can browse, bookmark and retrieve previously used working environments, i.e., desktop status, enabling the retention of the history of use of various sets of information. Significant tasks often require reuse of (sets of) information that was used earlier. Particularly, if a task involves extended interaction, then the task’s environment has been through a lot of changes and can get complex. Under the current prevailing desktop-based computing environment, after an interruption to the task users can gain little assistance to get back to the context that they previously worked on. A user thus encounters increased discontinuity in continuing extended tasks.
Note-Taking, Selecting, and Choice: Designing Interfaces that Encourage Smaller Selections
Aaron Bauer and Kenneth Koedinger
Abstract: Our research evaluates the use of copy-paste functionality in note-taking applications. While pasting can be more efficient than typing, our studies indicate that it reduces attention. An initial interface we designed to encourage attention by reducing selection-size, which is negatively associated with learning, was resisted by students and produced poor learning. In this paper we present a design study intended to learn more about how students interact with note-taking interfaces and develop more user-friendly restrictions. We also report an experimental evaluation of interfaces derived from this design study. While we were able to produce interfaces that reduced selection size and improved satisfaction, the new interfaces did not improve learning. We suggest design recommendations derived from these studies, and describe a “selecting-to-read” behavior we encountered, which has implications for the design of reading and note-taking applications.
A Fedora Librarian Interface
David Bainbridge and Ian Witten
Abstract: The Fedora content management system embodies a powerful and flexible digital object model. This paper describes a new open-source software front-end that enables end-user librarians to transfer documents and metadata in a variety of formats into a Fedora repository. The main graphical facility that Fedora itself provides for this task operates on one document at a time and is not librarian-friendly. A batch driven alternative is possible, but requires documents to be converted beforehand into the XML format used by the repository, necessitating a need for programming skills. In contrast, our new scheme allows arbitrary collections of documents residing on the user’s computer (or the web at large) to be ingested into a Fedora repository in one operation, without a need for programming expertise. Provision is also made for editing existing documents and metadata, and adding new ones. The documents can be in a wide variety of different formats, and the user interface is suitable for practicing librarians. The design capitalizes on our experience in building the Greenstone librarian interface and participating in dozens of workshops with librarians worldwide.