User Interface in Digital Libraries

Project acronym: QUESTION-HOW
Project Full Title:Quality Engineering Solutions via Tools, Information and Outreach for the New Highly-enriched Offerings from W3C: Evolving the Web in Europe
Project/Contract No. IST-2000-28767
Workpackage 2, Deliverable D3-XML/RDF Digital Libraries

Project Manager: Daniel Dardailler <danield@w3.org>
Author of this document: Oreste Signore (<oreste.signore@isti.cnr.it>), Paolo Ferragina, Silvia Martelli, Cristian Lucchesi, Marco Andreini

Created: March 2003. Last updated: 04/09/2003 13.44


Document Versions:

Table of Content


Introduction

The main task has been to develop a user interface to query complex and specialized XML documents corpora (like juridical documents, cultural heritage cataloguing cards, user manuals, etc.).
At the lower level, we have the XCDE library developed at the Department of Computer Science, University of Pisa. The core of the library (indexing and compressing algorithms) remain property of their authors. The library is written in C and provides a set of efficient algorithms and data structures for indexing and searching an XML document collection.
The documents must be well-formed and may be heterogeneous in that they may reflect different DTDs. The library supports the storage and management of these XML files in native form, operating directly at the File System level. The main features of the library are: state-of-the-art algorithms and data structures for text indexing, compressed space occupancy, and novel succinct data structures for the management of the hierarchical structure of the XML document.
The user interface gets the document structure from the XMLSchema, and makes use of some additional info, represented as RDF statements, to fire access to thesauri or dictionaries, for better support to the user in composing the query, to achieve more effective searches.


User Requirements

A document is made by three essential components:

Users are concerned with all these aspects.
Content is the most widely addressed aspect. All traditional Information Retrieval Systems, and also many search engines, perform querying looking for some specific words contained in the document. This is often referred as search by content. However, there are some drawbacks. The main one is that searching for a word does not mean to search for a specific concept. In fact, the semantics of a single word is normally very poor, as we can have many cases of a single word having different meanings, as well as we can be faced with the case where the same word can be significant only if it appears in a definite context. Just to make a simple example, think to a search in a file of documents containing person addresses. The same word, that in the user's mind is relevant to select a person identified by his/her last name, can appear in the street or in the location fields. Therefore, not surprisingly, documents are often structured in parts, so that users can search terms in specific document's components, reducing the risk of false hits.

To get the goal of looking for the appropriate terms and concepts, it is essential to share the same knowledge base between user and indexer. This is usuallly done by having access to thesauri or dictionaries of terms, where the user can check the allowed terms and their semantic relationships. Sometimes, a description of the document is available, but in most cases the semantics of the fields is just left to their name: this is unacceptable in a truly World Wide Web environment. Even more complex is the case when we are faced not with a simple, flat document, but with a more sophisticated structured document. In this case, semantics is conveyed both by the field and by the structure, therefore, knowing the document structure is an essential need.

Just as a trivial example, think to a simple document representing a scientific paper. In a sequence like:

<paper>
<title>Metadata and Semantic Web</title>
<author>author1</author>
<author>author2</author>
<author>author3</author>
</paper>

the tag author conveys the semantics that author1 is the author of the article, and the same is for author2 and author3. However, their order can convey one additional information: author1 is the first author. Therefore, the user can be interested in searching for papers written by author1 AND author2, where author1 is the first author, and there are no more than three authors.

In passing, it is worthwhile noting that the tag author does not has any special meaning, as the user is supposed to be aware that it refers to the author of the paper. To have a really effective search, the user should be aware of the meaning of the tag, e.g. that its semantics is the same as the <dc:creator> tag as defined by the Dublin Core initiative. Hence, we need a semantic description of every tag and a clear understanding of the structure of the searched documents.

The Web is a World Wide Web, this means that it is not unusual the case when users are searching on multiple databases or document collections. As an obvious consequence of the intrinsic decentralized nature of the of the Web, we must expect that different document collections will have different tag names, as well as different structures, even if containing intrinsically homogeneous documents (like books, papers, laws, cultural heritage cataloguing cards). Document collections containing etherogeneous, but somehow related documents (e.g. laws, sentences, regulations) will have different structures, containing elements having the same semantics, but, quite obvioulsy, different names. Having the possibility of identifying semantic equivalences in different document collections, and hence having the possibility of searching and linking documents in these collections is a relevant issue in the Semantic Web panorama.


Architecture: general overview

The system architecture is intended to be as open as possible, leaving the possibility of interfacing any kind of XML document collections, without imposing any manipulation of the schema, that is externally annotated. Some processsing occurs invoking services, identified by URIs.

A rough sketch of the architecture is in the following figure:

a rough sketch of the architecture

From the functional point of view, we can distinguish three levels:

  1. Administration level
  2. User level
  3. Data Server level

All of them will be here briefly sketched. For a more complete description the reader is referred to the subsequent paragraphs.

At the Administration level, the administrator of the document collection is in charge of setting up and maintaining:

This information is used by the system to produce the Query Form, that will contain all the information needed by the user agent (the browser in a first implementation) to implement the User Interface. This means that the Query Form will contain all the constraints applicable to the single elements, as well as the information needed to enable the user to formulate semantically correct queries. A semantically correct query is a query that specifies consistent conditions both on the content of the elements, as well as on the structure. Needless to say, a semantically correct query need not necessarily return a document set, it may happen that the result set is empty, but this is different from the case when the returned document set is empty because user specified unacceptable conditions (e.g. a condition like: "PY=02" instead of "PY=2002", where PY is the Publication Year, coded, in the particular document collection, as a four digits number).

At the User level, we implement the human computer interaction. This implies:

At the Data Server level, we have, in the current implementation, the XCDE Library, that has been upgraded in this project. The main tasks to perform are:


Querying the Document Collection

The Query formalization is the "glue" that keeps together the User Interface and the Data Server. The User Interface must be capable of helping the user in expressing all the queries (s)he is interested in, assuring their correctness and supplying almost all the features supported by the search engine in the background.
The way of composing the query is absolutely general, and does not depend on the specific search engine, but only on the supported functionalities.
In the following sections we will biefly describe the query composition and the main features of queries in XCDE.

Query Composition

The query is composed by navigating the document structure and interacting via the Query Form.

A Query (identified as Q01, ..., Q99) is a boolean composition (with parentheses) of queryFragments. By default, queryFragments are ANDed. The user can edit the query. A query is produced when the user specifies: buildQuery.

A queryFragment is a boolean composition (with parentheses) of queryTokens. The queryFragment is constructed automatically in a navigation upon the documentTree. By default, the queryTokens are ANDed. The user can edit the queryFragment, inserting parentheses and modifying the Boolean operator. Every queryFragment is identified as: qF01, ..., qF99. A queryFragment is started when user selects: newQueryFragment (so starting a new navigation upon the documentTree).

The queryToken is the most elementary component of the query. A queryToken is built by a single interaction with the queryForm. Would the user like to specify a condition upon several instances of an element, (s)he will initiate a new interaction upon the same element.
For example, supposing to have a document where the element <author> contains just one of the authors, and is therefore repeated as many times as the authors are, to search for papers having as <author> "X" AND "Y", the user will have to specify two query tokens: <author>="X" AND <author>="Y", while <author>= "X" AND "Y" will be meaningful only if the element <author> contains all the authors. What will be the right way to formulate the query, will be clear by the awareness of the document structure and content, as explained by the longDescription property.
Every queryToken is identified as: qT01, ..., qT99.

Queries in XCDE

After the documents have been compressed by the XCDE library, and indexes have been built, users can issue queries over the original XML document by using the XCDE query engine. XQuery is a powerful query language but it seems to be designed having in mind XML documents derived from structured data. Conversely XML is text based and thus its query language should be oriented mainly to the processing and management of textual data. Our goal has been therefore to design a query language which is simple to be used and IR oriented. Its specialties are the following.


The Administration Level

Generalities

To build an effective user interface, we must consider two issues.
On one hand, the interface must be tailored to the user needs and characteristics, that can be specified in an appropriate User Profile.
On the other hand, characteristics of the elements must be described by appropriate Element Metadata. Such metadata will be used for:

In the following sections we will informally describe such metadata.

User Profile

User profile is needed to personalize the level of detail of information supplied at the user interface level, the interaction language, the presentation of returned documents.
In the present implementation, the User Profile has been kept as simple as possible, leaving room for subsequent improvements.

User Profile Elements

Value Description
level novice|expert.
Default=novice.
Novice user will get a full description of the fields. Expert user will get just labels (more expressive than just tag names).
lang the preferred language for interaction. This means that user will get field description, messages and help in the selected language (if available). If information in the specified language is not available, or the parameter is not set, the default language for system dependent messages will be the languase set for the browser, while for document collection related info it will be English (if available) or the first available.
defaultRecordFormat short|long|full|off
Default: off
Specifies the default record format.

In the Query Form, all the nodes belonging to the selected defaultRecordFormat will appear flagged as elements to return. The user can deselect unwanted elements and/or flag desired elements.
The combined effect of the specified defaultRecordFormat and of the user actions is summarized in the following table:

defaultRecordFormat Return (user specified) Returned elements
short|long|full no Short|long|full
short|long|full yes The system will return all the elements flagged as elements to return, either by the user or by default as belonging to the current defaultRecordFormat.
If, finally, no elements will result to be flagged as elements to return, an error message will be sent.
off no Queried nodes, with an explanatory message ("By default the system is returning all queried nodes") for novice users
off yes Nodes selected as nodes to return.

Every document element is assigned to a record format. Record format "full" includes record format "long", which, in turn, includes record format "short".
At least one element must be assigned to the short recordFormat.

Element Metadata

Metadata are essential to describe data and tailor user interface and system behaviour. Some of them can be derived by the XML Schema, while others can only be specified by an expert of the domain, and have no way to be properly coded in the schema itself.
In addition, we note that a basic design issue is to leave the XML Schema unchanged.
From the XMLSchema we can take some properties of elements and attributes, like range, multiplicity, pattern, list of admitted values, content type ( text, element, mixed).
We can have some additional metadata to describe properties of elements and attributes in a RDF file. The RDF should contain information to univocally identify elements and attributes, descriptions to help users to understand their meaning, tranformation rules (e.g. dates can be stored in gregorian calendar, but queried and displayed in a different calendar, like muslin or jewish), kind of control, if any (list of values, controlled vocabularies, ontologies, etc.).
In the following sections, we will supply a rather informal description of the properties.

Identification of elements

Property Description
qh:fullyQualifiedName XPath-like expression to unambiguously identify the element or attribute in the document context.
qh:localName Name of the element or attribute as declared in the schema.
qh:kind Specifies if the resource whose properties are going to be given is an element or an attribute.
qh:elementaryType To be used to closely define a data type in case it has been loosely defined in the XML schema. Admitted values (e.g. xs:gYear) are the primitive datatypes as specified in the XML Schema Part 2: Datatypes W3C Recommendation.

Metadata for interface building

To achieve a greater readability, the first column (L.) in the following table indicates the level of nesting of properties.
L. Property Description
1 qh:label
+
xml:lang attribute
An expression in the specified language, to be used instead of the tag name in the query form. It should be reasonably expressive, even if short. The full explanation will be in the longDescription.
1 qh:longDescription
+
xml:lang attribute
An expression in the specified language, to be used to better describe the element or attribute.
Can be showed to novice users, and used as help on the field for expert users.
1 qh:searchable YES | NO
Default: YES
States if the user is allowed to formulate a query on the element or attribute.
In the sample schema, it is easily seen that it doesn't make sense to issue a query on the content of <em> tag in a document chunk like:
...<p> anytext jhsdfhljfwaljl<em>kjssjhd</em>dh,khfdh</p>.
1 qh:moreDescendents YES | NO
Default: YES
Useful to state that all the descendent elements will not be showed/queried. A typical example is a <p> element with mixed anyXML content.
Coding <qh:searchable>no</qh:searchable> for all descendents has the same effect.
1 qh:controlledBy A multiple property describing control information on the content of elements or attributes.
2 qh:mode States the way the element is controlled:
thesaurus | dictionary | externalList | localList
2 qh:serviceParms URI referring info needed by an external service that will perform the check
(if qh:mode value is "thesaurus", "dictionary" or "externalList").
2 qh:valueList If qh:mode is "localList" the list of allowed values is a set of values identified by the <qh:vI> tag.
3 qh:vI A value in the list of allowed values
1 qh:constrains Inverse property of qh:constrainedBy.
The query form will show a message saying that the specified value constrains values taken by other elements or attributes.
1 qh:constrainedBy The query form will show the message "Constrained field" saying that allowed values are constrained by values taken by other elements or attributes.
2 qh:constraint Property describing the constraint.
3 qh:element The nodeID of the constrained element or of the constraining element, depending on the qh:constraint property occurs under the qh:constrains or qh:constrainedBy property.
3 qh:constraintExpr
+
xml:lang attribute
A natural language expression in the specified language, describing the constraint.
The expression is showed in the query form to the novice user, and to the expert user in case of error.
3 qh:checkedBy URI referring a service that will perform the check.

Constraints are described making use of a natural language expression to avoid excessive complexity in stating the constraint using RDF. In a future release, the possibility of stating in a purely declarative way some simple and common constraints (e.g. existence constraints, elements whose values must be in a well defined equivalence relation, etc.) will be considered.

Metadata for preprocessing

Property Description
qh:order relevant | notRelevant
Useful when the sequence of element occurrences conveys a semantic information (e.g. the first author, two deseases in a patient record, etc.)
qh:externalFormat Module to perform transformation (e.g. calendar transformation, date format). Some of these transformations can make use of library functions like: julianToGregorian, ddmmyyToMmddyyyy, etc.

Metadata for postprocessing

Property Description
qh:postProcService URI referring a service that will perform postprocessing.
A typical example could be create hyperlinks based on values taken by an element, or activate transformations, either via a direct processing as well as invoking XSLT.
qh:emphasize YES | NO
Default: NO
If YES, matching items (words, or truncated words, or phrases) are emphasized.
qh:recfm short|long|full.
Specifies if the element is included in short, long or full recordFormat. Default is "full". Any element included in short recfm is also included into long and full recfm. At least one element must be specified as to be included in the short recfm.

The User Level

The user interface can be tested looking at the demo.

The first step in using the system is to select the document collection and setting up the User Profile.

Subsequently, the user will get, on the left of the screen, the document tree, with nodes that can be expanded or compressed.
For each node, a click will activate, on the right part of the screen, the Query Form, where the user can enter the desired values and operand to build a queryToken.
In composing the queryToken, user can select/deselect the element as a returned element, asking for support from a dictionary or thesaurus.
queryTokens can be composed into a queryFragment, which, in turn, can be composed into queries to submit.
User can edit queryFragments and query, adding parentheses and inserting or modifying boolean operators.
To modify a queryToken, user must go back to the token and interact via the query Form.

The Data Server Level

The XCDE Library

The library provides a set of algorithms and data structures for indexing and searching a collection of XML documents. The documents must be well-formed and may be heterogeneous in that they may reflect different DTDs o XML Schema. The library supports the storage and management of these XML files in native form by operating directly at the File System level. The main features are: state-of-the-art algorithms and data structures for text indexing, compressed space occupancy, and novel succinct data structures for the management of the hierarchical structure of an XML document.

The main indexing data structure is the classical Inverted List where proper compression techniques are used to keep the posting lists in succinct space. Words, tags, attributes and their values are indexed using independent indices of this type. As a result prefix, suffix, substring, regular expression, approximate and proximity searches on the textual content of the XML document as well on the attribute values can be executed in an efficient way. Resolving structural queries on tag paths can be also done efficiently by using a novel indexing of the hierarchical structure of the XML document based on geometric data structures.

The original text is compressed in such a way that random accesses to some of its parts are supported without decompressing the whole file. This is helpful for fast displaying the context of a query occurrence when visualizing the query results (usually called the snippet of the occurrence). Overall both the compressed XML document plus all of its indices occupy about the original file size. This is a significant space reduction with respect to other XML native indexers, like NATIX, for which the factor is at least 3, or is much less than the indexers based on Relational or OO Databases which abundantly surpass the factor 10. Clearly space reduction has an impact on the final performance because it allows to increase the disk bandwidth and to better exploit the buffering/caching strategies of the underlying OS and the hierarchy of memory levels. Furthermore, space reduction plays a crucial role in applications which run on small-memory devices like, for example, PDA and e-book.

One of the basic choices of the XCDE library is to operate at the granularity of the single document. This choice has one significant disadvantage and various advantages. It is clear that this fine granularity makes the searches on collections formed by many small documents much time consuming. However, on the other hand, single file indexing allows to easily implement distributed indices, to trivially update the index as a result of insertion/deletion of a document, to easily take care of the heterogeneity of XML documents, and also, to customize the indices according to the individual document features. It goes without saying that we might still index numerous XML documents without incurring in a significant query slowdown, in two different ways: either by "merging" small documents into bigger ones in order to reduce their overall number, or by adding a "light" global index which filters out some parts of the collection before the search on single documents is executed. The former approach might deploy a properly devised xml tagging which preserves the individuality of each document within a unique physical file containing the whole collection.

The library provides an API with a rich set of C functions to operate on its whole collection of data structures and algorithms. It may implement most of the basic functionalities of XQuery, and support more complex IR-like searches. The aim of its design is to manage efficiently and effectively XML documents that contain significant textual parts and/or a deeply nested hierarchical structure, and whose attribute values are complicated strings of numbers and letters. This is the typical scenario encountered in literary texts tagged via XML-TEI, one of our motivating application. The single-document granularity offered by our indices perfectly fits in this framework because it easily allows the humanists to restrict the searches on portions of the collection according to some criteria specified at query time: writing period, writing type, author, collections,...

Of course, the XCDE library is intended just as a kernel of a more complex XML document engine. It has been developed with the aim of providing researchers with a basic library for designing more complex applications. Its functionalities have been deployed to implement a sort of a mixed query language: XQuery + IR. The driving document is the W3C Working Draft of XQuery and XPath Full-Text Requirements .


Implementation

The implementation environment is:



The present version depends on parsedXML, using its DOM functions to access the XML schema. Next releases will make use of SAX functions embedded in Python.

The following components have been completed:

See the demo at http://www.weblab.isti.cnr.it/projects/QH/demo/schema


Case Study

To provide a demo of our system we have used the XML Schema provided for the sample data in "XQuery and XPath Full-Text Use Cases - W3C Working Draft 14 February 2003".
For the data we have used a variant of the sample data due in that document.

Reference XML Schema fullText.xsd
Reference data books.xml
Reference RDF file fullText.rdf


A (partial) graphical representation of the schema and its properties is:

schema with its properties

Deviations from plan

Delayed due to late signature of the contract.