Project acronym: QUESTION-HOW
Project Full Title:Quality
Engineering Solutions via Tools, Information and Outreach for the
New Highly-enriched Offerings from W3C: Evolving the Web in
Europe
Project/Contract No.
IST-2000-28767
Workpackage 2, Deliverable D3-XML/RDF Digital
Libraries
Project Manager: Daniel Dardailler
<danield@w3.org>
Author of this document: Oreste Signore
(<oreste.signore@isti.cnr.it>), Paolo
Ferragina, Silvia Martelli,
Cristian Lucchesi, Marco
Andreini
Created: March 2003. Last updated: 04/09/2003 13.44
The main task has been to develop a user interface to query
complex and specialized XML documents corpora (like juridical
documents, cultural heritage cataloguing cards, user manuals,
etc.).
At the lower level, we have the XCDE library developed at the
Department of Computer Science, University of Pisa. The core of
the library (indexing and compressing algorithms) remain property
of their authors. The library is written in C and provides a set
of efficient algorithms and data structures for indexing and
searching an XML document collection.
The documents must be well-formed and may be heterogeneous in
that they may reflect different DTDs. The library supports the
storage and management of these XML files in native form,
operating directly at the File System level. The main features of
the library are: state-of-the-art algorithms and data structures
for text indexing, compressed space occupancy, and novel succinct
data structures for the management of the hierarchical structure
of the XML document.
The user interface gets the document structure from the
XMLSchema, and makes use of some additional info, represented as
RDF statements, to fire access to thesauri or dictionaries, for
better support to the user in composing the query, to achieve
more effective searches.
A document is made by three essential components:
Users are concerned with all these aspects.
Content is the most widely
addressed aspect. All traditional Information Retrieval Systems,
and also many search engines, perform querying looking for some
specific words contained in the document. This is often referred
as search by content. However, there are some drawbacks. The main
one is that searching for a word does not mean to search for a
specific concept. In fact, the semantics of a single word is
normally very poor, as we can have many cases of a single word
having different meanings, as well as we can be faced with the
case where the same word can be significant only if it appears in
a definite context. Just to make a simple example, think to a
search in a file of documents containing person addresses. The
same word, that in the user's mind is relevant to select a person
identified by his/her last name, can appear in the street or in
the location fields. Therefore, not surprisingly, documents are
often structured in parts, so that users can search terms in
specific document's components, reducing the risk of false
hits.
To get the goal of looking for the appropriate terms and concepts, it is essential to share the same knowledge base between user and indexer. This is usuallly done by having access to thesauri or dictionaries of terms, where the user can check the allowed terms and their semantic relationships. Sometimes, a description of the document is available, but in most cases the semantics of the fields is just left to their name: this is unacceptable in a truly World Wide Web environment. Even more complex is the case when we are faced not with a simple, flat document, but with a more sophisticated structured document. In this case, semantics is conveyed both by the field and by the structure, therefore, knowing the document structure is an essential need.
Just as a trivial example, think to a simple document representing a scientific paper. In a sequence like:
<paper>
<title>Metadata and Semantic Web</title>
<author>author1</author>
<author>author2</author>
<author>author3</author>
</paper>
the tag author conveys the semantics that author1 is the author of the article, and the same is for author2 and author3. However, their order can convey one additional information: author1 is the first author. Therefore, the user can be interested in searching for papers written by author1 AND author2, where author1 is the first author, and there are no more than three authors.
In passing, it is worthwhile noting that the tag author does not has any special meaning, as the user is supposed to be aware that it refers to the author of the paper. To have a really effective search, the user should be aware of the meaning of the tag, e.g. that its semantics is the same as the <dc:creator> tag as defined by the Dublin Core initiative. Hence, we need a semantic description of every tag and a clear understanding of the structure of the searched documents.
The Web is a World Wide Web, this means that it is not unusual the case when users are searching on multiple databases or document collections. As an obvious consequence of the intrinsic decentralized nature of the of the Web, we must expect that different document collections will have different tag names, as well as different structures, even if containing intrinsically homogeneous documents (like books, papers, laws, cultural heritage cataloguing cards). Document collections containing etherogeneous, but somehow related documents (e.g. laws, sentences, regulations) will have different structures, containing elements having the same semantics, but, quite obvioulsy, different names. Having the possibility of identifying semantic equivalences in different document collections, and hence having the possibility of searching and linking documents in these collections is a relevant issue in the Semantic Web panorama.
The system architecture is intended to be as open as possible, leaving the possibility of interfacing any kind of XML document collections, without imposing any manipulation of the schema, that is externally annotated. Some processsing occurs invoking services, identified by URIs.
A rough sketch of the architecture is in the following figure:
From the functional point of view, we can distinguish three levels:
All of them will be here briefly sketched. For a more complete description the reader is referred to the subsequent paragraphs.
At the Administration level, the administrator of the document collection is in charge of setting up and maintaining:
At the User level, we implement the human computer interaction. This implies:
At the Data Server level, we have, in the current implementation, the XCDE Library, that has been upgraded in this project. The main tasks to perform are:
The Query formalization is the "glue" that keeps together the
User Interface and the Data Server. The User Interface must be
capable of helping the user in expressing all the queries (s)he
is interested in, assuring their correctness and supplying almost
all the features supported by the search engine in the
background.
The way of composing the query is absolutely general, and does
not depend on the specific search engine, but only on the
supported functionalities.
In the following sections we will biefly describe the query
composition and the main features of queries in XCDE.
The query is composed by navigating the document structure and interacting via the Query Form.
A Query (identified as Q01, ..., Q99) is a boolean composition (with parentheses) of queryFragments. By default, queryFragments are ANDed. The user can edit the query. A query is produced when the user specifies: buildQuery.
A queryFragment is a boolean composition (with parentheses) of queryTokens. The queryFragment is constructed automatically in a navigation upon the documentTree. By default, the queryTokens are ANDed. The user can edit the queryFragment, inserting parentheses and modifying the Boolean operator. Every queryFragment is identified as: qF01, ..., qF99. A queryFragment is started when user selects: newQueryFragment (so starting a new navigation upon the documentTree).
The queryToken is the most
elementary component of the query. A queryToken is built by a
single interaction with the queryForm. Would the user like to
specify a condition upon several instances of an element, (s)he
will initiate a new interaction upon the same element.
For example, supposing to have a document where the element
<author> contains just one
of the authors, and is therefore repeated as many times as the
authors are, to search for papers having as <author> "X" AND "Y", the user
will have to specify two query tokens: <author>="X" AND <author>="Y", while <author>= "X" AND "Y" will be
meaningful only if the element <author> contains all the
authors. What will be the right way to formulate the query, will
be clear by the awareness of the document structure and content,
as explained by the longDescription property.
Every queryToken is identified as: qT01, ..., qT99.
After the documents have been compressed by the XCDE library, and indexes have been built, users can issue queries over the original XML document by using the XCDE query engine. XQuery is a powerful query language but it seems to be designed having in mind XML documents derived from structured data. Conversely XML is text based and thus its query language should be oriented mainly to the processing and management of textual data. Our goal has been therefore to design a query language which is simple to be used and IR oriented. Its specialties are the following.
To build an effective user interface, we must consider two
issues.
On one hand, the interface must be tailored to the user needs
and characteristics, that can be specified in an appropriate
User Profile.
On the other hand, characteristics of the elements must be
described by appropriate Element
Metadata. Such metadata will be used for:
User profile is needed to personalize the level of detail of
information supplied at the user interface level, the interaction
language, the presentation of returned documents.
In the present implementation, the User Profile has been kept as
simple as possible, leaving room for subsequent improvements.
| Value | Description |
|---|---|
| level | novice|expert. Default=novice. Novice user will get a full description of the fields. Expert user will get just labels (more expressive than just tag names). |
| lang | the preferred language for interaction. This means that user will get field description, messages and help in the selected language (if available). If information in the specified language is not available, or the parameter is not set, the default language for system dependent messages will be the languase set for the browser, while for document collection related info it will be English (if available) or the first available. |
| defaultRecordFormat | short|long|full|off Default: off Specifies the default record format. |
In the Query Form, all the nodes belonging to the selected
defaultRecordFormat will appear flagged as elements to return.
The user can deselect unwanted elements and/or flag desired
elements.
The combined effect of the specified defaultRecordFormat and of
the user actions is summarized in the following table:
| defaultRecordFormat | Return (user specified) | Returned elements |
|---|---|---|
| short|long|full | no | Short|long|full |
| short|long|full | yes | The system will return all the elements flagged as elements
to return, either by the user or by default as belonging to the
current defaultRecordFormat. If, finally, no elements will result to be flagged as elements to return, an error message will be sent. |
| off | no | Queried nodes, with an explanatory message ("By default the system is returning all queried nodes") for novice users |
| off | yes | Nodes selected as nodes to return. |
Every document element is assigned to a record format. Record
format "full" includes record format "long", which, in turn,
includes record format "short".
At least one element must be assigned to the short
recordFormat.
Metadata are essential to describe data and tailor user
interface and system behaviour. Some of them can be derived by
the XML Schema, while others can only be specified by an expert
of the domain, and have no way to be properly coded in the schema
itself.
In addition, we note that a basic design issue is to leave the
XML Schema unchanged.
From the XMLSchema we can take some properties of elements
and attributes, like range, multiplicity, pattern, list of
admitted values, content type ( text, element, mixed).
We can have some additional metadata to describe properties of
elements and attributes in a RDF file. The RDF should contain
information to univocally identify elements and attributes,
descriptions to help users to understand their meaning,
tranformation rules (e.g. dates can be stored in gregorian
calendar, but queried and displayed in a different calendar, like
muslin or jewish), kind of control, if any (list of values,
controlled vocabularies, ontologies, etc.).
In the following sections, we will supply a rather informal
description of the properties.
| Property | Description |
|---|---|
| qh:fullyQualifiedName | XPath-like expression to unambiguously identify the element or attribute in the document context. |
| qh:localName | Name of the element or attribute as declared in the schema. |
| qh:kind | Specifies if the resource whose properties are going to be given is an element or an attribute. |
| qh:elementaryType | To be used to closely define a data type in case it has been loosely defined in the XML schema. Admitted values (e.g. xs:gYear) are the primitive datatypes as specified in the XML Schema Part 2: Datatypes W3C Recommendation. |
| L. | Property | Description |
|---|---|---|
| 1 | qh:label + xml:lang attribute |
An expression in the specified language, to be used instead of the tag name in the query form. It should be reasonably expressive, even if short. The full explanation will be in the longDescription. |
| 1 | qh:longDescription + xml:lang attribute |
An expression in the specified language, to be used to better
describe the element or attribute. Can be showed to novice users, and used as help on the field for expert users. |
| 1 | qh:searchable | YES | NO Default: YES States if the user is allowed to formulate a query on the element or attribute. In the sample schema, it is easily seen that it doesn't make sense to issue a query on the content of <em> tag in a document chunk like: ...<p> anytext jhsdfhljfwaljl<em>kjssjhd</em>dh,khfdh</p>. |
| 1 | qh:moreDescendents | YES | NO Default: YES Useful to state that all the descendent elements will not be showed/queried. A typical example is a <p> element with mixed anyXML content. Coding <qh:searchable>no</qh:searchable> for all descendents has the same effect. |
| 1 | qh:controlledBy | A multiple property describing control information on the content of elements or attributes. |
| 2 | qh:mode | States the way the element is controlled: thesaurus | dictionary | externalList | localList |
| 2 | qh:serviceParms | URI referring info needed by an external service that will
perform the check (if qh:mode value is "thesaurus", "dictionary" or "externalList"). |
| 2 | qh:valueList | If qh:mode is "localList" the list of allowed values is a set of values identified by the <qh:vI> tag. |
| 3 | qh:vI | A value in the list of allowed values |
| 1 | qh:constrains | Inverse property of qh:constrainedBy. The query form will show a message saying that the specified value constrains values taken by other elements or attributes. |
| 1 | qh:constrainedBy | The query form will show the message "Constrained field" saying that allowed values are constrained by values taken by other elements or attributes. |
| 2 | qh:constraint | Property describing the constraint. |
| 3 | qh:element | The nodeID of the constrained element or of the constraining element, depending on the qh:constraint property occurs under the qh:constrains or qh:constrainedBy property. |
| 3 | qh:constraintExpr + xml:lang attribute |
A natural language expression in the specified language,
describing the constraint. The expression is showed in the query form to the novice user, and to the expert user in case of error. |
| 3 | qh:checkedBy | URI referring a service that will perform the check. |
Constraints are described making use of a natural language expression to avoid excessive complexity in stating the constraint using RDF. In a future release, the possibility of stating in a purely declarative way some simple and common constraints (e.g. existence constraints, elements whose values must be in a well defined equivalence relation, etc.) will be considered.
| Property | Description |
|---|---|
| qh:order | relevant | notRelevant Useful when the sequence of element occurrences conveys a semantic information (e.g. the first author, two deseases in a patient record, etc.) |
| qh:externalFormat | Module to perform transformation (e.g. calendar transformation, date format). Some of these transformations can make use of library functions like: julianToGregorian, ddmmyyToMmddyyyy, etc. |
| Property | Description |
|---|---|
| qh:postProcService | URI referring a service that will perform
postprocessing. A typical example could be create hyperlinks based on values taken by an element, or activate transformations, either via a direct processing as well as invoking XSLT. |
| qh:emphasize | YES | NO Default: NO If YES, matching items (words, or truncated words, or phrases) are emphasized. |
| qh:recfm | short|long|full. Specifies if the element is included in short, long or full recordFormat. Default is "full". Any element included in short recfm is also included into long and full recfm. At least one element must be specified as to be included in the short recfm. |
The user interface can be tested looking at the demo.
The first step in using the system is to select the document collection and setting up the User Profile.
Subsequently, the user will get, on the left of the screen,
the document tree, with nodes that can be expanded or
compressed.
For each node, a click will activate, on the right part of the
screen, the Query Form, where the user
can enter the desired values and operand to build a queryToken.
In composing the queryToken, user can select/deselect the element
as a returned element, asking for support from a dictionary or
thesaurus.
queryTokens can be composed into a queryFragment, which, in turn, can be
composed into queries to submit.
User can edit queryFragments and
query, adding parentheses and inserting or modifying boolean
operators.
To modify a queryToken, user must go back to the token and
interact via the query Form.
The library provides a set of algorithms and data structures for indexing and searching a collection of XML documents. The documents must be well-formed and may be heterogeneous in that they may reflect different DTDs o XML Schema. The library supports the storage and management of these XML files in native form by operating directly at the File System level. The main features are: state-of-the-art algorithms and data structures for text indexing, compressed space occupancy, and novel succinct data structures for the management of the hierarchical structure of an XML document.
The main indexing data structure is the classical Inverted List where proper compression techniques are used to keep the posting lists in succinct space. Words, tags, attributes and their values are indexed using independent indices of this type. As a result prefix, suffix, substring, regular expression, approximate and proximity searches on the textual content of the XML document as well on the attribute values can be executed in an efficient way. Resolving structural queries on tag paths can be also done efficiently by using a novel indexing of the hierarchical structure of the XML document based on geometric data structures.
The original text is compressed in such a way that random accesses to some of its parts are supported without decompressing the whole file. This is helpful for fast displaying the context of a query occurrence when visualizing the query results (usually called the snippet of the occurrence). Overall both the compressed XML document plus all of its indices occupy about the original file size. This is a significant space reduction with respect to other XML native indexers, like NATIX, for which the factor is at least 3, or is much less than the indexers based on Relational or OO Databases which abundantly surpass the factor 10. Clearly space reduction has an impact on the final performance because it allows to increase the disk bandwidth and to better exploit the buffering/caching strategies of the underlying OS and the hierarchy of memory levels. Furthermore, space reduction plays a crucial role in applications which run on small-memory devices like, for example, PDA and e-book.
One of the basic choices of the XCDE library is to operate at the granularity of the single document. This choice has one significant disadvantage and various advantages. It is clear that this fine granularity makes the searches on collections formed by many small documents much time consuming. However, on the other hand, single file indexing allows to easily implement distributed indices, to trivially update the index as a result of insertion/deletion of a document, to easily take care of the heterogeneity of XML documents, and also, to customize the indices according to the individual document features. It goes without saying that we might still index numerous XML documents without incurring in a significant query slowdown, in two different ways: either by "merging" small documents into bigger ones in order to reduce their overall number, or by adding a "light" global index which filters out some parts of the collection before the search on single documents is executed. The former approach might deploy a properly devised xml tagging which preserves the individuality of each document within a unique physical file containing the whole collection.
The library provides an API with a rich set of C functions to operate on its whole collection of data structures and algorithms. It may implement most of the basic functionalities of XQuery, and support more complex IR-like searches. The aim of its design is to manage efficiently and effectively XML documents that contain significant textual parts and/or a deeply nested hierarchical structure, and whose attribute values are complicated strings of numbers and letters. This is the typical scenario encountered in literary texts tagged via XML-TEI, one of our motivating application. The single-document granularity offered by our indices perfectly fits in this framework because it easily allows the humanists to restrict the searches on portions of the collection according to some criteria specified at query time: writing period, writing type, author, collections,...
Of course, the XCDE library is intended just as a kernel of a more complex XML document engine. It has been developed with the aim of providing researchers with a basic library for designing more complex applications. Its functionalities have been deployed to implement a sort of a mixed query language: XQuery + IR. The driving document is the W3C Working Draft of XQuery and XPath Full-Text Requirements .
The implementation environment is:
The present version depends on parsedXML, using its DOM functions to access the XML schema. Next releases will make use of SAX functions embedded in Python.
The following components have been completed:
See the demo at http://www.weblab.isti.cnr.it/projects/QH/demo/schema
To provide a demo of our system we have used the XML Schema
provided for the sample data in "XQuery
and XPath Full-Text Use Cases - W3C Working Draft 14 February
2003".
For the data we have used a variant of the sample data due in
that document.
| Reference XML Schema | fullText.xsd |
| Reference data | books.xml |
| Reference RDF file | fullText.rdf |
A (partial) graphical representation of the schema and its properties is:
Delayed due to late signature of the contract.