XSet: a Search Engine on Treaps for XML
Project Overview
Ben Yanbin Zhao
Last updated: January 13th, 1999
 
Introduction Internal Design Future Work Presentation (Jan 1999) References

The official XSet homepage resides at:
http://www.cs.berkeley.edu/~ravenben/xset

Online Demo:
The demo makes a lot more sense if you read the demo documentation here beforehand.
Currently, the XSet demo is running on 3 Ninja machines, accessible here:

Ninja1  Ninja2   Ninja3
Please email me with any problems or suggestions.

Introduction:
This work has been done as my Masters project, and owns its origin to my service discovery work in the Ninja framework. I would encourage anyone interested to read on the Ninja project (in particular the Service Discovery Service) to get better context on the XSet project.

Motivation:
Much has been said about the power of XML as a flexible framework, and I will leave it to others to demonstrate/expound on the virtues of XML.  Web sites such as Robin Cover's SGML/XML page provide a complete list of XML references and projects.

I think that there are several major advantages to XML that make it stand out from the plethura of new languages and stardards.  In particular, there are three of which I think make XML particularly useful to operating systems research and applications development.  First, XML is text based, and therefore human readable and low overhead.  Second, it has a major advantage over other coordination frameworks, or database type organization structure, in it's extensibility and the ability to support evolving schemas.  And finally, it's self-describing, and lends itself in a natural way to use standards to bridge interoperability gaps.

Have gone over the basics, it's not a stretch to say that XML's characteristics make it useful and applicable to a wide array of computing application and research, from top level programming model and language interface issues, to low level mechanisms such as file systems.  All of these levels of applications will need an easy way to access, store, and search XML documents.  With such a vertical slice through the computing structure, we need a mechanism to provide the XML functionality with minimum overhead and maximum performance.

It's interesting to note that the standard relational database model does not work well with the hierarchical structure of the XML language.  And while object-oriented databases have the ability to represent XML's recursive structure, their representation will have to deal with a serious issue of overhead.  So with this in mind, XSet's purpose is to fill that void, by implementing a brand new database model and providing a low-level utility layer for applications across the spectrum.

Related Works:
To be completed in full:


Internal Design:
The XSet search engine strives to define an efficient way to compare and search for XML documents. The query language derives naturally from the XML language. Queries are written as incomplete but well-formed XML documents. These documents represent the fields and values known at the time of the search, and the results are supersets of the queries.  Currently, range queries (and additional future query functionality) are encoded as special attributes processed by the XSet server.  This may change in time (see future work section).

The server's main function is to provide an interface for creating, updating and perform queries on a virtual database through a large number of indices. These indices are organized hierarchically, to encompass and maintain the recursive structure of XML elements.  This hierarchy uses hashtables (one for each parent tag), to provide fast access while maintaining efficient memory usage.  Hashtables are reasonable data structure choices here, since the number of distinct XML tag names has an upper bound, defined by the total cardinality of all DTDs supported.  (This does not change appreciably even after I integrate support for XML namespaces.)

At the bottom of the tag hierarchy are "leaf tags," tags which contain no additional child tags, and only text values.  These leaf nodes are represented by treaps, self-balancing data structures akin to a cross between trees and heaps. Treaps use a randomized priority element, sorted in heap order, to maintain probabilistic self-balancing within the tree with minimal node state overhead.  "Leaf tag" nodes are sorted by values of the tag associated with the treap, and contain pointers to a hashtable of search result documents. The XML search engine satisfies queries by traversing down the tag hierarchy for each tag structure in the query XML tree, and collecting a list of XML document sets, each of which correspond to the result set of a tag search.  All result subsets are first filtered to eliminate documents with attributes not matching the query.  Then a global set intersection is applied to all remaining result sets, producing the solution set in O(n) intersection time.

Currently, the XSet server handles 2 way range queries on tag values in addition to direct tag value matches.  The three range query types currently supported are Integer, String, and Float.

Future Work:
There are several improvements I plan to make to the base query engine. First, we can leverage the hierarchical structure to provide query refinement and query expansion to the standard search engine functionality, giving the user the ability to adjust the size of the return set by adjusting query tags.  Query expansion allows the SET server to offer broader search terms or tag values as part of the search results, whereas query refinement returns terms or tag values that will help to narrow the solution space.  I also want to make the XSet server a fully distributed system. Part of this will involve partitioning the data into different content servers, and characterizing them in order to allow for efficient query routing between servers.  The issues of availability, consistency, and scalability will be the major issues.  And finally, I will consider expanding the query model, by defining a DTD for encoding a meta-query model, including AND, OR and NOT operators as the internal tags.

As far as immediate applications go, I plan to integrate the XSet service into an email client, giving the user ability to search for archived email messages on various and sundry characteristics.  Another natural application is the use of XSet as the internal structure in building a high-performance TupleSpace implementation.

Please direct all comments and questions to Ben Zhao.
Thank you.

An internal incremental design page can be found here.

References:
Discover: A Resource Discovery System based on Content Routing
Mark A. Sheldon, Andrzej Duda, Ron Weiss, and David K. Gifford
Proceedings of the Third International World Wide Web Conference Elsevier, North Holland
Computer Networks and ISDN Systems, April 1995
Fast Set Operations Using Treaps
Guy E. Blelloch and Margaret Reid-Miller
ACM SPAA '98, Puerto Vallarta, Mexico
Randomized Search Trees
Raimund Seidel and Cecilia R. Aragon
Algorithmica, 16:454-497, 1996
Content Routing: A Scalable Architecture for Network-Based Information Discovery
Mark A. Sheldon
MIT, PhD Thesis, 1995
The SGML/XML Web Page
Robin Cover
OASIS: Organization for the Advancement of Structured Information Standards