XSet 
Programmer
Tutorial

Introduction:

This is the tutorial page for the XSet search engine.  This document contains information on both the component usage of the XSet server, as well as the RMI service interface.  For more information on internal design, operation, and applications, please see the XSet paper.

Index:

Basic Background
Query Model
First Example
Range Queries
Second Example
Enumerate Tags
Max and Min
Querying
Persistence API
Service Startup
XSet Client
Component Usage



Basic Background

Here we will mention some of the background information involved in running and using XSet efficiently.

First, XSet is a performance oriented XML search engine.  The primary goal is to bring XML querying functionality to all applications with absolute minimum latency.  To achieve this goal, we have started with an implementation that does not provide the traditional ACID semantics generally associated with database systems.  XSet provides durability guarantees via redo logs and checkpoints, but does not support transactions.  XSet is also a main memory database, which creates an in memory index of a dataset.  The index is a combination of file system directories and user input XML strings.

In normal operation as a RMI service, the XSetService can be started in the background, and operations be executed via Java RMI calls from clients.  Clients can insert and delete documents to the dataset using several methods, query using several different methods, control as well as finding properties of the current in-memory dataset.

During the indexing of a new XML document, the document is parsed by an XML parser into a DOM tree.  The DOM tree is then traversed, and references to the XML document are added to self-balancing trees (Treaps) that hang off an internal tag index tree.   Queries are processed in much the same way.  Queries are parsed as XML DOM trees, where the corresponding path of each query subpath is traversed down in the internal tag index to collect a document set.  Each constraint in the query results in a set of XML documents, which are collected in a global join to form the desired result set.

In Version 2.0, the parser was changed from IBM's XML4J parser, to its descendant, the Apache Xerces XML parser, available here. Correct compilation of the XSet source distribution requires the IBM XML4J parser, version 2.0.6 or later, to be installed in the local classpath.

Query Model

The XSet query language is an extremely simplistic query language.  XSet queries for XML documents by a subset model, that is to say, the query you write to find a set of documents is itself a small subset of the contents of those desired documents.  XSet queries then are well formed documents that do not validate to any given DTD (Document Type Definition), even though the documents searched for might or might not validate to a DTD.   As a direct consequence of the subset model, the client performing the query must know the exact context of the information queried, or in other words, any tag and its value must be enclosed in the set of hierarchical tags that go all the way up to the root tag.  This should correspond to the same tag list that exists in the result document.  See the following example for a simple query demonstration.

NOTE: There has been some confusion about the use of XSet attributes for searching. At this time, attributes can only function as an additional filter when an existing tag is being queried, and itself is not enough to consistitute a constraint. What this means is that a query on an attribute alone will not return any values (such as <TAGNAME attr1="one"></TAGNAME>), only when the attribute belongs to a tag with an actual text value (such as <TAGNAME attr1="one">text</TAGNAME>).

First Example

In our example, we will use XSet as a simple service discovery mechanism, and attempt to find a color printer within the current administrative domain.  An example of a result document would be the following, a phaser color printer in room 443:

           <?xml version="1.0"?>
           <PRINTCAP>
           <LOCAL/>
           <ROOM>443</ROOM>
           <FULLNAME>Phaser in 443 Soda</FULLNAME>
           <NAME>phaser443</NAME>
           <COLOR>YES</COLOR>
           <DUPLEX>NO</DUPLEX>
           <LOGFILE>/var/log/lpd-errs</LOGFILE>
           <SPOOLDIRECTORY>/var/spool/lpd/phaser443</SPOOLDIRECTORY>
           <MX NOLIMIT="TRUE"/>
           <REMOTE>
              <SERVER>phaser</SERVER>
              <PRINTER>phaser443</PRINTER>
           </REMOTE>
           </PRINTCAP>
Most of the information in this printer description can be ignored for our purposes.  This current description does not reference a DTD to which it validates.  For our simple example, we want to query on the <COLOR> tag, and the query would look like the following:
 
           <?xml version="1.0"?>
           <PRINTCAP>
           <COLOR>YES</COLOR>
           </PRINTCAP>
The only thing of note here is that the <PRINTCAP> tag is necessary, to provide context for the <COLOR> tag and its value.  Additional constraints such as <DUPLEX>NO</DUPLEX> can be added to the query to further constrain the result set of XML documents.

Range Queries

Range queries can be executed on documents by providing special attributes which define the range query to the XSet query processor.  To specify a range query on a given tag, use the special attributes XSetLE, XSetME, XSetLT, XSetMT, where the relational operators are represented as : LE = "lower inclusive range", ME = "upper inclusive range", LT = "lower exclusive range" and MT = "upper exclusive range".  You also need to specify the attribute XSetKTYPE as either "INTEGER", "STRING", or "FLOAT".  The value inside a range query tag is irrelevant, but needs to be of length > 0.  See the next section for a simple range query example.

Second Example

Still using our printer descriptions as our dataset, we want to issue a query to find all the colors on the fourth floor.  So we take the previous query for color printers, and add a second constraint which specifies the room number as an integer between 400 (inclusive) and 500 (exclusive).
 

           <?xml version="1.0"?>
           <PRINTCAP>
           <COLOR>YES</COLOR>
           <ROOM XSetME="400" XSetLT="500" XSetKTYPE="INTEGER"> </ROOM>
           </PRINTCAP>
In the above query, notice the use of the specific attributes to restrain the values, as well as the XSetKTYPE attribute, and the space inside the <ROOM> tag.

Enumerate Tags

Occasionally, it may be useful to list the unique values for a given tag in the current data set.  For tags that have a small number of possible values, an enumeration of the tag would help the client focus the query better.  For instance, an enumeration request may be made on the tag <PRINTCAP><ROOM> in the printer descriptions dataset.  This would return all room numbers corresponding to rooms which contain accessible printers in the current building.

Max and Min

The max and min operators can often come in handy when dealing with ordered values in tags.  For example, had the printer descriptions included a page per minute count, it might be useful to find the max or min in a set of XML documents for a given tag, in order to find the fastest printers in the set.  Since this functionality is largely orthogonal to the actual querying of documents, we have abstracted out the max and min functionality to the SETutils class.  These functions can be applied to the results of queries, in order to return the document(s) in a set that contains the max or min value for a given tag.

Querying

Before a query can be issued to XSet, it must be first created by the client, either as a DOM tree, or as a simple XML text string.  The creation of the query is a non-trivial process.  But because of the wide variety of parameters available, the query creation is left to the client.

Persistence Operation

Starting with version 2.0, XSet has become a fully durable XML database.  During normal operation, documents which are inserted are immediately put to disk into the data directory.  Insert and delete operations are logged both before and after the operation completes.  The logs themselves are buffered in memory in a small buffer, which is flushed to disk when full.  XSet supports checkpoints, which in effect put the entire in memory index to disk.  This allows old logs to be truncated, and provides fast recovery during restarts.  Checkpoints can be done at regular intervals (static variable set inside RecoveryMgr.java, as is the in-memory log buffer size).

In addition to the automated facilities for data persistence, the API includes additional calls for explicit control.  Included are ForceSync(), Checkpoint(), ClearAll(), and Shutdown(boolean).  Please see the Javadoc for more information how how these calls are used.

Service Startup

To start up the RMI XSetService in the background, make sure that the xset package resides inside the current classpath, and type:
             java ninja.xset.XSetService [-b] DATASTOREDIR &
Where DATASTOREDIR is the directory where the log files and checkpoint files will be stored.  When XSetService starts, it will atempt to restore any existing checkpoints or log files in the DATASTOREDIR, unless the optional -b tag is found.  If -b is used, then XSet starts in cleanboot mode, and removes any existing log and checkpoint files from the DATASTOREDIR.

After the service has been started, commands can then be issued via RMI to modify the dataset, as well as perform queries, tag enumerations, max/min queries, and log flush and checkpoint operations.

XSet Client

Included with this release is an interactive XSet Client, which can be used to get a feel for the XSet operations, or used as a debugging tool.  To start XSet Client, first start the XSetService, then type:
              java ninja.xset.XSetClient hostname pathtofiles

Component Usage

The XSet package can also directly use the XSet functionality by using the SETserver class.  The SETserver class contains the majority of the functionality provided to the XSetService.  By accessing the object directly, the application will bypass the performance hit caused by RMI communication.  For an example of how the SETserver class is used directly, take a look at XSetProfile.java.