![]() |
XSet
|
Introduction:
This is the tutorial page for the XSet search engine. This document contains information on both the component usage of the XSet server, as well as the RMI service interface.
Index:
Basic Background
Query Language
First Example
Range Queries
Second Example
Enumerate Tags
Max and Min
Querying
Service Startup
Component Usage
Here we will mention some of the background information involved in running and using XSet efficiently.
First, XSet is a performance oriented XML search engine. The primary goal is to bring XML querying functionality to all applications with absolute minimum latency. To achieve this goal, we have started with an implementation that does not provide the traditional ACID semantics generally associated with database systems. XSet is currently a main memory database, which creates an in memory index of a dataset, which is a combination of file system directories and user input XML strings.
In normal operation as a RMI service, the XSetService can be started in the background, and operations be executed via remote RMI calls from clients. Clients can add and delete documents to the dataset using several methods, query using several different methods, as well as finding properties of the current in-memory dataset.
During the indexing of a new XML document, the document is parsed by an XML parser into a DOM tree. The DOM tree is then traversed, and references to the XML document are added to self-balancing trees (Treaps) that hang off an internal tag index tree. Queries are processed in much the same way. Queries are parsed as XML DOM trees, where the corresponding path of each query subpath is traversed down in the internal tag index to collect a document set. Each constraint in the query results in a set of XML documents, which are collected in a global join to form the desired result set.
In Version 1.1, the parser was changed from the Microsoft MSXML 1.9 parser to the IBM XML4J 2.0.6 parser. Correct compilation of the XSet source distribution requires the IBM XML4J parser, version 2.0.6 or later, to be installed in the local classpath.
The XSet query language is an extremely simplistic query language. XSet queries for XML documents by a subset model, that is to say, the query you write to find a set of documents is itself a small subset of the contents of those desired documents. XSet queries then are well formed documents that do not validate to any given DTD (Document Type Definition), even though the documents searched for might or might not validate to a DTD. As a direct consequence of the subset model, the client performing the query must know the exact context of the information queried, or in other words, any tag and its value must be enclosed in the set of hierarchical tags that go all the way up to the root tag. This should correspond to the same tag list that exists in the result document. See the following example for a simple query demonstration.
In our example, we will use XSet as a simple service discovery mechanism, and attempt to find a color printer within the current administrative domain. An example of a result document would be the following, a phaser color printer in room 443:
<?xml version="1.0"?> <PRINTCAP> <LOCAL/> <ROOM>443</ROOM> <FULLNAME>Phaser in 443 Soda</FULLNAME> <NAME>phaser443</NAME> <COLOR>YES</COLOR> <DUPLEX>NO</DUPLEX> <LOGFILE>/var/log/lpd-errs</LOGFILE> <SPOOLDIRECTORY>/var/spool/lpd/phaser443</SPOOLDIRECTORY> <MX NOLIMIT="TRUE"/> <REMOTE> <SERVER>phaser</SERVER> <PRINTER>phaser443</PRINTER> </REMOTE> </PRINTCAP>Most of the information in this printer description can be ignored for our purposes. This current description does not reference a DTD to which it validates. For our simple example, we want to query on the <COLOR> tag, and the query would look like the following:
<?xml version="1.0"?> <PRINTCAP> <COLOR>YES</COLOR> </PRINTCAP>The only thing of note here is that the <PRINTCAP> tag is necessary, to provide context for the <COLOR> tag and its value. Additional constraints such as <DUPLEX>NO</DUPLEX> can be added to the query to further constrain the result set of XML documents.
Range queries can be executed on documents by providing special attributes which define the range query to the XSet query processor. To specify a range query on a given tag, use the special attributes XSetLE, XSetME, XSetLT, XSetMT, where the relational operators are represented as : LE = "lower inclusive range", ME = "upper inclusive range", LT = "lower exclusive range" and MT = "upper exclusive range". You also need to specify the attribute XSetKTYPE as either "INTEGER", "STRING", or "FLOAT". The value inside a range query tag is irrelevant, but needs to be of length > 0. See the next section for a simple range query example.
Still using our printer descriptions as our dataset, we want to issue
a query to find all the colors on the fourth floor. So we take the
previous query for color printers, and add a second constraint which specifies
the room number as an integer between 400 (inclusive) and 500 (exclusive).
<?xml version="1.0"?> <PRINTCAP> <COLOR>YES</COLOR> <ROOM XSetME="400" XSetLT="500" XSetKTYPE="INTEGER"> </ROOM> </PRINTCAP>In the above query, notice the use of the specific attributes to restrain the values, as well as the XSetKTYPE attribute, and the space inside the <ROOM> tag.
Occasionally, it may be useful to list the unique values for a given tag in the current data set. For tags that have a small number of possible values, an enumeration of the tag would help the client focus the query better. For instance, an enumeration request may be made on the tag <PRINTCAP><ROOM> in the printer descriptions dataset. This would return all room numbers corresponding to rooms which contain accessible printers in the current building.
The max and min operators can often come in handy when dealing with ordered values in tags. For example, had the printer descriptions included a page per minute count, it might be useful to find the max or min in a set of XML documents for a given tag, in order to find the fastest printers in the set. Since this functionality is largely orthogonal to the actual querying of documents, we have abstracted out the max and min functionality to the SETutils class. These functions can be applied to the results of queries, in order to return the document(s) in a set that contains the max or min value for a given tag.
Before a query can be issued to XSet, it must be first created by the client, either as a DOM tree, or as a simple XML text string. The creation of the query is a non-trivial process. But because of the wide variety of parameters available, the query creation is left to the client.
To start up the RMI XSetService in the background, make sure that the
xset
package resides inside the current classpath, and type:
java xset.XSetService &
To start up the Ninja version of the XSet service, add the following
line to your ispace.cfg and start the ispace regularly:
ninja.xset.XSetService XSetServer
After the service has been started, commands can then be issued via RMI to modify the dataset, as well as perform queries, tag enumerations and max/min queries.
The XSet package can also directly use the XSet functionality by using the SETserver class. The SETserver class contains the majority of the functionality provided to the XSetService. By accessing the object directly, the application will bypass the performance hit caused by RMI communication. For an example of how the SETserver class is used directly, take a look at XSetProfile.java.