Team
Petko BogdanovDepartment: Computer Science
Affiliation: DBL lab
Email: petko at cs
Abstract
Graph querying is an important application in diverse domains due to the expressive power of graphs. Subgraph isomorphism is inevitably part of any querying method. A naive approach to answering graph queries is performing a linear scan of the database and testing it for subgraph isomorphism. As this test is an NP problem different approaches of indexing the database have been proposed to alleviate the cost of querying by reducing the number of subgraph isomorphism tests.
The aim of this work is to explore and extend a new index structure FG-Index [1] for graph querying in a database of graphs. The idea for extension is based on the recent analysis of using different frequent structures for the purposes of indexing [2]. The authors of the latter work argue that the trees are almost as selective as graphs when used for indexing.
The idea is to build a similar to FG-Index structure called FT-Index using spanning trees of the frequent features. A query is presented as a ranked set of low probable spanning trees LPSTs and corresponding sets of edges that complete the LPSTs to the original query graph. The index is iteratively searched from the least probable LPST until a non-empty candidate set is found. The verification of the obtained set is performed by checking for mappings of the removed edges. FT-Index is partially resident in memory and on disk similar to FG-Index.
Background and Motivation
The problem of graph querying could be stated in two settings:
1. The database is a collection of graphs Gi. The query q is a graph and the result of the query is set of graphs Dq.
2. The database is a single large graph G. In this case the query is a graph and the result of the query is a set of mappings of nodes.
This paper focuses on methods related to the case of a database of graphs. Several approaches for this problem have been proposed recently. He et al. proposed an indexing scheme called Closure-tree [5]. Closure-tree is a hierarchical index, with all database graphs in the leafs and graph closures in the internal nodes. The idea of graph closures is an extension of bounding box of spatial data that aggregates structural and annotation information of the underlined graphs.
Shasha et al. [4] proposed an indexing scheme based on enumeration of short paths called GraphGrep. An inverted index on the database graphs, containing additionally frequency information is built and used for filtering when a query is processed. The query is then tested for subgraph isomorphism with all graphs in the candidate set.
A later approach by Yan et al. (gIndex) [3] uses frequent and discriminant graphs patterns to build an inverted index for filtering purposes. The filtering rates are better than in GraphGrep due to the preserved structural information in the indexing entities. The built index is also smaller as compared to GraphGrep.
Zhao et al. [2] propose using frequent trees as indexing entities for the purposes of filtering. The authors analyse the filtering capabilities of Path, Tree and Graph features, as well as the sizes of the corresponding indices. The indexing scheme they propose is called Tree+delta and uses frequent trees for indexing entities and discriminative graphs, built on demand using a bounded heuristic.
The drawbacks of all methods based on filtering include: (i) they require a candidate subgraph isomorphism verification step and (ii) not all frequent features are indexed due to the in-memory index architecture.
A recent approach by Cheng et al. [1] called FG-Index is a step towards verification-free query processing. The index is used for direct query answering as opposed to filtering only. This is achieved by having a low frequency threshold for frequent graph patterns, and a hierarchical index resident both in memory and on disk. The ratio of in-memory and on-disk portions of the index could be parametrically controlled. The query processing first searches the index for direct containment of the query without isomorphism verification. If the query is not in the main index a second edge-index is used for filtering and verification of a small candidate set is performed.
FG-Index and Tree+delta are reported to outperform gIndex, but are based on orthogonal ideas in terms of query processing and index architecture. While Tree+delta uses the index for filtering, FG-index attempts to answer a query without verification. Tree+delta is designes as an in-memory index while FG-Index is distributed between memory and disk. The idea of this work is to take the best of both worlds and create an index based on frequent tree structures with the architecture of FG-Index.
Method Overview
The overview of the query processing using FT index is depicted in Figure 1.
Figure 1
Given a query graph, we first transform it into a tree. We use statistics from the Graph database in order to define the prior probability of occurrence for each distinct edge that is the tuple (v1, e, v2), where v1 and v2 are the labels of the adjacent nodes and e is the label of the edge. We use the edge prior as weigh and apply Kruskal's algorithm for Minimum Spanning Tree (MST). The MST of the query graph, as defined, comprises the most selective set of edges assuming edge independence when collecting the statistics. An output of this step is also a set of removed edges (RES).
We represent the MST in a canonical linear form using the minimum Breadth First Canonical Form (BFCF) proposed in [6]. A test for isomorphism of two trees is equivalent to comparing their BFCFs.
The BFCF is looked up in a hash table containing the BFCFs of frequent trees (FTT). Each object in the table stores the ids of the graphs in which the FT appears, together with their embedding. The lookup generates a candidate set Cq of DB graphs against which the query needs to be verified. In the verification step, we check for existence of the RES in each graph in Cq.
When the query BFCF is not matched in the FTT a different part of the index containing infrequent edges is used to construct a Cq in which the query graph is tested for isomorphism. note that in this case the Cq is of smaller cardinality than the frequency threshold T used when mining for frequent trees in the graph database.
We use an implementation of the gSpan [7] algorithm to mine for frequent trees in the graph database. The obtained trees are transformed and into BFCF and hashed in a the table FTT. The BFCFs are stored together with list of DB graphs in which they are contained as the positions of the embeddings.
Preliminary evaluation
We have analyzed the empirical performance of the query preprocessing step for a synthetic query set
containing 2000 query graphs of size 30 nodes and 40 edges on average. The average time for preprocessing a query graph is 1ms.
Future Steps
Explore the dependence of the index size on the parameter T for mining the frequent trees. Measure the size for the AIDS dataset(42 000 molecules of size 30 nodes on average) for different values of T. Decide on whether the index could fit in the memory for a small enough value of T.
Implement the query verification step 3.1 and measure the query answer time given that the index is in memory for the AIDS dataset.
References
[1] James Cheng, Yiping Ke, Wilfred Ng, An Lu, FG-Index: Towards Verification-Free Query Processing on Graph Databases, SIGMOD, 2007.
[2] Peixiang Zhao, Jeffrey Xu Yu, Philip S. Yu, Graph Indexing: Tree + Delta >= Tree, VLDB, 2007
[3] Xiafeng Yan, Philip S. Yu, Jiawei Han, Graph indexing: A Frequent Structure-based Approach, SIGMOD 2004
[4] D. Shasha, J. T. L. Wang, and R. Giugno. "Algorithmics and Applications of Tree and Graph Searching". In Proc. PODS'02 Proceeding of the International Conference in Pattern recognition (ICPR), Quebec, Canada, August 2002
[5] Huahai He, Ambuj K. Singh, Closure-Tree: An Index Structure for Graph Queries, ICDE, 2006
[6] Yun Chi, Yirong Yang, Richard Muntz, Mining Frequent Rooted Trees and Free Trees Using Canonical Forms, UCLA Technical Report, 2003
[7] X. Yan, J. Han, gSpan: Graph-based substructure pattern mining, ICDM, 2002