Welcome to the gDensity project page!
gDensity is a novel framework using graph density index to efficiently process graph proximity queries for very large graphs. This page provides source code, binaries and data sets for the purpose of experiment reproducibility.
For more information, refer to the following paper:
Nan Li, Xifeng Yan, Zhen Wen, Arijit Khan, "Density index and proximity search in large graphs", Proc. of the 2012 ACM International Conference on Information and Knowledge Management (CIKM'12), Maui, HI, USA, Oct. 29-Nov. 2, 2012, pp. 235-244.
This page consists of the following sections: query performance comparison, incremental indexing, partial indexing, and scalability test. The source code, executable binaries and data sets specifically needed for each section are provided individually. We provide sample graph proximity queries for the users to experiment. However, the users can generate their own queries to test our gDensity framework, using one of our routines provided below.
Due to proprietary reasons, we only publish two of our data sets: DBLP and WebGraph. Please refer to the paper for more details on the data sets. Variations of the original DBLP and WebGraph are also used in our expriments, such as synthetically labeled and incremented graphs. Subgraphs of WebGraph of various sizes are used in the scalability test. Those additional graph files will be given separately in the sections below.
The following table gives links for the two data sets: DBLP and WebGraph 10M.
The overall format of the data sets is described here.
All experiments are run on a machine that has a 2.5GHz Intel Xeon processor (only one core is used), 32G RAM, and runs 64-bit Fedora 8 with LEDA 6.0. A free edition of LEDA can be downloaded here. The package will contain README on how to install LEDA. Instructions on how to compile and link application programs using LEDA in a UNIX environment can be found here. The provided binaries in the following sections run in a 64-bit environment.
The source code and binary for indexing can be found here. The sample python script provides an example to run the indexing routine on various data sets under various parameter settings in a batch mode.
gDensity vs. Baselines
Query performance is compared between gDensity and two baseline methods: improved RarestFirst and gDensity without likelihood ranking. The source code and binary for all three methods can be found here. Both the original graphs and synthetically relabeled graphs are used in the comparison. Here we provide sample execution bundles for both DBLP and WebGraph 10M. Each bundle contains the graph data files, the index files, the sample query files and the python scripts to run the program in a batch mode.
- DBLP original graph
- DBLP synthetically labeled graphs
- WebGraph 10M original graph
- WebGraph 10M synthetically labeled graphs
Incremental Indexing Performance
The source code and binary provided here increases the graph by a certain percentage and then applies incremental indexing on the increased graph. We also provide a sample execution bundle for DBLP, which contains the graph data file, the index files, the sample query files and the python scripts to run the program in a batch mode.
Partial Indexing Performance
The opposite of partial indexing is all indexing. The default scheme for density indexing on large graphs (Section "Density Indexing") is partial indexing. In this section, we provide source code specifically for all/partial indexing comparison. All indexing is only feasible on small graphs.
Scalability test is done on a set of web graphs with various sizes. We provide the sample execution bundle, which contains the graph data files, the index files, the sample query files and the python script to run the program in a batch mode.
Useful Experimental Routines
In addition, we provide the source code for two useful routines: random query generator and synthetically labeled graph generator.