Welcome to the gIceberg project page!

gIceberg is an efficient framework to uncover graph icebergs. This page provides the source code, binaries and data for the gIceberg project. For more information, refer to the following paper:
Nan Li, Ziyu Guan, Lijie Ren, Jian Wu, Jiawei Han, Xifeng Yan. "gIceberg: towards iceberg analysis in large graphs", Proc. of the 2013 IEEE International Conference on Data Engineering (ICDE'13), Brisbane, Australia, Apr. 8-12, 2013.

Introduction

A graph iceberg is defined as a vertex whose local neighborhood contains a high concentration of an attribute of interest. gIceberg performs aggregation upon personalized PageRank vectors to assess the concentration level of the attribute, and ranks vertices by their aggregate scores. Two aggregation strategies, forward and backward aggregation, are proposed with corresponding optimization techniques and bounds. The former computes the aggregate score for each vertex via summing over the entries containing the attribute of interest, in the vertex's personalized PageRank vector; the latter starts from the vertices containing the attribute of interest, and back-propagates their personalized PageRank values to all vertices.

Data Sets

Due to proprietary reasons, we only publish two of our data sets: DBLP and R-MAT. The DBLP network is built from the DBLP computer science bibliography. The R-MAT network is generated by the GTgraph toolkit. Please refer to the paper for more details on the data sets.

The following table gives links for the two data sets: DBLP and WebGraph 10M.

Table 1: Data Sets
DBLP RMAT_500K

The overall format of the data sets is described here.

System Prerequisites

All experiments are run on a machine that has a 2.5GHz Intel Xeon processor (only one core is used), 32G RAM, and runs 64-bit Fedora 8 with LEDA 6.0. A free edition of LEDA can be downloaded here. The package will contain README on how to install LEDA. Instructions on how to compile and link application programs using LEDA in a UNIX environment can be found here. The provided binaries in the following sections run in a 64-bit environment.

Generating Ground Truth

We simulate 2K random walks on each vertex to generate its approximate ersonalized PageRank vector, and aggregate over such approximate vector to derive its aggregate score. Such score is used as the ground truth. Results in our paper suggeste that 2K random walks is enough to derive very accurate approximation of the real PageRank vectors. The source code and binary executable can be found here.

To run it, use the command:
$ ./gt_FA2K.out data_file -c0.15
where -c specifies the teleportation constant during random walks.

Indexing Pivot Vertices

A set of pivot vertices are selected in forward aggregation for efficient pruning of non-iceberg vertices. We find such pivot indices using BFS and store them into an index file. The source code and binary executable can be found here.

To run it, use the command:
$ ./pIndex.out data_file -s0.3
where -s specifies the neigborhood similarity threshold during pivot vertex selection.

Comparing Aggregation Schemes

We compare the three aggregation schemes, forward aggregation, pivot vertex-based forward aggregation, and backward aggregation, with respect to precision, recall, and run time. The source code and binary executable are provided here.

To run it, use the command:
$ ./fbc.out data_file query_file -c0.15 -m1 -t0.5 -s0.3
where the parameters are:
-c: teleportation constant
-m: number of runs (the average measure is taken over all runs)
-t: cut-off threshold value for iceberg vertices
-s: neighborhood similarity threshold for pivot vertex selection