|MINING AND SEARCHING GRAPHS AND STRUCTURES (TUTORIAL)|
The 12th ACM Conference on Knowledge Discovery and Data Mining (SIGKDD'2006)
Philadelphia, PA, August 20 - 23, 2006
by Jiawei Han*, Xifeng Yan*, and Philip S. Yu^
*Univ. of Illinois at Urbana-Champaign
^IBM T. J. Watson Research Center
Scalable methods for mining, indexing, and
similarity search in graphs and other complex structures, such
as sequences, trees, and networks, have become increasingly
important in data mining and database management with broad
applications in social science, the Web, computer vision,
software engineering, chem-informatics, bio-informatics, etc. Graph mining algorithms
such as mining network motif, structural pattern with
constraints and contrast graph pattern, graph clustering, and
graph classification, have been studied extensively in recent
years. The applications built on these algorithms, such as graph
indexing and similarity search, are evolving into new components
of data management system for handling complex structured data.
The motivation for this tutorial is to present to the data
engineering community a comprehensive overview of this growing
area and discuss its potential research and application topics.
In this tutorial, we will present a survey on the state of the art in graph and structural pattern mining, indexing and similarity search, and their applications. It will cover the following major themes: scalable methods for mining trees [Zak02, CXYM05], graphs [HCD94, IWM00, KK01, BB02, YH02,YH03], coherent graph patterns [HWB+04], network patterns with constraints [YZH05, PJZ05], graph indexing methods [WZJS94, PF97, GW97, CSF+01, SWG02, CLO03, SK03, YYH04], similarity search in tree and graph databases [WBD98, RGW02, KSBG02, KKSS04, YKT05, YYH05], mining networks in bioinformatics [HWB+04, KGS04, YZH05], mining social networks and the Web [KKR+99, LKF05], and other applications.
1. Why mining, indexing, and similarity search in
graphs and structured databases?
2. From database systems to data mining and information management systems that handle complex structured data: A road map
3. Mining complex structural patterns: sequences, trees, and graphs
(a) Apriori-based method
(b) Pattern-growth method
(c) Relational graph mining
4. Constrained graph pattern mining
(a) Density constraints
(b) Connectivity constraints
(c) The framework of constraint-based graph mining
5. Indexing complex structures: graph indexing
(a) Path-based indexing
(b) Frequent discriminative pattern-based indexing
6. Similarity search in tree and graph databases
(a) Feature-based distance
(b) Tree/Graph edit distance
(c) Maximum common subgraph
7. Graph classification
(a) Path-based classification
(b) Frequent graph-based classification
(c) Kernel method
8. Graph clustering and pattern summarization
(a) Structural motifs in biological networks
(b) Graph classification using structural patterns: (1) chemical compounds, and (2) protein structures
REFERENCES (mainly Data
Mining and Database) (somehow out of date, I'll try to update it
C. Borgelt and M. R. Berthold. Mining molecular
fragments: Finding relevant substructures of molecules. In Proc.
2002 Int. Conf. on Data Mining (ICDM’02), pages 211–218,
Maebashi, Japan, Dec. 2002.
S. Chakrabarti, B. E. Dom, S. R. Kumar, P.
Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. M.
Kleinberg. Mining the web’s link structure. COMPUTER, 32:60–67,
Q. Chen, A. Lim, and K. Ong. D(k)-Index: An
adaptive structural summary for graph-structured data. In Proc.
2003 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’03), pages
B. Cooper, N. Sample, M. Franklin, G. Hjaltason,
and M. Shadmon. A fast index for semistructured data. In Proc.
2001 Int. Conf. Very Large Data Bases (VLDB’01), pages 341–350,
Y. Chi, Y. Xia, Y. Yang, and R.Muntz. Mining
closed and maximal frequent subtrees from databases of labeled
rooted trees. IEEE Trans. on Knowledge and Data Engineering,
M. Faloutsos, P. Faloutsos, and C.
Faloutsos. On power-law relationships of the internet topology.
In Proc. ACM SIGCOMM’99 Conf. Applications, Technologies,
Architectures, and Protocols for Computer Communication, pages
251–262, Cambridge, MA, Aug. 1999.
C. Faloutsos, K. McCurley, and A. Tomkins.
Fast discovery of connection subgraphs. In Proc. 2004 Int. Conf.
on Knowledge Discovery and Data Mining (KDD’04), pages 118–127,
R. Goldman and J.Widom. DataGuides:
Enabling query formulation and optimization in semistructured
databases. In Proc. 1997 Int. Conf. Very Large Data Bases
(VLDB’97), pages 436–445, 1997.
L. Holder, D. Cook, and S. Djoko.
Substructure discovery in the subdue system. In Proc. AAAI’94
Workshop on Knowledge Discovery in Databases (KDD’94), pages 169
– 180, 1994.
J. Han and M. Kamber. Data Mining:
Concepts and Techniques, 2ed. Morgan Kaufmann, 2006.
J. Huan, W.Wang, D. Bandyopadhyay, J.
Snoeyink, J. Prins, and A. Tropsha. Mining spatial motifs from
protein structure graphs. In Proc. of the 8th Annual Int. Conf.
on Research in Computational Molecular Biology (RECOMB’04),
pages 308–315, 2004.
A. Inokuchi, T. Washio, and H. Motoda. An
apriori-based algorithm for mining frequent substructures from
graph data. In Proc. 2000 European Symp. Principle of Data
Mining and Knowledge Discovery (PKDD’00), pages 13–23, Lyon,
France, Sept. 2000.
M. Koyuturk, A. Grama, and W. Szpankowski.
An efficient algorithm for detecting frequent subgraphs in
biological networks. In Proc. of the 12th International
Conference on Intelligent Systems for Molecular Biology
(ISMB’04), pages 200–207, 2004.
M. Kuramochi and G. Karypis. Frequent
subgraph discovery. In Proc. 2001 Int. Conf. on Data Mining
(ICDM’01), pages 313–320, 2001.
J. M. Kleinberg, R. Kumar, P. Raghavan,
S. Rajagopalan, and A. Tomkins. The web as a graph:
Measurements, models, and methods. In Proc. Int. Conf.
Combinatorics and Computing, pages 1–17, 1999.
K. Kailing, H. Kriegel, S. Schnauer, and
T. Seidl. Efficient similarity search for hierarchical data in
large databases. In Proc. 9th Int. Conf. on Extending Database
Technology (EDBT’04), pages 676–693, 2004.
D. Kempe, J. Kleinberg, and E. Tardos.
Maximizing the spread of influence through a social network. In
Proc. 2003 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data
Mining (KDD’03), pages 137–146, Washington, DC, Aug 2003.
J. M. Kleinberg. Small world phenomena
and the dynamics of information. In Advances in Neural
Information Processing Systems 14 (NIPS’01), pages 431–438,
Vancouver, BC, Dec. 2001.
R. Kaushik, P. Shenoy, P. Bohannon, and
E. Gudes. Exploiting local similarity for efficient indexing
of paths in graph structured data. In Proc. 2002 Int. Conf. Data Engineering (ICDE’02), pages
J. Leskovec, J. Kleinberg, and C.
Faloutsos. Graphs over time: Densification laws, shrinking diam-
eters and possible explanations. In Proc. 2005 ACM SIGKDD Int. Conf. on Knowledge Discovery
and Data Mining (KDD’05), 2005.
E. Petrakis and C. Faloutsos. Similarity
searching in medical image databases. Knowledge and Data
Engineering, 9(3):435–447, 1997.
J. Pei, D. Jiang, and A. Zhang. On mining
cross-graph quasi-cliques. In Proc. 2005 Int. Conf.
Knowledge Discovery and Data Mining (KDD’05), Chicago, IL, Aug. 2005.
J. Raymond, E. Gardiner, and P. Willett.
Rascal: Calculation of graph similarity using maximum
common edge subgraphs. The Computer Journal, 45:631–644, 2002.
S. Srinivasa and S. Kumar. A platform
based on the multi-dimensional data model for analysis of
bio-molecular structures. In Proc. 2003 Int. Conf. on Very Large Data Bases (VLDB’03), pages
D. Shasha, J. Wang, and R. Giugno.
Algorithmics and applications of tree and graph searching. In
Proc. 21th ACM Symp. on Principles of Database Systems (PODS’02), pages 39–52, 2002.
P. Willett, J. Barnard, and G. Downs.
Chemical similarity searching. J. Chem. Inf. Comput. Sci.,
J. Wang, K. Zhang, K. Jeong, and D.
Shasha. A system for approximate tree matching. IEEE
Trans. on Knowledge and Data Engineering, 6:559 – 571, 1994.
D. Xin, J. Han, X. Yan, and H. Cheng.
Mining compressed frequent-pattern sets. In Proc. 2005
Int. Conf. Very Large Data Bases (VLDB’05), Trondheim, Norway,, Aug. 2005.
X. Yan, H. Cheng, J. Han, and D. Xin.
Summarizing itemset patterns: A profile-based approach.
In Proc. 2005 Int. Conf. Knowledge Discovery and Data Mining (KDD’05), Chicago, IL, Aug. 2005.
X. Yan and J. Han. gSpan: Graph-based
substructure pattern mining. In Proc. 2002 Int. Conf. on
Data Mining (ICDM’02), pages 721–724, Maebashi, Japan, Dec. 2002.
X. Yan and J. Han. CloseGraph: Mining
closed frequent graph patterns. In Proc. 2003 Int. Conf.
Knowledge Discovery and Data Mining (KDD’03), pages 286–295, Washington, D.C., Aug. 2003.
R. Yang, P. Kalnis, and A. Tung.
Similarity evaluation on tree-structured data. In Proc. of 2005
Int. Conf. on Management of Data (SIGMOD’05), pages 754 – 765, 2005.
X. Yan, P. Yu, and J. Han. Graph
indexing: A frequent structure-based approach. In Proc. of 2004
Int. Conf. on Management of Data (SIGMOD’04), pages 335 – 346, Paris, France, 2004.
X. Yan, P. Yu, and J. Han. Substructure
similarity search in graph databases. In Proc. of 2005 Int.
Conf. on Management of Data (SIGMOD’05), pages 766 – 777, Baltimore, MD, 2005.
X. Yan, X. Zhou, and J. Han. Mining
closed relational graphs with connectivity constraints. In
Proc. 2005 Int. Conf. Knowledge Discovery and Data Mining (KDD’05), Chicago, IL, Aug. 2005.
M. Zaki. Efficiently mining frequent
trees in a forest. In Proc. 2002 ACM SIGKDD Int. Conf.
Knowledge Discovery in Databases (KDD’02), pages 71–80, Edmonton, Canada, 2002.