MINING AND SEARCHING GRAPHS AND STRUCTURES (TUTORIAL)

The 12th ACM Conference on Knowledge Discovery and Data Mining (SIGKDD'2006)
Philadelphia, PA, August 20 - 23, 2006

by Jiawei Han*, Xifeng Yan*, and Philip S. Yu^
*Univ. of Illinois at Urbana-Champaign
^
IBM T. J. Watson Research Center
 
TUTORIAL SLIDES [pdf]
INTRODUCTION
Scalable methods for mining, indexing, and similarity search in graphs and other complex structures, such as sequences, trees, and networks, have become increasingly important in data mining and database management with broad applications in social science, the Web, computer vision, software engineering, chem-informatics, bio-informatics, etc. Graph mining algorithms such as mining network motif, structural pattern with constraints and contrast graph pattern, graph clustering, and graph classification, have been studied extensively in recent years. The applications built on these algorithms, such as graph indexing and similarity search, are evolving into new components of data management system for handling complex structured data. The motivation for this tutorial is to present to the data engineering community a comprehensive overview of this growing area and discuss its potential research and application topics.

In this tutorial, we will present a survey on the state of the art in graph and structural pattern mining, indexing and similarity search, and their applications. It will cover the following major themes: scalable methods for mining trees [Zak02, CXYM05], graphs [HCD94, IWM00, KK01, BB02, YH02,YH03], coherent graph patterns [HWB+04], network patterns with constraints [YZH05, PJZ05], graph indexing methods [WZJS94, PF97, GW97, CSF+01, SWG02, CLO03, SK03, YYH04], similarity search in tree and graph databases [WBD98, RGW02, KSBG02, KKSS04, YKT05, YYH05], mining networks in bioinformatics [HWB+04, KGS04, YZH05], mining social networks and the Web [KKR+99, LKF05], and other applications.

 
OUTLINE
1. Why mining, indexing, and similarity search in graphs and structured databases?
2. From database systems to data mining and information management systems that handle complex structured data: A road map
3. Mining complex structural patterns: sequences, trees, and graphs
     (a) Apriori-based method
     (b) Pattern-growth method
     (c) Relational graph mining
4. Constrained graph pattern mining
     (a) Density constraints
     (b) Connectivity constraints
     (c) The framework of constraint-based graph mining
5. Indexing complex structures: graph indexing
     (a) Path-based indexing
     (b) Frequent discriminative pattern-based indexing
6. Similarity search in tree and graph databases
     (a) Feature-based distance
     (b) Tree/Graph edit distance
     (c) Maximum common subgraph
7. Graph classification
    (a) Path-based classification
    (b) Frequent graph-based classification
    (c) Kernel method
8. Graph clustering and pattern summarization
9. Applications
    (a) Structural motifs in biological networks
    (b) Graph classification using structural patterns: (1) chemical compounds, and (2) protein structures
 
 
REFERENCES (mainly Data Mining and Database) (somehow out of date, I'll try to update it later)
[BB02] C. Borgelt and M. R. Berthold. Mining molecular fragments: Finding relevant substructures of molecules. In Proc. 2002 Int. Conf. on Data Mining (ICDM’02), pages 211–218, Maebashi, Japan, Dec. 2002.
 
[CDK+99] S. Chakrabarti, B. E. Dom, S. R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, and J. M. Kleinberg. Mining the web’s link structure. COMPUTER, 32:60–67, 1999.
 
[CLO03] Q. Chen, A. Lim, and K. Ong. D(k)-Index: An adaptive structural summary for graph-structured data. In Proc. 2003 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’03), pages 134–144, 2003.
 

[CSF+01] B. Cooper, N. Sample, M. Franklin, G. Hjaltason, and M. Shadmon. A fast index for semistructured data. In Proc. 2001 Int. Conf. Very Large Data Bases (VLDB’01), pages 341–350, 2001.
 
[CXYM05] Y. Chi, Y. Xia, Y. Yang, and R.Muntz. Mining closed and maximal frequent subtrees from databases of labeled rooted trees. IEEE Trans. on Knowledge and Data Engineering, 17:190–202, 2005.
 
[FFF99] M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In Proc. ACM SIGCOMM’99 Conf. Applications, Technologies, Architectures, and Protocols for Computer Communication, pages 251–262, Cambridge, MA, Aug. 1999.
 
[FMT04] C. Faloutsos, K. McCurley, and A. Tomkins. Fast discovery of connection subgraphs. In Proc. 2004 Int. Conf. on Knowledge Discovery and Data Mining (KDD’04), pages 118–127, 2004.
 
[GW97] R. Goldman and J.Widom. DataGuides: Enabling query formulation and optimization in semistructured databases. In Proc. 1997 Int. Conf. Very Large Data Bases (VLDB’97), pages 436–445, 1997.
 
[HCD94] L. Holder, D. Cook, and S. Djoko. Substructure discovery in the subdue system. In Proc. AAAI’94 Workshop on Knowledge Discovery in Databases (KDD’94), pages 169 – 180, 1994.
 
[HK05] J. Han and M. Kamber. Data Mining: Concepts and Techniques, 2ed. Morgan Kaufmann, 2006.
 
[HWB+04] J. Huan, W.Wang, D. Bandyopadhyay, J. Snoeyink, J. Prins, and A. Tropsha. Mining spatial motifs from protein structure graphs. In Proc. of the 8th Annual Int. Conf. on Research in Computational Molecular Biology (RECOMB’04), pages 308–315, 2004.
 
[IWM00] A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for mining frequent substructures from graph data. In Proc. 2000 European Symp. Principle of Data Mining and Knowledge Discovery (PKDD’00), pages 13–23, Lyon, France, Sept. 2000.
 
[KGS04] M. Koyuturk, A. Grama, and W. Szpankowski. An efficient algorithm for detecting frequent subgraphs in biological networks. In Proc. of the 12th International Conference on Intelligent Systems for Molecular Biology (ISMB’04), pages 200–207, 2004.
 
[KK01] M. Kuramochi and G. Karypis. Frequent subgraph discovery. In Proc. 2001 Int. Conf. on Data Mining (ICDM’01), pages 313–320, 2001.
 
[KKR+99] J. M. Kleinberg, R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. The web as a graph: Measurements, models, and methods. In Proc. Int. Conf. Combinatorics and Computing, pages 1–17, 1999.
 
[KKSS04] K. Kailing, H. Kriegel, S. Schnauer, and T. Seidl. Efficient similarity search for hierarchical data in large databases. In Proc. 9th Int. Conf. on Extending Database Technology (EDBT’04), pages 676–693, 2004.
 
[KKT03] D. Kempe, J. Kleinberg, and E. Tardos. Maximizing the spread of influence through a social network. In Proc. 2003 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (KDD’03), pages 137–146, Washington, DC, Aug 2003.
 
[Kle01] J. M. Kleinberg. Small world phenomena and the dynamics of information. In Advances in Neural Information Processing Systems 14 (NIPS’01), pages 431–438, Vancouver, BC, Dec. 2001.
 
[KSBG02] R. Kaushik, P. Shenoy, P. Bohannon, and E. Gudes. Exploiting local similarity for efficient indexing
of paths in graph structured data. In Proc. 2002 Int. Conf. Data Engineering (ICDE’02), pages
129–140, 2002.
 
[LKF05] J. Leskovec, J. Kleinberg, and C. Faloutsos. Graphs over time: Densification laws, shrinking diam-
eters and possible explanations. In Proc. 2005 ACM SIGKDD Int. Conf. on Knowledge Discovery
and Data Mining (KDD’05), 2005.
 
[PF97] E. Petrakis and C. Faloutsos. Similarity searching in medical image databases. Knowledge and Data
Engineering, 9(3):435–447, 1997.
 
[PJZ05] J. Pei, D. Jiang, and A. Zhang. On mining cross-graph quasi-cliques. In Proc. 2005 Int. Conf.
Knowledge Discovery and Data Mining (KDD’05), Chicago, IL, Aug. 2005.
 
[RGW02] J. Raymond, E. Gardiner, and P. Willett. Rascal: Calculation of graph similarity using maximum
common edge subgraphs. The Computer Journal, 45:631–644, 2002.
 
[SK03] S. Srinivasa and S. Kumar. A platform based on the multi-dimensional data model for analysis of
bio-molecular structures. In Proc. 2003 Int. Conf. on Very Large Data Bases (VLDB’03), pages
975–986, 2003.
 
[SWG02] D. Shasha, J. Wang, and R. Giugno. Algorithmics and applications of tree and graph searching. In
Proc. 21th ACM Symp. on Principles of Database Systems (PODS’02), pages 39–52, 2002.
 
[WBD98] P. Willett, J. Barnard, and G. Downs. Chemical similarity searching. J. Chem. Inf. Comput. Sci.,
38:983–996, 1998.
 
[WZJS94] J. Wang, K. Zhang, K. Jeong, and D. Shasha. A system for approximate tree matching. IEEE
Trans. on Knowledge and Data Engineering, 6:559 – 571, 1994.
 
[XHYC05] D. Xin, J. Han, X. Yan, and H. Cheng. Mining compressed frequent-pattern sets. In Proc. 2005
Int. Conf. Very Large Data Bases (VLDB’05), Trondheim, Norway,, Aug. 2005.
 
[YCHX05] X. Yan, H. Cheng, J. Han, and D. Xin. Summarizing itemset patterns: A profile-based approach.
In Proc. 2005 Int. Conf. Knowledge Discovery and Data Mining (KDD’05), Chicago, IL, Aug. 2005.
 
[YH02] X. Yan and J. Han. gSpan: Graph-based substructure pattern mining. In Proc. 2002 Int. Conf. on
Data Mining (ICDM’02), pages 721–724, Maebashi, Japan, Dec. 2002.
 
[YH03] X. Yan and J. Han. CloseGraph: Mining closed frequent graph patterns. In Proc. 2003 Int. Conf.
Knowledge Discovery and Data Mining (KDD’03), pages 286–295, Washington, D.C., Aug. 2003.
 
[YKT05] R. Yang, P. Kalnis, and A. Tung. Similarity evaluation on tree-structured data. In Proc. of 2005
Int. Conf. on Management of Data (SIGMOD’05), pages 754 – 765, 2005.
 
[YYH04] X. Yan, P. Yu, and J. Han. Graph indexing: A frequent structure-based approach. In Proc. of 2004
Int. Conf. on Management of Data (SIGMOD’04), pages 335 – 346, Paris, France, 2004.
 
[YYH05] X. Yan, P. Yu, and J. Han. Substructure similarity search in graph databases. In Proc. of 2005 Int.
Conf. on Management of Data (SIGMOD’05), pages 766 – 777, Baltimore, MD, 2005.
 
[YZH05] X. Yan, X. Zhou, and J. Han. Mining closed relational graphs with connectivity constraints. In
Proc. 2005 Int. Conf. Knowledge Discovery and Data Mining (KDD’05), Chicago, IL, Aug. 2005.
 
[Zak02] M. Zaki. Efficiently mining frequent trees in a forest. In Proc. 2002 ACM SIGKDD Int. Conf.
Knowledge Discovery in Databases (KDD’02), pages 71–80, Edmonton, Canada, 2002.