DBL Current Research Projects

Dynamic networks evolution

Models for Dynamic Networks

Current social and scientific endeavors are generating data that can be modeled as graphs: high-throughput biological experiments, screening of chemical compounds, social networks, ecological networks and food-webs, database schemas and ontologies. Traditional graph theory and most current research in graph modeling, querying, and mining concentrates on problems where the graph structure is inherently static and does not change with time. But networks in the real world are dynamic in nature with a wide range of temporal changes — while the topology of networks such as social networks and transportation networks undergoes gradual change (or evolution), the content (information flow, annotations) changes more rapidly. A dynamic graph that comprises of a series of static graph snapshots has fundamentally different properties than the union of properties of constituent static graph snapshots.

The first research thrust examines inherent properties of dynamic graphs. One such property is the dynamic reachability structure (DRS) of nodes; we are examining high-fidelity methods for summarizing and generating dynamic graphs based on these properties. DRS histograms of multiple dynamic networks follow novel distribution patterns. We introduce a new generative dynamic graph model that captures the dynamic graph properties of connectivity and reachability, as well as generates time values for its edges.

In another research thrust, we are examining how to extrapolate metrics of an information cascade (e.g., growth rate, size) from observations obtained at a set of sample nodes. Using empirical observations on online social networks, we intend to estimate metrics (e.g., growth rate, size) of the current state of the network-wide information cascade, and to predict the evolution of these metrics over time.

The final research thrust examines dynamic graphs from the point of view of content and topic models in order to understand the relationship between content of a message and its flow in a network. We focus on the additional information associated with many graphs beyond just their structure. In particular, graphs (such as email, twitter, blog graphs, etc.) have information content associated with them. We examine the relationship between information content and structure, proposing a new modeling approach that combines both.

Significant Sub-Structures

Analysis of Significant Substructures in Dynamic Networks

Time evolving networks arise in multiple domains. They characterize traffic variations in transportation networks, information flow in communication networks, variation of trade rates in a network of trading agents, or phases of pathway switching in gene interaction networks. One model of a time-varying network is based on fixed network structure (edges and nodes) and a time-varying characteristic of the nodes and edges. Significant or unusual patterns can then be defined by considering the deviation from the expected behavior of a single node/edge, and aggregating the significance of neighboring entities over a subgraph. There are different ways of defining such aggregate patterns of interest based on whether the significant subgraph stays fixed or varies across time (and constraints on variations). Furthermore, it is interesting to consider higher-level patterns such as splitting, merging, and shifting of the significant patterns, and logics for defining and verifying such patterns.

Content-User Interactions

Models for Content and User Interactions

We are developing a “topic genotype” framework (a gene corresponding to each topic and a genotype corresponding to a user) to quantify the transmission of ideas and opinions of agents in a network. Our scientific objectives include: (i) development of a genotypic model, (ii) characterization of genotypes within a social network, and (iii) analysis of the spread of genotypes in a network. We are extending preliminary results on Twitter hashtags by including more topic classifications in order to construct more detailed genetic signatures. Preliminary results suggest that we can relate the follower and friendship structure to topic genotypes by determining if neighbors in the network are more similar in topic genotype than non-neighbors.

We are investigating influence, persuasion, and dynamics of users’ sentiments in social network datasets and developing metrics that are able to detect the change of positions and opinions of users. Specifically, we are  focusing on fine-grained and user-specific (rather than global) sentiment classification, as well as allowing for three (positive, negative, and neutral) or more sentiment states. We analyze the properties of user sentiment dynamics within real-world social network datasets, especially extending this analysis beyond just text and user connectivity, and including both message timestamp/topic-based user activity levels, and friendship connection strengths. We will utilize the discovered properties to construct a new generative model of sentiment dynamics, capable of predicting which users are most likely to change their opinions, and capturing the evolution of users' sentiments across time.

We are also studying how information is passed across and modified within large-scale social and information networks such as blogger networks, Twitter feeds, and Wikipedia contributor networks. Previous models for unstructured text information have mostly focused on a bag-of-words representation or n-grams. While useful for search and retrieval tasks, such representations miss the intrinsic narrative structure of the underlying communications. Recent methods in natural language processing have enabled representations that better capture the meanings of the texts. We plan to extract concept map representations of networked content, which can be viewed as small information networks that capture the cognitive (and interpretive) understanding of communicants in the network with respect to a given issue. Within the complex of evolving stories and evolving social networks, we seek to build models of influence that take into account both the concept map structures and their corresponding location within a sources-and-consumers network. We target a better understanding of how propagated content and the network structure mutually affect each other.

Biological Networks

Biological Networks

Intensive investigations over several decades have revealed the functions of many individual genes, proteins, and pathways. There has been an explosion of data of widely diverse types, arising from genome-wide characterization of transcriptional profiles, protein-protein interactions, genomic structure, genetic phenotype, gene interactions, gene expression, and proteomics. We are developing techniques that can integrate and analyze data from multiple sources and models efficiently. One research thrust quantifies phenotypic variation using image analysis and pattern recognition tools, develops a causal model for gene regulatory processes, and validates the model experimentally.  Specific aims are to leverage expression quantitative train loci (eQTL) datasets to elucidate the mechanisms underlying two processes integral to C. elegans physiology: programmed cell death (PCD) and the regulation of the germline stem cell proliferation decision boundary. We are developing methodology that simultaneously clusters transcripts into coherent modules, and learns the causal structure that optimally explains how genotypic variation alters physiological phenotypes through changes in gene expression. Ultimately, we intend to use our causal modeling methodology to generate biological hypotheses that explain how natural genetic variation leads to phenotypic changes.

Research in systems biology has shown that clinical outcomes, such as susceptibility to cancer, depend not only on the expression level of a single protein, but on pathways or network modules. This can be modeled using a network with dynamic node labels and a global dynamic state; the node labels indicate the protein expression levels of an individual and the global state indicates the presence or absence of the disease. To predict the biological outcome or to categorize an individual, we need to find the sub-networks whose local states accurately predict the global network state. Learning discriminative subgraphs is key to understanding the complex relationship that exists between the local and the global states. Furthermore, they provide the platform for higher-order modeling tasks such as network regularization, classification and regression.

Closure tree logo

Querying and Mining in High Dimensional Spaces

Search for near neighbors of a given query object in a collection of objects is a fundamental operation in numerous applications, e.g., multimedia similarity search, data mining, information retrieval, and pattern recognition. The most common model for search is to represent objects in a high-dimensional attribute space, and then retrieve points near a given query point using a distance measure. We study the problem of r-near neighbor (r-NN) queries that retrieve all the points within a distance r from the query point. Existing techniques solve this problem efficiently only for the approximate case. We propose a novel indexing and querying scheme called Spatial Intersection and Metric Pruning (SIMP) that efficiently supports r-near neighbor queries in very high-dimensional spaces for all query ranges with 100% quality guarantee and with practical storage costs.

High-dimensional objects often have descriptive text information such as a set of keywords (e.g., geo-locations). We study nearest keyword set search (NKS) queries in which a user provides a subset of keywords. The top-1 result of an NKS query is a set of data points which contains all the query
keywords and the points form the tightest cluster in the multidimensional space. NKS queries are useful for extending web and image search engines, map services, GIS systems, and for geo-tagging of objects and regions. We propose a novel method called ProMiSH (Projection and MultiScale Hashing) that uses random projection and hash-based index structures to achieve high scalability and speedup.

Probabilistic Feature

Mining and Modeling Brain Activity Networks

Data from brain activity measurements in the context of different cognitive or physical tasks is becoming available due to a multitude of experiments by neuroscientists, psychologists and from medical research. As a result, there has been an increasing effort in employing data mining techniques to understand more aspects of the brain activity patterns shared across subjects. These patterns hold promises to unveil the micro-architecture of the human brain organization and dynamics.

Our recent work focuses on neuroactivity data from tasks that involve learning. We model the human brain as a dynamic network with nodes corresponding to brain regions and edges corresponding to levels of cohesion between pairs of regions. The level of cohesion (coupled behavior) observed on edges varies over time with the learning experiment progressing. Our goal is to discover dynamic subgraph patterns associated with learning.

Cheminformatics and Motif


Increased availability of large repositories of chemical compounds has created new challenges and opportunities for the application of data-mining techniques to problems in chemical informatics. Applications of chemoinformatics lie in performing virtual high-throughput screening to identify active compounds in a virtual library and predicting structure-activity relationships.

Often chemical compounds are represented as graphs where the nodes represent atoms and the edges represent covalent bonds between them. Mining statistically significant subgraphs from such graph databases enable us to visualize over-represented molecular substructures in a molecular database. These over-represented substructures are highly information-rich and are useful in molecular classification and scaffold hopping. We studied this problem and developed efficient techniques to mine over-represented substructures in a scalable manner. Empirical evaluation confirmed their potential in molecular classification by performing better than well-known molecular fingerprints.

A common representation of molecules is based on the geometric location of important features (such as donors, acceptors, hydrophobic cores, etc.) in 3D space. The underlying geometry of the pharmacophores is responsible for binding between compounds and targets as well as properties of compounds such as Blood Brain Barrier permeability. We define a Joint Pharmacophore Space (JPS) of chemical compounds, targets, and physicochemical/biological properties using the 3D geometry of pharmacophoric features and mine this space directly to identifypharmacophoric patterns. The statistical significance of the mined patterns are substantiated by their low p-values. These patterns are extremely promising in molecular classification by achieving superior prediction quality compared to molecular fingerprints and 3-point pharmacophores.

Past Research Projects

For more information on the lab's previous research projects, please see our publication page.