Models for Dynamic Networks
Current social and scientific endeavors are
generating data that can be modeled as graphs:
high-throughput biological experiments, screening of
chemical compounds, social networks, ecological
networks and food-webs, database schemas and
ontologies. Traditional graph theory and most
current research in graph modeling, querying, and
mining concentrates on problems where the graph
structure is inherently static and does not change
with time. But networks in the real world are dynamic
in nature with a wide range of temporal changes — while
the topology of networks such as social networks and
transportation networks undergoes gradual change (or
evolution), the content (information flow,
annotations) changes more rapidly. A dynamic graph
that comprises of a series of static graph snapshots
has fundamentally different properties than the union
of properties of constituent static graph snapshots.
The first research thrust examines inherent
properties of dynamic graphs. One such property is the
dynamic reachability structure (DRS) of nodes; we are
examining high-fidelity methods for summarizing and
generating dynamic graphs based on these properties.
DRS histograms of multiple dynamic networks follow
novel distribution patterns. We introduce a new
generative dynamic graph model that captures the
dynamic graph properties of connectivity and
reachability, as well as generates time values for its
In another research thrust, we are examining how to
extrapolate metrics of an information cascade (e.g.,
growth rate, size) from observations obtained at a set
of sample nodes. Using empirical observations on
online social networks, we intend to estimate metrics
(e.g., growth rate, size) of the current state of the
network-wide information cascade, and to predict the
evolution of these metrics over time.
The final research thrust examines dynamic graphs from the point of view of content and topic models in order to understand the relationship between content of a message and its flow in a network. We focus on the additional information associated with many graphs beyond just their structure. In particular, graphs (such as email, twitter, blog graphs, etc.) have information content associated with them. We examine the relationship between information content and structure, proposing a new modeling approach that combines both.
Analysis of Significant Substructures in Dynamic Networks
Time evolving networks arise in multiple domains. They characterize traffic variations in transportation networks, information flow in communication networks, variation of trade rates in a network of trading agents, or phases of pathway switching in gene interaction networks. One model of a time-varying network is based on fixed network structure (edges and nodes) and a time-varying characteristic of the nodes and edges. Significant or unusual patterns can then be defined by considering the deviation from the expected behavior of a single node/edge, and aggregating the significance of neighboring entities over a subgraph. There are different ways of defining such aggregate patterns of interest based on whether the significant subgraph stays fixed or varies across time (and constraints on variations). Furthermore, it is interesting to consider higher-level patterns such as splitting, merging, and shifting of the significant patterns, and logics for defining and verifying such patterns.
Models for Content and User Interactions
We are developing a “topic genotype” framework (a
gene corresponding to each topic and a genotype
corresponding to a user) to quantify the transmission
of ideas and opinions of agents in a network. Our
scientific objectives include: (i) development of a
genotypic model, (ii) characterization of genotypes
within a social network, and (iii) analysis of the
spread of genotypes in a network. We are extending
preliminary results on Twitter hashtags by including
more topic classifications in order to construct more
detailed genetic signatures. Preliminary results
suggest that we can relate the follower and friendship
structure to topic genotypes by determining if
neighbors in the network are more similar in topic
genotype than non-neighbors.
We are investigating influence, persuasion, and
dynamics of users’ sentiments in social network
datasets and developing metrics that are able to
detect the change of positions and opinions of users.
Specifically, we are focusing on fine-grained
and user-specific (rather than global) sentiment
classification, as well as allowing for three
(positive, negative, and neutral) or more sentiment
states. We analyze the properties of user sentiment
dynamics within real-world social network datasets,
especially extending this analysis beyond just text
and user connectivity, and including both message
timestamp/topic-based user activity levels, and
friendship connection strengths. We will utilize the
discovered properties to construct a new generative
model of sentiment dynamics, capable of predicting
which users are most likely to change their opinions,
and capturing the evolution of users' sentiments
We are also studying how information is passed across
and modified within large-scale social and information
networks such as blogger networks, Twitter feeds, and
Wikipedia contributor networks. Previous models for
unstructured text information have mostly focused on a
bag-of-words representation or n-grams. While useful
for search and retrieval tasks, such representations
miss the intrinsic narrative structure of the
underlying communications. Recent methods in natural
language processing have enabled representations that
better capture the meanings of the texts. We plan to
extract concept map representations of networked
content, which can be viewed as small information
networks that capture the cognitive (and interpretive)
understanding of communicants in the network with
respect to a given issue. Within the complex of
evolving stories and evolving social networks, we seek
to build models of influence that take into account
both the concept map structures and their
corresponding location within a sources-and-consumers
network. We target a better understanding of how
propagated content and the network structure mutually
affect each other.
Intensive investigations over several decades have revealed the functions of many individual genes, proteins, and pathways. There has been an explosion of data of widely diverse types, arising from genome-wide characterization of transcriptional profiles, protein-protein interactions, genomic structure, genetic phenotype, gene interactions, gene expression, and proteomics. We are developing techniques that can integrate and analyze data from multiple sources and models efficiently. One research thrust quantifies phenotypic variation using image analysis and pattern recognition tools, develops a causal model for gene regulatory processes, and validates the model experimentally. Specific aims are to leverage expression quantitative train loci (eQTL) datasets to elucidate the mechanisms underlying two processes integral to C. elegans physiology: programmed cell death (PCD) and the regulation of the germline stem cell proliferation decision boundary. We are developing methodology that simultaneously clusters transcripts into coherent modules, and learns the causal structure that optimally explains how genotypic variation alters physiological phenotypes through changes in gene expression. Ultimately, we intend to use our causal modeling methodology to generate biological hypotheses that explain how natural genetic variation leads to phenotypic changes.
Research in systems biology has shown that clinical outcomes, such as susceptibility to cancer, depend not only on the expression level of a single protein, but on pathways or network modules. This can be modeled using a network with dynamic node labels and a global dynamic state; the node labels indicate the protein expression levels of an individual and the global state indicates the presence or absence of the disease. To predict the biological outcome or to categorize an individual, we need to find the sub-networks whose local states accurately predict the global network state. Learning discriminative subgraphs is key to understanding the complex relationship that exists between the local and the global states. Furthermore, they provide the platform for higher-order modeling tasks such as network regularization, classification and regression.
Querying and Mining in High Dimensional Spaces
Search for near neighbors of a given query object in a collection of objects is a fundamental operation in numerous applications, e.g., multimedia similarity search, data mining, information retrieval, and pattern recognition. The most common model for search is to represent objects in a high-dimensional attribute space, and then retrieve points near a given query point using a distance measure. We study the problem of r-near neighbor (r-NN) queries that retrieve all the points within a distance r from the query point. Existing techniques solve this problem efficiently only for the approximate case. We propose a novel indexing and querying scheme called Spatial Intersection and Metric Pruning (SIMP) that efficiently supports r-near neighbor queries in very high-dimensional spaces for all query ranges with 100% quality guarantee and with practical storage costs.
High-dimensional objects often have descriptive text
information such as a set of keywords (e.g.,
geo-locations). We study nearest keyword set search
(NKS) queries in which a user provides a subset of
keywords. The top-1 result of an NKS query is a set of
data points which contains all the query
Mining and Modeling Brain Activity Networks
Data from brain activity measurements in the context of different
cognitive or physical tasks is becoming available due to
a multitude of experiments by neuroscientists, psychologists and
from medical research. As a result, there has been an
increasing effort in employing data mining techniques
to understand more aspects of the brain activity patterns
shared across subjects. These patterns hold promises
to unveil the micro-architecture of the human brain
organization and dynamics.
Our recent work focuses on neuroactivity data from tasks that involve learning. We model the human brain as a dynamic network with nodes corresponding to brain regions and edges corresponding to levels of cohesion between pairs of regions. The level of cohesion (coupled behavior) observed on edges varies over time with the learning experiment progressing. Our goal is to discover dynamic subgraph patterns associated with learning.
Increased availability of large repositories of
chemical compounds has created new challenges and
opportunities for the application of data-mining
techniques to problems in chemical informatics.
Applications of chemoinformatics lie in performing
virtual high-throughput screening to identify active
compounds in a virtual library and predicting
Often chemical compounds are represented as graphs where the nodes represent atoms and the edges represent covalent bonds between them. Mining statistically significant subgraphs from such graph databases enable us to visualize over-represented molecular substructures in a molecular database. These over-represented substructures are highly information-rich and are useful in molecular classification and scaffold hopping. We studied this problem and developed efficient techniques to mine over-represented substructures in a scalable manner. Empirical evaluation confirmed their potential in molecular classification by performing better than well-known molecular fingerprints.
A common representation of molecules is based on the geometric location of important features (such as donors, acceptors, hydrophobic cores, etc.) in 3D space. The underlying geometry of the pharmacophores is responsible for binding between compounds and targets as well as properties of compounds such as Blood Brain Barrier permeability. We define a Joint Pharmacophore Space (JPS) of chemical compounds, targets, and physicochemical/biological properties using the 3D geometry of pharmacophoric features and mine this space directly to identifypharmacophoric patterns. The statistical significance of the mined patterns are substantiated by their low p-values. These patterns are extremely promising in molecular classification by achieving superior prediction quality compared to molecular fingerprints and 3-point pharmacophores.
For more information on the lab's previous research projects, please see our publication page.