Dynamic Composite NetworksFinding Significant Substructures:Time evolving networks arise in multiple domains. They can characterize traffic variations in transportation networks, information flow in communication networks, variation of trade rates in a network of trading agents or phases of pathway switching in gene interaction networks. Regardless of the rich representative power of time-evolving network in modeling processes in all the above domains, they have received limited attention from in the data mining and database communities. One model of a time-varying network is based on fixed network structure (edges and nodes) and a time-varying characteristic of the edges. This scenario is applicable to modeling traffic in transportation, communication and information networks. We propose a method for mining a time-evolving network for heavy-edge sub-networks. In the transportation network scenario such sub-networks correspond to congested regions, while in a communication network they can correspond to links that are overloaded with information traffic. Modeling Dynamic Graphs:The analysis of time varying graphs has become a large area of interest for researchers in recent years. Most graph datasets change over time, and the modeling and prediction of future graph structure can be highly useful for many applications. In this project we examine different methods to allow for the modeling of graph structure as it changes over time. In one approach, we focus on the additional information associated with many graphs beyond just their graph structure. In particular, communication graphs (such as email, twitter, blog graphs, etc.) have information content associated with them. Previous research in this area has mostly ignored this additional information, focusing instead largely on using only the past graph structure to predict future links. In this project we examine the link between information content and graph structure, proposing a new graph modeling approach which combines both. We then apply this model to real world communication graphs and demonstrate that the built models can used effectively to predict future graph structure. Project Participants:
|
|
Scalable Graph AnalysisScalable Graph Clustering:Large amounts of graph data are generated each day from applications such as social networks, communication graphs, biological networks, and more. These structures grow from the hundreds to the millions of nodes and are both useful and important to analyze. As graphs grow in size, it becomes difficult to use or inspect them without some form of summarization. Clustering the nodes of these networks is one technique shown to be of great practical importance. Not only are the small, dense subgraphs easier to visualize and analyze, but well-connected groupings of nodes within graphs have been found to correspond to many real world problems, like biological function prediction and social or web community detection. In this project, we focus on the development of new graph clustering algorithms which allow for the efficient discovery of strong, well-connected clusters. An important insight is that many clustering applications need only the subset of best clusters, and not all clusters in the entire graph. We introduce a new technique, Top Graph Clusters (TopGC), which uses a modification of Locality Sensitive Hashing to probabilistically searches large, edge weighted, directed graphs for their best clusters in linear time. In addition, the idea of grouping related nodes using repeated random walks is the focus behind RRW, a new clustering algorithm we propose that greedily grows clusters based on random walk distances, allowing for strongly connected, variably sized, overlapping clusters. Scalable Proximity Search in Large Composite Graphs:Pairwise relations among entities and agents in networks arise from different sources and can be based on multiple features. For example, people may have accounts in several online social networks, targeted towards different interaction types. A single individual can be a user of Facebook to keep up with her friends, LinkedIn to organize and maintain her business contacts, and Last.fm to discuss musical trends with fellow listeners. This diverse social ambiance is complemented by an information layer in the form of email communication, generation and discussion of blog entries or shared photographs. Supporting exploratory and analysis tasks in such composite systems has to be flexible and scalable in order to reflect frequent network changes and user-centric prioritization of the different components. In the context of composite networks, we address the problem of k-Nearest Neighbor search, according to a random walk proximity measure called Effective Importance. Our approach retrieves the exact top neighbors at query time without relying on off-line indexing or summaries of the entire network. This makes it suitable for very large dynamic networks, as well as for composite network overlays mixed at query time. Project Participants:
Publications:
|
|
Discovering influential Groups in Composite NetworksWe consider the problem of discovering effective collaboration cliques of agents in complex organizations in which agents are related based on one or multiple homophily principles, based on geographical, demographic or social commonalities. Independent of the aforementioned relations, agents are grouped in teams to perform a specific task. While there exists a pairwise method of assessing team performance as a whole for a given task, we assume there is no way of directly evaluating the performance of individuals as their success or ineffectiveness is dependent on their teammates. At task completion teams regroup and new tasks are assigned. The above scenario arises in many real-world situations. For example, it can represent the dynamics of employees in a company working on team projects. The management forms teams for specific projects and evaluate each team's performance at project completion. We propose a model of individual and group effectiveness, based on historical data and agent social and demographic ties. We take into account the performance of dynamic large groups for specific tasks and extrapolate individual performance, conditioned on group success. We discover core subgroups that drive team success and relate these to history of collaboration and proximity in the social graph. Project Participants:
|
|
Modeling and Mining Social InteractionsCollaborative systems such as Wikipedia produce high-quality content based on democratic principles of contribution. The high quality is due to a well-balanced self-regulated community of contributors. We propose a method to characterize and study the community structure of contributors of Wiki-based systems based on raw-content generation analysis. We leverage topic modelling in order to capture agreement and opposition of contributors and analyze these multi-modal relations to map communities in the contributor base. The key steps of our approach include (i) modeling of pairwise variable-strength contributor interactions that can be both positive and negative, (ii) synthesis of a global network incorporating all pairwise interactions, and (iii) detection and analysis of community structure encoded in such networks. Analysis of the discovered community structure reveals coalitions of common-interest editors who back each other in promoting certain topics and collectively oppose other coalitions or single authors. A different trust within this project addresses the problem of quality of content and its dependence on a timely convergence of opposing viewpoints to an unbiased state. One problem that hinders this process is unnoticed controversial content due to multiple succeeding contributions that move the focus away from it. We develop a predictive model of editor behavior based on the content dynamics throughout the revision history of an article. A central component of our approach is the detection of the level of local agreement between a pair of contributors based on consecutive revisions. We employ these pairwise interactions (i) to estimate the level of controversy of a new contribution and (ii) to rank contributors based on their sensitivity to a specific new edit. Our ranking is based on the underlying content and the social structure of the editors encoded in a global contributor interaction network. Project Participants:
Publications:
|
|
|
Querying and Mining in High Dimensional SpacesQuerying and Mining Spatial Patterns:High dimensional data occurs in numerous domains e.g. document retrieval, multimedia retrieval, graph exploration, and chemical modeling. Querying and mining of high dimensional data is of great practical value but at the same time a very challenging task. We have designed new index structures and search algorithms for querying spatial patterns in large image datasets and nearest neighbor search in very high dimensional spaces over large datasets. We are developing new techniques for querying and mining keyword set with spatial relationship from high dimensional spatial data. Keyword set search finds application in map services, product search, general data exploration and analysis, and geo tagging of documents and images. Indexing Top-k Queries on Attractive and Repulsive Dimensions:Given a molecular database and a query molecule, often, it is of interest to find database molecules that are close to the query in certain properties and distant in another set of properties. For example, in scaffold hopping, one would like to identify molecules that are structurally dissimilar to a query molecule but has similar biological properties. Often these molecules are represented as feature vector where each dimension corresponds to a certain property. We developed efficient index structures that answer these queries in a sub-linear manner. Empirical evaluation highlighted interesting patterns that deviate from the well-established Lipinski's rule in the domain of drug-likeness. Project Participants:
Publications:
|
|
Querying, Mining and Modeling Uncertainty in Bio-ImagesUncertainty is inherent in many types of data. Oftentimes this uncertainty is due to poor data acquisition, inaccurate measurements, or even the nature of the data itself. Discarding this uncertainty when mining and modeling the data can lead to inaccurate results. However, incorporating uncertainty into all aspects of knowledge discovery makes most problems intractable. Our work focuses on the querying, mining and modeling of biological images. Due to the difficulty in imaging microscopic objects, bio-images are inherently uncertain. Our goal is to enable accurate knowledge discovery on these images. Previously we have examined methods to query uncertain data from a database using the Earth Mover's Distance (EMD). The EMD has been shown to be an effective distance metric between probability distributions, yet suffers from significant computational overhead. Our method improved upon existing lower bounds to the EMD in order to decrease the query time on uncertain databases. Our recent work focuses on spatial modeling and mining of the mouse retina under uncertainty. Our problem here is two fold: we wish to characterize the spatial relationship between objects in the retina (blood vessels, cells, etc), and determine how this relationship changes under different experimental conditions, such as retinal detachment. We use a variety of methods to infer these spatial relationships, from spatial statistics to markov random fields. We further plan to explore methods to determine if a set of uncertain spatial/image features from one retina is statistically different than uncertain features from another retina. Challenges for this problem include the need to determine if the difference between two sets of uncertain features is due to some underlying cause (such as detachment) or simply a result of the uncertainty in the data. Project Participants:
|
|
ChemoinformaticsMining Significant Chemical Substructures in 2D and 3D SpacesIncreased availability of large repositories of chemical compounds has created new challenges and opportunities for the application of data-mining techniques to problems in chemical informatics. Applications of chemoinformatics lie in performing virtual high-throughput screening to identify active compounds in a virtual library and predicting structure-activity relationships. Typical querying and mining tasks involve clustering of large molecular libraries, developing index structures for fast answering of top-k searches, visualizing statistically significant molecular substructures, and predicting biological activity of molecules. Mining over-represented molecular substructures and their application in molecular classification:Often chemical compounds are represented as graphs where the nodes represent atoms and the edges represent covalent bonds between them. Mining statistically significant subgraphs from such graph databases enable us to visualize over-represented molecular substructures in a molecular database. These over-represented substructures are highly information rich and find further application in molecular classification and scaffold hopping. We studied this problem and developed efficient techniques to mine over-represented substructures in a scalable manner. Empirical evaluation confirmed their potential in molecular classification by performing better than well-known molecular fingerprints. Pharmacophore analysis by mining the joint pharmacophore space:A common representation of molecules is based on the geometric location of important features (such as donors, acceptors, hydrophobic cores, etc.) in the 3d space. The underlying geometry of the pharmacophores is responsible for binding between compounds and targets as well as properties of compounds such as Blood Brain Barrier permeability. Existing pharmacophore based techniques are targeted towards extracting and optimizing a specific pharmacophore. Such an approach is limited in terms of the search space it can investigate in the drug discovery process. Often, multiple pharmacophoric targets need to be analyzed in search for drugs against diseases such as cancer or aids. We designed a new technique that eliminates the need to optimize pharmacophores against a specific target by defining a Joint Pharmacophore Space of chemical compounds, targets, and physicochemical/biological properties using the 3D geometry of pharmacophoric features and mine this space directly to identify pharmacophoric patterns. The statistical significance of the mined patterns are substantiated by their low p-values. Further, these patterns demonstrate extremely promising performance in molecular classification by achieving superior prediction quality compared to molecular fingerprints and 3-point pharmacophores Project Participants:
Publications:
|
Summarizing Probabilistic DataMany real world applications produce data with uncertainties drawn from measurements over a continuous domain space. Recent research in the area of probabilistic databases has mainly focused on managing and querying discrete data in which the domain is limited to a small number of values (i.e. on the order of 10). When the size of the domain increases, current methods fail due to their nature of explicitly storing each value/probability pair. Such methods are not capable of extending their use to continuous-valued attributes. In this paper, we provide a scalable, accurate, space efficient probabilistic data synopsis for uncertain attributes defined over a continuous domain. Our synopsis construction methods are all error-aware to ensure that our synopsis provides an accurate representation of the underlying data given a limited space budget. Additionally, we are able to provide approximate query results over the synopsis with error bounds. We provide an extensive experimental evaluation to show that our proposed methods improve upon the current state of the art in terms of construction time and query accuracy. We also demonstrate the ability of our synopsis to answer a variety of interesting queries on a real data set and show that our query error is reduced by up to an order of magnitude over the previous state-of-the-art method. Project Participants:
Publications:
|
|
Mining and Learning in Uncertain Heterogeneous Networked DataData from many applications often contain multiple sources of noise. For instance, data could be missing from a survey or online form in a non-random fashion, records may be labeled incorrectly, or it may be necessary to extract information from unstructured data (e.g. images or free form text). In all of these scenarios, it is necessary to utilize methods that can appropriately deal with these various and often unpredictable forms of uncertainties. These problems have prompted a rise in the popularity of probabilistic methods for data mining. Until recently, much of the research effort in statistics, machine learning and data mining has focused on problems in which data is assumed to be independent and identically distributed (iid). The recent explosion of data from the internet, especially social network and interaction data, has similar issues in terms of uncertainty. The main difference in these sources, however, is the complex structure of dependencies that is provided by the network (eg. if all of your friends enjoy comedies, you are more likely to enjoy them as well due to network homophily). Such dependencies add information that has been shown to significantly improve the accuracy of mining methods (eg. collective classification) at the cost of increased computational complexity. Developing methods that optimize the trade-off between these two factors is crucial for improving the accuracy and scalability of mining networked data sources. The central goal for our current and future work is in developing new methods that can cope with both noisy data and the dependencies that exist within a network. Our approach is threefold: (i) develop methods to mine interesting patterns from noisy networked data, (ii) develop new techniques for learning from networked data that are able to handle the complexity of data uncertainty, and (iii) extend statistical models of networked data to infer additional information about the heterogeneous types of relationships that exist between nodes (eg. Trust). Understanding these aspects of the data introduces many more opportunities for new and different kinds of knowledge discovery in noisy, interdependent data. Project Participants:
|
For more information on the lab's previous research projects, please see our publication page.