|
Analysis and Mining of Biological Images Often, it is necessary to compare regions in an image instead of the entire image itself. Given a query region in an image, we ask the question: Which are the most similar regions in an existing image database? We break both the query and the database images into tiles and represent regions by a connected set of tiles. Based on the feature distance from the query tile and how random it is (i.e., whether it is more likely to be the background and therefore, less important), we assign a score to each database tile. Dynamic programming on the database tiles then returns the maximal connected set of tiles. The top scoring regions are ranked according to their p-values calculated from the sum of the distributions of each tile in the region. Project participants: Vishwakarma Singh |
|
Querying and Mining of Uncertain Data Elements of uncertainty arise in all stages of the analysis workflow including image acquisition, feature extraction, querying and management, and pattern discovery. Our premise is that aspects of uncertainty must be explicitly modeled and accounted for in all stages, and their precise impact understood in a holistic manner. While the main context of our work is biological images, the techniques we develop are general and target a variety of applications in other image-based domains, such as environmental management, geographical information science, remote sensing, and interactive digital multimedia. Our work focuses on a semi-automated ``segmentation-less'' analysis method that extracts features from images. Unlike other methods, however, we account for the uncertainty during the feature extraction stage of analysis and present results over the set of all possible interpretations of the image. The resulting measurements are presented as distributions. Project participants: Nick Larusso, Brian Ruttenberg |
|
Cheminformatics and Motif Mining Increased availability of large repositories of chemical compounds is creating new challenges and oppurtunities for the application of data-mining techniques to problems in chemical informatics. Applications of cheminformatics lie in performing virtual high-throughput screening to identify active compounds in a virtual library and predicting structure-activity relationships. Often chemical compounds are represented as graphs where the nodes represent atoms and the edges repesent covalent bonds between them. Thus the challenge is to develop techniques capable of answering queries of interest to chemists over large chemical libraries. Such queries include similarity searches, sub-structure queries and finding motifs. Finding motifs can be investigated from two perspectives: one of finding the frequent ones, and the other of finding the significant structures. We look at these problems and develop algorithms that combine domain knowledge of chemical compounds with computer science techniques to efficiently answer these queries. Project participants: Sayan Ranu |
|
Protein Network Synthesis and Analysis Understanding large-scale probabilistic networks such as protein interaction networks is a challenging task that often requires the detection of well-connected substructures (clusters) in the networks. These connected substructures, usually paths or trees, correspond to meaningful patterns (e.g., metabolic or signaling pathways under which a group of proteins work together to carry out certain biological activities). A significant number of proteins in the protein interaction networks for different model organisms remain uncharacterized and predicting the function of these proteins has been a major challenge. A number of existing techniques assume that proteins with similar functions are topologically close in the network. Our hypothesis is that proteins with similar function annotations have similar functional patterns in their neighborhood, regardless of the distance between them in the interaction network. We present a novel approach for protein function prediction based on similarity in the functional neighborhood. The functional neighborhood of a protein is summarized into a feature vector using Markov Random Walks. In order to compare the feature vectors that are defined over Gene Ontology (GO) terms, we formulate a novel distance metric based on the Earth Mover’s Distance. This measure incorporates the inherent hierarchical relationship between GO annotations. Project participants: Petko Bogdanov, Kyle Chipman, Kathy Macropol Software: RRW: A graph clustering library based on repeated random walks |
|
Graph Databases With the prevalence of graph data in a variety of domains, there is an increasing need for a language to query and manipulate graphs with heterogeneous attributes and structures. We propose a query language for graph databases that supports arbitrary attributes on nodes, edges, and graphs. In this language, graphs are the basic units of information and each query manipulates one or more collections of graphs. To allow for flexible compositions of graph structures, we extend the notion of formal languages from strings to the graph domain. We present a graph algebra extended from the relational algebra in which the selection operator is generalized to graph pattern matching and a composition operator is introduced for rewriting matched graphs. Then, we investigate access methods of the selection operator. Pattern matching over large graphs is a challenge due to NPcompleteness of subgraph isomorphism. We address this by a combination of techniques: neighborhood subgraphs and profiles, joint reduction of the search space, and optimization of the search order. Experimental results on real and synthetic large graphs demonstrate that our graph specific optimizations outperform an SQL-based implementation by orders of magnitude. Project participants: Petko Bogdanov, Kyle Chipman, Kathy Macropol, Sayan Ranu |
Analysis and Mining of Biological Images Given a large collection of biomedical images of several conditions and treatments, how can we succinctly describe the characteristics of each setting? For example, given a large collection of retinal images from several different experimental conditions (normal, detached, reattached, etc.), how can data mining help biologists focus on important regions in the images or on the differences between different experimental conditions? We proposed to automatically develop a visual vocabulary by breaking images into $n \times n$ tiles and deriving key tiles ("ViVos") for each image and condition. We experiment with numerous domain-independent ways of extracting features from tiles (color histograms, textures, etc.), and several ways of choosing characteristic tiles (PCA, ICA). We performed experiments on two disparate biomedical datasets. Our "ViVos" do an excellent job as "visual vocabulary terms": they have biological meaning, as corroborated by domain experts; they help spot characteristic regions of images, exactly like text vocabulary terms do for documents; and they highlight the differences between pairs of images. The paper describing our method was chosen as one of the top-5 best student papers in ICDM 2005. Comparison of images requires a distance metric that is sensitive to the spatial location of objects and features. Such sensitive distance measures can, however, be computationally infeasible due to the high dimensionality of feature spaces coupled with the need to model the spatial structure of the images. We developed a novel multi-resolution approach to indexing spatially sensitive distance measures. We derived practical lower bounds for the earth mover's distance (EMD). Multiple levels of lower bounds, one for each resolution of the index structure, are incorporated into algorithms for answering range queries and k-NN queries, both by sequential scan and using an M-tree index structure. Experiments showed that using the lower bounds reduces the running time of similarity queries by a factor of up to 36 compared to a sequential scan without lower bounds. Computing separately for each dimension of the feature vector yields a speedup of around 14. By combining the two techniques, similarity queries can be answered more than 500 times faster. The technique was published in EDBT 2006. Project participants: Arnab Bhattacharya, Vebjorn Ljosa |
|
|
Probabilistic Querying and Mining Microscopy is a cornerstone of many fields of biology. Digital imaging has increased the rate at which data are acquired, but analysis is still predominantly by visual inspection, and the images do not live on outside their original experiment or lab. We would like the corpus of existing images to be a source of large-scale analysis and mining, and to bring to other fields of biology the successes that data-driven research has brought to genetics. This requires a database of the measurements and observations---not in terms of pixels, but in terms of biological concepts: What is this object? How big is it? How are proteins distributed in it? How does it interact with other objects? How do its changes correlate? Classification and measurement, whether manual or automatic, are error-prone and inexact because the most interesting research stretches both the imaging techniques and our understanding of the biological systems. We must therefore be able to work with uncertain information, and this presents a host of new challenges: Image analysis techniques must produce probabilistic values. Data mining and machine learning techniques must work with these results, and will produce probabilistic values of their own. And the database must be able to store, index, and query these probabilistic values. Project participants: Vebjorn Ljosa, Nick Larusso |
|
Scalable Querying and Mining of Graphs We have developed an index structure, the Closure-Tree, for graph databases that can efficiently support motif queries and similarity queries. Motif queries find graphs that contain a specific graph pattern, whereas similarity queries find graphs that are similar to a query graph. For motif queries, we propose an approximation algorithm for the subgraph isomorphism problem. For similarity queries, we measure graph similarity by edit distance, and have implemented two types of similarity queries: k-NN queries and range queries. Project participants: Huahai He |
|
Modeling and Prediction of Microtubule Dynamicity Microtubules are cellular structures that are critical for many cellular processes including cell division. Microtubules grow and shorten by a stochastic process called the dynamic instability. Tau is a microtubule associated protein which tightly regulates the microtubule dynamics. The dysfunction of tau is related to many neuro-degenerative diseases including fronto-temporal dementia, Parkinsonism and Alzheimer. Different isoforms of tau and different mutants of the isoforms regulate the microtubule dynamics differently. Modeling the dynamic patterns of microtubules provide a better understanding of the differences among the different experimental conditions. For example, our results with pre steady state in vitro data show that the two main wild-type isoforms of tau -- 3-repeat tau and 4-repeat tau -- differ considerably in the growth rate of the microtubules but not appreciably in the time spent in growing. Two-dimensional embedding of these differences provide a good visualization of the relative distances among them. Since experimenting with a particular experimental condition is time-consuming and requires the choice of that particular condition from a large set of possibilities, predicting the model parameters of a hitherto unseen condition will save time and will guide the biologists in planning the next experiments, the predictions for which are the most interesting. Project participants: Arnab Bhattacharya |
|
Information Processing over Sensor Networks Currently, organizations are deploying large scale sensor networks for advanced civilian and military applications such as habitat monitoring, seismic monitoring, location tracking systems, etc. Although the sensors provide an unprecedented scale of access to monitor phenomena up front, their limited resources in terms of processing speed, communication power and memory render the conventional data management tools and techniques ineffective. Therefore, distributed protocols and algorithms are being employed to store, process, and query high level semantically rich data from the the raw and noisy sensor data. In the following projects, we developed novel distributed techniques to mine and summarize spatio-temporal data in such a resource constrained sensor network. In the DIST project, we designed an index-structure to track moving objects at various spatio-temporal resolutions. DIST can be used to answer range queries and is scalable with respect to update, storage and query costs. The next project, ELink was designed to capture spatio-temporal correlations in sensor data. Data is compressed in the temporal dimension locally at the sensor node using AR models, and then sensors with similar models are spatially clustered using in-network algorithms. In the third project, we transformed raw data into symbolic models, and further designed novel algorithms to aggregate two or more constituent models into a single composite model. For large scale sensor networks, there is a need to extract the high level information from the imprecise sensor signals. Towards this end, we introduce the transformation of numeric data into symbols and the modeling of resulting symbolic sequences using two statistical models, Markov Chains (MCs) and Hidden Markov Models (HMMs). The symbolic information generated from the distributed sensors require further summarization and aggregation. We design novel algorithms to aggregate two or more constituent models into a single composite model. We develop two kinds of composite models: the first kind captures the average behavior of the underlying models and the second kind captures the bounded behavior of the underlying models. In the light of communication and memory constraints of a sensor, the composite models are built only from the model parameters of the component models and hence do not require the transmission of the training sequences used to build each component model. A hierarchy of such in-network aggregate models can be used to efficiently answer queries on observation sequences, such as finding all sensors which have observed an unusual sequence of symbols with a high likelihood and whether a particular region has a high probability of observing such an unusual sequence. High level semantic queries such as the similarity of a sensor model with a given query model can also be answered. Our experiments on real-world and synthetic data sets show that transmission costs and communication costs are reduced by upto 6 times and 90% pruning of queries is achieved when the models are correlated. Project participants: Anand Meka and Arnab Bhattacharya |
|
Mining Spatial Patterns on Theme Maps of Phenotyped Cells Retinal cells contain micro molecular mixture of 100-200 metabolic reactant monomers like amino acids, and nucleic acids. Their molecular signatures give us a picture of their functional states. Excitation markers can be used to chemically discriminate cells and classify them. Tracking cell classes is indispensable to analysis of cellular systems [2-5]. There is a need to develop tools that mine the phenotyped cells for patterns. We present a tool that mines tissue samples to cluster cells of a particular class based on their connectivity to other classes and tracks their spatial distribution. The tool can be used to track information flow in cellular systems by studying biological circuits based on the spatial distribution of cells. For example, co-localized cells are likely to form part of a common pathway. |
|
Biological Sequence Analysis DNA sequences and protein sequences constitute an important type of biological data, and efficient analysis is paramount because the the amount of sequence data is already enormous, and increasing at an exponential rate. Three sequence analysis problems are of particular interest to us: whole-genome DNA sequence alignment, whole-genome shotgun sequence assembly, and multiple sequence alignment. Our approaches to these problems have in common that they transform the sequences into points in a multi-dimensional space, build index structures on these points, and solve the problem quickly using the index structures. Our method for whole-genome DNA sequence alignment was recently published in Bioinformatics. Project participants: Tamer Kahveci, Vebjorn Ljosa and Arnab Bhattacharya |
|
Protein Structure Analysis We have developed methods for efficient analysis of the protein structure data. One of our methods, PSI, uses databasee techniques for efficient querying of of similar protein structures in large protein databases. We also developed automated classification techniques for classification of proteins by using multiple data sources. In this context we used machine learning techniques to combine protein structure and sequence data to perform high quality classifications. Project participants: Orhan Camoglu and Tolga Can |
|
We develop a technique that summarizes a dynamic stream incrementally at multiple resolutions. This approximation can be used to answer point queries, range queries, and inner product queries. We extend the above technique to work in a distributed setting, specifically in a large network where a main site summarizes the stream and clients ask queries. We minimize the message overhead by deciding what and where to replicate by using an adaptive replication scheme. We maintain a hierarchy of approximations that change adaptively based on the query and update rates. Monitoring thousands of data streams online poses a challenge in many data-centric applications such as telecommunications networks, traffic management, trend-related analysis, web-click streams, intrusion detection, and sensor networks. Stream mining techniques employed in these applications have to be efficient in terms of space and per-item processing time, while providing a high quality of answers to queries such as finding similar patterns, monitoring specified conditions, detecting correlations, and computing aggregates. Stardust provides a scalable framework for stream mining applications. Project participants: Ahmet Bulut |
|
ComBAT ComBAT is a collection of computational biology analysis tools. The goal of this project is to find efficient methods for indexing and searching various types of biological data. Typical datasets considered are DNA/protein sequences, structure databases, and microarray/pathway databases. Current tools include: efficient alignment of DNA sequences and whole genomes, clustering of pathways and expression arrays and proteins structure alignment. These tools can be accessed at the UCSB Bioserver web site. Project participants: Arnab Bhattacharya, Orhan Camoglu, Tolga Can, Maureen Heymans, and Vebjorn Ljosa. |
|
The Mokan project focused on the following areas:
Project participants: Kris Kvilekval |
|
The lecture-on-demand project enables students to search for lectures based on content. In order to process user queries, we facilitate Virage's Video Application Server that provides web-based multimedia search and retrieval interfaces. For a given user query, we issue the corresponding subqueries to the T-rex (see below) index structures for color, texture, etc. and combine the results. The answer is then presented to the user as a list of video shots ranked by relevance from which she can pick the ones to be played back. The playback of the videos is also handled by the Video Application Server. Project participants: Christian Lang |
|
T-rex is a java-based extensible toolbox for multi-dimensional indexing. This toolbox supports the use of prediction models to guide index creation and accelerate query execution. The design of the toolbox is fully modular to allow for different prediction models, query types, index structures, and storage systems. Project participants: Christian Lang (in collaboration with Mirek Riedewald from Cornell University) |
|
KAN was designed to support both parallel and distributed applications. Examples of parallel applications include Sorting, TSP, Cholesky Factorization, and Barnes-Hut. Examples of distributed applications include Banking, Shared Calendar, and Shared Spaces. We also wanted to effectively use a combination of some old ideas like heterogeneity, reliability, object-orientedness, nested transactions, guarded actions, asynchronous calls and consistency. Kan provides certain keywords that extend the Java programming language to support distributed programming. They provide the following key properties: guards, asynchronous method calls, and nested atomic actions. Project participants: Jerry James, Kris Kvilekval |