Project Title

Histogram-based Query Estimation for Datasets with different Modalities

Project Award Number

0209112


Principal Investigator

Divyakant
Agrawal
Department of Computer Science
UC Santa Barbara
3117 Eng I
UCSB
Santa Barbara
CA
93106
805-893-4385
805-893-8553
agrawal@cs.ucsb.edu
http://www.cs.ucsb.edu/~agrawal

Co-PI
Amr
El Abbadi
Department of Computer Science
UC Santa Barbara
3115 Eng I
UCSB
Santa Barbara
CA
93106
805-893-4239
805-893-8553
amr@cs.ucsb.edu
http://www.cs.ucsb.edu/~amr


Collaborator
Subhash
Suri
Department of Computer Science
UC Santa Barbara
2111 Eng I
UCSB
Santa Barbara
CA
93106
805-893-8856
805-893-8553
suri@cs.ucsb.edu
http://www.cs.ucsb.edu/~suri

Collaborator
Flip
Korn
ATT Research Lab.
180 Park Avenue
Rm A101
Florham Park
NJ
07932
973-360-8722
973-360-8871
flip@research.att.com
http://www.research.att.com/~flip

Keywords
Data Summarization 
On Line Analytical Processing  
Query Estimation
Query Optimization 
Data Streams 

Project Summary

Data summarization and estimation can serve as a useful tool for a diverse set of applications ranging from traditional database query optimization to OLAP applications and the exploration of large data sets.  During the course of this project, data estimation and summarization techniques will be developed for datasets with different modalities: point datasets, datasets containing objects with spatial extents, and stream datasets.  The approach is specifically based on the use of histograms in different contexts.  One of the main problems of applying estimation techniques in a database management system is the unknown characteristic of the distribution of the data that will populate a given DBMS.  In fact, in order for such a technique to be useful, the estimation technique should be effective for different kinds of data distributions and query patterns.  This is because a DBMS is used for a variety of applications resulting in a wide spectrum of data that populates the database.  A new estimation technique called the golden estimator has been identified that employs cumulative probability distributions for creating histograms and captures the underlying data distribution. This technique can also be used to adapt to changes in the query patterns.  For objects with spatial extents, this project will lead to histograms derived from Euler's formulation for graphs. For stream datasets, a variety of approaches will be explored that are amenable to maintain histograms dynamically. The research results will be evaluated in the context of the Alexandria Digital Library and the Digital Campus projects at UCSB.

Publications and Products

·   Chengyu Sun, Divyakant Agrawal, Amr El Abbadi, "Selectivity estimation for spatial joins with geometric selections", Advances in  Database Technology - EDBT 2002, 8th International Conference on Extending Database Technology. Extended version appears as a UCSB Technical Report., vol. March, (2002), p. 609.

·   Mirek Riedewald, Divyakanth Agrawal, Amr El Abbadi and Flip Korn, "Accessing Scientific Data: Simpler is better", Proc. Int. Symp. on Spatial and Temporal Databases, vol. July, (2003), p. Accepted

·   H.-G. Li, D. Agrawal, A. El Abbadi, M. Riedewald, "Exploiting the Multi-Append-Only-Trend Property of Historical Data in Data Warehouses.", Proc. Int. Symp. on Spatial and Temporal Databases, vol. July, (2003).

·   Lin Qiao, Divyakant Agrawal and Amr El Abbadi, "RHist: Adaptive Summarization over Continuous Data Streams", Proceedings of the 2002 ACM CIKM International Conference on Information and Knowledge Management, vol. November, (2003), p. 469.

·   Lin Qiao, Divyakant Agrawal, and Amr El Abbadi, "Supporting Sliding Window Queries for Continuous Data Streams", Processings of the 15th International Conference on Scientific and Statistical Database Management, vol. July, (2003), p. 85. 

·   Abhishek Gupta, , Divyakant Agrawal, and Amr El Abbadi, "Approximate Range Selection Queries in Peer-to-peer ", CIDR 2003, First Biennial Conference on Innovative Data Systems Research, vol. January, (2003), p. 141.

·   Kothari, Agrawal, Gupta, Suri, "Range Addressable Network: A P2P Cache Architecture for Data Ranges", Proceedings of The Third IEEE International Conference on Peer-to-Peer Computing, vol. September, (2003).  

·   S.A. Aghili, D. Agrawal, A. El Abbadi, "Filtration of String Proximity Search via Transformation", Proceedings of IEEE Conference on Bioinformatics and Biomedical Engineering(BIBE), vol. March, (2003), p. 149.  

·   S.A. Aghili, D. Agrawal, A. El Abbadi, "BFT: Bit Filtration Technique for Approximate String Join in Biological Databases", Proceedings of. 10-th International Conference on String Processing and Information Retrieval (SPIRE), vol. October,  (2003). 

Project Impact

Overall, we are in the process of developing a basic set of tools, e.g., histograms, to support data summarization and estimation.  With the increasing and overwhelming increase in the types of data being collected and the sizes of these datasets, such tools are crucial for many different applications. We have developed such innovative and novel techniques for diverse datasets including point and spatial data, data streams, multidimensional scientific datasets, archival and temporal data, biological data as well as distributed peer-to-peer datasets.  By exploring such diverse datasets, we are able to exploit and explore similar properties in diverse applications and find similar underlying principles.  We believe this development is crucial and will help simplify and increase our understanding of the fundamental characteristics of data and information in general.

As for specific contributions:
RHist, is a relaxed histogram that is particularly appropriate for summarization of data streams. This technique is novel in that it adapts to both changes in the data as well as in the query workload characteristics.

MFS is a particularly innovative tool for summarization and estimation of aggregate operations over large multidimensional scientific datasets.  The most significant property of MFS is that it specifically exploits the inherent properties of sequential access in disks.  One important side effect is that it can be used as a basis for improving the performance of a large variety of standard multi-dimensional index structures such as R-trees.

For distributed datasets, our extensions to Chord and CAN promise to expand their applicability from simple exact match to range queries and potentially to more complex queries.

Goals, Objectives and Targeted Activities

Our findings have been very promising and are very interesting, especially given the diversity of the datasets we are using.

1) In the context of spatial datasets, we introduced the notion of browsing large datasets and developed a new model that expresses the various perspectives between queries and objects. The main results is the Euler Histogram (based on the mathematical properties of Euler graphs) which is very efficient in supporting the various types of operations needed in browsing, including intersection and containment. Variations of Euler histograms were developed and were shown to have very high estimation accuracy, while maintaining fast evaluation times.  We also developed a general framework, using the Euler histogram, for estimating the sizes of complex spatial operations including join operations. The framework allows us to develop novel estimation techniques and opens the way for future incorporation of other approximation methods.

2)We developed the R(elaxed)Hist(ogram), which provides summarization over continuous data streams. RHist adapts to both changes in the data stream as well as changes in the query patterns.  One of the main challanges in developing RHIST was to efficiently (both space and timewise) maintain the histograms in the presence of continuously changing data. We also introduced a workload decay model to efficiently capture global workload information and ensure that the query patterns from the recent past are weighted more than queries that are further in the past. Our results indicate that the adaptive approaches based on both workload and data distributions are robust and result in very low estimation errors.

3) For large multidimensional scientific data, we developed the Multiresolution File Scan (MFS) approach which is based on a surprisingly simple and flexible data structure which outperforms sophisticated multidimensional indices, even if they are bulk-loaded and hence optimized for query processing. Our approach also has the advantage that it can incorporate a priori knowledge about the query workload. It readily supports summarization using distributive (e.g., count, sum, max, min) and algebraic (e.g., avg) aggregate operators. In the experiments MFS outperformed existing popular indices, even when they are bulk-loaded and hence optimized for query performance. We also showed how principles developed for MFS can be applied to indices in order to increase their performance. Thanks to its use of simple flat files, MFS can also be combined with popular secondary indices like B+-trees.

4) Regarding temporal and archival datasets, we developed the notion of Multi-Append Only Trend (MAOT), which realistically captures the fact that for real datasets, one will rarely find multiple dimensions such that newly inserted data items have increasing values in all these dimensions. For instance, once an order for a product is placed, the retailer processes it and ships the product to the customer within a certain amount of time. Typically given a later order date, there will be a later shipping date. However, varying order processing speed will cause exceptions from the rule. On the other hand such exceptions are not arbitrarily bad. In the order example most outliers will be off by at most one or two days. Hence such data sets with multiple time-related dimensions may exhibit a non-decreasing trend in some time-related dimensions while in others may maintain this trend approximately.  We developed preliminary data structures that take advantage of these properties in some important special cases.  Our preliminary results indicate implicit knowledge of such MOAT properties can be exploited to give excellent results.

5) Our preliminary findings on biological datasets using approximation techniques based on transformations are quite promising. They indicate that, especially for large datasets that are disk resident, they may be applied to prune most of the non-desired sequences, and reduce the real search problem to only a fraction of the database, as a preprocessing phase for any of the known heuristic approaches.

6) Our finding for approximate range queries in P2P networks is quite promising.  We developed a novel approach for locating relevant data partitions in peer-to-peer systems using locality sensitive hashing. The benefit of such an approach is that it not only finds exact partitions in the system if they exist, but also can help locate partitions that nearly match.  Our findings are a first step towards the support of more complex database operations in P2P systems.

Area Background

A common and very useful technique for summarizing large dataset is a histogram where buckets summarize data over different ranges over the entire domain.  Point data represents traditional datasets for example tuples in relational databases.  Query size estimation plays an important role in the query optimizer of a relational database management system.  Depending on the size of the intermediate results, a particular query execution plan may be favored over another. Although very important, applications increasingly need to deal with non-point datasets.  In the Alexandria digital library project, we have datasets that include objects with spatial extents. Traditional histogram approaches are not appropriate for such datasets since typically a spatial object spans multiple histogram buckets giving rise to over-estimation.  A novel type of dataset that has been gaining importance recently are data streams where data is continuously delivered. Again traditional approaches are not appropriate since the straightforward adaptation of histogram maintenance would require multiple passes over the entire data stream. Furthermore, the types of queries using stream data differ significantly from queries over stored datasets. Often they are continuous queries or simply window queries referring to most recent data.

Area References

·   Ashraf Aboulnaga and Surajit Chaudhuri. Self-tuning histograms: Building histograms without looking at data. In  SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1-3, 1999, Philadephia, Pennsylvania, USA, pages 181--192, 1999.

·   Ganti, Mong-Li Lee, and Raghu Ramakrishnan. Icicles: Self-tuning samples for approximate query answering. In  VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10-14, 2000, Cairo, Egypt, pages 176--187. Morgan Kaufmann, 2000.

·   Phillip B. Gibbons and Yossi Matias. New sampling-based summary statistics for improving approximate query answers. In  SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, 1998, Seattle, Washington, USA, pages 331--342. ACM Press, 1998.

·   M. Muralikrishna and David J. DeWitt. Equi-depth histograms for estimating selectivity factors for  multi-dimensional queries. In Haran Boral and Per-Ake Larson, editors, Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, Chicago, Illinois, June 1-3, 1988, pages 28--36. ACM Press, 1988.

·   G. Piatetsky-Shapiro and C. Connell. Accurate Estimation of the number of Tuples Satisfying a Condition. Proceedings of 1984 ACM SIGMOD international conference on Management of data, pages 256--276, 1984.
 

Potential Related Projects

The Euler Histogram has many applications to disciplines that use spatial datasets as the basis for their research.  This includes diverse domains such as digital libraries, GIS, earth science, etc. The Euler histogram was motivated by the need for a digital library browser in the Alexandria Digital library project.

Our preliminary filtration techniques for bio-informatics promise to significantly improve the performance of search based as well as join based operations on biological datasets.

Project Websites

http://www.cs.ucsb.edu/~dsl/pasteur.html
Project Pasteur web site details our efforts in search and retrieval of strings in bio-informatics.
http://www.cs.ucsb.edu/~dsl/euclid.html 

Project Euclid web site includes our efforts to use histograms to support browsing of spatial digital libraries.