Principal
Investigator
Department of Computer
Science
3117 Eng
I
Santa Barbara
93106
805-893-8553
http://www.cs.ucsb.edu/~agrawal
· Mirek Riedewald, Divyakanth
Agrawal, Amr El Abbadi and Flip Korn, "Accessing Scientific Data: Simpler is
better", Proc. Int. Symp. on Spatial and Temporal Databases, vol. July, (2003),
p. Accepted
· H.-G. Li, D. Agrawal, A. El
Abbadi, M. Riedewald, "Exploiting the Multi-Append-Only-Trend Property of
Historical Data in Data Warehouses.", Proc. Int. Symp. on Spatial and
Temporal Databases, vol. July, (2003).
· Lin Qiao, Divyakant Agrawal
and Amr El Abbadi, "RHist: Adaptive Summarization over Continuous Data Streams",
Proceedings of the 2002 ACM CIKM International Conference on Information
and Knowledge Management, vol. November, (2003), p. 469.
· Lin Qiao, Divyakant Agrawal,
and Amr El Abbadi, "Supporting Sliding Window Queries for Continuous Data
Streams", Processings of the 15th International Conference on Scientific
and Statistical Database Management, vol. July, (2003), p.
85.
· Abhishek Gupta, , Divyakant
Agrawal, and Amr El Abbadi, "Approximate Range Selection Queries in
Peer-to-peer ", CIDR 2003, First Biennial Conference on Innovative Data
Systems Research, vol. January, (2003), p. 141.
· Kothari, Agrawal, Gupta,
Suri, "Range Addressable Network: A P2P Cache Architecture for Data Ranges",
Proceedings of The Third IEEE International Conference on Peer-to-Peer
Computing, vol. September, (2003).
· S.A. Aghili, D. Agrawal, A.
El Abbadi, "Filtration of String Proximity Search via Transformation",
Proceedings of IEEE Conference on Bioinformatics and Biomedical
Engineering(BIBE), vol. March, (2003), p. 149.
· S.A. Aghili, D. Agrawal, A.
El Abbadi, "BFT: Bit Filtration Technique for Approximate String Join in
Biological Databases", Proceedings of. 10-th International Conference on
String Processing and Information Retrieval (SPIRE), vol. October,
(2003).
As for specific
contributions:
RHist, is a relaxed
histogram that is particularly appropriate for summarization of data streams.
This technique is novel in that it adapts to both changes in the data as well as
in the query workload characteristics.
MFS is a
particularly innovative tool for summarization and estimation of aggregate
operations over large multidimensional scientific datasets. The most
significant property of MFS is that it specifically exploits the inherent
properties of sequential access in disks. One important side effect is
that it can be used as a basis for improving the performance of a large variety
of standard multi-dimensional index structures such as R-trees.
For distributed
datasets, our extensions to Chord and CAN promise to expand their applicability
from simple exact match to range queries and potentially to more complex
queries.
1) In the context
of spatial datasets, we introduced the notion of browsing large datasets
and developed a new model that expresses the various perspectives between
queries and objects. The main results is the Euler Histogram (based on the
mathematical properties of Euler graphs) which is very efficient in supporting
the various types of operations needed in browsing, including intersection and
containment. Variations of Euler histograms were developed and were shown to
have very high estimation accuracy, while maintaining fast evaluation
times. We also developed a general framework, using the Euler histogram,
for estimating the sizes of complex spatial operations including join
operations. The framework allows us to develop novel estimation techniques
and opens the way for future incorporation of other approximation methods.
2)We developed the
R(elaxed)Hist(ogram), which provides summarization over continuous data streams.
RHist adapts to both changes in the data stream as well as changes in the query
patterns. One of the main challanges in developing RHIST was to
efficiently (both space and timewise) maintain the histograms in the presence of
continuously changing data. We also introduced a workload decay model to
efficiently capture global workload information and ensure that the query
patterns from the recent past are weighted more than queries that are further in
the past. Our results indicate that the adaptive approaches based on both
workload and data distributions are robust and result in very low estimation
errors.
3) For large
multidimensional scientific data, we developed the Multiresolution File Scan
(MFS) approach which is based on a surprisingly simple and flexible data
structure which outperforms sophisticated multidimensional indices, even if they
are bulk-loaded and hence optimized for query processing. Our approach also has
the advantage that it can incorporate a priori knowledge about the query
workload. It readily supports summarization using distributive (e.g., count,
sum, max, min) and algebraic (e.g., avg) aggregate operators. In the experiments
MFS outperformed existing popular indices, even when they are bulk-loaded and
hence optimized for query performance. We also showed how principles developed
for MFS can be applied to indices in order to increase their performance. Thanks
to its use of simple flat files, MFS can also be combined with popular secondary
indices like B+-trees.
4) Regarding
temporal and archival datasets, we developed the notion of Multi-Append Only
Trend (MAOT), which realistically captures the fact that for real datasets, one
will rarely find multiple dimensions such that newly inserted data items have
increasing values in all these dimensions. For instance, once an order for a
product is placed, the retailer processes it and ships the product to the
customer within a certain amount of time. Typically given a later order date,
there will be a later shipping date. However, varying order processing speed
will cause exceptions from the rule. On the other hand such exceptions are not
arbitrarily bad. In the order example most outliers will be off by at most one
or two days. Hence such data sets with multiple time-related dimensions may
exhibit a non-decreasing trend in some time-related dimensions while in others
may maintain this trend approximately. We developed preliminary data
structures that take advantage of these properties in some important special
cases. Our preliminary results indicate implicit knowledge of such MOAT
properties can be exploited to give excellent results.
5) Our preliminary
findings on biological datasets using approximation techniques based on
transformations are quite promising. They indicate that, especially for large
datasets that are disk resident, they may be applied to prune most of the
non-desired sequences, and reduce the real search problem to only a fraction of
the database, as a preprocessing phase for any of the known heuristic
approaches.
6) Our finding for
approximate range queries in P2P networks is quite promising. We developed
a novel approach for locating relevant data partitions in peer-to-peer systems
using locality sensitive hashing. The benefit of such an approach is that it not
only finds exact partitions in the system if they exist, but also can help
locate partitions that nearly match. Our findings are a first step towards
the support of more complex database operations in P2P systems.
· Ganti, Mong-Li Lee, and
Raghu Ramakrishnan. Icicles: Self-tuning samples for approximate query
answering. In VLDB 2000, Proceedings of 26th International Conference on
Very Large Data Bases, September 10-14, 2000, Cairo, Egypt, pages 176--187.
Morgan Kaufmann, 2000.
· Phillip B. Gibbons and Yossi
Matias. New sampling-based summary statistics for improving approximate query
answers. In SIGMOD 1998, Proceedings ACM SIGMOD International Conference
on Management of Data, June 2-4, 1998, Seattle, Washington, USA, pages 331--342.
ACM Press, 1998.
· M. Muralikrishna and David
J. DeWitt. Equi-depth histograms for estimating selectivity factors for multi-dimensional queries. In Haran
Boral and Per-Ake Larson, editors, Proceedings of the 1988 ACM SIGMOD
International Conference on Management of Data, Chicago, Illinois, June 1-3,
1988, pages 28--36. ACM Press, 1988.
Our preliminary
filtration techniques for bio-informatics promise to significantly improve the
performance of search based as well as join based operations on biological
datasets.