|
|

Project Summary
Recent advances in hardware technology make applications requiring
large numbers of sensor devices possible, where each sensor device has
computation, memory, and communication capabilities. Such miniature
devices can be unobtrusively embedded into the physical world where
they constantly monitor the environment for critical events. Such
environments pose enormous research and development challenges ranging
from networking, operating system design, and information
processing. The award has been instrumental in creating a wireless
sensor network (WSN) comprising of approximately 40 fixed node Grid
sensor network and approximately24 portable node sensor network. The
culmination of the various projects that made up our original proposal
and the projects described herein, has lead to our implementation of a
novel heterogeneous distributed system that couples diverse embedded
sensor devices with high-end servers. Recent advances in server
technology have facilitated novel hardware support and software layers
in support of virtualization and chip multi-processing. For our
system to take advantage of these advances, we have extended our
infrastructure with high-end compute and data processing nodes which
we have integrated into the WSN infrastructure of our system. This
WSN infrastructure has been used to investigate research issues in
remote routing control, voice-over-sensor-network application, hybrid
simulation of large-scale sensor networks and adaptive feedback-based
energy estimation. The nature of information that is generated through
sensor networks is a continuous stream of data elements which becomes
too voluminous for being processed in the traditional manner
i.e. using traditional databases. Often granular data contained in the
streams needs to be aggregated and summarized in an online
manner. Furthermore, this data must be queried for un-anticipated
patterns. We have used this infrastructure to create a TCAM-based NPU
(network processor units) set-up for implementing non-traditional
query processing over data streams. In particular, we have used the
TCAMs for processing frequent-element, top-k, heavy-hitters, and
distinct heavy-hitters queries over data streams. More, recently we
have developed data-stream processing software using this WSN
infrastructure to enable highly-scalable and parallel processing of
sensor data-streams on multi-core processor architectures.
Principal Investigators
Graduate Student Researchers (Research partially supported by this NSF Instrumentation Award)
- Hailing Yu (Ph.D. 2004)
Currently at Oracle
Dissertation: Architecture Conscious Information Management
Systems
Abstract
Due to continuous advances in semiconductor manufacturing, the memory hierarchy has suffered from problems of latency, bandwidth, and cost gaps for decades. The processor-to-memory gap has been partially mitigated by the integration of multi-level caches. However the memory-to-disk gap has been constantly growing. This has resulted in a significant performance bottleneck for data intensive applications.
The goal of this dissertation is to address the I/O bottleneck by exploiting the symbiosis between advanced technologies in hardwares and algorithms capable of maximizing the benefits of the new features provided by these technologies. To this end, we study methods to replace disks with MEMS-based storage devices, which have emerged as the leading candidate for next generation storage devices. Since MEMS-based storage devices have very different characteristics from disks, we exploit their inherent properties when integrating them into existing systems. As a first step, we study the I/O scheduling problem and propose a new scheduling algorithm for MEMS-based storage. Next, we explore MEMS-based storage in the context of relational databases, two-dimensional datasets, and vector sets. We propose the Flexible Retrieval Model (FRM) to place relational data on MEMS-based storage, which enables data retrieval in both row-wise and column-wise manner. We evaluate the performance of FRM using the TPC-H benchmark and show significant I/O improvement. In the context of two-dimensional datasets, we re-evaluate the optimal cost of range queries based on the current disk technology. We establish that it is mpossible to achieve the new optimal cost using disk devices, because the new optimal cost requires two-dimensional sequential access. Therefore we propose to use MEMS-based storage and design a new placement scheme, which almost achieves the new optimal cost.
Considering the I/O bottleneck purely from the perspective of an application, we propose methods for reducing the number of I/O transfers for aggregation ranking queries in OLAP data cubes. These queries aggregate information over a specified range and return the ranked order of the aggregated values. Existing techniques for range aggregate queries are not able to process aggregation ranking queries efficiently due to the large number of I/O transfers. We handle this problem through a new ranking algorithm, which reduces I/O cost dramatically.
- Ozgur Sahin (Ph.D. 2005)
Currently at Google
Dissertation: Supporting Complex Queries Over Structured P2P
Networks
Abstract
Traditional distributed systems generally use the client-server model, where
data is served by dedicated servers and clients are consumers. This model has
limited scalability since the servers can become performance bottlenecks and
the maintenance costs are high. Peer-to-peer (P2P) systems have emerged as a
scalable model for distributed computing as they allow the sharing of resources
available at the client machines in a decentralized manner. They gained a lot of
popularity with file sharing applications such as Napster and Gnutella. These
applications, however, do not provide efficient search mechanisms and do not
scale well with the increasing number of users. As a result, Distributed Hash
Table (DHT) based structured P2P systems have been proposed. DHTs support
efficient exact key lookups in addition to offering very desirable properties
such as scalability, dynamic node insertion and departure, fault tolerance, and
efficient routing. However their applicability is limited as they only support
exact key lookups. We believe that DHTs can be extended to support more
complex querying operations, and thus provide the infrastructure for a larger
set of applications. In this thesis, we describe how various complex querying
operations can be supported over DHTs. Among the operations we explored
are range queries, content-based event dissemination, and similarity search. We
also describe several applications that can benefit from these operations such
as publish/subscribe systems and distributed Web service discovery.
- Alireza Aghili (Ph.D. 2005)
Currently at Teradata, a division of NCR
corporation
Dissertation: Sequence and Structure Similarity Search in
Biological and XML Databases
Abstract
The unprecedented growth of the Internet and biological databases has introduced
challenging and complex data formats and hence furnishing unique collaborative
venues for scientists of various disciplines. The set of such complex
databases includes, 1) XML (eXtended Markup Language) databases, 2) DNA
and Protein sequence and structure databases, 3) Microarray gene expressions, 4)
Biomedical images, and 5) Sensor data stream and Time series databases. Given a
source query pattern and a target database, the similarity search (range query or
top-k) seeks to identify those records of the database which match the given query.
The problem of similarity search in biological and textual databases has received
substantial attention in the past decade. Numerous filtration and indexing techniques
have been proposed to address the scalability issues and reduce the curse
of dimensionality. However, complex applications demand special customization
based on the inherent and underlying dynamics of the data. In this work, we study
the integration of various transformation and shape summarization techniques on
biological sequence and protein structure data, as well as path encoding in the
tree-structured XML data, for more efficient similarity search query processing.
- Nagendar Bandi (Ph.D. 2006)
Currently at Oracle
Dissertation: New Hardware Support for Compute-Intensive Database and Data Stream Operations
Abstract
High performance database systems require both the software and
hardware components of the system to deliver their best possible
performance. While years of research by the database community has
primarily focused on improving performance by developing better
software techniques, relatively little effort has gone into hardware
level innovations. Several important database problems such as
polygon-related spatial database queries, conditional database joins
and one-pass database summarization queries, are still computationally
quite expensive to solve. Although several architecture-conscious
solutions have been proposed, there is a limit to the performance
improvement that can be achieved on current architectures.
In this thesis, we take an alternative approach and propose novel
hardware-based techniques using off-the-shelf graphics and networking
hardware to augment the capabilities of modern database systems. We
first develop the dual-threaded framework, which enables the
integration of hardware-based techniques into a commercial
database. Using modern graphics hardware as a co-processor, we propose
a spatial query operator for answering intersection queries over
polygon datasets. This operator, built using the dual-threaded
framework, complements the existing index structures and optimizations
of a spatial database.
We next focus on developing faster database algorithms using Terenary
Content Addressable Memories (TCAMs), which enable giga-bit rate
forwarding at network routers. We propose the CAM-Cache architecture
which integrates an off-the-shelf TCAM into the memory hierarchy of a
conventional processor through the PCI interface. Using this
architecture, we develop a fast sorting algorithm, which also
functions as an index structure. We then develop a fast conditional
join algorithm which uses the sorting algorithm as a basic component.
Finally, we present fast TCAM-based solutions for solving the heavy
hitters and heavy distinct hitters problems over data streams. We
analyze several popular software solutions for detecting heavy hitters
and heavy distinct hitters, discuss their bottlenecks and propose
several TCAM-conscious solutions which address these issues.
- Chiranjeeb Bourgohain (Ph.D. 2006)
Currently at Google
Dissertation: Efficient Data Gathering in Sensor Networks
Systems
Abstract
Wireless sensor nodes are tiny self-contained devices which combine sensing, compu-
tation and radio communication in a single package powered by batteries. An ad hoc network
of such devices can allow efficient environmental monitoring over a large geographic area. The
primary characteristic of these sensor networks is their size (many thousands of nodes) and
limited resources of individual nodes, especially battery power (couple of AA batteries). Since
radio communication is the largest consumer of battery power, the utility and lifetime of such
a sensor network is limited by the communication load in the network. The constraints arising
from scalability and limited power require that we have to fundamentally rethink the protocols
and algorithms that are deployed over such a network.
Since the sensors continuously monitor physical variables, they generate large quantities
of data. Users of the sensor network make queries over this network with the data of interest
spread over thousands of nodes over a large geographic area. Thus the key problems in sensor
networks arise in gathering this data and processing it efficiently to answer queries. In this
dissertation we focus on designing efficient data gathering algorithms which ensure reliable
delivery of data while minimizing resource consumption over the network. Building upon these
foundations, we design algorithms which can carry out efficient data processing within a single
node, as well as over the entire network.
- Fatih Emekci (Ph.D. 2006)
Currently at Oracle
Dissertation: Privacy Preserving Data Sharing
Abstract
Recent economic trends are forcing enterprises to collaborate with each other
to analyze the market in a better way and make intelligent decisions. Therefore,
data integration from multiple autonomous data sources has emerged as
an important practical problem. The key requirement is that owners of such
data need to cooperate in a competitive landscape in most of the cases. The
research challenge in developing a query processing solution is that the answers
to the queries need to be provided while preserving the privacy of the
data sources. In general allowing unrestricted read access to the whole data
may give rise to potential vulnerabilities as well as may have legal implications.
Therefore, there is a need for privacy preserving query processing methods for
querying data residing at different parties. To satisfy these requirements we
propose new query processing techniques to execute declarative queries spanning
private data warehouses in a privacy preserving manner (i.e., only the
query answer is revealed but nothing else). Along this direction, we explore
different ways of building such a query processor with and without using third
parties. Our scheme is able to answer queries without revealing any useful
information to the data sources or to the third parties. Our approach uses
lightweight computation and communication overhead thus making it scalable
to large data set. Furthermore, we looked at the decision tree learning over the
union of multiple private data warehouses and propose scalable and efficient
solutions to this problem. Finally, this thesis also propose new data management
tools for secure data outsourcing to protect the content of data from a
database service provider. Overall, this work represents the initial steps in the
expanding area of privacy preserving data management.
- Huagang Li (Ph.D. 2006)
Currently at Oracle
Dissertation: Application semantics driven query processing
Abstract
Traditionally, query processing in relational database systems depends
primarily on syntactic analysis, which may not be able to generate
efficient solutions for diverse database applications. Nevertheless,
there are many database applications with semantic knowledge, which
can be exploited to customize the query processing to achieve greater
performance gains. In this dissertation, we study how to integrate the
application semantics to improve query processing in three prevalent
database applications. (i) Data warehouses with
multi-append-only-trend data: Data warehouses maintain historical data
for the discovery of trends to support decision making activities. In
a data warehouse, multiple time-related attributes are usually used to
describe data items, and their values tend to increase over time. This
tendency is referred to as the multi-append-only-trend property. We
show how taking advantage of this property can improve the performance
of range aggregate queries, an essential approach for summarizing
large data sets. (ii) XML databases with tree shaped data: XML has
become the standard for data exchange across different enterprises due
to its expressive tree structured data. An XPath expression, a
fundamental building block of queries over XML databases often
involves content and structure predicates. We focus on XPath
expression queries with content range predicates and study how to
build an efficient summarization index to facilitate query processing
by exploiting the semantic relationship between the range attributes
and the corresponding path structures of XML data. (iii) Streaming
databases with punctuation data: Punctuation has been recently
introduced as a new streaming semantics to address the stateful
problem of relational operators when adapted to data stream processing
systems. We consider how a given continuous join query (CJQ) can
benefit from a set of available punctuation schemes. In other words,
if one can identify that a CJQ requires unbounded storage, then this
query can be flagged as unsafe and prevented from running. We provide
sufficient and necessary conditions for checking whether a CJQ can be
safely executed under a given set of punctuation schemes by
introducing a novel graph construct, the punctuation graph. We show
that the safety checking problem can be done in polynomial time based
on this punctuation graph construct.
- Ahmed Metwally (Ph.D. 2007)
Currently at Google
Dissertation: On-line data forensics for fraud detection in Internet advertising
Abstract
Internet advertising is crucial for the thriving of the entire
Internet, since it allows producers to advertise their products, and
hence contributes to the well being of e-commerce. Some publishers are
dishonest, and use automation to generate traffic to defraud the
advertisers. Similarly, some advertisers automate clicks on the
advertisements of their competitors to deplete their competitors
advertising budgets. Moreover, some dishonest publishers and
advertisers form coalitions to circumvent the fraud detecting
techniques.
In this thesis, we describe the advertising network model, and
discuss the issue of click fraud that is an integral problem in
such setting. We explain the classical approach of detecting click
fraud, which judges publishersg the click fraud techniques into two
major classes based on the motivation of the fraudulent publishers
and advertisers. We devise traffic analysis algorithms that detect
both classes of fraud attacks. This motivated us to solve several
data analysis problems, such as detecting frequent stream elements,
finding association rules between elements in a stream, and
sampling large sets to find similar pairs of sets among numerous
sets. We conclude with some successful findings of our attempt to
detect fraud on a real
- Ying Feng (Ph.D. 2007)
Currently at Google
Dissertation: Analytical query processing in data intensive applications
Abstract
Recent advances in web applications and IT infrastructures have to
deal with large amounts of data collected from the Internet and
business transactions. Such large scale data sets need more advanced
database support for analyzing and processing and querying. Analytical
queries are one of the important operations for business analysis and
decision-making. These queries involve a large number of data entities
and pre-processing is usually used to improve online response time
along with efficient query processing algorithms. This dissertation
addresses the efficient and scalable online execution of analytical
query processing, including aggregate queries and rank queries.
As the widely used pre-computation method for aggregate queries, data
cubes are prohibitively expensive in terms of computation time and
storage space. A new data structure, range trie, is proposed to
capture and utilize data dependency, or data correlations, in data
sets. Furthermore, a new cube representation is proposed to compress
data cube storage without information loss by taking advantage of the
partial order relationships in a data cube. Therefore, the approach
effectively reduces both the computation and I/O cost for cube
computation. In addition, a hierarchical indexing scheme based on
range tries is also suggested to process point aggregation queries and
range aggregation queries. For temporal data sets in which data are
appended periodically, this dissertation addresses the challenges for
range aggregate queries due to the sparsity and incremental property
of temporal attributes. A new data structure named PBBT (Perfect
Binary Block Tree) is used to ensure logarithmic time complexity for
both query processing and data appending.
Moreover, the dissertation introduces a scalable and efficient
approach to ranking top-k preference queries. An optimal search
algorithm based on multidimensional index structures is investigated
and analyzed for performance bottlenecks due to memory consumption. A
notion of dominance rank is proposed to pre-process data sets to
adaptively reduce data accesses at runtime for a given top-k
query. Thus the method ensures high level of scalability and stable
performance with bounded memory for increasing data set size.
- Lingli Zhang (Ph.D. 2007)
Currently at UC Santa Barbara
Dissertation: Title
Abstract
Pending.
- Selim Gurun (Ph.D. 2007)
Currently at UC Santa Barbara
Dissertation: Modeling, predicting and reducing energy consumption in resource restricted computers
Abstract
Pending.
- Ping Wu (Ph.D. 2008)
Currently at Google
Dissertation: Skyline Queries
Abstract
Pending.
Acknowledgement
This work was supported by NSF award IIS-0423336 RR: Wireless Sensor Network Laboratory Infrastructure
Related Links

Copyright(c) 2008
DSL, University of California Santa Barbara. All rights reserved. dsl@cs.ucsb.edu
|