|
|

Project Summary
Recent advances in networking, databases, and Internet applications have resulted in several important trends.
Advanced network hardware has emerged to ensure the fast and efficient routing of messages. The increasing sizes of data have
led to the development of sophisticated one-pass data stream management algorithms by the database community. Network attacks, spam,
Distributed Denial of Service attacks, and Internet fraud in general are now recognized as major problems for diverse
internet applications. The confluence of these trends motivates the research in this proposal, which
intends to solve some of these internet problems using data streams algorithms implemented on advanced network hardware.
In particular, we propose to develop data stream algorithms that are specifically designed for network processors units (NPU)
that are integrated with content addressable memories. Given the ubiquitous availability of NPUs, the proposed approach will have the
potential for use in the on-line analytical processing of message flows as well as in the identification of network attacks and fraud.
The proposed research is high risk since it involves the integration of specialized hardware and software components from different
vendors, including Intel for the NPU, IDT or NetLogic for the content addressable memories, and Monta Vista and Teja
for the operating systems. This integration process is fraught with risks given the need for one-on-one communication
with the support personnel of the different vendors to successfully integrate these novel computing platforms.
However, the research will have high impact since, if successful, it will lay the foundations for the widespread use of efficient
hardware devices throughout the internet for the detection of spam, fraud and DDoS, as well as enable data-centric analysis of
network traffic for the networks of the future.
Principal Investigator
Collaborator
Graduate Student Researcher
Acknowledgement
This work was supported by NSF award IIS-0744539 SGER: Leveraging Advanced Hardware for Streaming Applications
Publications
- Sudipto Das, Divyakant Agrawal, Amr El Abbadi, CAM Conscious Integrated Answering of Frequent Elements and Top-k Queries over Data Streams, 4th International Workshop on Data Management on New Hardware, DaMoN 2008, in conjunction with SIGMOD/PODS 2008, pages 1 - 10.
Abstract, Paper
Frequent elements and top-k queries constitute an important class of queries for data stream analysis applications.
Certain applications require answers for both frequent elements and top-k queries on the same stream. In addition,
the ever increasing data rates call for providing fast answers to the queries, and researchers have been looking towards
exploiting specialized hardware for this purpose. Content Addressable Memory(CAM) provides an efficient way of looking up
elements and hence are well suited for the class of algorithms that involve lookups. In this paper, we present a fast and
efficient CAM conscious integrated solution for answering both frequent elements and top-k queries on the same stream.
We call our scheme CAM conscious Space Saving with Stream Summary (CSSwSS), and it can efficiently answer continuous queries.
We provide an implementation of the proposed scheme using commodity CAM chips, and the experimental evaluation demonstrates
that not only does the proposed scheme outperforms existing CAM conscious techniques by an order of magnitude at query loads
of about 10%, but the proposed scheme can also efficiently answer continuous queries.
- Sudipto Das, Shyam Antony, Divyakant Agrawal, Amr El Abbadi, CoTS: A Scalable Framework for Parallelizing Frequency Counting over Data Streams, UCSB CS Tecnical Report 2008-08, June 2008.
Abstract, Paper
Applications involving analysis of data streams have gained significant popularity and importance.
Frequency counting, frequent elements and top-k queries form a class of operators that are used for a wide range of
stream analysis applications. In spite of the abundance of these algorithms, all known techniques for answering data
stream queries are sequential in nature. The imminent ubiquity of Chip Multi-Processor (CMP) architectures requires
algorithms that can exploit the parallelism of such architectures. In this paper, we first explore the challenges in
parallelizing frequent elements and top-k queries in the context of the inherent parallelism available in multi-core
processors, evaluate different naive techniques for intra-operator parallelism, and summarize the insights obtained
from the different parallelization efforts. Our experimental analysis of the naive designs implemented in the paper
shows that intra-operator parallelism is not straightforward and requires a complete redesign of the system. Based on
the lessons learnt from this analysis, we design an efficient and scalable framework for parallelizing frequency counting,
frequent elements and top-k queries over data streams. The proposed CoTS (Co-operative Thread Scheduling) framework
is based on the principle of threads co-operating rather than contending. Our experiments on a state-of-the-art quad-core
chip multiprocessor architecture and synthetic data sets demonstrate the scalability of the proposed framework, and
the efficiency is demonstrated by peak throughput of more that 60 million elements per second. In addition, for skewed
data distributions, despite using heavy weight synchronization primitives, the implementation of the proposed framework
outperforms the sequential implementation by a factor of 2 - 4X.
- Sudipto Das, Shyam Antony, Divyakant Agrawal, Amr El Abbadi, CoTS: A Scalable Framework for Parallelizing Frequency Counting over Data Streams, 25th International Conference on Data Engineering (ICDE 2009), pages 1323-1326.
Abstract, Paper
Applications involving analysis of data streams have gained significant popularity and importance.
Frequency counting, frequent elements and top-k queries form a class of operators that are used for a wide range of
stream analysis applications. In spite of the abundance of these algorithms, all known techniques for answering data
stream queries are sequential in nature. The imminent ubiquity of Chip Multi-Processor (CMP) architectures requires
algorithms that can exploit the parallelism of such architectures. In this paper, we first explore the challenges in
parallelizing frequent elements and top-k queries in the context of the inherent parallelism available in multi-core
processors, evaluate different naive techniques for intra-operator parallelism, and summarize the insights obtained
from the different parallelization efforts. Our experimental analysis of the naive designs implemented in the paper
shows that intra-operator parallelism is not straightforward and requires a complete redesign of the system. Based on
the lessons learnt from this analysis, we design an efficient and scalable framework for parallelizing frequency counting,
frequent elements and top-k queries over data streams. The proposed CoTS (Co-operative Thread Scheduling) framework
is based on the principle of threads co-operating rather than contending. Our experiments on a state-of-the-art quad-core
chip multiprocessor architecture and synthetic data sets demonstrate the scalability of the proposed framework, and
the efficiency is demonstrated by peak throughput of more that 60 million elements per second. In addition, for skewed
data distributions, despite using heavy weight synchronization primitives, the implementation of the proposed framework
outperforms the sequential implementation by a factor of 2 - 4X.
-
Sudipto Das, Shyam Antony, Divyakant Agrawal, Amr El Abbadi, "Thread Cooperation in Multicore Architectures for Frequency Counting Over Multiple Data Streams",[Paper], [Talk] in the 35th International Conference on Very Large Data Bases (VLDB) 2009 [PVLDB 2(1) 217-228].
Many real-world data stream analysis applications such as network
monitoring, click stream analysis, and others, require combining
multiple streams of data arriving from multiple sources. This is
referred to as multi-stream analysis. To deal with high stream
arrival rates, it is desirable that such systems be capable of supporting
very high processing throughput. The advent of multicore
processors and powerful servers driven by these processors
calls for efficient parallel designs that can effectively utilize the
parallelism of the multicores, since performance improvement is
possible only through effective parallelism. In this paper, we address
the problem of parallelizing multi-stream analysis in the context
of multicore processors. Specifically, we concentrate on parallelizing
frequent elements, top-k, and frequency counting over
multiple streams. We discuss the challenges in designing an efficient
parallel system for multi-stream processing. Our evaluation
and analysis reveals that traditional contention based locking results
in excessive overhead and wait, which in turn leads to severe
performance degradation in modern multicore architectures.
Based on our analysis, we propose a cooperation based locking
paradigm for efficient parallelization of frequency counting. The
proposed cooperation based paradigm removes waits associated
with synchronization, and allows replacing locks by much cheaper
atomic synchronization primitives. Our implementation of the proposed
paradigm to parallelize a well known frequency counting algorithm
shows the benefits of the proposed cooperation based
locking paradigm when compared to the traditional contention
based locking paradigm. In our experiments, the proposed cooperation
based design outperforms the traditional contention based
design by a factor of 2 - 5.5X for synthetic zipfian data sets.
Hide Abstract
-
Sudipto Das, Divyakant Agrawal, Amr El Abbadi, "ElasTraS: An Elastic Transactional Data Store in the Cloud",[Paper], [Talk] USENIX Workshop on Hot Topics in Cloud Computing (HotCloud '09), in conjunction with USENIX '09.
Over the last couple of years, "Cloud Computing" or
"Elastic Computing" has emerged as a compelling and
successful paradigm for internet scale computing. One
of the major contributing factors to this success is the
elasticity of resources. But in spite of the elasticity provided
by the infrastructure and the scalable design of the
applications, the elephant (or the underlying database),
which drives most of these web-based applications, is
not very elastic and scalable, and hence limits scalability.
In this paper, we propose ElasTraS which addresses
this issue of scalability and elasticity of the data store in a
cloud computing environment to leverage from the elastic
nature of the underlying infrastructure, while providing
scalable transactional data access. This paper aims
at providing the design of a system in progress, highlighting
the major design choices, analyzing the different
guarantees provided by the system, and identifying
several important challenges for the research community
striving for computing in the cloud.
Hide Abstract
Related Links
- Euclid - Hardware Acceleration of Database Operations

Copyright(c) 2008
DSL, University of California Santa Barbara. All rights reserved. dsl@cs.ucsb.edu
|