Thread Cooperation in Multicore Architectures for Frequency Counting Over Multiple Data Streams


Sudipto Das, Shyam Antony, Divyakant Agrawal, and Amr El Abbadi
Department of Computer Science,
University of California, Santa Barbara
Santa Barbara, CA 93106 - 5110, USA
{sudipto, shyam, agrawal, amr}@cs.ucsb.edu


Abstract

Many real-world data stream analysis applications such as network monitoring, click stream analysis etc., require combining multiple streams of data arriving from multiple sources. This is referred to as multi-stream analysis. To deal with high stream arrival rates, it is desirable that such systems be capable of supporting very high processing throughput. The advent of multicore processors and powerful servers driven by these processors open many research challenges for exploiting parallelism for high throughput processing. This calls for efficient parallel designs that can effectively utilize the parallelism of the multicores. In this paper, we address this problem of parallelizing multi-stream analysis in the context of multicore processors. Parallelizing stream analysis is extremely important since in the multicore era, performance improvement is possible only through effective parallelization. Specifically, we concentrate on parallelizing frequent elements, top-k, and frequency counting over multiple streams. We discuss the challenges in designing an efficient parallel system for processing streams originating from multiple sources. Our evaluation and analysis reveals that traditional .contention. based locking results in excessive overhead which lead to severe performance degradation in modern multicore architectures. Based on our analysis, we propose a .cooperation. based locking paradigm for efficient parallelization of frequency counting. Our implementation of the proposed paradigm to parallelize a well known frequency counting algorithm shows the benefits of the proposed .cooperation. based locking paradigm when compared to the traditional .contention. based locking paradigm. In our experiments, the proposed .cooperation. based design outperforms the traditional .contention. based design by a factor of 2 - 5.5X for synthetic zipfian data sets.


Categories and Subject Descriptors
H.2.4 [Database Management]: Systems.Concurrency; H.4.m [Information Systems Applications]: Miscellaneous

General Terms
Algorithms, Design, Experimentation, Performance.

Keywords
Multicore processors, parallel algorithms, data streams, concurrent structures, multi-streams, frequency counting, frequent elements, top-k.


UCSB Computer Science Technical Report (CS-2009-04), March 2009.

Report:2009-04, Official copy