Report ID
2008-08
Report Authors
Sudipto Das, Shyam Antony, Divyakant Agrawal, Amr El Abbadi
Report Date
Abstract

Applications involving analysis of data streams have gained significant popularity and importance. Frequency counting, frequent elements and top-k queries form a class of operators that are used for a wide range of stream analysis applications. In spite of the abundance of these algorithms, all known techniques for answering data stream queries are sequential in nature. The imminent ubiquity of Chip Multi-Processor (CMP) architectures requires algorithms that can exploit the parallelism of such architectures. In this paper, we first explore the challenges in parallelizing frequent elements and top-k queries in the context of the parallelism offered by multi-core processors, evaluate different naive techniques for intra-operator parallelism, and summarize the insights obtained from the different parallelization efforts. Our experimental analysis of the naive designs implemented in the paper shows that intra-operator parallelism is not straightforward and requires a complete redesign of the system. Based on the lessons learnt from this analysis, we design an efficient and scalable framework for parallelizing frequency counting, frequent elements and top-k queries over data streams. The proposed CoTS (Co-operative Thread Scheduling) framework is based on the principle of threads co-operating rather than contending. Our experiments on a state-of-the-art quad-core chip-multiprocessor architecture and synthetic data sets demonstrate the scalability of the proposed framework, and the efficiency is demonstrated by peak throughput of more that 60 million elements per second. In addition, for skewed data distributions, despite using heavy weight synchronization primitives, the implementation of the proposed framework outperforms the sequential implementation by a factor of 2 - 4X.

Document
2008-08.pdf610.02 KB