Parallel Database Systems: DB2 Parallel Edition



Introduction

Parallel Database Systems Research focuses on finding the ways to make data loading and processing faster. The huge data sizes require a lot of time to load into the system, or execute queries on. As a result of research in this area, some parallel database applications were developed and are successfully used in industry. This web page describes one of them: IBM's DB2 Parallel Edition.

Parallel Platform

DB2 uses shared-nothing architecture, where a number of processors (nodes) are connected together, but each processor has its own memory and disk space. The processors communicate through interconnection network. Adding interprocessor communication feature (Communication Manager) to the application required significant changes to runtime mechanism. Communication manager was implemented basing on send/receive operators. Shared-nothing architecture is highly scalable, since there is no overlapping which is a bottleneck for shared-memory type parallel architecture.


Parallel Data Storage

DB2 PE stores large tables at several nodes, and the querying is performed by each node on the subset of the table, then the partial results are merged and passed for further processing. This strategy is called table partitioning, it speeds up the query execution. DB2 implements hash partitioning strategy, using 2 hash functions. First hash function determines to which node the tuple will go, the second hash function runs at the node and exact location is chosen for a tuple. Processor load balancing can be performed by adjusting hash partitioning functions in order to find optimal number of tuples per node.

Transferring for DB2 to DB2 PE

The SQL language didn't require significant changes. Since SQL is not procedural language, there was no need to modify it in order to transfer to parallel version. Query optimizer was modified to take into account the parallel structure of the system. It is designed to get the best advantage of available information about parallel database layout. Query optimizer uses cost estimation of each operation including interprocessor messaging in order to make the best decision about the query execution plans. DB2 implements function shipping. This means that 'tasks on data are executed at the same node where this data actually resides'[1]. The coordinator node is responsible for assembling the parts of each task and passing it to application.

Performance

Application testing was done the lab and external settings. Table1 summarizes parameters of the system used for performance testing. Table2 shows major results.

Table1. Testing System Parameters

Number of nodes Memory Disk Capacity Interprocessor Connection
8 to 64 128-256 MB 8-48 GB high-speed switch

Table. Performance testing results

Test Speed up
Data load time linear
Insert, Update Performance linear
Parallel Scan sublinear
Index Create Performance linear
Scalability linear or superlinear

As seen from the table, only Parallel Scan demonstrated a sublinear performance. In this test, the task was to execute a query on 100000 rows, returning 10% of the rows. The coordinator node in this test became a bottleneck and slowed down the performance.

Challenges and bottlenecks

- protocol currently used for network communication can be overwhelmed by large queries
- task coordinator node (for parallel query processing) can become a bottleneck

Conclusion

I believe that the application was able to reach its engineering goals and is successful. The performance analysis shows linear speed up for most of important database functionality. Moreover, in Database Systems, where processed data sizes reach critical amounts, the use of parallelism is a good solution.

References

1. DB2 Parallel Edition. Chaitanya Baru, Gilles Fecteau IBM SWSD, Toronto. Ambuj Goyal, Hui-I Hsiao, Anant Jhingran, Sriram Padmanabhan, Walter Wilson IBM TJ Watson Research Center. IBM Systems Journal, 1995. http://citeseer.ist.psu.edu/baru95db.html
2. An Overview of DB2 Parallel Edition. Chaitanya Baru, Gilles Fecteau IBM SWSD, Toronto. International Conference on Management of Data, 1995. http://portal.acm.org/citation.cfm?id=223876
3. Database Management Systems Third Edition, Raghu Ramakrishnan, Johannes Gehrke. 2002.


Created: 4/11/07, by Anna Sibirtseva #8216467.