Research towards my PhD(Advisors: Prof. Divyakant Agrawal and Prof. Amr El Abbadi, @ UCSB)
Scalable, Elastic, and Autonomic Transaction Processing Systems for Cloud Platforms
With the launch of Amazon Web Services (AWS), the Utility (or Cloud) Computing paradigm has become extremely popular. The pay-as-you-go model, and the Elastic computing model has been extremely successful for both small and medium sized corporations. As a result, many similar services have come up since the inception of AWS, salesforce.com, RackSpace, Microsoft Azure, Google AppEngine etc to name a few. All these services scale the entire spectrum from Insfrastructure as a Service (IaaS) to Platform as a Service (PaaS). In the high end services like Google AppEngine, the scalability is automatically taken care of by Google, but with raw services such as EC2 and S3, which provides a huge amount of elastic compute power and storage capability, comes the challenge of automatic scaling of resources based on request patterns. Companies like RightScale Inc. provide auto scaling for the compute cycles, but the back-end database still remains the bottle-neck for many data driven web-applications. In this project, we are looking towards scaling the data management insfrastructure in the cloud. From our understanding, there has been two types of workloads that are predominant for web-applications:
- Single Key Workloads: Which is the most predominant workload requiring very high scalability, and this had led to the development and widespread success of large non-relational key-value stores such as Bigtable or Dynamo.
- Traditional Transactional Workloads: This is not a significant fraction of the workload for web-applications, but still, many applications do require relational databases, and transactional guarantees. But the scalability requirements of these applications are not as demanding as the ones for the single key access patterns.
In addition, there has been a recent trend of Web 2.0 applications that require Scalable Transactional Access to Large Number of Small Groups of keys. Each of these applications require transactional access to more than a single key, the keys of keys are small, but the number of applications can be pretty large.
The Single Key Model has seen many successful and mature solutions, but the two latter models are still a mystery somewhere in the cloud. And this is the goal of this research project. Our focus is to develop a system which will act as the Data Management Layer in the Cloud, supporting the different kinds of workloads noted above. Even though from the application's perspective, the data management layer can have a uniform API, but the system data management system can itself be built on heterogenous components.
Selected Relevant Publications: VLDB 2011, SIGMOD 2011, EDBT 2011 and VLDB 2010 (Tutorial), SoCC 2010, HotCloud '09.
Towards Parallelizing Data Stream Operators
Recent years have seen the increase in popularity in applications where the data is processed as a stream of tuples. These applications are known as Data Stream applications and the algorithms processing these streams need to provides answers on-line. Most of the existing work in stream processing are sequential in nature. With the advent of multi-core processors and their ubiquitous presence, the research community is striving for parallel algorithms to effectively utilize the parallelism of the modern processor paradigms. This project is aimed at designing parallel algorithms for data stream applications.
Relevant Publications: VLDD 2009, ICDE 2009.
TCAM Conscious Algorithms for Frequent Element Queries
In this project, I am interested in designing TCAM conscious algorithms for answering frequent element and top-k queries over data streams. TCAM or Ternary Content Addressable Memory is characterized by constant time look-up operations in hardware. The current research is focussed on designing an efficient algorithm that can efficiently leverage this look-up property of the TCAMs.
Relevant Publication: DaMoN 2008.
Anonymizing Weighted Social Network Graphs
The increasing popularity of social networks has initiated a fertile research area in information extraction and data mining. Although such analysis can facilitate better understanding of sociological, behavioral, and other interesting phenomena, there is growing concern about personal privacy being breached, thereby requiring effective anonymization techniques. If we consider the social graph to be a weighted graph, then the problem of anonymization can be of various types: node identity anonymization, structural anonymization, or edge weight anonymization. In this paper, we consider edge weight anonymization. Our approach builds a linear programming (LP) model which preserves properties of the graph that are expressible as linear functions of the edge weights. Such properties form the foundations of many important graph-theoretic algorithms such as single source shortest paths tree, all-pairs shortest paths, k-nearest neighbors, minimum cost spanning tree, etc. Off-the-shelf LP solvers can then be used to find solutions to the resulting model where the computed solution forms the weights of the anonymized graph. As a proof of concept, we choose the shortest paths problem and its extensions, prove the correctness of the constructed models, analyze their complexity, and experimentally evaluate the proposed techniques using real social network data sets. Our experiments demonstrate that not only does the proposed technique anonymize the weights, but it also improves the k-anonymity of the graphs while scrambling the relative ordering of the edge-weights, thereby providing robust and effective anonymization of the sensitive edge-weights.
Relevant Publications: TKDE Journal, ICDE 2010.
Research in Wireless Networks
ReTiMon: A Real Time Network Monitor
This project involves designing a tool that would help monitor a wireless network in real time.
This is an important problem as with the increasing popularity of the wireless networks and
their large scale deployment, managing such networks has become a challenging task. To be particular,
if there are problems in operation of these networks, then diagnosing them and troubleshooting them manually
has become a big problem. Most of the existing methods involve off-line analysis of the network performance,
but that is of little help when it comes to dealing with the problems in real-time. Real-time analysis is required
to ensure reliable operation of the network and guarantee user satisfaction.
We have designed a tool, that would help the users/ network adminitrators to analyse the network in real time. This tool consists of a sniffer, that passively monitors the network, and a GUI that displays the various metrics of the network as observed by the sniffer. Details of the project can be found in the technical report.
QoS Routing in Wireless Mesh Networks
Wireless Mesh Networks are an emerging field of research in the recent years. But very little has been done for providing QoS enabled routing in Wireless Mesh Networks. This work consists of designing a routing protocol for Wireless Mesh Networks that provides "strong" QoS guarantees in terms of Bandwidth and Delay. This protocol determines the routes on demand and selects a route that can strictly comply with the bandwidth and delay requirements. The protocol takes the overall robustness of the routes when selecting a route amongst multiple candidate routes.
Relevant Publications: MONET Journal, QShine 2007.
Improvement of the performance of Ad-hoc routing Protocols
Various Ad-hoc routing protocol implementations suffer form certain implementation problems like Communication Gray Zone, Fluctuating Neighbors etc. We hope to provide a pragmatic approach towards improving the performance of the routing protocols in this respect. A new module has been implemented that keeps track of the stability of the neighbors to make route decisions and is also capable of detecting the presence of new nodes in its neighborhood, even if they are not participating in any data transfer, and explore possible routes through it. We are also exploring the possible applications of this new module in Multipath routing. In addition to obtaining the stability information, it gathers other neighbor statistics to help the routing entity choose the best route from a set of available routes.
Relevant Publications: EAIT 2006, AMOC 2006.
Improvement of the performance of TCP over Mixed Wireless (Wired-cum-Wireless) Networks
In this project we are working on improving the throughput / goodput of the TCP sources. We are exploring a change in the congestion avoidance algorithm of the traditional TCP-Reno flavor by using a constant congestion window to compensate for the packets lost in wireless losses, as well as congestion losses. The entire experimental work is mainly simulation based; all simulations are performed using the network simulator ns-2 (VINT Project of UCB, LBL & Xerox PARC). We are also exploring the impact of this modified congestion control algorithm on other aspects of TCP performance such as Fairness.
Relevant Publications: PIMRC 2007, SNCNW 2006, SNCNW 2005.
Intern @ Microsoft Research, Redmond ( Summer 2010)
(Mentor: Dr. Philip A. Bernstein)
Hyder supports reads and writes on indexed records within classi-cal multi-step transactions. It is designed to run on a cluster of servers that have shared access to a large pool of network-addressable raw flash chips. The flash chips store the indexed records as a multiversion log-structured database. Log-structuring leverages the high random I/O rate of flash and automatically wear-levels it. Hyder uses a data-sharing architecture that scales out without partitioning the database or application. Each transac-tion executes on a snapshot, logs its updates in one record, and broadcasts the log record to all servers. Each server rolls forward the log against its locally-cached partial-copy of the last committed state, using optimistic concurrency control to determine whether each transaction commits. This paper explains the architecture of the overall system and its three main components: the log, the index, and the roll-forward algorithm. Simulations and prototype measurements are presented that show Hyder can scale out to support high transaction rates.
Relevant Publication: CIDR 2011.
Intern @ IBM Almaden Research Labs, Almaden ( Summer 2009 )
(Mentor: Dr. Yannis Sismanis, Manager: Dr. John McPherson)
Many modern enterprises are collecting data at the most detailed level possible, creating data repositories ranging from terabytes to petabytes in size. The ability to apply sophisticated statistical analysis methods to this data is becoming essential for marketplace competitiveness. This need to perform deep analysis over huge data repositories poses a significant challenge to existing statistical software and data management systems. On the one hand, statistical software provides rich functionality for data analysis and modeling, but can handle only limited amounts of data; e.g., popular packages like R and SPSS operate entirely in main memory. On the other hand, data management systems.such as MapReduce-based systems.can scale to petabytes of data, but provide insufficient analytical functionality. We report our experiences in building Ricardo, a scalable platform for deep analytics. Ricardo is part of the eXtreme Analytics Platform (XAP) project at the IBM Almaden Research Center, and rests on a decomposition of data-analysis algorithms into parts executed by the R statistical analysis system and parts handled by the Hadoop data management system. This decomposition attempts to minimize the transfer of data across system boundaries. Ricardo contrasts with previous approaches, which try to get along with only one type of system, and allows analysts to work on huge datasets from within a popular, well supported, and powerful analysis environment. Because our approach avoids the need to re-implement either statistical or data-management functionality, it can be used to solve complex problems right now.
Relevant Publication: SIGMOD 2010.
Intern @ Google, Mountain View ( Summer 2007 )
(Mentor: Shilpa Kolhar, Manager: Dr. Johnny Chen) LocalGIS
Intern @ IBM Global Services, India ( Summer 2005 )
(Mentor: Dr. Anup K. Ghosh, Manager: Dr. Amitava Mukherjee) Integrating SIP with SAP Enterprise Portals
This work was carried out during the summer training at IGSI, Kolkata in which I had performed some research & developmental work towards integrating SIP phones (based on Session Initiation Protocol) with SAP enterprise portal. Our goal was to integrate a SIP phone with the other collaboration features of SAP Enterprise portals. This would equip the SAP portals to make or receive phone calls. The main objective of the project was to design a portal component that can be integrated with the SAP Collaboration Features in order to make a call from a computer to a mobile or basic landline phone without using PSTN. The goal is to use the Session Initiation Protocol (SIP) for this purpose. I had tried to incorporate a SIP phone as one of the SAP Collaboration Room features. SAP is one of the most popular ERPs - short for enterprise resource planning, a business management system that integrates all facets of the business.
Ruby under Scanner: Comparison with Java
( Technical Report )
Graph Media: a multi featured Software for Graphical Analysis of Mathematical Functions
DFD Generator: a Java based tool the helps the user draw Data Flow Diagrams.