The back-end databases of web-based applications are a major data security concern to enterprises. The problem becomes more critical with the proliferation of enterprises hosted web applications in the cloud. While much prior work has concentrated on the malicious attacks that try to break into the database by using vulnerabilities of web applications, little work has focused on the threat of data harvesting through web form interfaces, in which large collections of the underlying data can be harvested and sensitive information can be learnt by iteratively submitting legitimate queries and analyzing the returned results for designing new queries. Although the individual data items in the database are public, data harvesting aims at accumulating large subsets of the underlying data, thus potentially revealing competitive information.
To defend against data harvesting, traditional prevention approaches such as inference control could be used, but unfortunately they hurt usability. Thus a detection approach should be used either as an alternative or complement to prevention approaches. In this paper, we summarize the characteristics of data harvesting, and propose the notions of query correlation and result coverage for data harvesting detection. We design a detection system called HengHa, in which Heng examines the correlation among queries in a session, and Ha evaluates the data coverage of the results of queries in the same session. Our experimental results verify the effectiveness and efficiency of HengHa for data harvesting detection.