Cloud computing has drastically changed the way we perform scientific computing research. In the past, a scientist could run software on a local machine, generate and store results locally, and share resulting data sets in a controlled (and non-scalable) setting. The cloud not only brings tremendous scale to computation and storage, but also to sharing and collaboration. As a result, new techniques and tools are needed that facilitate such sharing in a way that can be trusted and verified (requirements for scientific advance).
Toward this end, we present CloudTracker, a software platform that takes advantage of recent advances in public cloud technology to provide scientists with easy and automatic data provenance tracking and result reproduction. CloudTracker records the inputs and outputs of individual programs deployed to the cloud, as well as information about their execution state. The platform maintains and secures this information in the cloud and provides a web service with which scientists can regenerate the results of others for verification purposes. We demonstrate the utility of CloudTracker for scientific simulations and use it to evaluate the trade-off between storing program outputs and regenerating those outputs in a public cloud.