Publications
Rethinking data management for big data scientific workflows
Abstract
Scientific workflows consist of tasks that operate on input data to generate new data products that are used by subsequent tasks. Workflow management systems typically stage data to computational sites before invoking the necessary computations. In some cases data may be accessed using remote I/O. There are limitations with these approaches, however. First, the storage at a computational site may be limited and not able to accommodate the necessary input and intermediate data. Second, even if there is enough storage, it is sometimes managed by a filesystem with limited scalability. In recent years, object stores have been shown to provide a scalable way to store and access large datasets, however, they provide a limited set of operations (retrieve, store and delete) that do not always match the requirements of the workflow tasks. In this paper, we show how scientific workflows can take advantage of the …
- Date
- October 6, 2013
- Authors
- Karan Vahi, Mats Rynge, Gideon Juve, Rajiv Mayani, Ewa Deelman
- Conference
- 2013 IEEE International Conference on Big Data
- Pages
- 27-35
- Publisher
- IEEE