Publications

A greedy consensus-based approach to distributed job selection: toward fully-decentralized workload management system

Abstract

Current approaches to resilience for highly distributed, heterogeneous, large-scale scientific workflows are limited. Most existing workflow and resource management systems have a single point of failure and resilience strategies are often static, depend on a centralized control, and require considerable design effort from experts. The increasing scale and complexity of workflows coupled with limited resilience capabilities in centralized systems necessitates a fully decentralized, adaptive resource management approach. This paper addresses a very important slice of the overall problem by leveraging the advances in multi-agent systems (MAS). In particular, we explore the suitability of a MAS consisting of globally distributed agents to perform distributed job selection from a dynamic job pool in a truly decentralized, performant, and resilient manner. We present a novel consensus formulation of the distributed job …

Date
May 1, 2025
Authors
Komal Thareja, Krishnan Raghavan, Anirban Mandal, Pawel Zuk, Imtiaz Mahmud, Mariam Kiran, Ewa Deelman
Conference
2025 IEEE 25th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)
Pages
63-72
Publisher
IEEE Computer Society