Close

Presentation

Error Analysis of Globally Distributed Workflow Management System
DescriptionThe management of data-intensive workflows in globally distributed computing systems, such as those used in high-energy physics, presents significant challenges in scalability, resource allocation, and fault tolerance. Workflow Management Systems (WMS) provide a critical framework for addressing these challenges by automating, monitoring, and optimizing the execution of complex computational tasks across heterogeneous resources. Production and Distributed Analysis (PanDA) system, a sophisticated WMS engineered to handle the immense data processing and analysis demands of ATLAS, operating on the Worldwide LHC Computing Grid (WLCG), one of the largest distributed computing infrastructures globally. However, errors frequently occurs when distributing and managing workloads on such a globally distributed computing grid. Errors can occur in various form across different sites. To understand and mitigate these errors, analysis is the first step. In this work, we analyze the errors that occurs across the globally distributed grid which will be stepping stone towards designing effective mitigation strategies.