Software: Offline Parallel Debugging: A Case Study Report
Debugging is difficult; debugging parallel programs at large scale is particularly so. Interactive debugging tools continue to improve in ways that mitigate the difficulties, and the best such systems will continue to be mission critical. Such tools have their limitations, however. They are often unable to operate across many thousands of cores. Even when they do function correctly, mining and analyzing the right data from the results of thousands of processes can be daunting, and it is not easy to design interfaces that are useful and effective at large scale. One additional challenge goes beyond the functionality of the tools themselves. Leadership class systems typically operate in a batch mode intended to maximize utilization and throughput. It is generally unrealistic to expect to schedule a large block of time to operate interactively across a substantial fraction of such a system. Even when large scale interactive sessions are possible, they can be expensive, and can impact system access for others.
Given these challenges, there is potential value in research into other non-traditional debugging models. Here we describe our progress with one such approach: offline debugging. In Section \ref{Concept} we describe the concept of offline debugging in general terms. We then provide in Section \ref{Implementation} an overview of GDBase, a prototype offline debugger. Section \ref{Case Studies} describes proof-of-concept demonstrations of GDBase, and focuses on the first attempts to deploy GDBase in large-scale debugging efforts of major research codes. Section \ref{Conclusions} highlights lessons learned and recommendations.