A16: Diagnosing Parallel I/O Bottlenecks in HPC
Applications
Author
Event Type
ACM Student Research Competition
Poster
TimeWednesday, November 15th3:55pm -
4:05pm
Location701
DescriptionHPC applications are generating increasingly large
volumes of data (up to hundreds of TBs), which need to
be stored in parallel to be scalable. Parallel I/O is a
significant bottleneck in HPC applications, and is
especially challenging in Adaptive Mesh Refinement (AMR)
applications because the structure of output files
changes dynamically during runtime. Data-intensive AMR
applications run on the Cori supercomputer show variable
and often poor I/O performance, but diagnosing the root
cause remains challenging. Here we analyze logs from
multiple levels of Cori's parallel I/O subsystems, and
find bottlenecks during file metadata operations and
during the writing of file contents that reduced I/O
bandwidth by up to 40x. Such bottlenecks seemed to be
system-dependent and not the application's fault.
Increasing the granularity of file-system performance
data will help provide conclusive causal relationships
between file-system servers and metadata
bottlenecks.




