P92: Characterization and Comparison of Application
Resilience for Serial and Parallel Executions
SessionPoster Reception
Authors
Event Type
ACM Student Research Competition
Poster
Reception
TimeTuesday, November 14th5:15pm -
7pm
LocationFour Seasons Ballroom
DescriptionSoft error of exascale application is a challenge
problem in modern HPC. In order to quantify an
application’s resilience and vulnerability, the
application-level fault injection method is widely
adopted by HPC users. However, it is not easy since
users need to inject a large number of faults to ensure
statistical significance, especially for parallel
version program. Normally, parallel execution is more
complex and requires more hardware resources than its
serial execution. Therefore, it is essential that we can
predict error rate of parallel application based on its
corresponding serial version. In this poster, we
characterize fault pattern in serial and parallel
executions. We find first there are same fault sources
in serial and parallel execution. Second, parallel
execution also has some unique fault sources compared
with serial executions. Those unique fault sources are
important for us to understand the difference of fault
pattern between serial and parallel executions.




