P38: Benchmarking Parallelized File Aggregation Tools for
Large Scale Data Management
SessionPoster Reception
Event Type
ACM Student Research Competition
Poster
Reception
TimeTuesday, November 14th5:15pm -
7pm
LocationFour Seasons Ballroom
DescriptionLarge-scale genomic data analyses have given rise to
bottlenecks in data management due to the production of
many small files. Existing file-archiving utilities,
such as tar, are unable to efficiently package large
datasets with upward of multiple terabytes and hundreds
of thousands of files. To create parallelized and
multi-threaded alternatives, ParFu (parallel archiving
file utility), MPItar, and ptgz (parallel tar gzip) were
developed by the Blue Waters team and the NCSA Genomics
team as efficient data management tools, with the
ability to perform parallel archiving (and eventually
extracting). Scalability was tested for each tool as a
function of the number of ranks executed and stripe
count on a Lustre filesystem. We used two datasets
typically seen in genomic analyses to measure the
effects of different file-size distributions. These
tests suggest the best user parameters and subsequent
costs for usage as efficient replacements of
data-packaging tools.
Authors




