GUIDE: A Scalable Information Directory Service to
Collect, Federate, and Analyze Logs for Operational
Insights into a Leadership HPC Facility
Authors
Event Type
Paper
State of the Practice
TimeWednesday, November 15th4pm -
4:30pm
Location405-406-407
DescriptionIn this paper, we describe the GUIDE framework used to
collect, federate, and analyze log data from the Oak
Ridge Leadership Computing Facility (OLCF), and how we
use that data to derive insights into facility
operations. We collect system logs and extract
monitoring data at every level of the various OLCF
subsystems, and have developed a suite of pre-processing
tools to make the
raw data consumable. The cleansed logs are then ingested and federated into a central, scalable data warehouse, Splunk, that offers storage, indexing, querying, and visualization capabilities. We have further developed and deployed a set of tools to analyze these multiple disparate log streams in concert and derive operational insights. We describe our experience from developing and deploying the GUIDE infrastructure, and deriving valuable insights on the various subsystems, based on two years of operations in the production OLCF environment.
raw data consumable. The cleansed logs are then ingested and federated into a central, scalable data warehouse, Splunk, that offers storage, indexing, querying, and visualization capabilities. We have further developed and deployed a set of tools to analyze these multiple disparate log streams in concert and derive operational insights. We describe our experience from developing and deploying the GUIDE infrastructure, and deriving valuable insights on the various subsystems, based on two years of operations in the production OLCF environment.
Download PDF:
here
Authors




