U.S. DOE AMASE Project

Architecture and Management for Autonomic Science Ecosystems

Home Papers Participants Resource News

About

The AMASE project is fund by the U.S. Department of Energy’s Smart High-Performance Computing and Communications (SmartHPCC) program managed by Thomas Ndousse-Fetter. It was launched in September, 2017.

Motivation

Extraordinary advances in computing, communication networks, and information technologies have produced an explosive growth of highly interconnected systems, which are increasingly becoming complex, dynamic, heterogeneous, and challenging to operate and manage with existing approaches. For large organizations such as DOE’s science community with thousands of geographically interconnected systems, traditional distributed systems operation and management based on static behaviors, interactions, and configuration are proving to be inadequate. Many of these conventional systems are labor-intensive (human in the loop requirement) to operate, often requiring expertise that can’t be easily sustained, replaced, or scaled, especially as we approach the era of Internet of Things (IoT).

Current interconnected conventional systems are governed by local static policies (e.g. cyber security, SLA, and scheduling policies) that are difficult to dynamically negotiate and synchronize in order to obtain the desired level of end-to-end services. In summary, conventional systems are rapidly becoming fragile, unmanageable, and insecure when configured as a federation of resources to provide globally accessible services. These concerns have led researchers to explore alternative strategies for designing, operating, and managing complex engineered systems with massively interacting heterogeneous components.

Our mission

Scientific computing systems are becoming significantly more complex, and have reached a critical limit in manageability using current human-in-the-loop techniques. The current state-of-the-art for managing HPC infrastructures does not leverage the remarkable advances in machine learning to more accurately predict, diagnose, and improve computational resources in response to user computation. The DOE science complex consists of thousands of interconnected systems that are geographically distributed. As distributed teams and complex workflows now span resources from telescopes and light sources to fast networks and smart IoT sensor systems, it is clear that a single, centralized, administrative team and software stack cannot coordinate and manage all of the resources. Instead, resources must begin to respond autonomically, managing and tuning their behavior in response to scientific workflows. This project outlines a plan to explore the architecture, methods, and algorithms needed to support future scientific computing systems that self-tune and self-manage. We propose to make the science ecosystem smart by incorporating the functions of sensing, intelligence, and control. Our aim is threefold:

  1. Design a scalable architecture for smart science ecosystems.
  2. Embed intelligence in relevant sub-systems via light-weight machine learning.
  3. Explore methods for distributed and autonomous management of the systems.

We believe the outcome of this research to design and prototype a smart distributed science ecosystem has many benefits:

  1. Scientists using DOE computing infrastructure will be able to run workflows on automatically selected resources that are dynamically configured and tuned for their application.
  2. Facility and network operators will have the ability to predict and diagnose problems before they cause downtime.