Call for Papers

HPCRI: 1st Workshop on High Performance Computing Reliability Issues

To be held in conjunction with
the 11th International Symposium on High Performance Computer Architecture (HPCA-11)
Palace Hotel, San Francisco
February 12-16, 2005

 

Application demands and trends in the hardware and software industry have necessitated the development of High Performance Computing Systems with high availability and reliability. Clusters and grids are a cost-effective way of solving high performance computing needs and are built by using thousands of commodity systems that are geographically distributed and provide fast, dependable and reliable access to data and resources. Thus reliability of the underlying platform is essential to high performance computing. The design trends in the circuit technology with smaller die sizes and faster clock speeds have made the components more prone to random errors and crosstalk. Fault tolerant approaches in the hardware/micro architecture are necessary to mitigate component level errors. The workshop will address the reliability and availability needs for HPC and the integration of these techniques into enterprise and technical server systems. Recent developments and future direction will be presented in order to share and stimulate ideas for developing highly reliable systems for performing high performance tasks. The workshop will have several areas of focus: trends in rates of errors and types of errors, error detection and recovery mechanisms, fault prediction, and fault-driven provisioning of large scale systems.

 

Topics of interest include (but are not limited to):

  • Error Detection, Mitigation and Recovery
    • Error Types  and Rates
    • Characterization of Errors (e.g., radiation–induced, hard, intermittent, etc.)
    • Detection Mechanisms
    • Monitoring Tools and Techniques
    • Error Detection Latencies
    • Performance Impacts / Overhead
    • Techniques for Error Recovery
  • Reliability Design
    • Caches/memory architecture
    • Interconnect architecture
  • Fault Prediction
  • Fault Modeling
  • Self-healing / Autonomics for error handling in cluster/grid environments
  • Hardware Redundancy Techniques  to decrease error rates

 

Paper Submission Information:

 

 We welcome submissions in the form of abstracts (< 1 page) and short papers (5-7 pages).   Submissions that describe ongoing research in the above areas are encouraged. Please e-mail your   submissions (preferably in pdf) to padmashree.k.apparao@intel.com and gregory.s.averill@intel.com

 

Workshop Chairs:

Padma Apparao           Intel Labs         padmashree.k.apparao@intel.com    

Greg Averill                  Intel Labs         gregory.s.averill@intel.com

 

Organizing / Program Committee:

Sanjay Kale                              University of Illinois at Urbana Champaign

Emre Kiciman                           Stanford University

Vittal Kini                                 Intel Corporation

Sanjay Ranka                           University of Florida

Sartaj Sahni                              University of Florida           

Darius Tanksalvala                    HP

                                     

Important Dates:

Abstract Submission

1st November 2004

Paper Submission:

15th November 2004

Notification of Acceptance:

 3rd December 2004

Camera-ready due:

 7th January 2005

                                                                       

Workshop Activities:

Workshop activities will include a keynote, an invited papers’ session and peer-reviewed technical sessions.

 

Workshop Publication:

We will distribute published proceedings at the workshop.

For further details on the workshop and any questions please contact Padma Apparao (padmashree.k.apparao@intel.com) or Greg Averill (gregory.s.averill@intel.com).