Example: dental hygienist

Continuous Reliability Monitoring Using Adaptive Critical ...

Continuous Reliability Monitoring Using Adaptive Critical Path Testing Bardia Zandian*, Waleed Dweik, Suk Hun Kang, Thomas Punihaole, Murali Annavaram Electrical Engineering Department, University of Southern California Mail Code: EEB200, University of Southern California, Los Angeles, CA 90089. Tel:(213)740-3299, Abstract formance of a chip during the entire lifetime just to ensure correct functionally in a small fraction of time during the As processor Reliability becomes a first order design con- late stages of the chip lifetime. straint, this research argues for a need to provide continu- Our inability to precisely and continuously monitor tim- ous Reliability Monitoring .

Continuous Reliability Monitoring Using Adaptive Critical Path Testing Bardia Zandian*, Waleed Dweik, Suk Hun Kang, Thomas Punihaole, Murali Annavaram

Tags:

  Critical, Using, Reliability, Adaptive, Monitoring, Continuous, Continuous reliability monitoring using adaptive critical

Information

Domain:

Source:

Link to this page:

Please notify us if you found a problem with this document:

Other abuse

Transcription of Continuous Reliability Monitoring Using Adaptive Critical ...

1 Continuous Reliability Monitoring Using Adaptive Critical Path Testing Bardia Zandian*, Waleed Dweik, Suk Hun Kang, Thomas Punihaole, Murali Annavaram Electrical Engineering Department, University of Southern California Mail Code: EEB200, University of Southern California, Los Angeles, CA 90089. Tel:(213)740-3299, Abstract formance of a chip during the entire lifetime just to ensure correct functionally in a small fraction of time during the As processor Reliability becomes a first order design con- late stages of the chip lifetime. straint, this research argues for a need to provide continu- Our inability to precisely and continuously monitor tim- ous Reliability Monitoring .

2 We present an Adaptive Critical ing degradation is the primary reason for over-provisioning path Monitoring architecture which provides accurate and of resources. Without an accurate and real-time measure real-time measure of the processor's timing margin degra- of timing margin, designers are forced to use conservative dation. Special test patterns check a set of Critical paths in guardbands and/or use expensive error detection and recov- the circuit-under-test. By activating the actual devices and ery methods. Processors currently provide performance, signal paths used in normal operation of the chip, each test power, and thermal Monitoring capabilities.

3 These capa- will capture up-to-date timing margin of these paths. The bilities are exploited by designers and developers to under- Monitoring architecture dynamically adapts testing interval stand and debug performance and power problems and de- and complexity based on analysis of prior test results, which sign solutions to address them. In this paper, we argue that increases efficiency and accuracy of Monitoring . Experi- providing Reliability Monitoring to improve visibility into mental results based on FPGA implementation show that the timing degradation process is equally important to fu- the proposed Monitoring unit can be easily integrated into ture processors as there will be an ever increasing need to existing designs.

4 Monitoring overhead can be reduced to monitor device Reliability . Such Monitoring capability en- zero by scheduling tests only when a unit is idle. ables just-in-time activation of error detection and recovery methods, such as those proposed in [1, 13, 16, 17, 24, 25]. Keywords: Reliability , Critical Paths, Timing Margins The primary contribution of this research is to propose a Submission Category: Regular Paper for DCCS runtime Reliability Monitoring framework that uses Adaptive Critical path testing. The proposed mechanism injects spe- cially designed test vectors into a circuit-under-test (CUT). 1. Introduction that not only measure functional correctness of the CUT but also its timing margin.

5 The outcomes of these tests are an- Reduced processor Reliability is one of the negative alyzed to get a measure of the current timing margin of the repercussions of silicon scaling. Reliability concerns stem CUT. Furthermore, a partial set of interesting events from from multiple factors, such as manufacturing imprecision test injection results are stored in flash memory to provide that leads to several within-in die and die-to-die varia- an unprecedented view into timing margin degradation pro- tions [7, 8, 10], ultra-thin gate-oxide layers that breakdown cess over long time scales. For Monitoring to be effective, under high thermal stress, Negative Bias Temperature Insta- we believe, it must satisfy the following three criteria: bility (NBTI) [2], and Electromigration.

6 Many of these re- liability concerns lead to timing degradation first and even- 1. Continuous Monitoring : Unlike performance and tually lead to processor breakdown [7]. Timing degradation power, Reliability must be monitored continuously over occurs extremely slowly over time and can even be reversed extended periods of time; possibly many years. in some instances, such as those caused by NBTI effects. 2. Adaptive Monitoring : Monitoring must dynamically When individual device variations are taken into consider- adapt to changing operating conditions. Due to dif- ation, timing degradation is hard to predict or accurately ferences in device activation factors and device vari- model analytically.

7 Most commercial products solve tim- ability, timing degradation rate may differ from one ing margin degradation problem by inserting a guardband CUT to the other. Even a chip in early stages of at design and fabrication time. Guardband reduces the per- its expected lifetime can become vulnerable due to 1. aggressive runtime power and performance optimiza- elements. The four shaded boxes in the figure are the key tions such as operating at near-threshold voltages [15] components in our proposed RMU. The first component is and higher frequency operation [17, 24]. Some effects the Test Vector Repository (TVR). TVR holds a set of test such as NBTI related timing degradation are even re- patterns and the expected correct outcomes when these test versible [2].

8 Thus there is a need to adapt the monitor- patterns are injected into the CUT. TVR will be filled once ing mechanisms to match the operating conditions at with CUT-specific test vectors at post-fabrication phase. We runtime. describe the process for test vector selection in more de- 3. Low Overhead Monitoring : The Monitoring archi- tail in Section Multiplexer, MUX1, is used to select tecture should have low area overhead and design com- either the regular operating frequency of the CUT or one plexity. Obviously, Monitoring framework should be test frequency from a small set of testing frequencies. Test implementable with minimal modifications to exist- frequency selection will be described in Section Mul- ing processor structures, and should not in itself be a tiplexer, MUX2, on the input path of the CUT allows the source of errors.

9 CUT to receive inputs either from normal execution trace The Reliability Monitoring approach introduced in this pa- or from TVR. MUX1 input selection is controlled by the per is designed to satisfy the above criteria. Using the pro- Freq. Select signal and MUX2 input selection is controlled posed Monitoring mechanism has many benefits. With con- by the Test Enable signal. Both these signals are generated tinuous, Adaptive and low-overhead Monitoring , conserva- by the Dynamic Test Control (DTC) unit. tive preset guardbands can be tightened. Unlike current ap- DTC selects a set of test vectors from TVR to inject into proaches where error detection and correction mechanisms the CUT.

10 After each test vector injection the CUT output are continuously enabled, processor can deploy preemptive will be compared with the expected correct output and a error correction measures during in-field operations only test pass/fail signal is generated. For every test vector injec- when the timing margin is small enough to affect circuit tion an entry is filled in the Reliability History Table (RHT). functionality. Furthermore, the unprecedented view pro- Each RHT entry stores a time stamp of when the test is vided by Continuous Monitoring will enable designers to conducted, test vector index, testing frequency, pass/fail re- correlate predicted behavior from analytical models with sult, and environment conditions such as CUT temperature.


Related search queries