A New Approach to In-System Silicon Validation and Debug

Part 1: The Problem

By Miron Abramovici and Paul Bradley, Dafca    
September 16, 2007 -- edadesignline.com

A difficult problem

Silicon validation " proving the chip works correctly at speed and in system under different operating conditions " is always necessary, even for a "perfect" design. Silicon debug " finding the root cause of a malfunction " is necessary whenever the design turns out to be not entirely flawless. First-silicon validation and debug require a labor-intensive engineering effort of several months, and have become the least predictable and the most time-consuming part " 35% on average " of the development cycle of a new chip at 90nm (Figure 1). The difficulty of the silicon validation is expected to increase at 65nm and below because existing ad-hoc methodologies do not scale with the unprecedented levels of SoC device complexity.
1. Silicon Validation times increase as features size decreases.

Even the most sophisticated SoC design methodology cannot fully account for all the parameters that impact silicon behavior, or for all logic "corner cases" that occur in the real life of a chip working at speed and in system. For example, the simultaneous occurrence of two unlikely events may not be anticipated pre-silicon, so it is never simulated or analyzed; however, it may cause unexpected behavior when it occurs in system. Pre-silicon verification methods " simulation, emulation, FPGA prototyping, timing analysis, and formal verification " do not address many deep sub-micron problems that occur in the actual device.

To pick just one example from a large universe, consider a scenario in which an unanticipated usage model of a circuit is exercised in system, causing more extensive system activity in a confined area of the chip. This results in an increase in the local on-chip temperature and introduces additional delays in that region, and consequently affects the timing of a critical path, resulting in erroneous logic behavior. Such a problem, first revealed in system, is nearly impossible to detect pre-silicon. Many other integration problems, configuration problems, and unexpected behaviors resulting from signal integrity, power, noise, cross-talk, thermal stress, or process-related issues may be similarly difficult to anticipate pre-silicon.

Because complete system-level verification of a complex SoC at 90 nm or below is not feasible pre-silicon, post-silicon validation has become an essential step in the design implementation methodology. The most important phase of this process is in-system, at-speed validation, which is the first opportunity for a newly manufactured chip to work in its intended environment and to interact with other chips on a system board while operating at its target frequency. At-speed, in-system usage under stress conditions introduces many new functional patterns and explores deep states and corner cases never encountered previously, exposing errors that escaped pre-silicon verification. Moreover, because of severely limited visibility of the internal dynamic behavior of the chip, locating such problems is much more difficult and far more expensive than in pre-silicon or on a tester.

Unlike tester-based experiments that are fully repeatable, in-system operation is not completely deterministic because of unpredictable interactions among independent events such as external interrupts, irregular network traffic, bus arbitration, and transactions involving asynchronous clock domains. This non-determinism makes many problems appear as intermittent, and severely complicates silicon validation and debug.

The non-deterministic operation and the lack of control over stimuli applied to the chip combine to create another difficulty: we no longer know the expected values. Even if we had complete observability, determining whether the observed values are correct requires a more complex analysis than "at vector number k, the values of BUSX should be YYY."

Yet another complicating factor is the difficulty of determining the nature of the problem we are dealing with. In addition to functional problems (such as corner cases or deep-state bugs) and timing errors, the SoC operation may be also affected by defects that eluded manufacturing tests and are detected only in system. Taking a chip that fails in system through manufacturing testing often results in a "No Trouble Found" outcome because the operational conditions that create the failures cannot be reproduced on a tester. This situation also occurs with chips that fail in the field. Understanding how such problems manifest functionally has significant benefit in later stages of development.

It is clear that the semiconductor industry needs a new, systematic, and scalable approach to solve these problems effectively.

To read the full article, click here

×
Semiconductor IP