Simulation vs lab validation for IC debug

Article By : Manzil Shah, Bipin Patel

Know the challenges associated with post-silicon-lab validation phases of ASIC cycles. We will compare it with the functional simulation wherever required.

Design verification has been the main challenge for any IC design company for decades now. EDA tools and verification methods have evolved a lot over the last two decades, but along with that, the size and complexity of chips has also increased. Effectively, we don't see measurable time savings in the design verification phase of an ASIC or FPGA cycle. The time it takes to complete the functional verification phase is still around 70% for most complex ASICs.
Now the question that arises is, within a given verification cycle, what activities consume the majority of time? The common answer that we get from DV engineers is, "debugging". The kinds of issues faced are of different types during different sub-phases of the DV cycle. In this article, we will talk about the kind of challenges we have been facing at eInfochips in our projects on post-silicon-lab validation phases of ASIC cycles. We will compare it with the functional simulation wherever required.
During VE development phase, we face issues related to VE architecture and spend a lot of time making VE more robust and user-controllable. Once we have the design in hand for simulations, we face issues related to VE integration with the design. During the real RTL verification phase, we face issues related to functional bugs and spend a majority of our time debugging. This is the sub-phase where we spend almost 70-80% of overall verification cycle time finding design bugs and verifying related design fixes.
Often we find that a bug fix has resulted in a Pandora's box of more bugs in design that in turn needs more debugging time to reach a point where design works as expected. In the latter phase, our objective should be to face and debug issues related to functional coverage closure.
It's pretty straightforward to achieve functional coverage of around 75-85% but as we make progress towards 100% functional coverage, it becomes really really difficult. To design at any level of complexity, achieving 95% to 100% coverage is the real challenge. For ASIC DV cycle, most of the companies prefer to do Gate level simulations to gain enough confidence on the final netlist drop they have. Running GLSs itself is a time-consuming sub phase.
Considering the complexity of the design, we experience a huge slowness in simulation while running into lots of other issues including the debugging of tracing x's while ensuring our first test is up and running with gates. While timing back-annotated gate level simulations, we add more issues and debugging time. Once the verification is over, we are in what is known as a verification "soak phase". Here, any failure that pops up, comes with its own level of challenges where we have to debug and zero in on the cause because a majority of failures are due to corner case issues.
In short, at every step of design verification phase, we are involve in "Debugging". I would say that debugging skill is an art expected of a DV engineer. Depending on the debugging skills, a DV engineer may take time ranging from a few minutes or days or even weeks to close pending issues.
For better analogy of various points later in this paper, please see below figures.
Figure 1 shows a basic example diagram of an ASIC SoC that we target for Lab Validation. This is just an example but such ASICs may have more or less blocks than what is shown.

[EEIOL 2016JUL07 TA 01Fig1]*Figur 1: Example of ASIC SoC design.*

Figure 2 shows how typical test bench setup looks like for Lab Validation.

[EEIOL 2016JUL07 TA 01Fig2]*Figure 2: Typical setup for lab Validation*

In figure-2:
PC1/PC2/PC3 are the machines used by engineers to validate Lab activities. They run the targeted tests to simulate the Chip-on-board. As we have just one targeted board, please note that at a time just one Test can be run in lab.
Host PC is the machine interacting with the actual Board having ASIC under validation. It hears requests from PCs and takes action to get access to the Board connected through some interface protocol. This protocol can be standard protocol like PCIe, USB or some proprietary protocol. Host computer has some socket software running to catch and respond to the Test case commands coming in over Ethernet.
Board is the last product that has Chip under validation mounted on it along with multiple FPGAs and MEP (Module Embedded Processor) on it.

1. Reference Model related challenges
Most of the products companies make sure that the Reference Model (RM) components that they developed during functional verification are also reused later during Lab Validation.
Such reuse may be done as it is or with little changes considering the design and the way some block level RMs are encoded. But it's a general trend nowadays where such RMs are reused at higher level integration during lab validation. Such RMs receive input transactions for prediction from one or more monitors for simulations at a block or chip level. But, in the absence of such internal interface monitors (as we can't probe internal interfaces in real chip) while in lab, we have to rely on Back to Back (B2B) RM connection where one block level RM's output goes as input to the next block level RM's input. This is the first challenge that we face during pre lab validation work. B2B RM connections, the type of transactions from one to next RM should match. Such RM works fine in a simulation environment where monitors are feeding them but as soon as we tie it to other RMs, it may start misbehaving. This is due to the RM now receiving all transactions in zero time whereas in a monitor-based connection, it receives such input transactions over a huge time period. (Not in zero time). Some prediction logic within RM may fail due to this and it needs some change.
In case an RM has some dependence on the time of receiving input transactions to accurately predict the output transactions (something likes arbitration logic) then these scenarios can't be predicted with B2B RM integration. For example, the RM logic is such that if it receives Type-A transactions before Type-B transactions to predict different output results compared to Type-B coming before Type-A transaction. This is the kind of order dependency which may not work as expected with B2B RM. We have to come up with different directed scenarios with self checking mechanism to validate such features in lab.

2. No waveforms for ease of debugging, only Logs
During simulation, we heavily depend on the waveforms to debug and narrow down any failures that we have but while in lab, we can't dump waves so it becomes very difficult to identify cause of any failure. We have to depend on the log dumping to understand the possible cause of the failure. . Effective log dumping and better control to enable/disable the debug messages for specific block RM, specific VE component would help to narrow down certain issues very quickly.

3. Failures due to the false initial state of the Chip
This is the most common issue that we face during initial phase of Lab validation. Unlike in simulation, all your test run starts with same Init state. In lab it may not be true. Let's say you run your first scenario in Lab and leave certain registers to some value in Chip which may not be their default initial value, now you run your next test which considering the scenario, sets some different set of registers. In this case, as RM always starts from its same initial state for all registers for each run, we see actual hardware out of sync compared to RM where we may see lots of failures. Such issues consume lots of time to narrow down the root cause especially if your chip consists of thousands of registers. To address this, either you have to update your test to make sure all affected registers are resets to its initial value or at the beginning of each test that is performed with respect to all registers with their default value. Based on the how big your design is, you have to take a call on the efficient alternative to save time.
Many chips have some sort of debug features that helps in Lab validation. One of such debug feature is the "Signature Generator" at the block input or output boundary. It implements some sort of polynomial to generate signature over N number of cycles (which can be programmed) starting with some initial programmable seed.

[EEIOL 2016JUL07 TA 01Fig3]*Figure 3: B5 block having Siggen sitting on output signals.*

Let's take an example of such Siggen (as in above Figure-3) sitting at the output of a major block. It is to generate signature using the signals o_fire_out [12:0] and o_fire_valid leaving the block. As far as functionality of the next block is concerned, the inputs o_fire_out makes sense only if o_fire_valid is non zero. But above siggen works on all 13+1 bits (out + valid) of data for N cycles to calculate the last signature.

[EEIOL 2016JUL07 TA 01Fig4]*Figure 4: o_fire_out changes upon each o_fire_valid in first test.*

For example if at the end of one test, the o_fire_out [12:0] is wiggling with the Value-3 driven last shown in above__ figure 4__.

[EEIOL 2016JUL07 TA 01Fig5]*Figure 5: o_fire_out holds the last driven value from previous test. o_fire_valid stays 0 for second test.*

Now in the next test, the scenario is such that the output o_fire_out shouldn't change at all as in above Figure-5. This means in the hardware the final signature for second test is calculated with the Value-3 on o_fire_out which is the value driven in the previous test. The o_fire_valid set to zero throughout for second test. But, this is not the case with RM as it starts the new test with the assumption that the o_fire_out having the value which is as per the effect of POR. Now as o_fire_out is the intermediate signal between two blocks, we never realise what could be the cause of the signature mismatch. Such back to back tests in simulation always pass but in it's not. To fix this, we have to run some operation that can put the o_fire_out to known init value before we run a next test. In-depth knowledge of design is a MUST to cause such issues, otherwise it may take several days. We can have more scenarios like this that consumes huge debugging time.

4. Test bench and Test cases portability issues
In simulation environment, at many places we might have used Clock delay loops or hard "#" delays to introduce some sort of delay for Synchronisation purpose or for some desired delay. Such delays won't take the effect while in lab so if the code has some real dependency on such wait delays then it has to be addressed before we reuse them in lab. One of the solutions could be to add dummy writes or reads to a register in Chip in order to introduce the delay.
To conclude, it can be safely said that while debugging issues & failures related to Lab Validation, one should think at different level and not just around the micro level within specific module where the failure is. As was explained in the above example, the cause of failure may be something different which is not due to any real functional failure of a chip. The thought processes needed to cause a lab issue are completely different from those to narrow down issues at simulation level.

About the authors
Shah and Bipin Patel are with eInfochips.

Leave a comment