WHITE PAPER
Home | Products | Services | Support | About Us | News | Careers | Contact Us
 
White Paper





Error Prone Areas in PCI Express Design

Jitendra Puri, Engineering Director, nSys Design Systems



Verification is a process which presumes that no “developer” is perfect. In other words the whole Verification Industry thrives on the imperfection of 'Man' ! If you tell any design engineer that he/she is coding bugs along with the design, they might get offended ! But, that's the reality. In this “Tips and Tricks” article we are going to talk about error-prone areas in PCI Express designs based upon our findings during verification of several PCI Express designs.

Error Prone Areas in PCI Express Design
An experienced PCI designer is expected to have knowledge of Configuration Space, Configuration cycles, Memory cycles, Device number, Bus number, Base Addresses, TAG and Split cycles etc.
Majority of the designers are also very comfortable with the Transaction Layer concepts and terminology. It is the “newer” concepts which need special attention and are more error-prone.
We did an analysis of the bugs unearthed during the Verification of several designs, and found the following Error prone areas:

Physical Layer
Many of the designs had one error or another related to LTSSM state transitions. At times the designers had ignored the Transition control bits viz. Hot Reset, Loopback, Disable Link, Disable Scrambling which are received as part of TS Ordered set. All this led to the DUT making
incorrect transition to L0 instead of the desired states. This also resulted in LTSSM deadlock and unnecessary Time-outs.

Data Link Layer
Flow control is another key area where several issues were observed in various designs. It was observed that DUT behavior was correct for VC0 possibly because this is initialized by default. The designers took good care of the credit flow information for VC0 but somehow, VCx-credit handling was not perfect in all the designs. Updated FCs were not sent for VCx properly, leading to starvation.

Another key issue we came across was that the credit information was not updated for message packets. Certain message packets that are not supported by the DUT e.g. SSPL, Vendor Defined etc. are simply ignored. This is not even taken into consideration while sending the updated FC information resulting in credits not getting freed up and hence starvation.

Data Link Layer does not get reset on entry to DL_Inactive state. NEXT_TRANSMIT_SEQ, ACKD_SEQ, REPLAY_NUM, NEXT_RCV_SEQ, etc are not set to default values. The Retry buffer is not freed up on re-establishing link. The TL packets received from the application layer are remembered and are sent out even when link layer is in DL_Down state (FC_INIT1). Therefore the first packet would not be sent with sequence number as 0, the contents of retry buffer would
get transmitted and the credit logic gets totally out of sync.

Replay Mechanism
During Replay, some designs do not block accepting new packets from the Transaction Layer. So these new packets appears in the middle of the replay. Other designs do not handle the ACK/NAK during the course of replay and end up sending the ACKed packets again, causing performance degradation.

Transaction Layer
When AER is implemented, the bits in the AER registers are updated correctly, but certain registers (like status register) that need to be updated in the standard PCI configuration space are left out. EP designs assume that they can also send 64-byte aligned completions when RCB bit is set to 0. As a consequence, they incorrectly send out completions broken to 64-byte boundary. The receiver treats these as Malformed. Max_Read_Request_Size register is wrongly
interpreted and its value is used in receive logic. The received read requests' size gets wrongly compared with this register and as a consequence, well-formed read request gets treated as Malformed read request.

Power Management
Tx_L0s and Rx_L0s are not kept independent. As a result the TX is forced to go to L0s when the RX is in L0s. This results in blocking the packet transmission. The transmitter sends out N_FTS equal to the number it itself advertised and not based on the number it received. Thus the device does not get the required time for transition to L0 from L0s. It might go to recovery after the N_FTS timeout.

Conclusion
To summarize, we should review error prone areas more thoroughly and use back-to-back operation of Verification IP to improve understanding. Use Verification IPs with proven ability to detect bugs. Make Compliance test suite one of the key components of the verification effort. When all is said and done, “To Err is Human”… the bugs are going to be there! So, the efforts should be focused on defect-prevention and the remedies thereof !




 

 

 

Home | Products | Services | About Us | News | Careers | Contact Us