Error Prone Areas in PCI Express Design
Jitendra Puri, Engineering
Director, nSys Design Systems
Verification is a process
which presumes that no “developer” is perfect. In
other words the whole Verification Industry thrives on the imperfection
of 'Man' ! If you tell any design engineer that he/she is coding
bugs along with the design, they might get offended ! But, that's
the reality. In this “Tips and Tricks” article we
are going to talk about error-prone areas in PCI Express designs
based upon our findings during verification of several PCI Express
designs.
Error Prone Areas in PCI
Express Design
An experienced PCI designer is expected to have knowledge of Configuration
Space, Configuration cycles, Memory cycles, Device number, Bus
number, Base Addresses, TAG and Split cycles etc.
Majority of the designers are also very comfortable with the Transaction
Layer concepts and terminology. It is the “newer”
concepts which need special attention and are more error-prone.
We did an analysis of the bugs unearthed during the Verification
of several designs, and found the following Error prone areas:
Physical Layer
Many of the designs had one error or another related to LTSSM
state transitions. At times the designers had ignored the Transition
control bits viz. Hot Reset, Loopback, Disable Link, Disable Scrambling
which are received as part of TS Ordered set. All this led to
the DUT making
incorrect transition to L0 instead of the desired states. This
also resulted in LTSSM deadlock and unnecessary Time-outs.
Data Link Layer
Flow control is another key area where several issues were observed
in various designs. It was observed that DUT behavior was correct
for VC0 possibly because this is initialized by default. The designers
took good care of the credit flow information for VC0 but somehow,
VCx-credit handling was not perfect in all the designs. Updated
FCs were not sent for VCx properly, leading to starvation.
Another key issue we came across
was that the credit information was not updated for message packets.
Certain message packets that are not supported by the DUT e.g.
SSPL, Vendor Defined etc. are simply ignored. This is not even
taken into consideration while sending the updated FC information
resulting in credits not getting freed up and hence starvation.
Data Link Layer does not get reset
on entry to DL_Inactive state. NEXT_TRANSMIT_SEQ, ACKD_SEQ, REPLAY_NUM,
NEXT_RCV_SEQ, etc are not set to default values. The Retry buffer
is not freed up on re-establishing link. The TL packets received
from the application layer are remembered and are sent out even
when link layer is in DL_Down state (FC_INIT1). Therefore the
first packet would not be sent with sequence number as 0, the
contents of retry buffer would
get transmitted and the credit logic gets totally out of sync.
Replay Mechanism
During Replay, some designs do not block accepting new packets
from the Transaction Layer. So these new packets appears in the
middle of the replay. Other designs do not handle the ACK/NAK
during the course of replay and end up sending the ACKed packets
again, causing performance degradation.
Transaction Layer
When AER is implemented, the bits in the AER registers are updated
correctly, but certain registers (like status register) that need
to be updated in the standard PCI configuration space are left
out. EP designs assume that they can also send 64-byte aligned
completions when RCB bit is set to 0. As a consequence, they incorrectly
send out completions broken to 64-byte boundary. The receiver
treats these as Malformed. Max_Read_Request_Size register is wrongly
interpreted and its value is used in receive logic. The received
read requests' size gets wrongly compared with this register and
as a consequence, well-formed read request gets treated as Malformed
read request.
Power Management
Tx_L0s and Rx_L0s are not kept independent. As a result the TX
is forced to go to L0s when the RX is in L0s. This results in
blocking the packet transmission. The transmitter sends out N_FTS
equal to the number it itself advertised and not based on the
number it received. Thus the device does not get the required
time for transition to L0 from L0s. It might go to recovery after
the N_FTS timeout.
Conclusion
To summarize, we should review error prone areas more thoroughly
and use back-to-back operation of Verification IP to improve understanding.
Use Verification IPs with proven ability to detect bugs. Make
Compliance test suite one of the key components of the verification
effort. When all is said and done, “To Err is Human”…
the bugs are going to be there! So, the efforts should be focused
on defect-prevention and the remedies thereof !