Troubleshooting fabric errors due to faulty linecard on ASR9000

    Intro

     

    In a troubleshooting case I did some time ago, extreme latency when going through one specific ASR9K router was reported. We discovered CRC errors received from the crossbar fabric. This post outlines the steps we went through in order to detect the faulty component.

     

     

    Troubleshooting

     

    From the 'show controllers fabric fia drops egress location..' we discovered that FIA dropped packets in egress direction (towards the NP) due to CRC errors received from the fabric. These counters were increasing on pretty much all linecards.

     

    RP/0/RSP1/CPU0:A9K#sh controllers fabric fia drops egress location 0/0/CPU0 | e \\ 0$
    Thu Jun  8 13:30:16.472 CEST
    ********** FIA-0 ********
    Category: eg_drop-0
                               From Xbar Uc Crc-0                   123338
                               From Xbar Uc Crc-1                   123287
                               From Xbar Uc Crc-2                   123698
           Uc dq pkt-len-crc/RO-seq/len error drp                   370323

     

    Going a step further, backwards, we looked at the fabric crossbar statistics on the RSPs and found that the fabric instances had increasing CRC error on specific crossbar ports.

     

    RP/0/RSP1/CPU0:A9K#sh controllers fabric crossbar statistics instance 0 location 0/RSP0/CPU0

    Port statistics for xbar:0 port:6
    ==============================
    Internal Error Count: 14028
    Hi priority stats (unicast)
    ===========================
        Ingress Packet Count Since Last Read       : 37555537110
        Ingress Channel Utilization Count          : 3
        Packet CRC Error Count                     : 18042566
        Egress Packet Count Since Last Read        : 30208290980
        Egress Channel Utilization Count           : 3

    Port statistics for xbar:0 port:9
    ==============================
    Internal Error Count: 14028
    Hi priority stats (unicast)
    ===========================
        Ingress Packet Count Since Last Read       : 37656309502
        Ingress Channel Utilization Count          : 3
        Packet CRC Error Count                     : 18121186
        Egress Packet Count Since Last Read        : 30209843187
        Egress Channel Utilization Count           : 3

     

    RP/0/RSP1/CPU0:A9K#sh controllers fabric crossbar statistics instance 1 location 0/RSP0/CPU0

    Port statistics for xbar:1 port:2
    ==============================
    Internal Error Count: 14035
    Hi priority stats (unicast)
    ===========================
        Ingress Packet Count Since Last Read       : 22822817397
        Ingress Channel Utilization Count          : 2
        Packet CRC Error Count                     : 11045205
        Egress Packet Count Since Last Read        : 20251200154
        Egress Channel Utilization Count           : 2

    Port statistics for xbar:1 port:24
    ==============================
    Internal Error Count: 14035
    Hi priority stats (unicast)
    ===========================
        Ingress Packet Count Since Last Read       : 22821790124
        Ingress Channel Utilization Count          : 2
        Packet CRC Error Count                     : 11045315
        Egress Packet Count Since Last Read        : 30909392742
        Egress Channel Utilization Count           : 3

     

    By looking at the crossbar link-status output we could trace the CRC errors back to the linecard in slot 5 of the chassis.

     

    RP/0/RSP1/CPU0:A9K#sh controllers fabric crossbar link-status instance 0 location 0/RSP0/CPU0

    PORT    Remote Slot  Remote Inst    Logical ID  Status
    ======================================================
    00      0/1/CPU0            00             1        Up
    01      0/1/CPU0            00             0        Up
    02      0/6/CPU0            00             0        Up
    04      0/2/CPU0            00             1        Up
    05      0/2/CPU0            00             0        Up
    06      0/5/CPU0            00             1        Up
    07      0/3/CPU0            00             1        Up
    09      0/5/CPU0            00             0        Up
    11      0/3/CPU0            00             0        Up
    13      0/4/CPU0            00             1        Up
    14      0/4/CPU0            00             0        Up
    16      0/RSP0/CPU0         00             0        Up
    18      0/7/CPU0            00             1        Up
    19      0/0/CPU0            00             1        Up
    20      0/7/CPU0            00             0        Up
    22      0/0/CPU0            00             0        Up
    24      0/6/CPU0            00             1        Up

     

    RP/0/RSP1/CPU0:A9K#sh controllers fabric crossbar link-status instance 1 location 0/RSP0/CPU0

    PORT    Remote Slot  Remote Inst    Logical ID  Status
    ======================================================
    00      0/2/CPU0            00             1        Up
    01      0/2/CPU0            00             0        Up
    02      0/5/CPU0            00             0        Up
    04      0/1/CPU0            00             1        Up
    05      0/1/CPU0            00             0        Up
    06      0/6/CPU0            00             1        Up
    07      0/0/CPU0            00             1        Up
    09      0/6/CPU0            00             0        Up
    11      0/0/CPU0            00             0        Up
    13      0/7/CPU0            00             1        Up
    14      0/7/CPU0            00             0        Up
    16      0/RSP0/CPU0         00             0        Up
    18      0/4/CPU0            00             1        Up
    19      0/3/CPU0            00             1        Up
    20      0/4/CPU0            00             0        Up
    22      0/3/CPU0            00             0        Up
    24      0/5/CPU0            00             1        Up

     

     

    Conclusion

     

    After replacing the linecard in slot 5 of the ASR9K chassis the high latency issues and CRC error disappeared. The most peculiar observation during this troubleshooting case was that traffic not traversing the slot 5 linecard was still affected. So, somehow the faulty linecard created instabilities for the entire system making it more difficult to point out the exact cause of the issue.