10 Replies Latest reply: May 3, 2019 3:50 PM by Yoj RSS

    Nexus Interview Question

    Yoj

      Got this question a couple of days ago. I'm curious as to how would you guys respond.

       

      The site had a power outage. When the core (N7K with modules and VDCs) came up, it couldn't detect the modules and all links are down. What would be your approach to try and restore service?

         
        • 1. Re: Nexus Interview Question
          Kevin Santillan

          There are a couple of possible reasons for this such as failed modules/fabric module/backplane issue/old code etc.  Whatever the reason is, I'll approach it in such a way that I'd first try solutions that could probably solve the issue faster and do the most tedious method last. In times like this, the goal is to restore service ASAP, workaround or not. Perhaps something like this:

           

          1.) Reload one module from the CLI. If this works reload the other modules.

          2.) If step 1 doesn't work, power off one module and physically reseat it. Do this for the others.

          3.) Assuming step 2 fails, at this point have someone open a TAC case and have the engineer join the call. I would still continue to steps 4 and 5 though while waiting for TAC.

          4.) Reload the chassis.

          5.) If step 4 doesn't work, de-allocate the interfaces in each VDC and re-allocate. Reconfigure all ports and port-channels for each VDC.

          6.) If port re-allocation doesn't work, consult TAC as you might need to RMA or upgrade the code.

           

          Looks to me that the goal of the question is to determine how experienced you are and whether you have a "super hero mentality" or know when to seek help if the situation calls for it.

          • 2. Re: Nexus Interview Question
            Micheline

            Hello Yoj--that's a great question.  I agree with Kevin that the question is testing experience and judgment.  Lack of experience can be fixed, lack of judgment cannot.  I also agree with Kevin's approach that the first priority is to get traffic flowing.

             

            With that in mind, I'd see what you could do to route traffic around the problem first.  Can you pull the box from production and replace it to get traffic flowing?  If no traffic is flowing, there's no additional harm in pulling the box. What about changing a static route to redirect traffic?  These sorts of things are often pre-built into a resilient network, but if they aren't then you could do so.

             

            Just some thoughts.  MM

            • 3. Re: Nexus Interview Question
              Yoj

              Thanks Mich. I guess pulling the switch out from prod would be your last resort as this entails rerouting of cables regardless if you have a collapsed core or three-tier design. But it's still a good suggestion since it's better than waiting for replacement hardware to arrive.

              • 4. Re: Nexus Interview Question
                Yoj

                Kevin, your response is quite similar to what I answered during the interview. The only difference is I didn't mention calling TAC which might've put me in jeopardy. lol

                • 5. Re: Nexus Interview Question
                  Kevin Santillan

                  Well, that's why you're getting Smart Net right - for support. Leverage it. But also, don't neglect your duties as an engineer and do your own due diligence as you should be capable of resolving the issue or providing a workaround to restore service.

                   

                  I've experienced the given scenario twice and Step 5 has always worked for me. But that doesn't mean I'd go straight to reallocating interfaces if it happens again as there's always a chance you could resolve the issue through the other first steps.

                  • 6. Re: Nexus Interview Question
                    Micheline

                    Actually, with a malfunctioning switch passing no traffic, there's no harm in pulling it since the traffic is already stopped.  Swap in a  like-for-like switch (the big question is whether you have dead switch's configs saved).   If that's an option, I'd do it first rather than last because that will be the fastest route to 100% functionality.

                     

                    MM

                    • 7. Re: Nexus Interview Question
                      Yoj

                      I see where you're coming from. But wouldn't you want to try first if you could make the line cards come up again before totally removing the switch in prod? Cause doing a like-for-like swap would mean you'd need to configure the new chassis, make sure all cables are properly labeled, unmount the chassis, mount the new chassis, and then reconnect all cables. That alone would surely take you at least an hour to complete depending on the number of links. Plus if you have SFPs plugged into the problematic modules, you'd need to transfer those too. Not to mention it's not everyday that you have a cold standby 7K just lying around.

                      • 8. Re: Nexus Interview Question
                        arteq

                        well put kevin

                        • 9. Re: Nexus Interview Question
                          Micheline

                          Yes, Yoj, all of the things you mentioned are considerations.  But when you are brainstorming solutions to a problem, there are no negatives.  The idea is to get as many ideas up as possible first, and *then* assess each idea's viability.  Certainly if you don't have hardware to swap out you cannot use this option.

                           

                          One other thing I would add about not mentioning TAC, TAC is a service that Cisco offers just like any other.  I'm sure it comes with a cost that's already paid for, and if you need to call TAC, you should not be afraid to.  In my opinion, it is a very poor work environment that suppresses its workers from calling for help when it is needed, and it is probably not an employer you would want to work for.  Saying that you would call TAC is a mark that you have the emotional intelligence to recognize your own limitations and when you need to call for help.

                           

                          MM

                          • 10. Re: Nexus Interview Question
                            Yoj

                            Yes, Yoj, all of the things you mentioned are considerations.  But when you are brainstorming solutions to a problem, there are no negatives.  The idea is to get as many ideas up as possible first, and *then* assess each idea's viability.  Certainly if you don't have hardware to swap out you cannot use this option.

                             

                            I fully agree. Thanks for your inputs!