Troubleshooting Cisco Nexus Switches and NX-OS
Troubleshooting Cisco Nexus Switches and NX-OS
Troubleshooting Cisco Nexus Switches and NX-OS
Cisco Press
800 East 96th Street
Published by:
Cisco Press
800 East 96th Street
Indianapolis, IN 46240 USA
All rights reserved. No part of this book may be reproduced or transmitted in any form or by any
means, electronic or mechanical, including photocopying, recording, or by any information storage
and retrieval system, without written permission from the publisher, except for the inclusion of brief
quotations in a review.
01 18
ISBN-13: 978-1-58714-505-6
ISBN-10: 1-58714-505-7
The information is provided on an “as is” basis. The authors, Cisco Press, and Cisco Systems, Inc. shall
have neither liability nor responsibility to any person or entity with respect to any loss or damages
arising from the information contained in this book or from the use of the discs or programs that may
accompany it.
The opinions expressed in this book belong to the author and are not necessarily those of Cisco
Systems, Inc.
Trademark Acknowledgments
All terms mentioned in this book that are known to be trademarks or service marks have been
appropriately capitalized. Cisco Press or Cisco Systems, Inc., cannot attest to the accuracy of this
information. Use of a term in this book should not be regarded as affecting the validity of any
trademark or service mark. Technet24.ir
iii
Special Sales
For information about buying this title in bulk quantities, or for special sales opportunities (which
may include electronic versions; custom cover designs; and content particular to your business,
training goals, marketing focus, or branding interests), please contact our corporate sales department at
corpsales@pearsoned.com or (800) 382-3419.
For questions about sales outside the U.S., please contact intlcs@pearson.com.
Feedback Information
At Cisco Press, our goal is to create in-depth technical books of the highest quality and value. Each book
is crafted with care and precision, undergoing rigorous development that involves the unique expertise of
members from the professional technical community.
Readers’ feedback is a natural continuation of this process. If you have any comments regarding how we
could improve the quality of this book, or otherwise alter it to better suit your needs, you can contact
us through email at feedback@ciscopress.com. Please make sure to include the book title and ISBN in
your message.
Cisco has more than 200 offices worldwide. Addresses, phone numbers, and fax numbers are listed on the Cisco Website at www.cisco.com/go/offices.
Cisco and the Cisco logo are trademarks or registered trademarks of Cisco and/or its affiliates in the U.S. and other countries. To view a list of Cisco trademarks,
go to this URL: www.cisco.com/go/trademarks. Third party trademarks mentioned are the property of their respective owners. The use of the word partner does
not imply a partnership relationship between Cisco and any other company. (1110R)
iv Troubleshooting Cisco Nexus Switches and NX-OS
Brad Edgeworth, CCIE No. 31574 (R&S & SP), is a systems engineer at Cisco Systems.
Brad is a distinguished speaker at Cisco Live, where he has presented on various topics.
Before joining Cisco, Brad worked as a network architect and consultant for various
Fortune 500 companies. Brad’s expertise is based on enterprise and service provider
environments with an emphasis on architectural and operational simplicity. Brad holds a
Bachelor of Arts degree in Computer Systems Management from St. Edward’s University
in Austin, Texas. Brad can be found on Twitter as @BradEdgeworth.
Richard Furr, CCIE No. 9173 (R&S & SP), is a technical leader with the Cisco Technical
Assistance Center (TAC), supporting customers and TAC teams around the world. For
the past 17 years, Richard has worked for the Cisco TAC and High Touch Technical
Support (HTTS) organizations, supporting service provider, enterprise, and data center
environments. Richard specializes in resolving complex problems found with routing
protocols, MPLS, multicast, and network overlay technologies.
Matt Esau, CCIE No. 18586 (R&S) is a graduate from the University of North Carolina
at Chapel Hill. He currently resides in Ohio with his wife and two children, ages three
and one. Matt is a Distinguished Speaker at Cisco Live. He started with Cisco in 2002
and has spent 15 years working closely with customers on troubleshooting issues and
product usability. For the past eight years, he has worked in the Data Center space, with a
focus on Nexus platforms and technologies.
v
Dedications
This book is dedicated to three important women in my life: my mother, my wife,
Khushboo, and Sonal. Mom, thanks for being a friend and a teacher in different phases
of my life. You have given me the courage to stand up and fight every challenge
that comes my way in life. Khushboo, I want to thank you for being so patient with
my madness and craziness. I couldn’t have completed this book or any other project
without your support, and I cannot express in words how much it all means to me. This
book is a small token of love, gratitude and appreciation for you. Sonal, thank you
for being the driver behind my craziness. You have inspired me to reach new heights
by setting new targets every time we met. This book is a small token of my love and
gratitude for all that you have done for me.
I would further like to dedicate this book to my dad and my brother for believing in me
and standing behind me as a wall whenever I faced challenges in life. I couldn’t be where
I am today without your invincible support.
—Vinit Jain
This book is dedicated to David Kyle. Thank you for taking a chance on me. You will
always be more than a former boss. You mentored me with the right attitude and foun-
dational skills early in my career.
In addition to stress testing the network with Quake, you let me start my path with
networking under you. Look where I am now!
—Brad Edgeworth
This book is dedicated to my loving wife, Sandra, and my daughter, Calianna. You are
my inspiration. Your love and support drive me to succeed each and every day. Thank
you for providing the motivation for me to push myself further than I thought possible.
Calianna, you are only two years old now. When you are old enough to read this, you
will have long forgotten about all the late nights daddy spent working on this project.
When you hold this book, I want you to remember that anything is possible through
dedication and hard work.
I would like to further dedicate this book to my mother and father. Mom, thanks for
always encouraging me, and for teaching me that I can do anything I put my mind to.
Dad, thank you for always supporting me, and teaching me how to be dedicated and
work hard. Both of you have given me your best.
—Richard Furr
vi Troubleshooting Cisco Nexus Switches and NX-OS
Acknowledgments
Vinit Jain:
Brad and Richard: Thank you for being part of this yearlong journey. This project
wouldn’t have been possible without your support. It was a great team effort, and it was
a pleasure working with both of you.
I would like to thank our technical editors, Ramiro and Matt, for your in-depth
verification of the content and insightful input to make this project a successful one.
I couldn’t have completed the milestone without the support from my managers, Chip
Little and Mike Stallings. Thank you for enabling us with so many resources, as well as
being flexible and making an environment that is full of opportunities.
I would like to thank David Jansen, Lukas Krattiger, Vinayak Sudame, Shridhar
Dhodapkar, and Ryan McKenna for your valuable input during the course of this book.
Most importantly, I would like to thank Brett Bartow and Marianne Bartow for their
wonderful support on this project. This project wouldn’t have been possible without your
support.
Brad Edgeworth:
Vinit, thanks again for asking me to co-write another book with you. Richard, thanks
again for your insight. I’ve always enjoyed our late-night conference calls.
Ramiro and Matt, thank you for hiding all my mistakes, or at least pointing them out
before they made it to print!
This is the part of the book that you look at to see if you have been recognized. Well,
many people have provided feedback, suggestions, and support to make this a great
book. Thanks to all who have helped in the process, especially Brett Bartow, Marianne
Bartow, Jay Franklin, Katherine McNamara, Dustin Schuemann, Craig Smith, and my
managers.
P.S. Teagan, this book does not contain dragons or princesses, but the next one might!
Richard Furr:
I’d like to thank my coauthors, Vinit Jain and Brad Edgeworth, for the opportunity to work
on this project together. It has been equally challenging and rewarding on many levels.
Brad, thank you for all the guidance and your ruthless red pen on my first chapter. You
showed me how to turn words and sentences into a book. Vinit, your drive and ambition
are contagious. I look forward to working with both of you again in the future.
I would also like to thank our technical editors, Matt Esau and Ramiro Garza Rios,
for their expertise and guidance. This book would not be possible without your
contributions.
I could not have completed this project without the support and encouragement of my
manager, Mike Stallings. Mike, thank you for allowing me to be creative and pursue
projects like this one. You create the environment for us to be our best.
vii
Contents at a Glance
Foreword xxvi
Introduction xxvii
Index 977
Reader Services
Register your copy at www.ciscopress.com/title/9781587145056 for convenient access
to downloads, updates, and corrections as they become available. To start the registra-
tion process, go to www.ciscopress.com/register and log in or create an account*. Enter
the product ISBN 9781587145056 and click Submit. When the process is complete, you
will find any available bonus content under Registered Products.
*Be sure to check the box that you would like to hear from us to receive exclusive
discounts on future editions of this product.
ix
Contents
Foreword xxvi
Introduction xxvii
FabricPath 294
FabricPath Terminologies and Components 296
FabricPath Packet Flow 297
FabricPath Configuration 300
FabricPath Verification and Troubleshooting 303
FabricPath Devices 310
Emulated Switch and vPC+ 310
vPC+ Configuration 311
vPC+ Verification and Troubleshooting 314
Summary 320
References 320
RA Guard 363
IPv6 Snooping 365
DHCPv6 Guard 368
First-Hop Redundancy Protocol 370
HSRP 370
HSRPv6 376
VRRP 380
GLBP 385
Summary 391
OPEN 601
UPDATE 602
NOTIFICATION 602
KEEPALIVE 602
BGP Neighbor States 602
Idle 603
Connect 603
Active 604
OpenSent 604
OpenConfirm 604
Established 605
BGP Configuration and Verification 605
Troubleshooting BGP Peering Issues 609
Troubleshooting BGP Peering Down Issues 609
Verifying Configuration 610
Verifying Reachability and Packet Loss 611
Verifying ACLs and Firewalls in the Path 613
Verifying TCP Sessions 615
OPEN Message Errors 617
BGP Debugs 618
Demystifying BGP Notifications 619
Troubleshooting IPv6 Peers 621
BGP Peer Flapping Issues 622
Bad BGP Update 622
Hold Timer Expired 623
BGP Keepalive Generation 624
MTU Mismatch Issues 626
BGP Route Processing and Route Propagation 630
BGP Route Advertisement 631
Network Statement 631
Redistribution 633
Route Aggregation 634
Default-Information Originate 636
BGP Best Path Calculation 636
BGP Multipath 640
EBGP and IBGP Multipath 640
xx Troubleshooting Cisco Nexus Switches and NX-OS
Index 977
xxv
■ Boldface indicates commands and keywords that are entered literally as shown. In
actual configuration examples and output (not general command syntax), boldface
indicates commands that are manually input by the user (such as a show command).
■ Braces within brackets ([{ }]) indicate a required choice within an optional element.
Note This book covers multiple Nexus switch platforms (5000, 7000, 9000, etc).
A generic NX-OS icon is used along with a naming syntax for differentiation of devices.
Platform-specific topics use a platform-specific icon and major platform number in the
system name.
xxvi Troubleshooting Cisco Nexus Switches and NX-OS
Foreword
The data center is at the core of all companies in the digital age. It processes bits and
bytes of data that represent products and services to its customers. The data storage and
processing capabilities of a modern business have become synonymous with the ability
to generate revenue. Companies in all business sectors are storing and processing more
information digitally every year, regardless of their vertical affiliation (construction,
medical, entertainment, and so on). This means that the network must be designed for
speed, capacity, and flexibility.
The Nexus platform was built with speed and bandwidth capacity in mind. When the
Nexus 7000 launched in 2008, it provided high-density 10 Gigabit interfaces at a low
per-port cost. In addition, the Nexus switch operating system, NX-OS, brought forth evo-
lutionary technologies like virtual port channels (vPC) that increased available bandwidth
and redundancy while overcoming the inefficiencies of Spanning-Tree Protocol (STP).
NX-OS introduced technologies such as Overlay Transport Virtualization (OTV), which
revolutionized the design of the data center network by enabling host mobility between
sites and allowing full data center redundancy. Today, the Nexus platform continues
to evolve by supporting 25/40/100 Gigabit interfaces in a high-density compact form
factor, and brings other innovative technologies such as VXLAN and Application Centric
Infrastructure (ACI) to the market.
NX-OS was built with the mindset of operational simplicity and includes additional tools
and capabilities that improve the operational efficiency of the network. Today, websites
and applications are expected to be available 24 hours a day, 7 days a week, and 365 days
a year. Downtime in the data center directly translates to a financial impact. The move
toward digitization and the potential impact the network has to a business makes it more
important than ever for network engineers to attain the skills to troubleshoot data center
network environments efficiently.
As the leader of Cisco’s technical services for more than 25 years, I have the benefit of
working with the best network professionals in the industry. This book is written by
Brad, Richard, and Vinit: “Network Rock Stars,” who have been in my organization for
years supporting multiple Cisco customers. This book provides a complete reference for
troubleshooting Nexus switches and the NX-OS operating system. The methodologies
taught in this book are the same methods used by Cisco’s technical services to solve a
variety of complex network problems.
Joseph Pinto
SVP, Technical Services, Cisco, San Jose
xxvii
Introduction
The Nexus operating system (NX-OS) contains a modular software architecture that
primarily targets high-speed/high-density network environments like data centers.
NX-OS provides virtualization, high availability, scalability, and upgradeability features
for Nexus switches.
The Nexus 7000 switch debuted in 2008, providing more than 512 10 Gbps ports. Over
the years, Cisco has released other Nexus switch families that include the Nexus 5000,
Nexus 2000, Nexus 9000, and virtual Nexus 1000. NX-OS has grown in features,
allowing Nexus switch deployments in enterprise routing and switching roles.
This book is the single source for mastering techniques to troubleshoot various features
and issues running on Nexus platforms with NX-OS operating system. Bringing together
content previously spread across multiple sources and Cisco Press titles, it covers
updated various features and architecture-level information on how various features
function on Nexus platforms and how one can leverage the capabilities of NX-OS to
troubleshoot them.
Part III of the book, “Troubleshooting Layer 3 Routing,” explains the underlying
IP components of NX-OS. This includes the routing protocols EIGRP, OSPF, IS-IS, BGP,
and the selection of routes for filtering or path manipulation.
Part IV of the book, “Troubleshooting High Availability,” discusses and explains the high
availability components of NX-OS.
■ Chapter 12, “High Availability”: This chapter explains how to troubleshoot high
availability components such as bidirectional forward detection (BFD), Stateful
Switchover (SSO), In-service software upgrade (ISSU) and Graceful Insertion and
Removal (GIR).
Part V of the book, “Multicast Network Traffic,” explains the operational components of
multicast network traffic on Nexus switches.
Part VI of the book, “Troubleshooting Nexus Tunneling,” discusses the various tunneling
techniques that NX-OS provides.
Part VII of the book, “Network Programmability,” provides details on the methods that
NX-OS can be configured with APIs and automation.
On the product web page you also will find a bonus chapter, “Troubleshooting VxLAN
and VxLAN BGP EVPN.”
Technet24
xxx Troubleshooting Cisco Nexus Switches and NX-OS
Additional Reading
The authors tried to keep the size of the book manageable while providing only
necessary information for the topics involved.
Some readers may require additional reference material and may find the following books
a great supplementary resource for the topics in this book.
■ Fuller, Ron, David Jansen, and Matthew McPherson. NX-OS and Cisco Nexus
Switching. Indianapolis: Cisco Press, 2013.
■ Edgeworth, Brad, Aaron Foss, and Ramiro Garza Rios. IP Routing on Cisco IOS,
IOS XE, and IOS XR. Indianapolis: Cisco Press, 2014.
■ Krattiger, Lukas, Shyam Kapadia, and David Jansen. Building Data Centers with
VXLAN BGP EVPN. Indianapolis: Cisco Press, 2017.
Chapter 1
■ NX-OS Architecture
■ NX-OS Virtualization Features
At the time of its release in 2008, the Nexus operating system (NX-OS) and the Nexus
7000 platform provided a substantial leap forward in terms of resiliency, extensibility,
virtualization, and system architecture compared to other switching products of the time.
Wasteful excess capacity in bare metal server resources had already given way to the effi-
ciency of virtual machines and now that wave was beginning to wash over to the network
as well. Networks were evolving from traditional 3-Tier designs (access layer, distribution
layer, core layer) to designs that required additional capacity, scale, and availability. It was
no longer acceptable to have links sitting idle due to Spanning Tree Protocol blocking
while that capacity could be utilized to increase the availability of the network.
As network topologies evolved, so did the market’s expectation of the network infra-
structure devices that connected their hosts and network segments. Network operators
were looking for platforms that were more resilient to failures, offered increased switch-
ing capacity, and allowed for additional network virtualization in their designs to better
utilize physical hardware resources. Better efficiency was also needed in terms of
reduced power consumption and cooling requirements as data centers grew larger with
increased scale.
The Nexus 7000 series was the first platform in Cisco’s Nexus line of switches created to
meet the needs of this changing data center market. NX-OS combines the functionality
of Layer 2 switching, Layer 3 routing, and SAN switching into a single operating system.
Technet24
2 Chapter 1: Introduction to Nexus Operating System (NX-OS)
From the initial release, the operating system has continued to evolve, and the portfolio
of Nexus switching products has expanded to include several series of switches that
address the needs of a modern network. Throughout this expansion, the following four
fundamental pillars of NX-OS have remained unchanged:
■ Resiliency
■ Virtualization
■ Efficiency
■ Extensibility
This chapter introduces the different types of Nexus platforms along with their place-
ment into the modern network architecture, and the major functional components of
NX-OS. In addition, some of the advanced serviceability and usability enhancements are
introduced to prepare you for the troubleshooting chapters that follow. This enables you
to dive into each of the troubleshooting chapters with a firm understanding of NX-OS
and Nexus switching to build upon.
The following sections introduce each Nexus platform and provide a high-level overview
of their features and placement depending on common deployment scenarios.
■ Extend the fabric to hosts without the need for spanning tree
The Nexus 2000 FEX products do not function as standalone devices; they require a
parent switch to function as a modular system. Several models are available to meet the
host port physical connectivity requirements with various options for 1 GE, 10 GE
connectivity as well as Fiber Channel over Ethernet (FCoE). On the fabric side of the
FEX, which connects back to the parent switch, different options exist for 1 GE, 10 GE,
and 40 GE interfaces. The current FEX Models are as follows:
When deciding on a FEX platform, consider the host connectivity requirements, the
parent switch connectivity requirements, and compatibility of the parent switch model.
The expected throughput and performance of the hosts should also be a consideration
because the addition of a FEX allows oversubscription of the fabric-side interfaces based
on the front panel bandwidth available for hosts.
Each of these models has advantages depending on the intended role. For example,
the Nexus 3500 series are capable of ultra-low-latency switching (sub-250ns),
Technet24
4 Chapter 1: Introduction to Nexus Operating System (NX-OS)
Note All Nexus 3000 series, with the exception of the Nexus 3500 series, run the same
NX-OS software release as the Nexus 9000 series switches.
The Nexus 5000 series is well suited as a Top of Rack (ToR) or End of Row (EoR) switch
for high-density and high-scale environments. They support 1 GE, 10 GE, and 40 GE
connectivity for Ethernet and FCoE. Superior port densities are achieved when used as a
parent switch for FEX aggregation. The 5696Q supports 100 GE uplinks with the addi-
tion of expansion modules. The platform naming convention is the model family, then the
supported number of ports at 10 GE or 40 GE depending on the model. A Nexus 5672 is
a 5600 platform that supports 72 ports of 10 GE Ethernet, and the UP characters indicate
the presence of 40 GE uplink ports.
The support for Layer 3 features combined with a large number of ports, FEX
aggregation, and the flexibility of supporting Ethernet, FCoE, and Fibre Channel in a
single platform make the Nexus 5000 series a very attractive ToR or EoR option for
many environments.
The different chassis configurations allow for optimal sizing in any environment. The
7000 series has five fabric module slots, whereas the 7700 has six fabric module slots.
The 7004 and the 7702 do not use separate fabric modules because the crossbar fabric
on the Input/Output (I/O) modules are sufficient for handling the platform’s requirements.
Access to the fabric is controlled by a central arbiter on the supervisor. This grants access
to the fabric for ingress modules to send packets toward egress modules. Virtual output
queues (VOQ) are implemented on the ingress I/O modules that represent the fabric
capacity of the egress I/O module. These VOQs minimize head-of-line blocking that
could occur waiting for an egress card to accept packets during congestion.
The Nexus 7000 and 7700 utilize a supervisor module that is responsible for running the
management and control plane of the platform as well as overseeing the platform health.
The supervisor modules have increased in CPU power, memory capacity, and switching
performance, with each generation starting with the Supervisor 1, then the Supervisor 2,
and then the current Supervisor 2E.
Because the Nexus 7000 is a distributed system, the I/O modules run their own software,
and they are responsible for handling all the data plane traffic. All Nexus 7000 I/O mod-
ules fall into one of two families of forwarding engines: M Series or F Series. Both fami-
lies of line cards have port configurations that range in speed from 1 GE, 10 GE, 40 GE,
to 100 GE. They are commonly referred to by their forwarding engine generation (M1,
M2, M3 and F1, F2, and F3), with each generation offering improvements in forwarding
capacity and features over the previous. The M series generally has larger forwarding
table capacity and larger packet buffers. Previously the M series also supported more
Layer 3 features than the F series, but with the release of the F3 cards, the feature gap
Technet24
6 Chapter 1: Introduction to Nexus Operating System (NX-OS)
has closed with support for features like Locator-ID Separation Protocol (LISP) and
MPLS. Figure 1-1 explains the I/O module naming convention for the Nexus 7000 series.
N77-F348XP-23
The Nexus 7000 is typically deployed in an aggregation or core role; however, using
FEXs with the Nexus 7000 provides high-density access connectivity for hosts. The
Nexus 7000 is also a popular choice for overlay technologies like MPLS, LISP, Overlay
Transport Virtualization (OTV), and VXLAN due to its wide range of feature availability
and performance.
■ Supervisor A with a 4 core 1.8 GHz CPU, 16 GB of RAM, and 64 GB of SSD storage
■ Supervisor B with a 6 core 2.2 GHz CPU, 24 GB of RAM, and 256 GB of SSD storage
The Nexus 9000 series uses a mix of commodity merchant switching application-
specific integrated circuits (ASIC) as well as Cisco’s developed ASICs to reduce cost
where appropriate. The Nexus 9500 was followed by the Nexus 9300 and Nexus 9200
series. Interface speeds of 1 GE, 10 GE, 25 GE, 40 GE, and 100 GE are possible, depend-
ing on the model, and FCoE and FEX aggregation is also supported on select models.
The 9500 is flexible and modular, and it could serve as a leaf/aggregation or core/spine
layer switch, depending on the size of the environment.
The 9300 and 9200 function well as high-performance ToR/EoR/leaf switches. The
Nexus 9000 series varies in size from 1RU to 21RU with various module and connectivity
Nexus Platforms Overview 7
options that match nearly any connectivity and performance requirements. The available
models are as follows:
N9K-C93180YC-EX
F – MAC SEC
C – Chassis/ToR
E – Enhanced
X – Line Card
X – Analytics (NetFlow)
[92–93] – 9200 or 9300 Platform S – Merchant Silicon 100G
[94–97] – 9500 LC ASIC Type U – Unified Ports
R – Deep Buffers
Number of Ports If They Are the Same Speed
P – 10G SFP+
or
T – 10G Copper
Total Bandwidth in 10s of Gb
Y – 25G SFP+
Q – 40G QSFP+
C – 100G QSFP28
The Nexus 9000 series is popular in a variety of network deployments because of its
speed, broad feature sets, and versatility. The series is used in high-frequency trading,
high-performance computing, large-scale leaf/spine architectures, and it is the most
popular Cisco Nexus platform for VXLAN implementations.
The portfolio of Nexus switching products is always evolving. Check the product data
sheets and documentation available on www.cisco.com for the latest information about
each product.
Technet24
8 Chapter 1: Introduction to Nexus Operating System (NX-OS)
NX-OS Architecture
Since its inception, the four fundamental pillars of NX-OS have been resiliency, virtual-
ization, efficiency, and extensibility. The designers also wanted to provide a user interface
that had an IOS-like look and feel so that customers migrating to NX-OS from legacy
products feel comfortable deploying and operating them. The greatest improvements to
the core operating system over IOS were in the following areas:
■ Process scheduling
■ Memory management
■ Process isolation
■ Management of feature processes
In NX-OS, feature processes are not started until they are configured by the user. This
saves system resources and allows for greater scalability and efficiency. The features use
their own memory and system resources, which adds stability to the operating system.
Although similar in look and feel, under the hood, the NX-OS operating system has
improved in many areas over Cisco’s IOS operating system.
VLAN UDLD
CLI MGR OSPF GLBP SYSMGR
PSS
IGMP 802.1x ElGRP VRRP
SNMP
Hardware Netstack
Drivers
Kernel
Note The next section covers some of the fundamental NX-OS components that are of
the most interest. Additional NX-OS services and components are explained in the context
of specific examples throughout the remainder of this book.
The Kernel
The primary responsibility of the kernel is to manage the resources of the system
and interface with the system hardware components. The NX-OS operating sys-
tem uses a Linux kernel to provide key benefits, such as support for symmetric-
multiprocessors (SMPs) and pre-emptive multitasking. Multithreaded processes can
be scheduled and distributed across multiple processors for improved scalability.
Each component process of the OS was designed to be modular, self-contained,
and memory protected from other component processes. This approach results in
a highly resilient system where process faults are isolated and therefore easier to
recover from when failure occurs. This self-contained, self-healing approach means
that recovery from such a condition is possible with no or minimal interruption
because individual processes are restarted and the system self-heals without requiring
a reload.
Note Historically, access to the Linux portion of NX-OS required the installation of a
“debug plugin” by Cisco support personnel. However, on some platforms NX-OS now
offers a feature bash-shell that allows users to access the underlying Linux portion of
NX-OS.
The command show system internal sysmgr service all displays all the services, their
UUID, and PID as shown in Example 1-1. Notice that the Netstack service has a PID of
6427 and a UUID of 0x00000221.
Technet24
10 Chapter 1: Introduction to Nexus Operating System (NX-OS)
Additional details about a service, such as its current state, how many times it has restart-
ed, and how many times it has crashed is viewed by using the UUID obtained in the
output of the previous command. The syntax for the command is show system internal
sysmgr service uuid uuid as demonstrated in Example 1-2.
Note If a service has crashed, the process name, PID, and date/time of the event is found
in the output of show cores.
NX-OS Architecture 11
For NX-OS platforms with redundant supervisor modules, another important role of the
system manager is to coordinate state between services on the active and standby super-
visors. The system manager ensures synchronization in the event the active fails and the
standby needs to take over.
As the name implies, MTS is used for interprocess communication in NX-OS. This is
facilitated using service access points (SAP) to allow services to exchange messages. To
use an analogy, if the MTS is the postal service, think of the SAP as a post office box for
a process. Messages are sent and received by a process using its SAP over MTS.
The system manager table output referenced previously is used again to reference a
service name and find its UUID, PID, and the SAP. This SAP number is then used to
get details from MTS on the number of messages exchanged and what the state of the
MTS buffers are for this service. To illustrate this, an OSPF process is configured with a
process tag of 32. Example 1-3 shows the OSPF process in the output of show system
internal sysmgr service all. This output is used to locate the UUID 0x41000119, the PID
of 13198, and the SAP of 320.
In Example 1-4, the show system internal mts sup sap sap-id [description | uuid | stats]
command is used to obtain details about a particular SAP. To examine a particular SAP,
first confirm that the service name and UUID match the values from the show system
internal sysmgr services all command. This is a sanity check to ensure the correct SAP
is being investigated. The output of show system internal mts sup sap sap-id [descrip-
tion] should match the service name, and the output of show system internal mts sup
sap sap-id [UUID] should match the UUID in the sysmgr output. Next examine the MTS
statistics for the SAP. This output is useful to determine what the maximum value of the
Technet24
12 Chapter 1: Introduction to Nexus Operating System (NX-OS)
MTS queue was (high-water mark), as well as examining the number of messages this
service has exchanged. If the max_q_size ever reached is equal to the hard_q_limit it is
possible that MTS has dropped messages for that service.
Note In the output of Example 1-4, the UUID is displayed as a decimal value, whereas in
the output from the system manager it is given as hexadecimal. NX-OS has a built-in utility
to do the conversion using the hex value or dec value command.
The NX-OS MTS service is covered in more detail in Chapter 3, “Troubleshooting Nexus
Platform Issues,” along with additional troubleshooting examples.
NX-OS Architecture 13
The PSS provides reliable and persistent storage for NX-OS services in a lightweight
key/value pair database. Two types of storage are offered by PSS, volatile and non-
volatile. The volatile storage is in RAM and is used to store service state that needs
to survive a process restart or crash. The second type is nonvolatile, which is stored in
flash. Nonvolatile PSS is used to store service state that needs to survive a system reload.
Example 1-5 uses the show system internal flash command to examine the flash file sys-
tem and demonstrates how to verify the current available space for the nonvolatile PSS.
Example 1-5 Verify the Size and Location of PSS in the Flash File System
Technet24
14 Chapter 1: Introduction to Nexus Operating System (NX-OS)
An NX-OS service utilizes volatile and nonvolatile PSS to checkpoint its run-time data
as needed. Consistent with the modular nature of NX-OS, PSS does not dictate what is
stored in which type of PSS and leaves that decision to the service. PSS simply provides
the infrastructure to allow services to store and retrieve their data.
Feature Manager
Features in NX-OS are enabled on-demand and only consume system resources such as
memory, CPU time, MTS queues, and PSS when they have been enabled. If a feature is
in use and is then later shut down by the operator, the resources associated with that
feature are freed and reclaimed by the system. The task of enabling or disabling features
is handled by the NX-OS infrastructure component known as the feature manager. The
feature manager is also responsible for maintaining and tracking the operational state of
all features in the system.
To better understand the role of the feature manager and its interaction with other ser-
vices, let’s review a specific example. An operator wants to enable BGP on a particular
Nexus switch. Because services in NX-OS are not started until they are enabled, the user
must first enter the feature bgp command in configuration mode. The feature manager
acts on this request by ensuring the proper license is in place for the feature, and then
feature manager sends a message to the system manager to start the service. When the
BGP service is started, it binds to an MTS SAP, creates its PSS entries to store run-time
state, and then informs the system manager. The BGP service then registers itself with the
feature manager where the operational state is changed to enabled.
When a feature is disabled by a user, a similar set of events occur in reverse order. The
feature manager asks the service to disable itself. The feature empties its MTS buffers and
destroys its PSS data and then communicates with the system manager and feature man-
ager, which sets the operational state to disabled.
It is important to note that some services have dependencies on other services. If a ser-
vice is started and its dependencies are not satisfied, additional services are started so
the feature operates correctly. An example of this is the BGP feature that depends on the
route policy manager (RPM). The most important concept to understand from this is that
services implement one or multiple features and dependencies exist. Except for the fact
that a user must enable features, the rest of this is transparent to the user, and NX-OS
takes care of the dependencies automatically.
NX-OS Architecture 15
Certain complex features require the user to specifically install a feature set before the
associated feature is enabled. MPLS, FEX, and Fabricpath are a few examples. To enable
these features, the user must first install the feature set with the install feature-set [feature]
command. The feature set is then enabled with the feature-set [feature] command.
Note The license manager tracks all the feature licenses on the system. When a license
expires, the license manager notifies the feature manager to shut down the feature.
In Example 1-6, the current state of a feature is verified using the show system inter-
nal feature-mgr feature state command. The output is provided in a table format that
lists the feature name, along with its UUID, state, and reason for the current state. In
Example 1-6, several features have been enabled successfully by the feature manager,
including two instances of EIGRP. The output also displays instances of a feature that
have not yet been enabled, such as EIGRP instance 3 through 16.
Technet24
16 Chapter 1: Introduction to Nexus Operating System (NX-OS)
Although problems with feature manager are not common, NX-OS does provide a way to
verify whether errors have occurred using the command-line interface (CLI). Although no
error codes are present in this output, Example 1-7 shows how to obtain an error code for
a specific feature if it existed, using the show system internal feature-mgr feature action
command.
Note NX-OS maintains a running log of events for many features and services referred to
as event history logs, which are discussed later in this chapter and referenced throughout
this book. Feature manager provides two event history logs (errors and messages) that pro-
vide additional detail for troubleshooting purposes. The output is obtained using the show
system internal feature-mgr event-history [msgs | errors] command.
NX-OS Architecture 17
FIB Manager
Management HA
ACL/QoS Manager Infrastructure
Infrastructure
Port Manager
Hardware
Drivers Netstack
Kernel
During system boot, or if a card is inserted into the chassis, the supervisor decides if it
should power on the card or not. This is done by checking the card type and verifying
that the required power, software, and hardware resources are in place for the card to
operate correctly. If so, the decision to power on the card is made. From that point, the
line card powers on and executes its Basic Input/Output System (BIOS), power-on self-
tests, and starts its system manager. Next, all the line card services are started that are
required for normal operation. Communication and messaging channels are established to
the supervisor that allow the supervisor to push the configuration and line card software
upgrades as needed. Additional services are started for local handling of exception log-
ging, management of environmental sensors, the card LEDs, health monitoring, and so
on. After the critical system services are started, the individual ASICs are started, which
allow the card to forward traffic.
In the operational state packets are forwarded and communications occur as needed with
the supervisor to update counters, statistics, and environmental data. The line card has
local storage for PSS as well as for On-Board Failure Logging (OBFL). The OBFL data is
stored in nonvolatile memory so that it can survive reloads and is an excellent source of
data for troubleshooting problems specific to the line card. Information such as exception
history, card boot history, environmental history and much more is stored in the OBFL
storage.
Technet24
18 Chapter 1: Introduction to Nexus Operating System (NX-OS)
For day-to-day operations, there is typically no need to enter the line card CLI. The
NX-OS operating system and distributed platforms are designed to be configured and
managed from the supervisor module. There are some instances where direct access
to the CLI of a line card is required. Typically, these scenarios also involve working
with Cisco TAC to collect data and troubleshoot the various line card subsystems.
In Example 1-8, the line card CLI is entered from the supervisor module using the
attach module command. Notice that the prompt changes to indicate which module
the user is currently connected to. After the user has entered the line card CLI, the
show hardware internal dev-port-map command is issued, which displays the map-
ping of front panel ports to the various ASICs of the card on this Nexus 7000 M2
series card.
Example 1-8 Use of the attach module CLI from the Supervisor
9 2 0 0 0,1 0 0 0,1 0
10 2 0 0 0,1 0 0 0,1 0
11 2 0 0 0,1 0 0 0,1 0
12 2 0 0 0,1 0 0 0,1 0
13 3 1 1 2,3 1 1 2,3 0
14 3 1 1 2,3 1 1 2,3 0
15 3 1 1 2,3 1 1 2,3 0
16 3 1 1 2,3 1 1 2,3 0
17 4 1 1 2,3 1 1 2,3 0
18 4 1 1 2,3 1 1 2,3 0
19 4 1 1 2,3 1 1 2,3 0
20 4 1 1 2,3 1 1 2,3 0
21 5 1 1 2,3 1 1 2,3 0
22 5 1 1 2,3 1 1 2,3 0
23 5 1 1 2,3 1 1 2,3 0
24 5 1 1 2,3 1 1 2,3 0
+-----------------------------------------------------------------------+
+-----------------------------------------------------------------------+
Note A common reason to access a line card’s CLI is to run embedded logic
analyzer module (ELAM) packet captures on the local forwarding engine. ELAM
is a tool used to troubleshoot data plane forwarding and hardware forwarding table
programming problems. ELAM capture is outside the scope of this book.
File Systems
The file system is a vital component of any operating system, and NX-OS is no exception.
The file system contains the directories and files needed by the operating system to boot,
log events, and store data generated by the user, such as support files, debug outputs, and
scripts. It is also used to store the configuration and any data that services store in non-
volatile PSS, which aids in system recovery after a failure.
Working with the NX-OS file system is similar to working with files in Cisco’s IOS, with
some improvements. Files and directories are created and deleted from bootflash: or the
external USB memory referred to as slot0:. Archive files are created and compress large
files, like show techs, to save space. Table 1-1 provides a list of file system commands that
are needed to manage and troubleshoot an NX-OS switch.
Technet24
20 Chapter 1: Introduction to Nexus Operating System (NX-OS)
Note The gzip and tar options are useful when working with data collected during trou-
bleshooting. Multiple files are combined into an archive and compressed for easy export to
a central server for analysis.
NX-OS Architecture 21
Technet24
22 Chapter 1: Introduction to Nexus Operating System (NX-OS)
This provides the list of files and subdirectories on the currently active supervisor. For
platforms with redundant supervisors, directories of the standby supervisor are accessed
as demonstrated in Example 1-10 by appending //sup-standby/ to the directory path.
an M2 I/O module. Having this persistent historical information is extremely useful for
troubleshooting a module problem.
Note The output in Example 1-11 is from a distributed platform; however, OBFL data is
available for nondistributed platforms as well. The items enabled depend on the platform.
Configure the OBFL options using the hw-module logging onboard configuration com-
mand with various subcommand options. There is typically no reason to disable OBFL.
Logflash
Logflash is a persistent storage location used to store system logs, syslog messages,
debug output, and core files. On some Nexus platforms the logflash is an external com-
pact flash or USB that may have not been installed, or was removed at some point. The
system prints a periodic message indicating the logflash is missing to alert the operator
about this condition so that it can be corrected. It is recommended to have the logflash
mounted and available for use by the system so that any operational data is stored
there. In the event of a problem, the persistent nature of logflash means that this data is
available for analysis. Example 1-12 uses the show system internal flash to verify that
logflash: is mounted and how much free space is available.
Technet24
24 Chapter 1: Introduction to Nexus Operating System (NX-OS)
Example 1-12 Verifying the State and Available Space for the logflash:
The contents of the logflash directory is examined using the dir logflash: as shown in
Example 1-13.
Example 1-14 demonstrates using the show file command to print the contents of a file in
logflash:.
■ Minor releases enhance the features and functions of an existing major release.
Depending on the Nexus platform and release, the naming convention of the software
version varies. In early versions of NX-OS, each platform was built on its own NX-OS
operating system code base. Today the majority of platforms use a NX-OS common
base operating system. This common base code is then modified or augmented as
needed to meet the feature requirements or hardware support of a specific platform.
The advantage of this approach is that fixes for software defects in the platform inde-
pendent base code are now incorporated back into the common base, and all platforms
benefit from those fixes.
Figure 1-5 explains how to interpret the NX-OS software naming convention to recognize
the Major/Minor/Maintenance release portions of the image name for a 6.2 release of
NX-OS for the Nexus 7000 platform.
6.2(8a)
Rebuild Identifier
Maintenance Release Identifier
Minor Release Identifier
Major Release Identifier
Figure 1-6 explains how to interpret the NX-OS software naming convention with the
common platform independent base code and platform dependent release details for the
Nexus 7000 platform.
Technet24
26 Chapter 1: Introduction to Nexus Operating System (NX-OS)
7.3(1) D1(1)
■ D—Nexus 7000/7700
■ N—Nexus 5000/6000
■ A—Nexus 3548
Note The Nexus 3000 and Nexus 9000 series platforms now share a common
platform-dependent software base, and the image name begins with nxos; for example,
nxos.7.0.3.I6.1.bin.
Upgrading the Erasable Programmable Logic Device (EPLD) image is also possible on
some platforms. The EPLD image is packaged separately from the NX-OS operating
system. The EPLD image upgrades firmware on the hardware components of the I/O
modules or line cards to offer new hardware functionality or to resolve known problems
without having to replace the hardware.
Understanding NX-OS Software Releases and Packaging 27
Note Not every NX-OS system upgrade requires an EPLD upgrade. The procedure for
installing NX-OS software and EPLD images are documented with examples for each
Nexus platform on www.cisco.com. Refer to the Software Upgrade and Installation Guides
for more details.
The Software Maintenance Upgrade (SMU) feature allows network operators to apply
a specific bug fix to their Nexus switch without requiring a system reload or in-service
software upgrade (ISSU). Critical network environments do not upgrade software with-
out extensive qualification testing specific to their architecture and configured features.
Previously if a bug fix was needed, a new maintenance release of NX-OS had to undergo
qualification testing and then be rolled out to the network. This obviously adds delay
waiting for the fix to be released in a maintenance release of NX-OS, as well as the delay
accrued during qualification testing before the network was finally patched to eliminate
the problem. This delay is solved with the SMU concept because only the SMU changes
are applied to the already qualified base image. The SMU installation procedure leverages
process restart or ISSU when possible to minimize impact to the network during installa-
tion. The Nexus switch then runs with the SMU applied until the specific bug fix is avail-
able in a qualified NX-OS maintenance release on Cisco.com.
Note An SMU is valid only for the image it was created for. If the NX-OS software is
upgraded to another release, the SMU is deactivated. It is critical to ensure any applicable
software defects are fixed in the new version of software before performing an upgrade.
The SMU files are packaged as a binary and a README.txt that detail the associ-
ated bugs that are addressed by the SMU. The naming convention of the SMU file is
platform-package_type.release_version.Bug_ID.file_type. For example, n7700-s2-
dk9.7.3.1.D1.1.CSCvc44582.bin. The general procedure for installing a SMU follows:
Step 1. Copy the package file or files to a local storage device or file server.
Step 2. Add the package or packages on the device using the install add command.
Step 3. Activate the package or packages on the device using the install activate
command.
Step 4. Commit the current set of packages using the install commit command.
However, in case of the reload or ISSU SMU, commit the packages after the
reload or ISSU.
Technet24
28 Chapter 1: Introduction to Nexus Operating System (NX-OS)
Note Before attempting the installation of an SMU, please review the detailed examples
on www.cisco.com for the platform.
Licensing
NX-OS requires that the operator obtain and install appropriate license files for the
features being enabled. Typically, Nexus platforms support a base feature set with no
additional license requirements. This includes most Layer 2 functionality and gener-
ally some form of Layer 3 routing support. To enable advanced features, such as MPLS,
OTV, FabricPath, FCoE, advanced routing, or VXLAN, additional licenses may need to
be installed depending on the platform. In addition to feature licenses, several Nexus
platforms also offer licenses to provide additional hardware capabilities. For example,
SCALEABLE_SERVICES_PKT on the Nexus 7000 series enables XL-capable I/O mod-
ules to operate in XL mode and take full advantage of their larger table sizes. Another
example is the port upgrade licenses available for some Nexus 3000 platforms.
License enforcement is built in to the NX-OS operating system by the feature manager,
which disables services if the appropriate licenses are not present. If a specific feature is
not configurable, the most likely culprit is a missing license. Cisco does allow for feature
testing without a license by configuring the license grace-period in global configuration,
which allows features to function for up to 120 days without the license installed. This
does not cover all feature licenses on all platforms, however. Most notably the Nexus
9000 and Nexus 3000 do not support license grace-period.
License files are downloaded from www.cisco.com. To obtain a license file you need
the serial number that is found with the show license host-id command. Next, use the
product authorization key (PAK) from your software license claim to retrieve the license
file and copy it to your switch. Installation of a license is a nondisruptive task and is
accomplished with the install license command. For platforms that support virtual device
contexts (VDC) the license is installed and managed on the default VDC and applies for
all VDCs present on the chassis. The license installation is verified with the show license
command.
NX-OS is capable of restarting a service to recover and resume normal operation while
minimizing impact to the data plane traffic being forwarded. This process restart event is
either stateful or stateless and occurs when initiated by the user, or automatically when
the system manager identifies a process failure.
NX-OS High-Availability Infrastructure 29
In the event of a stateless restart, all the run-time data structures associated with the failed
process are lost, and the system manager quickly spawns a new process to replace the one
that failed. A stateful restart means that a portion of the run-time data is used to recover
and seamlessly resume functioning where the previous process left off after a process fail-
ure or restart. Stateful restart is possible because the service updates its state in PSS while
active and then recovers the important run-time data structures from PSS after a failure.
Persistent MTS messages left in the process queue are picked up by the restarted service
to allow a seamless recovery. The capability to resume processing persistent messages in
the MTS queue means the service restart is transparent to other services that were com-
municating with the failed process.
NX-OS provides the infrastructure to the individual processes so that they can choose
the type of recovery mechanism to implement. In some cases, a stateful recovery does
not make sense, because a recovery mechanism is built in to the higher layers of a proto-
col. Consider a routing protocol process, such as OSPF or BGP, that has a protocol level
graceful restart or nonstop forwarding implementation. For those protocols, it does not
make sense to checkpoint the routing updates into the PSS infrastructure because they
are recovered by the protocol.
Note The reason for a reset is reviewed in the output of show system reset-reason.
Process crash or restart details are viewed with the show processes log pid and show
cores commands.
Supervisor Redundancy
Nexus platforms with redundant supervisor modules operate in an Active/Standby
redundancy mode. This means that only one of the supervisors is active at a time, and
the standby is ready and waiting to take over when a fatal failure of the active occurs.
Active/Standby supervisor redundancy provides a fully redundant control plane for the
device and allows for stateful switchover (SSO) and in-service software upgrades (ISSU).
The current redundancy state and which supervisor is active is viewed in the output of
show module, as well as the output of show system redundancy status, as shown in
Example 1-15.
Technet24
30 Chapter 1: Introduction to Nexus Operating System (NX-OS)
2. The system manager process on the standby announces itself to the system manager
process of the active.
3. The system manager of the standby synchronizes the startup configuration from the
active and starts all services on the standby to mirror the active.
4. The services on the standby synchronize state with a snapshot of the services state
on the active.
5. MTS messages from the services on the active are copied to the standby.
7. Process events are now copied to the standby so the services on both supervisors
remain in sync during normal operation (event-based synchronization).
In the event of a supervisor switchover, services on the standby supervisor are noti-
fied by the system manager to recover state and prepare to take over the active role.
Because the process events are synchronized to the standby by MTS during normal
operation, the recovery occurs quickly. After the switchover is complete, the supervi-
sor that was previously active is restarted, and it undergoes normal boot diagnostic
tests. If diagnostic tests pass, and it boots successfully, it synchronizes using the
same procedure previously outlined to synchronize with the current active supervisor.
Figure 1-7 shows the relationship of the NX-OS services that make up the supervisor
redundancy model.
NX-OS High-Availability Infrastructure 31
Active Supervisor
System Manager Service
MTS PSS
Redundancy Driver
Redundancy Driver
MTS PSS
In rare circumstances, the standby supervisor may fail to reach the HA Standby state.
One possible reason is that a service on the standby is not able to synchronize state with
the active. To check for this condition, verify the sysmgr state on the active and standby
supervisor to confirm which service is not able to synchronize state. If multiple VDCs are
configured, perform this verification for each VDC. To verify the synchronization state
of the supervisors, use the show system internal sysmgr state command, as shown in
Example 1-16.
Technet24
32 Chapter 1: Introduction to Nexus Operating System (NX-OS)
HA info:
slotid = 5 supid = 0
cardstate = SYSMGR_CARDSTATE_ACTIVE .
cardstate = SYSMGR_CARDSTATE_ACTIVE (hot switchover is configured enabled).
Configured to use the real platform manager.
Configured to use the real redundancy driver.
Redundancy register: this_sup = RDN_ST_AC, other_sup = RDN_ST_SB.
EOBC device name: veobc.
Remote addresses: MTS - 0x00000601/3 IP - 127.1.1.6
MSYNC done.
Remote MSYNC not done.
Module online notification received.
Local super-state is: SYSMGR_SUPERSTATE_STABLE
Standby super-state is: SYSMGR_SUPERSTATE_STABLE
Swover Reason : SYSMGR_UNKNOWN_SWOVER
Total number of Switchovers: 0
Swover threshold settings: 5 switchovers within 4800 seconds
Switchovers within threshold interval: 0
Last switchover time: 0 seconds after system start time
Cumulative time between last 0 switchovers: 0
Start done received for 1 plugins, Total number of plugins = 1
Statistics:
Message count: 0
Total latency: 0 Max latency: 0
Total exec: 0 Max exec: 0
The show system internal sysmgr gsync-pending command is used to verify that syn-
chronization is complete. Any services that are still pending synchronization are listed in
the output. Example 1-17 confirms that no services are pending synchronization on the
active supervisor.
The sysmgr output confirms that the superstate is stable for both supervisors, which
indicates there is no problem currently. If there was a problem, the superstate displays as
unstable. The superstate on the standby supervisor is verified by attaching to the standby
supervisor module, as shown in Example 1-18.
NX-OS High-Availability Infrastructure 33
Debugging info:
HA info:
slotid = 6 supid = 0
cardstate = SYSMGR_CARDSTATE_STANDBY .
cardstate = SYSMGR_CARDSTATE_STANDBY (hot switchover is configured enabled).
Configured to use the real platform manager.
Configured to use the real redundancy driver.
Redundancy register: this_sup = RDN_ST_SB, other_sup = RDN_ST_AC.
EOBC device name: veobc.
Remote addresses: MTS - 0x00000501/3 IP - 127.1.1.5
MSYNC done.
Remote MSYNC done.
Module online notification received.
Local super-state is: SYSMGR_SUPERSTATE_STABLE
Standby super-state is: SYSMGR_SUPERSTATE_STABLE
Swover Reason : SYSMGR_UNKNOWN_SWOVER
Total number of Switchovers: 0
Swover threshold settings: 5 switchovers within 4800 seconds
Switchovers within threshold interval: 0
Last switchover time: 0 seconds after system start time
Cumulative time between last 0 switchovers: 0
Start done received for 1 plugins, Total number of plugins = 1
Statistics:
Message count: 0
Total latency: 0 Max latency: 0
Total exec: 0 Max exec: 0
Technet24
34 Chapter 1: Introduction to Nexus Operating System (NX-OS)
The superstate is stable, and the redundancy register indicates that this supervisor is
redundancy state standby (RDN_ST_SB). Verify there are no services pending synchroni-
zation on the standby, as shown in Example 1-19.
If a service that was pending synchronization was found in this output, the next step in
the investigation is to verify the MTS queues for that particular service. An example of
verifying the MTS queues for a service was demonstrated earlier in this chapter and is
also shown in Chapter 3. If the MTS queue had messages pending for the service, further
investigation into why those messages are pending is the next step in solving the problem.
Network or device instability could be causing frequent MTS updates to the service that
is preventing the synchronization from completing.
ISSU
NX-OS allows for in-service software upgrade (ISSU) as a high-availability feature. ISSU
makes use of the NX-OS stateful switchover (SSO) capability with redundant supervisors
and allows the system software to be updated without an impact to data traffic. During
an ISSU, all components of the chassis are upgraded.
ISSU is initiated using the install all command, which performs the following steps to
upgrade the system.
Step 1. Determines whether the upgrade is disruptive and asks if you want to
continue
Step 3. Copies the kickstart and system images to the standby supervisor module
Step 5. Reloads the standby supervisor module with the new Cisco NX-OS software
Step 6. Reloads the active supervisor module with the new Cisco NX-OS software,
which causes a switchover to the newly upgraded standby supervisor module
For platforms that do not have a redundant supervisor, such as the Nexus 5000 series, a
different method is used to achieve ISSU. The control plane becomes inactive while the
data plane continues to forward packets. This allows the supervisor CPU to reset without
causing a traffic disruption and load the new NX-OS software version. After the CPU is
NX-OS Virtualization Features 35
booted on the new software release, the control plane is restored from the previous con-
figuration and run-time state. The switch then synchronizes the control plane state to the
data plane.
Nexus 9000 and Nexus 3000 platforms introduced an enhanced ISSU feature begin-
ning in release 7.0(3)I5(1). Normally the NX-OS software runs directly on the hardware.
However, with enhanced ISSU, the NX-OS software runs inside of a separate Linux con-
tainer (LXC) for the supervisor and line cards. During enhanced ISSU, a third container is
created to act as the standby supervisor so that the primary supervisor and line cards are
upgraded without disruption to data traffic. This feature is enabled with the boot mode
lxc configuration command on supported platforms.
Note ISSU has restrictions on some platforms, and ISSU may not be supported between
certain releases of NX-OS. Please reference the documentation on www.cisco.com to
ensure ISSU is supported before attempting an upgrade with this method.
A common use case for VDC is with OTV or LISP, where a dedicated VDC is configured
for the overlay encapsulation protocol, and another VDC serves to function as a distribu-
tion layer switch performing traditional Layer 2 and Layer 3 functions. Another popular
use of the VDC concept is to have a production VDC and a test/development VDC to
allow separation of these different environments in a single chassis. After appropriate
planning and VDC creation, operators allocate ports to each VDC and then intercon-
nect those ports to allow control plane protocols and data plane traffic to be exchanged
between the VDCs.
The VDC architecture inherently means that some resources are global to the switch;
other resources are shared between VDCs or dedicated to a specific VDC. For example,
an OSPF process in VDC-1 is independent of an OSPF process in VDC-2, although they
share the common CPU resources of the switch. The management Ethernet on the super-
visor is shared among all VDCs. Specific ports on an I/O module are dedicated to a VDC,
whereas the NX-OS kernel is global to the switch.
Technet24
36 Chapter 1: Introduction to Nexus Operating System (NX-OS)
The logical separation between VDCs extends to the protocol stack; however, all VDCs
on the switch share the same kernel resources and infrastructure. The system infrastruc-
ture is designed to allow fair resource allocation of shared resources, as well as the con-
trol plane queues from the kernel to the protocol stack of each VDC. Other resources are
dedicated to a particular VDC, such as VLANs and routing table space. Figure 1-8 pro-
vides a visual representation of the VDC architecture of the Nexus 7000 series.
Physical Switch
VDC A VDC B
Layer 2 Protocols Layer 3 Protocols Layer 2 Protocols Layer 3 Protocols
VLAN VLAN
UDLD OSPF GLBP UDLD OSPF GLBP
MGR MGR
STP CDP BGP HSRP STP CDP BGP HSRP
Infrastructure
Kernel
With appropriate licenses, the Supervisor 1 and Supervisor 2 allow for four VDCs plus
an admin VDC. The Supervisor 2E allows for eight VDCs plus an admin VDC. The admin
VDC does not handle any data plane traffic and serves only switch management func-
tions. In the context of operating or troubleshooting in a VDC environment, note that
certain tasks can be performed only from the default VDC.
4. Licensing operations
8. Ethanalyzer captures
NX-OS Virtualization Features 37
Although VDCs allow additional versatility, some restrictions exist. For instance, all
VDCs run on the same NX-OS version and kernel. Restrictions also exist on which I/O
modules can be in the same VDC, and which ports of a line card can be allocated to a
VDC based on the hardware application-specific integrated circuit (ASIC) architecture of
the I/O module and forwarding engine. Before attempting to create VDCs, check the doc-
umentation for the specific supervisor and I/O modules that are installed in the switch so
that any limitations are dealt with in the design and planning phase.
Note At the time of this writing, multiple VDCs are supported only on the Nexus
7000 series.
The concept of VRF-lite defines multiple routing and forwarding tables with logical
separation on a single device without a Multiprotocol Label Switching (MPLS) transport.
MPLS VPNs use VRFs on provider edge (PE) nodes to separate multiple routing and for-
warding tables logically with an MPLS transport between PEs.
NX-OS supports both VRF-lite and MPLS VPN for virtualization and separation of rout-
ing tables and forwarding state. Importing and exporting routes between VRF contexts is
supported, as well as import and export from the global routing table to a VRF table. In
addition to the user-defined VRFs, NX-OS puts the management Ethernet interface of the
switch into its own management VRF by default. This provides a desirable separation of
data plane and management plane services.
In the virtualization hierarchy, a VRF exists locally within a VDC, and multiple VDCs can
exist in a physical switch. If VRFs configured in different VDCs need to communicate,
a control plane routing protocol is required to exchange information between the VRFs
in different VDCs. This is done in the same manner as routing between VDCs using the
default VRF. Routing traffic between VRFs is achieved with a control plane protocol to
exchange routing information, or if the VRFs exist in the same VDC, route leaking is
used to exchange routes between them.
Note Support for MPLS VPN is dependent upon the capabilities of the platform and the
installed feature licenses.
Technet24
38 Chapter 1: Introduction to Nexus Operating System (NX-OS)
switch pair, which makes this technology a very attractive option to remove STP blocking
ports from the access layer. When two switches are configured as a vPC pair, one switch
is elected primary (lowest priority wins). The primary role comes into play for STP, as
well as during certain failure scenarios.
Figure 1-9 is an example of a vPC-enabled switch pair connected with two additional
switches using vPC.
vPC Peer-link
Orphan Port
Non-vPC Port
NX-1 NX-2
vPC Peer Keepalive
vPC Ports vPC Ports
Port-ch 10 Port-ch 20
In Figure 1-9, the vPC pair is using vPC Port-channel 10 and vPC Port-channel 20 to
connect with two access switches. A third access switch to the right of NX-2 is not con-
nected in vPC mode. This non-vPC enabled interface is known as an orphan port in vPC
terminology. Each of the vPC terms in Figure 1-9 are as follows:
■ vPC Peer Keepalive link should be a separate path from the peer-link. It does not
have to be a point-to-point link and can traverse a routed infrastructure. The peer
keepalive link is used to ensure liveness of the vPC peer switch.
■ vPC Port (vPC member ports) are ports assigned to a vPC Port-channel group. The
ports of the vPC are split between the vPC peers.
Because all links are up and forwarding in vPC, the decision of which vPC member-link
interface to forward a packet on is made by the port-channel load-balance hash of the
device sending the packet. The sending switch looks at the frame source and destina-
tion addresses of a traffic flow and feeds these details into an algorithm. The algorithm
performs a hash function and returns a selected member port of the port-channel for the
Management and Operations Capabilities 39
frame to exit on. This allows all member link interfaces of the port-channel to share the
load of traffic.
With each vPC peer making independent frame forwarding decisions, there is a need for
state synchronization between the peers. CFS is the protocol that is used to synchronize
Layer 2 (L2) state. It operates over the peer-link and is enabled automatically. Its role is to
ensure compatibility of the vPC member ports between vPC peers. It is also used to syn-
chronize MAC address tables and IGMP snooping state between vPC peers so that any
table entries exist on both vPC peers. Layer 3 (L3) forwarding tables and protocol state
are independent on each vPC peer.
Note vPC is introduced here as a virtualization concept and is covered in detail later in
this book.
During the process of investigating a problem, data is often collected for offline analysis.
This means that the user needs to execute commands, capture the output, and then transfer
that output from the device to somewhere else for review. NX-OS provides the > and >>
operators, as shown in Example 1-20, which allow output to be redirected to a new file or
appended to an existing file, respectively. This is especially useful when collecting a lengthy
show tech support file.
Technet24
40 Chapter 1: Introduction to Nexus Operating System (NX-OS)
The use of the parsing utilities (i.e. “| count” and “| wc”) provide a way of counting the
number of lines or words in the show command being executed. Example 1-21 shows
how to use this for situations where a simple count provides verification of the current
state; for example, comparing the number of lines in show ip ospf neighbor before and
after a configuration change.
There are many troubleshooting scenarios where command output needs to be taken mul-
tiple times and compared to see what has changed or which counters have incremented
since the previous execution. Example 1-22 demonstrates the use of the diff utility. Do
not use this for large outputs such as show tech, because it consumes system resources
while retaining the output for comparison. It is better to compare large outputs after
transferring the data off-box.
In some cases it is desirable to obtain only the last few lines of a command instead of
paging through the output or using an include/exclude utility option. The last count
utility displays the last few entries and is used when parsing the accounting log, system
Management and Operations Capabilities 41
log buffer, or event history logs. Example 1-23 shows how to print only the last line in the
log buffer.
Another nice feature is the “,” utility, which is used to execute a command with multiple
arguments. Example 1-24 shows how this is useful for checking the interface rate simulta-
neously on two different ports when combined with egrep.
The egrep and grep utilities are extremely useful for removing clutter in a command out-
put to show only the character pattern of interest. Example 1-25 demonstrates a common
use for egrep, which is to review an event history log and look for a specific event. In this
example, egrep is used to find each time OSPF has run its shortest path first (SPF) algo-
rithm. The prev 1 option is used so that the line previous to the pattern match is printed,
which indicates a full or partial SPF run. The next option is used to get lines after the
egrep pattern match.
Technet24
42 Chapter 1: Introduction to Nexus Operating System (NX-OS)
2017 Jul 9 23:44:00.094085 ospf 12 [16161]: : SPF run 10 STARTED with flags
0x1, vpn superbackbone changed flag is FALSE
--
2017 Jul 9 23:43:56.074094 ospf 12 [16161]: : This is a full SPF
2017 Jul 9 23:43:56.074091 ospf 12 [16161]: : SPF run 9 STARTED with flags
0x1, vpn superbackbone changed flag is FALSE
--
Egrep has several other useful options that you should become familiar with. They are
count, which returns a count of the number of matches, invert-match, which prints only
lines that do not match the pattern, and line-number, which appends the line number of
the match to each line.
Example 1-26 demonstrates show cli list [string], which returns all CLI commands that
match the given string input. This saves time over using the “?” to figure out which com-
mands exist.
Complementary to the previous example, the syntax for the identified CLI commands are
available with the show cli syntax [string] command, as shown in Example 1-27.
The show running-config diff command is useful to quickly compare the running-con-
figuration and the startup-configuration of the switch. In Example 1-28 a logging logfile
was configured and is highlighted with the “!” in the output as a changed line between
the two files.
Technet24
44 Chapter 1: Introduction to Nexus Operating System (NX-OS)
Cisco IOS requires that you prepend any user or exec level command with do while in
the configuration mode. NX-OS eliminates the need for the do command because it
allows the execution of exec level commands from within the configuration mode.
Each feature enabled in NX-OS has the capability to provide a tech-support file that
obtains the most useful information about that specific feature. The show tech-support
[feature] obtains the feature configuration, show commands, data structures, and event
histories needed for offline analysis of a problem with a specific feature. Be aware of fea-
ture dependencies when performing data collection so that all relevant information about
the problem is gathered.
For example, a unicast routing problem with OSPF as the routing protocol requires you
to collect show tech-support ospf, but for a complete analysis the output of show
tech-support routing ip unicast is also needed to get Unicast Routing Information Base
(URIB) events. Feature dependency and what to collect is determined on a case-by-case
basis, depending on the problem under investigation. Many feature show tech-support
outputs do not include a full show running-config.
It is always a good idea to collect the full show running-config along with any
specific feature show techs that are needed. Example 1-29 shows the collection of
Management and Operations Capabilities 45
Notice that each show tech is collected as an individual file by redirecting the output to
bootflash: with the “>” operator. After collecting all the relevant feature show techs, they
are combined into an archive using the tar command, which makes the data easy to copy
from the switch for later analysis.
Note In addition to the show tech-support [feature], NX-OS also provides show
running-config [feature], which prints only the running-configuration of the given feature.
Accounting Log
NX-OS keeps a history of all configuration changes made to the device in the accounting
log. This is a useful piece of information to determine what has changed in a switch, and
by whom. Typically, problems are investigated based on the time when they started to
occur, and the accounting log can answer the question, What has changed? An example
of reviewing the accounting log is shown in Example 1-30. Because only the last few lines
are of interest, the start-seqnum option is used to jump to the end of the list.
Technet24
46 Chapter 1: Introduction to Nexus Operating System (NX-OS)
The accounting log is stored persistently in logflash: so that it is available even if the
switch is reloaded.
Note The terminal log-all configuration command enables the logging of show com-
mands in the accounting log.
Feature Event-History
One very useful serviceability feature of NX-OS is that it keeps circular event-history
buffers for each configured feature. The event-history is similar in many ways to an
always-on debug for the feature that does not have any negative CPU impact on the
switch. The granularity of events stored in the event-history depends on the individual
feature, but many are equivalent to the output that is obtained with debugging. In many
cases, the event-history contains enough data to determine what sequence of events has
occurred for the feature without the need for additional debug logs, which makes them a
great troubleshooting resource.
Event-history buffers are circular, which means that the possibility exists for events to
be overwritten by the time a problem condition is recognized, leaving no event history
evidence to investigate. For some features, the event-history size is configurable as [small |
medium | large]. If a problem with a particular feature is occurring regularly, increase the
event-history size to improve the chance of catching the problem sequence in the buffer.
Most feature event-histories are viewed with the show {feature} internal event-history
command, as shown in Example 1-31.
Management and Operations Capabilities 47
Note Troubleshooting scenarios may require a periodic dump of the feature tech support
file. This is done with Embedded Event Manager (EEM), or another method of scripting
the data collection. bloggerd is another tool for such scenarios, but it is recommended for
use only under guidance from Cisco Technical Assistance Center (TAC).
Technet24
48 Chapter 1: Introduction to Nexus Operating System (NX-OS)
■ Atomic: Perform the rollback only if no errors occur. This is the default option.
■ Verbose mode: Shows the detailed execution log during the rollback operation.
When performing a configuration rollback, the changes to be applied are viewed with the
show diff rollback-patch checkpoint command. This allows a checkpoint file to be com-
pared to another checkpoint file, or to the running configuration. During a rollback, if an
error is encountered, you need to decide to cancel or continue. If the rollback is canceled,
a list of changes that were applied is provided, and those changes need to be backed out
manually to return to the pre-rollback configuration. Example 1-33 provides an example
of a configuration checkpoint and rollback operation.
!!
no router ospfv3 1
NX-1# rollback running-config checkpoint known_good
Note: Applying config parallelly may fail Rollback verification
Collecting Running-Config
#Generating Rollback Patch
Executing Rollback Patch
Generating Running-config for verification
Generating Patch for verification
Verification is Successful.
In Example 1-33, an OSPFv3 process was deleted after creating the initial configuration
checkpoint. The difference between the checkpoint and the running configuration was
highlighted with the show diff rollback-patch checkpoint command. The configuration
change was then rolled back successfully and the OSPFv3 process was restored.
Consistency Checkers
Consistency checkers are an example of how NX-OS platforms are improving serviceabil-
ity in each release. Certain software bugs or race conditions may result in a mismatch of
state between the control plane, data plane, or forwarding ASICs. Finding these problems
is nontrivial and usually requires in-depth knowledge of the platform. Consistency check-
ers were introduced to deal with these situations, and they are the result of feedback
from TAC and customers. Example 1-34 shows the usage of the forwarding consistency
checker on a Nexus 3172 platform.
The type of consistency checkers that are available vary by platform. In addition, the
capabilities are continuing to evolve, and support is being added for new protocols and
platforms with each release. Consistency checkers are run on-demand while investigating
Technet24
50 Chapter 1: Introduction to Nexus Operating System (NX-OS)
a problem that has been isolated to the specific device. They quickly validate that a prob-
lem is not caused by a state mismatch in the platform.
The scheduler is a useful feature for backing up configurations, copying files, or collecting
data at a specified time interval. The scheduler feature can be combined with the NX-OS
python or Embedded Event Manager (EEM) to provide a powerful method of automating
tasks. In Example 1-35, a scheduler job is defined that executes a python script every day
at midnight. The scheduler configuration requires that feature scheduler is enabled. The job
and schedule for the job is then configured to determine when the job executes.
NX-OS provides access to the python interpreter from the exec mode CLI by using the
python command, as shown in Example 1-36.
NX-1# python
Python 2.7.5 (default, Jun 3 2016, 03:57:06)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print "hello world!"
hello world!
>>> quit()
In addition to python, NX-OS also supports EEM, which is another way of automating tasks such
as data collection, or dynamically modifying the configuration if a defined event has occurred.
Note Chapter 15, “Programmability and Automation,” covers the programming and
automation capabilities of NX-OS in more detail.
References 51
Bash Shell
A recent addition to NX-OS is the bash shell feature. After feature bash-shell is enabled,
the user can enter the Linux bash shell of the NX-OS operating system. Example 1-37
shows how to access the bash shell from the exec prompt.
To gain access to the bash shell, your user account must be associated to the dev-ops or
network-admin role. Bash commands can also be run from the exec mode CLI of the
switch with the run bash [command] option. This command takes the command argu-
ment specified, runs it in the bash shell, and returns the output. Python scripts can run
from within the bash shell as well. It is recommended to use caution while utilizing the
bash shell to manage the device.
Summary
NX-OS is a powerful, feature-rich network operating system that is deployed in thou-
sands of networks around the world. For the past 10 years, it has remained under constant
development to meet the needs of those networks as they evolve over time. The modular
architecture allows for rapid feature development and source code sharing between the
different Nexus switching platforms. NX-OS and Nexus switches are designed for resil-
ience of both the hardware and software to allow for uninterrupted service even in the
event of a component failure.
The important management and operational features of NX-OS were introduced in this
chapter. That foundational knowledge enables you to apply the troubleshooting tech-
niques from the remaining chapters to your own network environment.
References
Fuller, Ron. “Virtual Device Context (VDC) Design and Implementation” (presented at
Cisco Live, San Francisco 2014).
Esau, Matthew and Souvik Ghosh. “Advanced Troubleshooting Cisco 7000 Series”
(presented at Cisco Live, Las Vegas 2017).
Technet24
52 Chapter 1: Introduction to Nexus Operating System (NX-OS)
Fuller, Ron, David Jansen, and Matthew McPherson. NX-OS and Cisco Nexus
Switching. Indianapolis: Cisco Press, 2013.
Cisco. Cisco Nexus Platform Software Upgrade and Installation Guides, www.cisco.com.
Chapter 2
Technet24
54 Chapter 2: NX-OS Troubleshooting Tools
the far end device. Understanding the packet flow between two directly connected
devices requires taking three perspectives:
■ Determining whether the originating router is transmitting the packet across the
network medium
This is where concept of network sniffing comes into play. Network sniffing is the
technique of intercepting the traffic that passes over the transmission medium for
the protocol and for deep packet analysis. Not only does packet sniffing help with
troubleshooting packet forwarding issues, but security experts also heavily use it to
perform deep analysis of the network and find security holes.
Performing a network sniffer capture requires a PC with a packet capture tool, such as
Wireshark, attached to the switch. A mirror copy of the relevant traffic is copied and
sent to the destination interface, where it is captured by the packet capture tool and is
available for analysis. Figure 2-1 shows a Nexus switch connected between two routers
and a capture PC that has Wireshark installed to capture the traffic flowing between
routers R1 and R2.
Host with
Sniffer
H1
Eth4/3
Eth4/1 Eth4/2
R1 R2
Source Destination
On Cisco devices, the sniffing capability is called a Switched Port Analyzer (SPAN)
feature. The source port is called the monitored port and the destination port is called
the monitoring port. The SPAN feature on NX-OS is similar in Cisco IOS, but different
Nexus switches have different capabilities, based on the hardware support. The following
source interfaces can be used as SPAN source interfaces:
■ Ethernet
■ Port-channel
Packet Capture: Network Sniffer 55
■ Inband interfaces to the control plane CPU (on Nexus 7000, this feature is supported
only on default virtual device context [VDC])
■ FCoE ports
Note These features can vary on each Nexus platform, based on the hardware support.
The number of active sessions and the source and destination interfaces per session vary
on different Nexus platforms. Be sure to verify relevant Cisco documentation before
configuring a SPAN session on any Nexus switch.
To enable a port to forward the spanned traffic to the capture PC, the destination inter-
face is enabled for monitoring with the interface parameter command switchport moni-
tor. The destination ports are either an Ethernet or Port-Channel interface configured
in access or trunk mode. The SPAN session is configured using the command monitor
session session-number, under which the source interface is specified with the command
source interface interface-id [rx|tx|both]. The rx option is used to capture the ingress
(incoming) traffic, whereas the tx option is used to capture the egress (outgoing) traffic.
By default, the option is set to both, which captures both ingress and egress traffic on
the configured source interface. The destination interface is specified with the command
destination interface interface-id. By default, the monitor session is in shutdown state
and must be manually un-shut for the SPAN session to function.
Note The SPAN features can vary across different Nexus platforms. For instance,
features such as SPAN-on-Drop and SPAN-on-Latency are supported on Nexus 5000 and
Nexus 6000 series but not on Nexus 7000 series. Refer to the platform documentation for
more about the feature support.
Example 2-1 illustrates a SPAN session configuration on a Nexus switch. Notice that, in this
example, the source interface is a range of interfaces, along with the direction of the capture.
Technet24
56 Chapter 2: NX-OS Troubleshooting Tools
Note On FCoE ports, the SPAN destination interface is configured with the command
switchport mode SD, which is similar to the command switchport monitor.
Example 2-2 displays the status of the monitor session. In this example, the rx, tx, and
both fields are populated for interface Eth4/1 and Eth4/2, but the interface Eth5/1 is
listed only for the rx direction. There is also an option to filter VLANS under the monitor
session using the filter vlan vlan-id command.
The default behavior of a SPAN session is to mirror all traffic to the destination
port, but NX-OS also provides the capability to perform a filter on the traffic
to be mirrored to the destination port. To filter the relevant traffic, an access
control list (ACL) is created, to be referenced in the SPAN session configuration
by using the filter access-group acl command. Example 2-3 illustrates the filtering
configuration on the SPAN session and verification using the show monitor session
command.
Note ACL filtering varies on different Nexus platforms. Refer to the CCO documentation
for ACL filtering support on respective Nexus platforms.
Packet Capture: Network Sniffer 57
■ ERSPAN ID
■ GRE-encapsulated traffic
Technet24
58 Chapter 2: NX-OS Troubleshooting Tools
The ERSPAN ID is used to distinguish among multiple source devices, sending spanned
traffic to one single centralized server.
Figure 2-2 shows a network topology with ERSPAN setup. Two Nexus switches are
connected by a routed network. The N6k-1 switch is configured as the ERSPAN-source
with a local source SPAN port, and the destination port is located in an IP network on
the N7k-1 switch. The GRE-encapsulated packets are transmitted across the IP network
toward the destination switch, where they are decapsulated and sent to the traffic
analyzer.
Mirrored
Traffic
N7k-1
192.168.1.10
Eth1/3 Network
Analyzer
ERSPAN Traffic
IP Cloud
L3
N6k-1
Eth1/10 Data
Traffic
The source and destination sessions can be configured on different switches separately
for the source traffic in ingress, egress, or both directions. The ERSPAN is configured
to span traffic on Ethernet ports, VLANs, VSANs, and FEX ports. The destination port
remains in monitoring state and does not participate in the spanning tree or any Layer 3
protocols. Example 2-4 illustrates the configuration of both the source ports and
destination ports on two different Nexus switches. Note that the ERSPAN-ID should be
the same on both switches.
Packet Capture: Network Sniffer 59
For the ERSPAN source session to come up, the destination IP should be present in the
routing table. The ERSPAN session status is verified using the command show monitor
session session-id. Example 2-5 demonstrates the verification of both the source and
destination ERSPAN sessions.
Technet24
60 Chapter 2: NX-OS Troubleshooting Tools
Note Refer to the Cisco documentation before configuring ERSPAN on any Nexus
switch, to verify any platform limitations.
NX-OS provides the capability to span the traffic based on the specified latency thresh-
olds or based on drops noticed in the path. These capabilities are available for both SPAN
and ERSPAN.
SPAN-on-Latency
The SPAN-on-Latency (SOL) feature works a bit differently than the regular SPAN session.
In SOL, the source port is the egress port on which latency is monitored. The destination
port is still the port where the network analyzer is connected on the switch. The latency
threshold is defined on the interface that is being monitored using the command packets
latency threshold threshold-value. When the packets cross or exceed the specified
threshold, the SPAN session is triggered and captures the packets. If the threshold value is
not specified under the interface, the value is truncated to the nearest multiple of 8.
Example 2-6 illustrates the SOL configuration, in which packets are sniffed only at
the egressing interface Eth1/1 and Eth1/2 for flows that have latency more than 1μs
(microsecond). The packet latency threshold configuration is per port for 40G interfaces
but if there are 4x10G interfaces, they share the same configuration. For this reason,
Example 2-6 displays the log message that interfaces Eth1/1 to Eth1/4 are configured
with a latency threshold of 1000 ns.
Packet Capture: Network Sniffer 61
Interfaces Eth1/1, Eth1/2, Eth1/3 and Eth1/4 are configured with latency
threshold 1000
SPAN-on-Drop
SPAN-on-Drop is a new feature that enables the spanning of packets that were dropped
because of unavailable buffer or queue space upon ingress. This feature provides the
capability to span packets that would otherwise be dropped because the copy of the
spanned traffic is transferred to a specific destination port. A SPAN-on-Drop session is
configured by specifying the type as span-on-drop in the monitor session configura-
tion. Example 2-7 demonstrates the SPAN-on-Drop monitor session configuration. The
source interface Eth1/1 specified in the configuration is the interface where congestion is
present.
Technet24
62 Chapter 2: NX-OS Troubleshooting Tools
Note The SPAN-on-Drop feature captures only drops in unicast flows that result from
buffer congestion.
Unlike other SPAN features, SPAN-on-Drop does not have any ternary content
addressable memory (TCAM) programming involved. Programming for the source
side is in the buffer or queue space. Additionally, only one instance of SPAN-on-Drop
can be enabled on the switch; enabling a second instance brings down the session
with the syslog message “No hardware resource error.” If the SPAN-on-Drop session
is up but no packets are spanned, it is vital to verify that the drop is happening in
the unicast flow. This is verified by using the command show platform software
qd info interface interface-id and checking that the counter IG_RX_SPAN_ON_
DROP is incrementing and is nonzero. Example 2-8 shows the output for the
counter IG_RX_SPAN_ON_DROP, confirming that no drops are occurring in the
unicast flows.
N6k-1# show plat software qd info interface ethernet 1/1 | begin BM-INGRESS
BM-INGRESS BM-EGRESS
-------------------------------------------------------------------------------
IG_RX 364763|TX 390032
SP_RX 1491|TX_MCAST 0
LB_RX 15689|CRC_BAD 0
IG_RX_SPAN_ON_DROP 0|CRC_STOMP 0
IG_RX_MCAST 14657|DQ_ABORT_MM_XOFF_DROP 0
LB_RX_SPAN 15689|MTU_VIO 0
IG_FRAME_DROP 0|
SP_FRAME_DROP 0|
LB_FRAME_DROP 0|
IG_FRAME_QS_EARLY_DROP 0|
ERR_IG_MTU_VIO 0|
ERR_SP_MTU_VIO 0|
ERR_LB_MTU_VIO 0|
Note At the time of writing, SOL and SPAN-on-Drop are supported only on Nexus 5600
and Nexus 6000 series switches.
Nexus Platform Tools 63
■ Ethanalyzer
■ Packet Tracer
These tools are capable of performing packet capture for the traffic destined for the CPU
or transit hardware-switched traffic. They are helpful in understanding the stages the packet
goes through in a switch, which helps narrow down the issue very quickly. The main benefit
of these features is that they do not require time to set up an external sniffing device.
Note The ELAM capture is supported on all Nexus switches, but because it requires
deeper understanding of the ASICs and the configuration differs among Nexus platforms,
it is outside the scope of this book. Additionally, ELAM is best performed under the
supervision of a Cisco Technical Assistance Center (TAC) engineer. ELAM also is not
supported on N5000 or N5500 switches.
Ethanalyzer
Ethanalyzer is an NX-OS implementation of TShark, a terminal version of Wireshark.
TShark uses the libpcap library, which gives Ethanalyzer the capability to capture and
decode packets. It can capture inband and management traffic on all Nexus platforms.
Ethanalyzer provides the users with the following capabilities:
■ Avoid the requirement of using an external sniffing device to capture the traffic
Technet24
64 Chapter 2: NX-OS Troubleshooting Tools
The next step is to set the filters. With a working knowledge of Wireshark, configuring
filters for Ethanalyzer is fairly simple. Two kinds of filters can be set up for configur-
ing Ethanalyzer: capture filter and display filter. As the name suggests, when a capture
filter is set, only frames that match the filter are captured. The display filter is used to
display the packets that match the filter from the captured set of packets. That means
Ethanalyzer captures other frames that do not match the display filter but are not dis-
played in the output. By default, Ethanalyzer supports capturing up to 10 frames and
then stops automatically. This value is changed by setting the limit-captured-frames
option, where 0 means no limit.
Note All in-band Ethernet ports that send or receive data to the switch supervisor are
captured with the inbound-hi or inbound-low option. However, display or capture filtering
can be applied.
To start a packet capture with Ethanalyzer, use the command ethanalyzer local interface
[inbound-hi | inbound-lo | mgmt] options, with the following options:
While using Ethanalyzer, specifying the filters is easier for someone who is familiar with
Wireshark filters. The syntax for both the capture filter and the display filter is different.
Table 2-1 lists some of the common filters and their syntax with the capture-filter and
display-filter options.
Filter on manufacturer:
eth.src[0:2]==vendor-mac-addr
Technet24
66 Chapter 2: NX-OS Troubleshooting Tools
Packet length:
less length
greater length
Layer 4 udp port 53 tcp.port==53
udp dst port 53 udp.port==53
udp src port 53
tcp port 179
tcp portrange 2000-2100
FabricPath proto 0x8903 Dest HMAC/MC destination:
cfp.d_hmac==mac
cfp.d_hmac_mc==mac
Nexus Platform Tools 67
ICMP-Types:
icmp-echoreply
icmp-unreach
icmp-sourcequench
icmp-redirect
icmp-echo
icmp-routeradvert
icmp-routersolicit
icmp-timxceed
icmp-paramprob
icmp-tstamp
icmp-tstampreply
icmp-ireq
icmp-ireqreply
icmp-maskreq
icmp-maskreply
Technet24
68 Chapter 2: NX-OS Troubleshooting Tools
Example 2-9 illustrates the use of Ethanalyzer to capture all packets hitting the inbound-
low as well as inbound-hi queue on Nexus 6000. From the following outputs, notice that
the TCP SYN/SYN ACK packets even for a BGP peering are part of the inbound-low
queue, but the regular BGP updates and keepalives (such as the TCP packets after the
BGP peering is established) and the acknowledgements are part of the inband-hi queue.
Capturing on inband
2017-05-21 21:34:42.821141 10.162.223.34 -> 10.162.223.33 BGP KEEPALIVE Message
2017-05-21 21:34:42.932217 10.162.223.33 -> 10.162.223.34 TCP bgp > 14779 [ACK]
Seq=1 Ack=20 Win=17520 Len=0
2017-05-21 21:34:43.613048 10.162.223.33 -> 10.162.223.34 BGP KEEPALIVE Message
2017-05-21 21:34:43.814804 10.162.223.34 -> 10.162.223.33 TCP 14779 > bgp [ACK]
Seq=20 Ack=20 Win=15339 Len=0
2017-05-21 21:34:46.005039 10.1.12.2 -> 224.0.0.5 OSPF Hello Packet
2017-05-21 21:34:46.919884 10.162.223.34 -> 10.162.223.33 BGP KEEPALIVE Message
2017-05-21 21:34:47.032215 10.162.223.33 -> 10.162.223.34 TCP bgp > 14779 [ACK]
Seq=20 Ack=39 Win=17520 Len=0
! Output omitted for brevity
As stated earlier, optimal practice is to write the captured frames in a file and then read
it after the frames are captured. The saved file in a local bootflash is read using the
command ethanalyzer local read location [detail].
Nexus 7000 offers no option for inbound-hi or inbound-low. The CLI supports cap-
tures on the mgmt interface or the inband interface. The inband interface captures
both high- and low-priority packets. Example 2-10 illustrates how to write and read the
saved packet capture data. In this example, Ethanalyzer is run with a capture-filter on
STP packets.
Nexus Platform Tools 69
Technet24
70 Chapter 2: NX-OS Troubleshooting Tools
Logical-Link Control
DSAP: Spanning Tree BPDU (0x42)
IG Bit: Individual
SSAP: Spanning Tree BPDU (0x42)
CR Bit: Command
Control field: U, func=UI (0x03)
000. 00.. = Command: Unnumbered Information (0x00)
.... ..11 = Frame type: Unnumbered frame (0x03)
Spanning Tree Protocol
Protocol Identifier: Spanning Tree Protocol (0x0000)
Protocol Version Identifier: Rapid Spanning Tree (2)
BPDU Type: Rapid/Multiple Spanning Tree (0x02)
BPDU flags: 0x3c (Forwarding, Learning, Port Role: Designated)
0... .... = Topology Change Acknowledgment: No
.0.. .... = Agreement: No
..1. .... = Forwarding: Yes
...1 .... = Learning: Yes
.... 11.. = Port Role: Designated (3)
.... ..0. = Proposal: No
.... ...0 = Topology Change: No
Root Identifier: 4096 / 1 / 50:87:89:4b:bb:42
Root Bridge Priority: 4096
Root Bridge System ID Extension: 1
Root Bridge System ID: 50:87:89:4b:bb:42 (50:87:89:4b:bb:42)
Root Path Cost: 0
Bridge Identifier: 4096 / 1 / 50:87:89:4b:bb:42
Bridge Priority: 4096
Bridge System ID Extension: 1
Bridge System ID: 50:87:89:4b:bb:42 (50:87:89:4b:bb:42)
Port identifier: 0x9000
Message Age: 0
Max Age: 20
Hello Time: 2
Forward Delay: 15
Version 1 Length: 0
! Output omitted for brevity
The saved .pcap file can also be transferred to a remote server via File Transfer Protocol
(FTP), Trivial File Transfer Protocol (TFTP), Secure Copy Protocol (SCP), Secure FTP
(SFTP), and Universal Serial Bus (USB), after which it can be easily analyzed using a
packet analyzer tool such as Wireshark.
Nexus Platform Tools 71
Note If multiple VDCs exist on the Nexus 7000, the Ethanalyzer runs only on the admin
or default VDC. In addition, starting with Release 7.2 on Nexus 7000, you can use the
option to filter on a per-VDC basis.
Packet Tracer
During troubleshooting, it becomes difficult to understand what action the system is
taking on a particular packet or flow. For such instances, the packet tracer feature is used.
Starting with NX-OS Version 7.0(3)I2(2a), the packet tracer utility was introduced on the
Nexus 9000 switch. It is used when intermittent or complete packet loss is observed.
Note At the time of writing, the packet tracer utility is supported only on the line cards
or fabric modules that come with Broadcom Trident II ASICs. More details about the Cisco
Nexus 9000 ASICs can be found at http://www.cisco.com.
To set up the packet tracer, use the command test packet-tracer [src-ip src-ip | dst-ip
dst-ip ] [protocol protocol-num | l4-src-port src-port | l4-dst-port dst-port]. Then start
the packet tracer, using the command test packet-tracer start. To view the statistics of
the specified traffic and the action on it, use the command test packet-tracer show.
Finally, stop the packet tracer using the command test packet-tracer stop. Example 2-11
illustrates the use of the packet tracer to analyze the ICMP statistics between two hosts.
Packet-tracer stats
---------------------
Module 1:
Filter 1 installed: src-ip 192.168.2.2 dst-ip 192.168.1.1 protocol 1
Technet24
72 Chapter 2: NX-OS Troubleshooting Tools
ASIC instance 0:
Entry 0: id = 9473, count = 120, active, fp,
Entry 1: id = 9474, count = 0, active, hg,
Filter 2 uninstalled:
Filter 3 uninstalled:
Filter 4 uninstalled:
Filter 5 uninstalled:
! Second iteration of the Output
N9000-1# test packet-tracer show
Packet-tracer stats
---------------------
Module 1:
Filter 1 installed: src-ip 192.168.2.2 dst-ip 192.168.1.1 protocol 1
ASIC instance 0:
Entry 0: id = 9473, count = 181, active, fp,
Entry 1: id = 9474, count = 0, active, hg,
Filter 2 uninstalled:
Filter 3 uninstalled:
Filter 4 uninstalled:
Filter 5 uninstalled:
! Stopping the Packet-Tracer
N9000-1# test packet-tracer stop
Even if the incoming traffic is dropped because of an ACL, the packet tracer helps
determine whether the packet is reaching the router incoming interface. To remove all
the filters from the packet tracer, use the command test packet-tracer remove-all.
NetFlow
NetFlow is a Cisco feature that provides the capability to collect statistics and infor-
mation on IP traffic as it enters or exits an interface. NetFlow provides operators with
network and security monitoring, network planning, traffic analysis, and IP accounting
capabilities. Network traffic is often asymmetrical, even on small networks, whereas
probes typically require engineered symmetry. NetFlow does not require engineering the
network around the instrumentation; it follows the traffic through the network over its
natural path. In addition to traffic rate, NetFlow provides QoS markings, TCP flags, and
so on for specific applications, services, and traffic flows at each point in the network.
NetFlow assists with validating traffic engineering or policy enforcement at any point in
the topology.
Cisco NX-OS supports both traditional NetFlow (Version 5) and Flexible NetFlow
(Version 9) export formats, but using flexible NetFlow is recommended on Nexus
NetFlow 73
platforms. With traditional NetFlow, all the keys and fields exported are fixed and it
supports only IPv4 flows. By default, a flow is defined by seven unique keys:
■ Source IP address
■ Destination IP address
■ Source port
■ Destination port
■ Layer 3 protocol type
■ TOS byte (DSCP markings)
■ Input logical interface (ifindex)
The user can select a few other fields, but NetFlow Version 5 has limitations on the
details it can provide. Flexible NetFlow (FNF) is standardized on Version 9 NetFlow
and gives users more flexibility on defining flows and the exported fields for each
flow type. Flexible NetFlow provides support for IPv6 as well as L2 NetFlow records.
The NetFlow version is template based, so users can specify what data has to be
exported.
■ Flexibility to choose the definition of a flow (the key and nonkey fields)
Network operators and architects often wonder where to attach the NetFlow monitor. For
such challenges, answering the following questions can assist:
■ What type of information are users looking for? MAC fields or IPv4/v6 fields?
■ Is the box switching packets within VLANs or routing them across VLANs using
Switched Virtual Interfaces (SVI)?
NetFlow Configuration
These questions help users make the right choice of applying a Layer 3 or Layer 2
NetFlow configuration. Configuring NetFlow on a Nexus switch consists of following
steps:
Step 2. Define a flow record by specifying key and nonkey fields of interest.
Technet24
74 Chapter 2: NX-OS Troubleshooting Tools
Step 3. Define one or many flow exporters by specifying export format, protocol,
destination, and other parameters.
Step 4. Define a flow monitor based on the previous flow record and flow exporter(s).
Step 5. Apply the flow monitor to an interface with a sampling method specified.
Note NetFlow consumes hardware resources such as TCAM and CPU. Thus, understand-
ing the resource utilization on the device is recommended before enabling NetFlow.
A flow record also specifies the fields of interest that has to be collected for a flow. The
following match keys are supported for identifying flows in NetFlow:
■ IPv6 options
■ ToS field
■ L4 protocol
■ L4 source/destination ports
■ Ethertype
■ VLAN
NetFlow 75
A user has the flexibility to select the collect parameters that can be used in either
Version 5 or Version 9, except for IPv6 parameters, which can be used only with
Version 9. The following parameters are collected using NetFlow:
■ TCP flags
Example 2-12 shows the configuration for a flow record for both Layer 3 and Layer 2
traffic. In this flow record, multiple match entries are created, along parameters to be
used for collection.
Technet24
76 Chapter 2: NX-OS Troubleshooting Tools
that is configurable by the user. The default flow timeout value is 30 minutes. Under the
flow export, the following fields are defined:
■ Source interface
■ Version
The NetFlow configuration is viewed using the command show run netflow. To validate
the NetFlow configuration, use the command show flow [record record-name | exporter
exporter-name | monitor monitor-name].
To view the statistics of the flow ingressing and egressing the interface E1/4 as config-
ured in the previous example, use the command show hardware flow [ip | ipv6] [detail].
Example 2-15 displays the statistics of the ingress and egress traffic flowing across the
NetFlow 77
interfaces Eth3/31-32. This example shows both ingress (I) and egress (O) traffic. NetFlow
displays the statistics for OSPF and other ICMP traffic, along with the protocol number
and packet count.
The statistics in Example 2-15 are collected on the N7k platform, which supports
hardware-based flows. However, not all Nexus platforms have support for hardware-
based flow matching. Nexus switches such as Nexus 6000 do not support hardware-
based flow matching. Thus, a software-based flow matching must be performed.
This can be resource consuming and can impact performance, however, so such
platforms support only Sampled NetFlow (see the following section).
Note Nexus 5600 and Nexus 6000 support only ingress NetFlow applied to the inter-
face; Nexus 7000 supports both ingress and egress NetFlow statistics collection.
NetFlow Sampling
NetFlow supports sampling on the data points to reduce the amount of data collected.
This implementation of NetFlow is called Sampled NetFlow (SNF). SNF supports M:N
packet sampling, where only M packets are sampled out of N packets.
A sampler is configured using the command sampler name. Under the sampler
configuration, sampler mode is defined using the command mode sample-number
out-of packet-number, where sample-number ranges from 1 to 64 and the packet-
number ranges from 1 to 65536 packets). This is defined using the sampler subcom-
mand mode sampler-number out-of packet-number. After the sampler is defined,
it is used in conjunction with the flow monitor configuration under the interface in
Example 2-16.
Technet24
78 Chapter 2: NX-OS Troubleshooting Tools
sampler NF-SAMPLER1
mode 1 out-of 1000
!
interface Eth3/31-32
ip flow monitor FL_MON input sampler NF-SAMPLER1
Users can also define the active and inactive timer for the flows using the command flow
timeout [active | inactive] time-in-seconds.
Starting with NX-OS Version 7.3(0)D1(1), NetFlow is also supported on the control plane
policing (CoPP) interface. NetFlow on the CoPP interface enables users to monitor and
collect statistics of different packets that are destined for the supervisor module on the
switch. NX-OS allows an IPv4 flow monitor and a sampler to be attached to the con-
trol plane interface in the output direction. Example 2-17 demonstrates the NetFlow
configuration under CoPP interface and the relevant NetFlow statistics on the Nexus
7000 platform.
Control-plane
ip flow monitor FL_MON output sampler NF-SAMPLER1
Note In case of any problems with NetFlow, collect the output of the command show
tech-support netflow during problematic state.
sFlow
Defined in RFC 3176, sFlow is a technology for monitoring traffic using sampling
mechanisms that are implemented as part of an sFlow agent in data networks that contain
switches and routers. The sFlow agent is a new software feature for the Nexus 9000 and
Nexus 3000 platforms. The sFlow agent on these platforms collects the sampled packet
from both ingress and egress ports and forwards it to the central collector, known as the
sFlow Analyzer. The sFlow agent can periodically sample or poll the counters associated
with a data source of the sampled packets.
When sFlow is enabled on an interface, it is enabled for both ingress and egress direc-
tions. sFlow can be configured only for Ethernet and port-channel interfaces. sFlow is
enabled by configuring the command feature sflow. Various parameters can be defined
as part of the configuration (see Table 2-2).
NetFlow 79
Example 2-18 illustrates the configuration of sFlow on a Nexus 3000 switch. The running
configuration of sFlow is viewed using the command show run sflow.
feature sflow
sflow sampling-rate 1000
sflow max-sampled-size 200
sflow counter-poll-interval 100
sflow max-datagram-size 2000
sflow collector-ip 172.16.1.100 vrf management
sflow collector-port 2020
sflow agent-ip 170.16.1.130
sflow data-source interface ethernet 1/1-2
To verify the configuration, use the command show sflow. This command output
displays all the information that is configured for the sFlow (see Example 2-19).
Technet24
80 Chapter 2: NX-OS Troubleshooting Tools
When sFlow is configured, the sFlow agent starts collecting the statistics. Although the
actual flow is viewed on the sFlow collector tools, you can still see the sFlow statistics on
the switch using the command show sflow statistics and also view both internal informa-
tion about the sFlow and statistics using the command show system internal sflow info.
Example 2-20 displays the statistics for the sFlow. Notice that although the total packet
count is high, the number of sampled packets is very low. This is because the configura-
tion defines sampling taken per 1000 packets. The system internal command for sFlow
also displays the resource utilization and its present state.
Note In case of any problems with sFlow, collect the output of the command by using
show tech-support sflow during problematic state.
Network Time Protocol 81
Technet24
82 Chapter 2: NX-OS Troubleshooting Tools
When the NTP is configured, the NTP is automatically synchronized on the client from
the server. To check the status of the NTP server or peer, use the command show ntp
peer-status. The * beside the peer address indicates that the NTP has synchronized with
the server. Example 2-22 displays the output from both the server and the client. On the
NTP server, notice that the peer address is 127.127.1.0, which means that the device itself
is the NTP server. On the client, the * is beside 172.16.1.10, which is configured as the
preferred NTP server in the configuration. Note that all the devices in this example are
part of the same management subnet.
After the NTP has been synchronized, the time is verified using the show clock
command.
NX-OS also has a built-in proprietary feature known as Cisco Fabric Services (CFS) that
can be used to distribute data and configuration changes to all Nexus devices. CFS distrib-
utes all local NTP configuration across all the Nexus devices in the network. It applies a
network-wide lock for NTP when the NTP configuration is started. When the configuration
changes are made, users can discard or commit the changes, and the committed configura-
tion replicates across all Nexus devices. The CFS for NTP is enabled using the command
Embedded Event Manager 83
ntp distribute. The configuration is committed to all the Nexus devices by using the ntp
commit command and is aborted using the ntp abort command. When either command is
executed, CFS releases the lock on NTP across network devices. To check that the fabric
distribution is enabled for NTP, use the command show ntp status.
NX-OS also provides a CLI to verify the statistics of the NTP packets. Users can
view input-output statistics for NTP packets, local counters maintained by NTP, and
memory-related NTP counters (which is useful in case of a memory leak condition by
NTP process), and per-peer NTP statistics. If the NTP packets are getting dropped for
some reason, those statistics can be viewed from the CLI itself. To view these statistics,
use the command show ntp statistics [io | local | memory | peer ipaddr ip-address].
Example 2-23 displays the IO and local statistics for NTP packets. If bad NTP packets or
bad authentication requests are received, those counters are viewed under local statistics.
Technet24
84 Chapter 2: NX-OS Troubleshooting Tools
of trigger input and enables the user to define what actions can be taken. This includes
capturing various show commands or performing actions such as executing a Tool
Command Language (TCL) or Python script when the event gets triggered.
An EEM consists of two major components:
Another component of EEM is the EEM policy, which is nothing but an event paired
with one or more actions to help troubleshoot or recover from an event. Some system-
defined policies look out for certain system-level events such as a line card reload or
supervisor switchover event and then perform predefined actions based on those events.
These system-level policies are viewed using the command show event manager system-
policy. The policies are overridable as well and can be verified using the previous com-
mand. The system policies help prevent a larger impact on the device or the network.
For instance, if a module has gone bad and keeps crashing continuously, it can severely
impact services and cause major outages. A system policy for powering down the module
after N crashes can reduce the impact.
Example 2-24 lists some of the system policy events and describes the actions on those
events. The command show event manager policy-state system-policy-name checks
how many times an event has occurred.
Name : __pfm_fanabsent_any_singlefan
Description : Shutdown if any fanabsent for 5 minute(s)
Overridable : Yes
Name : __pfm_fanbad_any_singlefan
Description : Syslog when fan goes bad
Overridable : Yes
Name : __pfm_power_over_budget
Description : Syslog warning for insufficient power overbudget
Overridable : Yes
Name : __pfm_tempev_major
Description : TempSensor Major Threshold. Action: Shutdown
Overridable : Yes
Embedded Event Manager 85
Name : __pfm_tempev_minor
Description : TempSensor Minor Threshold. Action: Syslog.
Overridable : Yes
NX-1# show event manager policy-state __lcm_module_failure
Policy __lcm_module_failure
Cfg count : 3
Hash Count Policy will trigger if
----------------------------------------------------------------
default 0 3 more event(s) occur
An event can be either a system event or a user-triggered event, such as configuration change.
Actions are defined as the workaround or notification that should be triggered in case an
event occurs. EEM supports the following actions, which are defined in the action statement:
■ Logging exceptions
■ Reloading devices
For example, an action can be taken when high CPU utilization is being seen on the
router, or logs can be taken when a BGP session has flapped. Example 2-25 shows the
EEM configuration on a Nexus platform. The EEM has the trigger event set for the high
CPU condition (for instance, the CPU utilization is 70% or higher); the actions include
BGP show commands that are captured when the high CPU condition is noticed. The
policy is viewed using the command show event manager policy internal policy-name.
Technet24
86 Chapter 2: NX-OS Troubleshooting Tools
Similarly, a Python script can be referenced in the EEM script. The Python script is also
saved in the bootflash with the .py extension. Example 2-27 illustrates a Python script
and its reference in the EEM script. In this example, the EEM script is triggered when the
traffic on the interface exceeds the configured storm-control threshold. In such an event,
the triggered Python script collects multiple commands.
Logging 87
Note Refer to the CCO documentation at www.cisco.com for more details on configur-
ing EEM on various Cisco Operating Systems. If any behavioral issues arise with EEM,
capture the show tech-support eem output from the device.
Logging
Network issues are hard to troubleshoot and investigate if the device contains no
information. For instance, if an OSPF adjacency goes down and no correlating alert
exists, determining when the problem happened and what caused the problem is difficult.
For these reasons, logging is important. All Cisco routers and switches support logging
functionality. Logging capabilities are also available for specific features and protocols.
For example, logging can be enabled for BGP session state changes or OSPF adjacency
state changes.
Table 2-3 lists the various logging levels that can be configured.
Technet24
88 Chapter 2: NX-OS Troubleshooting Tools
When the higher value is set, all the lower logging levels are enabled by default. If the
logging level is set to 5 (Notifications), for example, all events falling under the category
from 0 to 5 (Emergency to Notifications) are logged. For troubleshooting purpose,
setting the logging level to 7 (Debugging) is good practice.
■ Console logging
■ Buffered logging
Console logging is important when the device is experiencing crashes or a high CPU
condition and access to the terminal session via Telnet or Secure Shell (SSH) is not
available. However, having console logging enabled when running debugs is not a good
practice because some debug outputs are chatty and can flood the device console.
As a best practice, console logging should always be disabled when running debugs.
Example 2-28 illustrates how to enable console logging on Nexus platforms.
NX-OS not only provides robust logging, but it also is persistent across reloads. All
the buffered logging is present in the /var/log/external/ directory. To view the internal
directories, use the command show system internal flash. This command lists all the
internal directories that are part of the flash along with their utilization. The buffered
log messages are viewed using the command show logging log.
Example 2-29 displays the directories present in the flash and the contents of the /var/
log/external/ directory. If the show logging log command does not display output or the
logging gets stopped, check the /var/log/ directory to ensure that space is available for
that directory.
The logging level is also defined for various NX-OS components so that the user can
control logging for chatty components or disable certain logging messages for less
chatty or less important components. This is achieved by setting the logging level of
the component using the command logging level component-name level. Example 2-30
demonstrates setting the logging level of the ARP and Ethpm components to 3 to reduce
unwanted log messages.
Technet24
90 Chapter 2: NX-OS Troubleshooting Tools
The most persistent form of logging is to use a syslog server to log all the device logs.
A syslog server is anything from a text file to a custom application that actively stores
device logging information in a database.
Example 2-31 illustrates the syslog logging configuration. Before configuring
syslog-based logging on NX-OS, the command logging timestamp [microseconds
| milliseconds | seconds] must be enabled for the logging messages so that all log
messages have time stamps. This helps when investigating the log messages. Generally,
management interfaces are configured with a management VRF. In such cases, the
syslog host must be specified using the logging server ip-address use-vrf vrf-name
command on NX-OS so that the router knows from which VRF routing table the server
is reachable. If the VRF option is not specified, the system does a lookup in default VRF
(the global routing table).
Debug Logfiles
NX-OS provides the user with an option to redirect debug output to a file. This
is useful when running debugs and segregating debug outputs from regular log
messages. Use the debug logfile file-name size size command. Example 2-32
demonstrates using the debug logfile command to capture debugs in a logfile. In this
example, a debug logfile named bgp_dbg is created with a size of 10000 bytes. The
size of the logfile ranges from 4096 bytes to 4194304 bytes. All the debugs that are
enabled are logged under the logfile. To filter the debug output further to capture
more precise debug output, use the debug-filter option. In the following example,
a BGP update debug is enabled and the update debug logs are filtered for neighbor
10.12.1.2 in a VRF context VPN_A.
The NX-OS software creates the logfile in the log: file system root directory, so all
the created logfiles are viewed using dir log:. After the debug logfile is created, the
respective debugs are enabled and all the debug outputs are redirected to the debug
logfile. To view the contents of the logfile, use the show debug logfile file-name
command.
Logging 91
Accounting Log
During troubleshooting, it is important to identify the trigger of the problem, which
could be normal show command or a configuration change. For such issues, examining
all the configuration and show commands during the time of the problem provides vital
information.
NX-OS logs all this information into the accounting logfile, which is readily available
to the users. Using the command show accounting log, users capture all the commands
executed and configured on the system, along with the time stamp and user information.
The accounting logs are persistent across reloads. By default, the accounting logs capture
only the configuration commands. To allow the capture of show commands along
with configuration commands, configure the command terminal log-all. Example 2-33
displays the output of the accounting log, highlighting the various configuration changes
made on the device.
Note The accounting logs and show logging logfiles are both stored on logflash and are
accessible across reloads.
Technet24
92 Chapter 2: NX-OS Troubleshooting Tools
Event-History
NX-OS provides continuous logging for all events that occur in the system for both
hardware and software components as event-history logs. The event-history logs are
VDC local and are maintained on a per-component basis. These logs reduce the need
for running debugs in a live production environment and are useful for investigating a
service outage even after the services are restored. The event-history logs are captured in
the background for each component and do not have any impact on CPU utilization to
perform this task.
■ Large
■ Medium
■ Small
The event-history logs are viewed from the CLI of each component. For instance, the
event-history is viewed for all ARP events using the command show ip arp internal
event-history event. Example 2-34 displays the event-history logs for ARP and shows
how to modify the event-history size. Disable the event-history logs by using the
disabled keyword while defining the size of the event-history. Disabling event-history is
not a recommended practice, however, because it reduces the chances of root causing a
problem and understanding the sequence of events that occurred.
Summary
This chapter focused on various NX-OS tools that can be used to troubleshoot com-
plex problems. It examined various packet capture capabilities with Nexus platforms,
including SPAN and ERSPAN. NX-OS provides the following capabilities, which are
useful for troubleshooting latency and drops from buffer congestion:
■ SPAN-on-Latency
■ SPAN-on-Drop
The chapter explained how to use internal platform tools such as Ethanalyzer and packet
tracer; it also described NetFlow and sFlow use cases, deployment, and configuration
for collecting statistics and network planning. NTP ensures that all clocks are synchro-
nized across multiple devices, to properly correlate timing of events across devices. EEM
scripts are useful for troubleshooting on a daily basis or collecting information after an
event. Finally, the chapter looked at the logging methods available with NX-OS, including
accounting and event-history logging.
References
RFC 3176, InMon Corporation’s sFlow: A Method for Monitoring Traffic in Switched
and Routed Networks. P. Phaal, S. Panchen, and N. McKee. IETF, https://www.ietf.org/
rfc/rfc3176.txt, September 2001.
Technet24
This page intentionally left blank
Chapter 3
Troubleshooting Nexus
Platform Issues
■ NX-OS
Technet24
96 Chapter 3: Troubleshooting Nexus Platform Issues
Before delving into troubleshooting for Nexus platform hardware, it is important to know
which series of Nexus device is being investigated and what kinds of cards are present in
the chassis. The first step is to view the information of all the cards present in the chassis.
Use the command show module [module-number] to view all the cards present on the
Nexus device; here, module-number is optional for viewing the details of a specific line
card. Examine the output of the show module command from Nexus 7009 and Nexus
3548P in Example 3-1. The first section of the output is from Nexus 7000. It shows two
SUP cards in both active and standby state, along with three other cards: One is running
fine, and the other two are powered down. The command output also shows the software
and hardware version for each card and displays the online diagnostic status of those
cards. The command output shows the reason the device is in a powered-down state. At
the end, the command displays the fabric modules present in the chassis, along with the
software and hardware versions and their status.
The second section of the output is from a Nexus 3500 switch that shows only a single
SUP card. This is because the Nexus 3548P is a single rack unit (RU) switch. The number
of modules present in the chassis depends on the device being used and the kind of cards
it supports.
Nexus 7000
N7K1# show module
Mod Ports Module-Type Model Status
--- ----- ----------------------------------- ------------------ ----------
1 0 Supervisor Module-2 N7K-SUP2E active *
2 0 Supervisor Module-2 N7K-SUP2E ha-standby
5 48 10/100/1000 Mbps Ethernet XL Module powered-dn
6 48 1/10 Gbps Ethernet Module N7K-F248XP-25E ok
7 32 10 Gbps Ethernet XL Module powered-dn
Mod Sw Hw
--- --------------- ------
1 8.0(1) 0.403
2 8.0(1) 1.0
6 8.0(1) 1.2
Xbar Sw Hw
--- --------------- ------
1 NA 2.0
2 NA 3.0
3 NA 2.0
4 NA 2.0
5 NA 2.0
Nexus 3500
N3K1# show module
Mod Ports Module-Type Model Status
--- ----- ----------------------------------- ---------------------- -----------
1 48 48x10GE Supervisor N3K-C3548P-10G-SUP active *
Technet24
98 Chapter 3: Troubleshooting Nexus Platform Issues
Note A fabric module is not required for all Nexus 7000 chassis types. The Nexus 7004
chassis has no fabric module, for example. However, higher slot chassis types do require
fabric modules for the Nexus 7000 switch to function successfully.
One of the most common issues noticed with Nexus 7000/7700 installations or hard-
ware upgrades involves interoperability. For example, the network operator might try
to install a line card in a VDC that does not function well in combination with the
existing line cards. M3 cards operate only in combination with M2 or F3 cards in the
same VDC. Similarly, Nexus Fabric Extender (FEX) cards are not supported in com-
bination with certain line cards. Refer to the compatibility matrix to avoid possible
interoperability issues. The show module command output in Example 3-1 for Nexus
7000 switches highlights a similar problem, with two line cards powered down because
of incompatibility.
Note Nexus I/O module compatibility matrix CCO documentation is available at http://
www.cisco.com/c/dam/en/us/td/docs/switches/datacenter/nexus7000/sw/matrix/technical/
reference/Module_Comparison_Matrix.pdf.
The referenced CCO documentation also lists the compatibility of the FEX modules with
different line cards.
The show hardware command is used to get detailed information about both the soft-
ware and the hardware on the Nexus device. The command displays the status of the
Nexus switch, as well as the uptime, the health of the cards (both line cards and fabric
cards), and the power supply and fans present in the chassis.
Bootup Diagnostics
Bootup diagnostics detect hardware faults such as soldering errors, loose connections,
and faulty module. These tests are run when the system boots up and before the hard-
ware is brought online. Table 3-1 shows some of the bootup diagnostic tests.
Troubleshooting Hardware Issues 99
Note The FIPS test is not supported on the F1 series modules on Nexus 7000.
Bootup diagnostics are configured to be performed and supported at one of the follow-
ing levels:
■ None (Bypass): The module is put online without running any bootup diagnostic
tests, for faster card bootup.
■ Complete: The entire bootup diagnostic tests are run for the module. This is the
default and the recommended level for bootup diagnostics.
The diagnostic level is configured using the command diagnostic bootup level [bypass |
complete] in global configuration mode. The diagnostic level must be configured within
individual VDCs, where applicable. The bootup diagnostic level is verified using the com-
mand show diagnostic bootup level.
Technet24
100 Chapter 3: Troubleshooting Nexus Platform Issues
Runtime Diagnostics
The runtime diagnostics are run when the system is in running state (that is, on a live
node). These tests help detect runtime hardware errors such as memory errors, resource
exhaustion, and hardware faults/degradation. The runtime diagnostics are further be clas-
sified into two categories:
■ Health-monitoring diagnostics
■ On-demand diagnostics
Health-monitoring (HM) tests are nondisruptive and run in the background on each mod-
ule. The main aim of these tests is to ensure that the hardware and software components
are healthy while the switch is running network traffic. Some specific HM tests, marked
as HM-always, start by default when the module goes online. Users can easily enable
and disable all HM tests except HM-always tests on any module via the configuration
command-line interface (CLI). Additionally, users can change the interval of all HM tests
except the fixed-interval tests marked as HM-fixed. Table 3-2 lists the HM tests available
across SUP and line card modules.
The interval for HM tests is set using the global configuration command diagnostic
monitor interval module slot test [name | test-id | all] hour hour min minutes second
sec. Note that the name of the test is case sensitive. To enable or disable an HM test,
use the global configuration command [no] diagnostic monitor module slot test [name |
test-id | all]. Use the command show diagnostic content module [slot | all] to display the
information about the diagnostics and their attributes on a given line card. Example 3-2
illustrates how to view the diagnostics information on a line card on a Nexus 7000 switch
and how to disable an HM test. The line card in the output of Example 3-2 is the SUP
card, so the test names listed are relevant only for the SUP card, not the line card. For
example, with the ExternalCompactFlash test, notice that the attribute in the first output
is set to A, which indicates that the test is Active. When the test is disabled from the con-
figuration mode, the output displays the attribute as I, indicating that the test is Inactive.
Technet24
102 Chapter 3: Troubleshooting Nexus Platform Issues
Nexus 7000
N7K1# show diagnostic content module 1
Diagnostics test suite attributes:
B/C/* - Bypass bootup level test / Complete bootup level test / NA
P/* - Per port test / NA
M/S/* - Only applicable to active / standby unit / NA
D/N/* - Disruptive test / Non-disruptive test / NA
H/O/* - Always enabled monitoring test / Conditionally enabled test / NA
F/* - Fixed monitoring interval test / NA
X/* - Not a health monitoring test / NA
E/* - Sup to line card test / NA
L/* - Exclusively run this test / NA
T/* - Not an ondemand test / NA
A/I/* - Monitoring is active / Monitoring is inactive / NA
Z/D/* - Corrective Action is enabled / Corrective Action is disabled / NA
Testing Interval
ID Name Attributes (hh:mm:ss)
____ __________________________________ ____________ _________________
1) ASICRegisterCheck-------------> ***N******A* 00:00:20
2) USB---------------------------> C**N**X**T** -NA-
3) NVRAM-------------------------> ***N******A* 00:05:00
4) RealTimeClock-----------------> ***N******A* 00:05:00
5) PrimaryBootROM----------------> ***N******A* 00:30:00
6) SecondaryBootROM--------------> ***N******A* 00:30:00
7) CompactFlash------------------> ***N******A* 00:30:00
8) ExternalCompactFlash----------> ***N******I* 00:30:00
9) PwrMgmtBus--------------------> **MN******A* 00:00:30
10) SpineControlBus---------------> ***N******A* 00:00:30
11) SystemMgmtBus-----------------> **MN******A* 00:00:30
12) StatusBus---------------------> **MN******A* 00:00:30
13) PCIeBus-----------------------> ***N******A* 00:00:30
14) StandbyFabricLoopback---------> **SN******A* 00:00:30
15) ManagementPortLoopback--------> C**D**X**T** -NA-
16) EOBCPortLoopback--------------> C**D**X**T** -NA-
17) OBFL--------------------------> C**N**X**T** -NA-
The command show diagnostic content module [slot | all] displays not only the HM
tests but also the bootup diagnostic tests. In the output of Example 3-2, notice the tests
whose attributes begin with C. Those tests are complete bootup-level tests. To view all
the test results and statistics, use the command show diagnostic result module [slot |
all] [detail]. When verifying the diagnostic results, ensure no test has a Fail (F) or Error
(E) result. Example 3-3 displays the diagnostic test results of the SUP card both in brief
format and in detailed format. The output shows that the bootup diagnostic level is set
to complete. The first output lists all the tests the SUP module went through along with
its results, where “.” indicates that the test has passed. The detailed version of the output
lists more specific details, such as the error code, the previous execution time, the next
execution time, and the reason for failure. This detailed information is useful when issues
are observed on the module and investigation is required to isolate a transient issue or a
hardware issue.
Technet24
104 Chapter 3: Troubleshooting Nexus Platform Issues
1) ASICRegisterCheck-------------> .
2) USB---------------------------> .
3) NVRAM-------------------------> .
4) RealTimeClock-----------------> .
5) PrimaryBootROM----------------> .
6) SecondaryBootROM--------------> .
7) CompactFlash------------------> .
8) ExternalCompactFlash----------> U
9) PwrMgmtBus--------------------> .
10) SpineControlBus---------------> .
11) SystemMgmtBus-----------------> .
12) StatusBus---------------------> .
13) PCIeBus-----------------------> .
14) StandbyFabricLoopback---------> U
15) ManagementPortLoopback--------> .
16) EOBCPortLoopback--------------> .
17) OBFL--------------------------> .
N7K1# show diagnostic result module 1 detail
Current bootup diagnostic level: complete
Module 1: Supervisor Module-2 (Active)
______________________________________________________________________
1) ASICRegisterCheck .
2) USB .
On-demand diagnostics have a different focus. Some tests are not required to be run
periodically, but they might be run in response to certain events (such as faults) or in an
anticipation of an event (such as exceeded resources). Such on-demand tests are useful in
localizing faults and applying fault-containment solutions.
Both disruptive and nondisruptive on-demand diagnostic tests are run from a CLI. An
on-demand test is executed using the command diagnostic start module slot test [test-id
| name | all | non-disruptive] [port port-number | all]. The test-id variable is the number
of tests supported on a given module. The test is also run on a port basis (depending on
the kind of test) by specifying the optional keyword port. The command diagnostic stop
module slot test [test-id | name | all] is used to stop an on-demand test. The on-demand
tests default to single execution, but the number of iterations can be increased using the
command diagnostic ondemand iteration number, where number specifies the number
of iterations. Be careful when running disruptive on-demand diagnostic tests within pro-
duction traffic.
==============================================
Card:(6) 1/10 Gbps Ethernet Module
==============================================
Current running test Run by
PortLoopback OD
Technet24
106 Chapter 3: Troubleshooting Nexus Platform Issues
______________________________________________________________________
6) PortLoopback:
Port 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
-----------------------------------------------------
U U U U U U U U U U U U U . . .
Port 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
-----------------------------------------------------
U U . . U U U U U U U U U U U U
Port 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
-----------------------------------------------------
U U U U U U U U U U U U U U U U
During troubleshooting, if the number of iterations is set to a higher value and an action
needs to be taken if the test fails, use the command diagnostic ondemand action-on-failure
[continue failure-count num-fails | stop]. When the continue keyword is used, the failure-
count parameter sets the number of failures allowed before stopping the test. This value
defaults to 0, which means to never stop the test, even in case of failure. The on-demand
diagnostic settings are verified using the command show diagnostic ondemand setting.
Example 3-5 illustrates how to set the action upon failure for on-demand diagnostic tests.
In this example, the action-on-failure is set to continue until the failure count reaches
the value of 2.
Troubleshooting Hardware Issues 107
Note Diagnostic tests are also run in offline mode. Use the command hardware module
slot offline to put the module in offline mode, and then use the command diagnostic start
module slot test [test-id | name | all] offline to execute the diagnostic test with the offline
attribute.
■ RewriteEngineLoopback
■ StandbyFabricLoopback
■ Internal PortLoopback
■ SnakeLoopback
On the Supervisor module, if the StandbyFabricLoopback test fails, the system reloads
the standby supervisor card. If the standby supervisor card does not come back up
online in three retries, the standby supervisor card is powered off. After the reload of the
standby supervisor card, the HM diagnostics start by default. The corrective actions are
disabled by default and are enabled by configuring the command diagnostic eem action
conservative.
Note The command diagnostic eem action conservative is not configurable on a per-
test basis; it applies to all four of the previously mentioned GOLD tests.
Technet24
108 Chapter 3: Troubleshooting Nexus Platform Issues
■ Packet drops
The previous section covered module state and diagnostics. This section focuses on com-
mands used across different Nexus platforms to perform health checks.
When the core file is identified, it can be copied to bootflash or any external location,
such as a File Transfer Protocol (FTP) or Trivial FTP (TFTP) server. On Nexus 7000, the
core files are located in the core: file system. The relevant core files are located by follow-
ing this URL:
core://<module-number>/<process-id>/<instance-number>
For instance, in Example 3-6, the location for the core files is core://6/4298/1. If the
Nexus 7000 switch rebooted or a switchover occurred, the core files would be located in
the logflash://[sup-1 | sup-2]/core directory. On other Nexus platforms, such as Nexus
Troubleshooting Hardware Issues 109
5000, 4000, or 3000, the core files would be located in the volatile: file system instead
of the logflash: file system; thus, they can be lost if the device reloads. In newer versions
of software for platforms that stores core files in volatile: file system, the capability was
added to write the core files to bootflash: or to a remote file location when they occur.
If a process crashed but no core files were generated for the crash, a stack trace might
have been generated for the process. But if neither a core file nor a stack trace exists for
the crashed service, use the command show processes log vdc-all to identify which pro-
cesses were impacted. Such crashed processes usually are marked with the N flag. Using
the process ID (PID) values from the previous command and using the command show
processes log pid pid can identify the reason the service went down. The command out-
put displays the reason the process failed in the Death reason field. Example 3-7 displays
using the show processes log and show processes log pid commands to identify crashes
on the Nexus platform
PID: 5656
Exit code: signal 6 (core dumped)
Technet24
110 Chapter 3: Troubleshooting Nexus Platform Issues
cgroup: 1:devices,memory,cpuacct,cpu:/1
CWD: /var/sysmgr/work
RLIMIT_AS: 1936268083
! Output omitted for brevity
For quick verification of the last reset reason, use the show system reset-reason com-
mand. Additional commands to capture and identify the reset reason when core files were
not generated follow:
Packet Loss
Packet loss is a complex issue to troubleshoot in any environment. Packet happens
because of multiple reasons:
■ Bad hardware
■ Drops on a platform
The packet drops that result from routing and switching issues can be fixed by rectifying
the configuration. Bad hardware, on the other hand, impacts all traffic on a partial port or
on the whole line card. Nexus platforms provide various counters that can be viewed to
determine the reason for packet loss on the device (see the following sections).
MAC address information, and switchport and trunk information. Example 3-8 displays
the output of the show interface command, with various fields highlighting the informa-
tion to be verified on an interface. The second part of the output displays on information
on the various capabilities of the interface.
Technet24
112 Chapter 3: Troubleshooting Nexus Platform Issues
To view just the various counters on the interfaces, use the command show interface
counters errors. The counters errors option is also used with the specific show
interface interface-number command. Example 3-9 displays the error counters for
the interface. If any counter is increasing, the interface needs further troubleshoot-
ing, based on the kind of errors received. The error can point to Layer 1 issues, a bad
port issue, or even buffer issues. Some counters indicated in the output are not errors,
but instead indicate a different problem: The Giants counter, for instance, indicates
that packets are being received with a higher MTU size than the one configured on
the interface.
Troubleshooting Hardware Issues 113
--------------------------------------------------------------------------------
Port Align-Err FCS-Err Xmit-Err Rcv-Err UnderSize OutDiscards
--------------------------------------------------------------------------------
Eth2/1 0 0 0 0 0 0
--------------------------------------------------------------------------------
Port Single-Col Multi-Col Late-Col Exces-Col Carri-Sen Runts
--------------------------------------------------------------------------------
Eth2/1 0 0 0 0 0 0
--------------------------------------------------------------------------------
Port Giants SQETest-Err Deferred-Tx IntMacTx-Er IntMacRx-Er Symbol-Err
--------------------------------------------------------------------------------
Eth2/1 0 -- 0 0 0 0
To view the details of the hardware interface resources and utilization, use the command
show hardware capacity interface. This command displays not only buffer information
but also any drops in both the ingress and egress directions on multiple ports across each
line card. The output varies a bit among Nexus platforms, such as between the Nexus
7000 and the Nexus 9000, but this command is useful for identifying interfaces with the
highest drops on the switch. Example 3-10 displays the hardware interface resources on
the Nexus 7000 switch.
Interface drops:
Module Total drops Highest drop ports
3 Tx: 0 -
3 Rx: 101850 Ethernet3/37
4 Tx: 0 -
4 Rx: 64928 Ethernet4/4
Technet24
114 Chapter 3: Troubleshooting Nexus Platform Issues
One of the most common problems on an interface is input and output discards. These
errors usually take place when congestion occurs on the ports. The previous interface
commands and the show hardware internal errors [module slot] command are useful
in identifying input or output discards. If input discards are identified, you must try to
discover congestion on the egress ports. Input discards can be a problem even if SPAN is
configured on the device if oversubscription on egress ports is taking place. Thus, ensure
that SPAN is not configured on the device unless it is required for performing SPAN
captured; in that case, remove it afterward. If the egress-congested port is a Gig port, the
problem could result from a many-to-one unicast traffic flow causing congestion. This
issue can be overcome by upgrading the port to a 10-Gig port or by bundling multiple
Gig ports into a port-channel interface.
The output discards are usually caused by drops in the queuing policy on the interface.
This is verified using the command show system internal qos queueing stats interface
interface-id. The queueing policy configuration information is viewed using the com-
mand show queueing interface interface-id or show policy-map interface interface-
id [input | output]. Tweaking the QoS policy prevents the output discards or drops.
Example 3-11 displays the queueing statistics for interface Ethernet1/5, indicating drops
in various queues on the interface.
Transmit queues
----------------------------------------
Queue 1p7q4t-out-q-default
Total bytes 0
Total packets 0
Current depth in bytes 0
Min pg drops 0
No desc drops 0
WRED drops 0
Taildrop drops 0
Queue 1p7q4t-out-q2
Total bytes 0
Total packets 0
Current depth in bytes 0
Min pg drops 0
No desc drops 0
WRED drops 0
Taildrop drops 0
Queue 1p7q4t-out-q3
Total bytes 0
Total packets 0
Troubleshooting Hardware Issues 115
Technet24
116 Chapter 3: Troubleshooting Nexus Platform Issues
Platform-Specific Drops
Nexus platforms provide in-depth information on various platform-level counters to
identify problems with hardware and software components. If packet loss is noticed on
a particular interface or line card, the platform-level commands provide information on
what is causing the packets to be dropped. For instance, on the Nexus 7000 switch, the
command show hardware internal statistics [module slot | module-all] pktflow dropped
is used to identify the reason for packet drops. This command details the information per
line card module and packet drops across all interfaces on the line card. Example 3-12
displays the packet drops across various ports on the line card in slot 3. The command
output displays packet drops resulting from bad packet length, error packets from Media
Access Control (MAC), a bad cyclic redundancy check (CRC), and so on. Using the diff
keyword along with the command helps identify drops that are increasing on particular
interfaces and that result from specific reasons, for further troubleshooting.
|---------------------------------------|
|Executed at : 2017-06-02 10:09:16.914 |
|---------------------------------------|
Hardware statistics on module 03:
|------------------------------------------------------------------------|
| Device:Flanker Eth Mac Driver Role:MAC Mod: 3 |
| Last cleared @ Fri Jun 2 00:28:46 2017
|------------------------------------------------------------------------|
Instance:0
Cntr Name Value Ports
----- ----- ----- -----
0 igr in upm: pkts rcvd, len(>= 64B, <= mtu) with bad crc 0000000000000001
3 -
1 igr rx pl: received error pkts from mac 0000000000000001 3 -
2 igr rx pl: EM-IPL i/f dropped pkts cnt 0000000000000004 3 -
3 igr rx pl: cbl drops 0000000000002818 3 -
4 igr rx pl: EM-IPL i/f dropped pkts cnt 0000000000000002 4 -
Instance:1
Cntr Name Value Ports
----- ----- ----- -----
5 igr in upm: pkts rcvd, len > MTU with bad CRC 0000000000000001 10 -
6 igr in upm: pkts rcvd, len > MTU with bad CRC 0000000000000001 11 -
7 igr rx pl: EM-IPL i/f dropped pkts cnt 0000000000000002 9 -
8 igr rx pl: EM-IPL i/f dropped pkts cnt 0000000000000011 10 -
Troubleshooting Hardware Issues 117
Instance:3
Cntr Name Value Ports
----- ----- ----- -----
13 igr rx pl: EM-IPL i/f dropped pkts cnt 0000000000000003 26 -
14 igr rx pl: cbl drops 0000000000000008 26 -
15 igr rx pl: EM-IPL i/f dropped pkts cnt 0000000000000001 31 -
Instance:4
Cntr Name Value Ports
----- ----- ----- -----
16 igr in upm: pkts rcvd, len > MTU with bad CRC 0000000000000027 35 -
17 igr in upm: pkts rcvd, len > MTU with bad CRC 0000000000000044 36 -
18 igr in upm: pkts rcvd, len(>= 64B, <= mtu) with bad crc 0000000000000001
36 -
19 igr in upm: pkts rcvd, len > MTU with bad CRC 0000000000005795 37 -
20 igr in upm: pkts rcvd, len > MTU with bad CRC 0000000000000034 38 -
21 igr rx pl: EM-IPL i/f dropped pkts cnt 0000000000000008 33 -
22 igr rx pl: cbl drops 0000000000002801 33 -
23 igr rx pl: EM-IPL i/f dropped pkts cnt 0000000000000004 34 -
24 egr out pl: total pkts dropped due to cbl 0000000000001769 34 -
25 igr rx pl: received error pkts from mac 0000000000000003 35 -
26 igr rx pl: EM-IPL i/f dropped pkts cnt 0000000000000200 35 -
27 igr rx pl: cbl drops 0000000000002813 35 -
28 igr rx pl: dropped pkts cnt 0000000000000017 35 -
29 igr rx pl: received error pkts from mac 0000000000000093 36 -
30 igr rx pl: EM-IPL i/f dropped pkts cnt 0000000000002515 36 -
31 igr rx pl: cbl drops 0000000000002894 36 -
32 igr rx pl: dropped pkts cnt 0000000000000166 36 -
33 igr rx pl: EM-IPL i/f dropped pkts cnt 0000000000047337 37 -
34 igr rx pl: dropped pkts cnt 0000000000001371 37 -
35 igr rx pl: EM-IPL i/f dropped pkts cnt 0000000000000212 38 -
36 igr rx pl: dropped pkts cnt 0000000000000012 38 -
Technet24
118 Chapter 3: Troubleshooting Nexus Platform Issues
|------------------------------------------------------------------------|
| Device:Flanker Xbar Driver Role:XBR-INTF Mod: 3 |
| Last cleared @ Fri Jun 2 00:28:46 2017
|------------------------------------------------------------------------|
|------------------------------------------------------------------------|
| Device:Flanker Queue Driver Role:QUE Mod: 3 |
| Last cleared @ Fri Jun 2 00:28:46 2017
|------------------------------------------------------------------------|
Instance:4
Cntr Name Value Ports
----- ----- ----- -----
0 igr ib_500: pkt drops 0000000000000003 35 -
1 igr ib_500: pkt drops 0000000000000010 36 -
2 igr ib_500: vq ib pkt drops 0000000000000013 33-40 -
3 igr vq: l2 pkt drop count 0000000000000013 33-40 -
4 igr vq: total pkts dropped 0000000000000013 33-40 -
Instance:5
Cntr Name Value Ports
----- ----- ----- -----
5 igr ib_500: de drops, shared by parser and de 0000000000000004 41-48 -
6 igr ib_500: vq ib pkt drops 0000000000000004 41-48 -
7 igr vq: l2 pkt drop count 0000000000000004 41-48 -
8 igr vq: total pkts dropped 0000000000000004 41-48 -
|------------------------------------------------------------------------|
| Device:Lightning Role:ARB-MUX Mod: 3 |
| Last cleared @ Fri Jun 2 00:28:46 2017
|------------------------------------------------------------------------|
Communication among the supervisor card, line card, and fabric cards occurs over the
Ethernet out-of-band channel (EOBC). If errors occur on the EOBC channel, the Nexus
switch can experience packet loss and major service loss. EOBC errors are verified using
the command show hardware internal cpu-mac eobc stats. The Error Counters section
displays a list of errors that occur on the EOBC interface. In most instances, physically
reseating the line card fixes the EOBC errors. Example 3-13 displays the EOBC stats
for Error Counters on a Nexus 7000 switch. Filter the output for checking just the error
counters by using the grep keyword (see Example 3-13).
Troubleshooting Hardware Issues 119
Nexus platforms also provide in-band stats for packets that the central processing unit
(CPU) processes. If an error counter shows the inband stats increasing frequently, it
could indicate a problem with the supervisor card and might lead to packet loss. To view
the CPU in-band statistics, use the command show hardware internal cpu-mac inband
stats. This command displays various statistics on packets and length of packets received
by or sent from the CPU, interrupt counters, error counters, and present and maximum
punt statistics. Example 3-14 displays the output of the in-band stats on the Nexus
7000 switch. This command is also available on the Nexus 9000 switch, as the second
output shows.
Technet24
120 Chapter 3: Troubleshooting Nexus Platform Issues
RMON counters Rx Tx
----------------------+--------------------+--------------------
total packets 1154193 995903
good packets 1154193 995903
64 bytes packets 0 0
65-127 bytes packets 432847 656132
128-255 bytes packets 429319 8775
256-511 bytes packets 236194 328244
512-1023 bytes packets 619 18
1024-max bytes packets 55214 2734
broadcast packets 0 0
multicast packets 0 0
good octets 262167681 201434260
total octets 0 0
XON packets 0 0
XOFF packets 0 0
management packets 0 0
Interrupt counters
-------------------+--
Assertions 1176322
Rx packet timer 1154193
Rx absolute timer 0
Rx overrun 0
Rx descr min thresh 0
Tx packet timer 0
Tx absolute timer 1154193
Tx queue empty 995903
Tx descr thresh low 0
Error counters
--------------------------------+--
CRC errors ..................... 0
Alignment errors ............... 0
Symbol errors .................. 0
Sequence errors ................ 0
RX errors ...................... 0
Missed packets (FIFO overflow) 0
Troubleshooting Hardware Issues 121
Throttle statistics
-----------------------------+---------
Throttle interval ........... 2 * 100ms
Packet rate limit ........... 64000 pps
Rate limit reached counter .. 0
Tick counter ................ 193078
Active ...................... 0
Rx packet rate (current/max) 3 / 182 pps
Tx packet rate (current/max) 2 / 396 pps
NAPI statistics
----------------+---------
Weight Queue 0 ......... 512
Weight Queue 1 ......... 256
Weight Queue 2 ......... 128
Weight Queue 3 ......... 16
Weight Queue 4 ......... 64
Weight Queue 5 ......... 64
Weight Queue 6 ......... 64
Weight Queue 7 ......... 64
Poll scheduled . 1176329
Poll rescheduled 0
Poll invoked ... 1176329
Weight reached . 0
Technet24
122 Chapter 3: Troubleshooting Nexus Platform Issues
qdisc stats:
----------------+---------
Tx queue depth . 10000
qlen ........... 0
packets ........ 995903
bytes .......... 197450648
drops .......... 0
Inband stats
----------------+---------
Tx src_p stamp . 0
N9396PX-5# show hardware internal cpu-mac inband stats
================ Packet Statistics ======================
Packets received: 58021524
Bytes received: 412371530221
Packets sent: 57160641
Bytes sent: 409590752550
Rx packet rate (current/peak): 0 / 281 pps
Peak rx rate time: 2017-03-08 19:03:21
Tx packet rate (current/peak): 0 / 289 pps
Peak tx rate time: 2017-04-24 14:26:36
Note The output varies among Nexus platforms. For instance, the previous output is
brief and comes from the Nexus 9396 PX switch. The same command output on the
Nexus 9508 switch is similar to the output displayed for the Nexus 7000 switch. This
command is available on all Nexus platforms.
In the previous output, the in-band stats command on Nexus 9396, though brief, displays
the time when the traffic hit the peak rate; such information is not available on the com-
mand for the Nexus 7000 switch. Nexus 7000 provides the show hardware internal cpu-
mac inband events command, which displays the event history of the traffic rate in the
ingress (Rx) or egress (Tx) direction of the CPU, including the peak rate. Example 3-15
displays the in-band events history for the traffic rate in the ingress or egress direction of
the CPU. The time stamp of the peak traffic rate is useful when investigating high CPU or
packet loss on the Nexus 7000 switches.
Troubleshooting Hardware Issues 123
NX-OS also provides with a brief in-band counters CLI that displays the number of in-band
packets in both ingress (Rx) and egress (Tx) directions, errors, dropped counters, overruns,
and more. These are used to quickly determine whether the in-band traffic is getting dropped.
Example 3-16 displays the output of the command show hardware internal cpu-mac inband
counters. If nonzero counters appear for errors, drops, or overruns, use the diff keyword to
determine whether they are increasing frequently. This command is available on all platforms.
Technet24
124 Chapter 3: Troubleshooting Nexus Platform Issues
Packet drops on the Nexus switch happen because of various errors in the hardware. The drops
happen either at the line card or on the supervisor module itself. To view the various errors and
their counters across all the modules on a Nexus switch, use the command show hardware
internal errors [all | module slot]. Example 3-17 displays the hardware internal errors on the
Nexus 7000 switch. Note that the command is applicable for all Nexus platforms.
|------------------------------------------------------------------------|
| Device:Clipper XBAR Role:QUE Mod: 1 |
| Last cleared @ Wed May 31 12:59:42 2017
| Device Statistics Category :: ERROR
|------------------------------------------------------------------------|
|------------------------------------------------------------------------|
| Device:Clipper FWD Role:L2 Mod: 1 |
| Last cleared @ Wed May 31 12:59:42 2017
| Device Statistics Category :: ERROR
|------------------------------------------------------------------------|
! Output omitted for brevity
Note Each Nexus platform has different ASICs where errors or drops are observed.
However, these are outside the scope of this book. It is recommended to capture show
tech-support detail and tac-pac command output during problematic states, to identify
the platform-level problems leading to packet loss.
specifically designed to extend the architecture and functionality of the Nexus switches.
FEX is connected to Nexus 9000, 7000, 6000, and 5000 series parent switches. The uplink
ports connecting the FEX to the parent switch are called the Fabric ports or network-facing
interface (NIF) ports; the ports on the FEX module that connect the servers (front-panel
ports) are called the satellite ports or host-facing interface (HFI) ports. Cisco released FEX
models in three categories, according to their capabilities and capacity:
■ 1 GE Fabric Extender
■ N2224TP, 24 port
■ N2248TP, 48 port
■ N2248TP-E, 48 port
■ N2332TQ, 32 port
■ N2348TQ, 48 port
■ N2348TQ-E, 48 port
■ N2232TM, 32 port
■ N2232TM-E, 32 port
■ N2348UPQ, 48 port
■ N2248PQ, 48 port
■ N2232PP, 48 port
Note Compatibility between an FEX and its parent switch is based on the software
release notes of the software version being used on the Nexus switch.
Connectivity between the parent switch and an FEX occurs in three different modes:
■ Pinning: In pinning mode, a one-to-one mapping takes place between HIF ports
and uplink ports. Thus, traffic from a specific HIF port can traverse only a specific
uplink. Failures on uplink ports bring down the mapped HIF ports.
■ Port-channeling: In this mode, the uplink is treated as one logical interface. All the
traffic between the parent switch and FEX is hashed across the different links of
the port-channel.
Technet24
126 Chapter 3: Troubleshooting Nexus Platform Issues
Note Chapter 4, “Nexus Switching,” has more details on the FEX supported and
nonsupported designs.
To enable FEX, NX-OS first requires installing the feature set using the command install
feature-set fex. Then the feature set for FEX must be installed using the command
feature-set fex. If the FEX is being enabled on the Nexus 7000, the FEX feature set is
installed in the default VDC along with the command no hardware ip verify address
reserved; the feature-set fex then is configured under the relevant VDC. The command
no hardware ip verify address reserved is required only when the intrusion detection
system (IDS) reserved address check is enabled. This is verified using the command show
hardware ip verify. If the check is already disabled, the command no hardware ip verify
address reserved is not required to be configured.
When the feature-set fex is enabled, interfaces are enabled as FEX fabric using the com-
mand switchport mode fex-fabric. The next step is to assign an ID for the FEX, which is
further used to distinguish an FEX on the switch. Example 3-18 illustrates the configura-
tion on the Nexus switch for connecting to an FEX.
When FEX configuration is complete, the FEX is accessible on the parent switch and
its interfaces are available for further configuration. To verify the status of the FEX, use
the command show fex. This command shows the status of the FEX, along with the
FEX module number and the ID associated by the parent switch. To determine which
FEX interfaces are accessible on the parent switch, use the command show interface
interface-id fex-intf. Note that the interface-id in this command is the NIF port-channel
interface. Example 3-19 examines the output of the show fex and the show interface
fex-intf commands to verify the FEX status and its interfaces.
Troubleshooting Hardware Issues 127
Further details on the FEX are viewed using the command show fex fex-number detail.
This command displays the status of the FEX and all the FEX interfaces. Additionally,
it displays the details of pinning mode and information regarding the FEX fabric ports.
Example 3-20 displays the detailed output of the FEX 101.
Technet24
128 Chapter 3: Troubleshooting Nexus Platform Issues
When the FEX satellite ports are available, use them to configure these ports as either
Layer 2 or Layer 3 ports; they also can act as active-active ports by making them part of
the vPC configuration.
If issues arise with the fabric ports or the satellite ports, the state change information is
viewed using the command show system internal fex info fport [all | interface-number]
or show system internal fex info satport [all | interface-number]. Example 3-21 displays
the internal information of both the satellite and fabric ports on the Nexus 7000 switch.
In the first section of the output, the command displays a list of events that the system
goes through to bring up the FEX. It lists all the finite state machine events, which is use-
ful while troubleshooting in case the FEX does not come up and gets stuck in one of the
states. The second section of the output displays information about the satellite ports and
their status information.
Technet24
130 Chapter 3: Troubleshooting Nexus Platform Issues
Note If any issues arise with the FEX, it is useful to collect show tech-support fex
fex-number during the problematic state. The issue might also result from the Ethpm
component on Nexus as the FEX sends state change messages to Ethpm. Thus, capturing
the show tech-support ethpm output during problematic state could be relevant. Ethpm is
discussed later in this chapter.
■ Only users with the network-admin role can create a VDC and allocate resources it.
Three primary kinds of VDCs are supported on the Nexus 7000 platform:
The VDC resource template is configured using the command vdc resource template
name. This puts you in resource template configuration mode, where you can limit the
resources previously mentioned by using the command limit-resource resource mini-
mum value maximum value, where resource can be any of the six listed resources. To
view the configured resources within a template, use the command show vdc resource
template [vdc-default | name], where vdc-default is for the default VDC template.
Example 3-22 demonstrates configuration of a VDC template and the show vdc resource
template command output displaying the configured resources within the template.
Technet24
132 Chapter 3: Troubleshooting Nexus Platform Issues
vdc-default
-------------
Resource Min Max
---------- ----- -----
monitor-rbs-product 0 12
monitor-rbs-filter 0 12
monitor-session-extended 0 12
monitor-session-mx-exception-src 0 1
monitor-session-inband-src 0 1
port-channel 0 768
monitor-session-erspan-dst 0 23
monitor-session 0 2
vlan 16 4094
anycast_bundleid 0 16
m6route-mem 5 20
m4route-mem 8 90
u6route-mem 4 4
u4route-mem 8 8
vrf 2 4096
N7K-1(config)# vdc resource template DEMO-TEMPLATE
N7K-1(config-vdc-template)# limit-resource port-channel minimum 1 maximum 4
N7K-1(config-vdc-template)# limit-resource vrf minimum 5 maximum 100
N7K-1(config-vdc-template)# limit-resource vlan minimum 20 maximum 200
N7K-1# show vdc resource template DEMO-TEMPLATE
DEMO-TEMPLATE
---------------
Resource Min Max
---------- ----- -----
vlan 20 200
vrf 5 100
port-channel 1 4
If the network requires all the VDCs on Nexus to be performing different tasks and have
different kind of resources allocated to them, it is better not to have VDC templates con-
figured. Limit the VDC resources using the limit-resource command under vdc configu-
ration mode.
Virtual Device Context 133
Configuring VDC
VDC creation is broken down into four simple steps:
Step 1. Define a VDC. A VDC is defined using the command vdc name [id id] [type
Ethernet | storage]. By default, a VDC is created as an Ethernet VDC.
Step 2. Allocate interfaces. Single or multiple interfaces are allocated to a VDC. The
interfaces are allocated using the command allocate interface interface-id.
Note that the allocate interface configuration is mandatory; the interface
allocation cannot be negated. Interfaces are allocated only from one VDC to
another and cannot be released back to the default VDC. If the user deletes the
VDC, the interfaces also get unallocated and are then made part of VDC ID 0.
For the 10G interface, some modules require all the ports tied to the port-
ASIC to be moved together. This is done so as to retain the integrity where
each port group can switch between dedicated and shared mode. An error
message is displayed if not all members of the same port group are allocated
together. Beginning with NX-OS Release 5.2(1), all members of a port group
are automatically allocated to the VDC when only a member of the port
group is being added to the VDC.
Step 3. Define the HA policy. The high availability (HA) policy is determined based
on whether Nexus is running on a single supervisor or a dual supervisor card.
The HA policy is configured using the command ha-policy [single-sup | dual-
sup] policy under the VDC configuration. Table 3-3 lists the different HA
policies based on single or dual supervisor cards.
Example 3-23 demonstrates the configuration of creating an Ethernet VDC. Notice that
if a particular interface is added to the VDC and other members of the port-group are
not part of the list, NX-OS automatically tries to add the remaining ports to the VDC.
The VDC defined in Example 3-23 limits only for F3 series modules; for instance, adding
ports from an F2 or M2 series module would result in an error.
Technet24
134 Chapter 3: Troubleshooting Nexus Platform Issues
N7K-1(config-vdc)#
N7K-1(config-vdc)# limit-resource module-type f3
This will cause all ports of unallowed types to be removed from this vdc. Continue
(y/n)? [yes] yes
N7K-1(config-vdc)# allocate interface ethernet 3/1
Entire port-group is not present in the command. Missing ports will be included
automatically
Additional Interfaces Included are :
Ethernet3/2
Ethernet3/3
Ethernet3/4
Ethernet3/5
Ethernet3/6
Ethernet3/7
Ethernet3/8
Moving ports will cause all config associated to them in source vdc to be removed.
Are you sure you want to move the ports (y/n)? [yes] yes
N7K-1(config-vdc)# ha-policy dual-sup ?
bringdown Bring down the vdc
restart Bring down the vdc, then bring the vdc back up
switchover Switchover the supervisor
N7K-1(config-vdc)# ha-policy dual-sup restart
N7K-1(config-vdc)# ha-policy single-sup bringdown
N7K-1(config-vdc)# limit-resource port-channel minimum 3 maximum 5
N7K-1(config-vdc)# limit-resource vlan minimum 20 maximum 100
N7K-1(config-vdc)# limit-resource vrf minimum 5 maximum 10
VDC Initialization
VDC is initialized before VDC-specific configuration is applied. Before VDC initializa-
tion, perform a copy run start after the VDC is created so that the newly created VDC
is part of the startup configuration. The VDC is initialized using the switchto vdc name
command from the default or admin VDC (see Example 3-24). The initialization process
of the VDC has steps similar to when a new Nexus switch is brought up. It prompts for
the admin password and then the basic configuration dialog. Use this option to perform
basic configuration setups for the VDC using this method, or follow manual configura-
tion by replying with no for the basic configuration dialog. The command switchback is
used to switch back to default or admin VDC.
Virtual Device Context 135
This setup utility will guide you through the basic configuration of
the system. Setup configures only enough connectivity for management
of the system.
Would you like to enter the basic configuration dialog (yes/no): yes
Technet24
136 Chapter 3: Troubleshooting Nexus Platform Issues
In Example 3-24, after the VDC is initialized, the host name of the VDC is seen as
N7k-1-N7k-2—that is, the hostnames of both the default VDC and the new VDC are
concatenated. To avoid this behavior, configure the command no vdc combined-hostname
in default or admin VDC.
Virtual Device Context 137
VDCs also support in-band management. VDC is accessed using one of the Ethernet
interfaces that are allocated to the VDC. Using in-band management involves using only
separate management networks, which ensures separation of the AAA servers and syslog
servers among the VDCs.
VDC Management
NX-OS software provides a CLI to easily manage the VDCs when troubleshooting prob-
lems. The VDC configuration of all the VDCs is seen from default or admin VDC. Use
the command show run vdc to view all the VDC-related configuration. Additionally,
when saving the configuration, use the command copy run start vdc-all to copy the con-
figuration done on all VDCs.
NX-OS provides a CLI to view further details of the VDC without looking at the config-
uration. Use the command show vdc [detail] to view the details of each VDC. The show
vdc detail command displays various lists of information for each VDC, such as ID, name,
state, HA policy, CPU share, creation time and uptime of the VDC, VDC type, and line
cards supported by each VDC (see Example 3-25). On a Nexus 7000 switch, some VDCs
might be running critical services. By default, NX-OS allocates an equal CPU share (CPU
resources) to all the VDCs. On SUP2 and SUP2E supervisor cards, NX-OS allows users
to allocate a specific amount of the switch’s CPU, to prioritize more critical VDCs.
vdc id: 1
vdc name: N7k-1
vdc state: active
vdc mac address: 50:87:89:4b:c0:c1
vdc ha policy: RELOAD
vdc dual-sup ha policy: SWITCHOVER
vdc boot Order: 1
CPU Share: 5
CPU Share Percentage: 50%
vdc create time: Fri Apr 21 05:57:30 2017
vdc reload count: 0
vdc uptime: 1 day(s), 0 hour(s), 35 minute(s), 41 second(s)
vdc restart count: 1
Technet24
138 Chapter 3: Troubleshooting Nexus Platform Issues
vdc id: 2
vdc name: N7k-2
vdc state: active
vdc mac address: 50:87:89:4b:c0:c2
vdc ha policy: RESTART
vdc dual-sup ha policy: SWITCHOVER
vdc boot Order: 1
CPU Share: 5
CPU Share Percentage: 50%
vdc create time: Sat Apr 22 05:05:59 2017
vdc reload count: 0
vdc uptime: 0 day(s), 1 hour(s), 28 minute(s), 12 second(s)
vdc restart count: 1
vdc restart time: Sat Apr 22 05:05:59 2017
vdc type: Ethernet
vdc supported linecards: f3
To further view the details of resources allocated to each VDC, use the command show
vdc resource [detail]. This command displays the configured minimum and maximum
value and the used, unused, and available values for each resource. The output is run for
individual VDCs using the command show vdc name resource [detail]. Example 3-26
displays the resource configuration and utilization for each VDC on the Nexus 7000
chassis running two VDCs (for instance, N7k-1 and N7k-2).
u4route-mem 2 used 102 unused 514 free 412 avail 516 total
-------------
Vdc Min Max Used Unused Avail
--- --- --- ---- ------ -----
N7k-1 96 96 1 95 95
N7k-2 8 8 1 7 7
! Output omitted for brevity
Based on the kind of line cards the VDC supports, interfaces are allocated to each VDC.
To view the member interfaces of each VDC, use the command show vdc membership.
Example 3-27 displays the output of the show vdc membership command. In Example 3-27,
notice the various interfaces that are part of VDC 1 (N7k-1) and VDC 2 (N7k-2). If a par-
ticular VDC is deleted, the interfaces become unallocated and are thus shown under the
VDC ID 0.
Technet24
140 Chapter 3: Troubleshooting Nexus Platform Issues
NX-OS also provides internal event history logs to view errors or messages related to a
VDC. Use the command show vdc internal event-history [errors | msgs | vdc_id id] to
view the debugging information related to VDCs. Example 3-28 demonstrates creating
a new VDC (N7k-3) and shows relevant event history logs that display events the VDC
creation process goes through before the VDC is created and active for use. The events in
Example 3-28 show the VDC creation in progress and then show that it becomes active.
Note If a problem arises with a VDC, collect the show tech-support vdc and show tech-
support detail command output during problematic state to open a TAC case.
■ If M2 module interfaces are working with M3 module interfaces, interfaces from the
M2 module cannot be allocated to the other VDC.
■ If interfaces from both M2 and M3 series line cards are present in the VDC, the M2
module must operate in M2-M3 interop mode.
■ If interfaces from both F2E and M2 series line cards are present in the VDC, the M2
module must operate in M2-F2E mode.
The M2 series line cards support both M2-F2E and M2-M3 interop modes, with the
default being M2-F2E mode. M3 series line cards, on the other hand, support M2-M3
interop mode only. To allocate interfaces from both M2 and M3 modules that are part of
same VDC, use the command system interop-mode m2-m3 module slot to change the
Technet24
142 Chapter 3: Troubleshooting Nexus Platform Issues
operating mode of M2 line cards to M2-M3. Use the no option to disable M2-M3 mode
and fall back to the default M2-F2E mode on the M2 line card.
To support both M and F2E series modules in the same VDC, F2E series modules oper-
ate in proxy mode. In this mode, all Layer 3 traffic is sent to the M series line card in the
same VDC.
Table 3-4 reinforces which module type mix is supported on Ethernet VDCs.
Note For more details on supported module combinations and the behavior of modules
running in different modes, refer to the CCO documentation listed in the “References”
section, at the end of the chapter.
Troubleshooting PD issues requires having knowledge about not only various system
components but also dependent services or components. For instance, Route Policy
Manager (RPM) is a process that is dependent on the Address Resolution Protocol (ARP)
and Netstack processes (see Example 3-29). These processes are further dependent on
other processes. The hierarchy of dependency is viewed using the command show
system internal sysmgr service dependency srvname name.
Troubleshooting NX-OS System Components 143
Of course, knowledge of all components is not possible, but problem isolation becomes
easier with knowledge of some primary system components that perform major tasks in
the NX-OS platforms. This section focuses on some of these primary components:
■ Forwarding components
■ Unicast Routing Information Base (URIB), Unicast Forwarding Information Base
(UFIB), and Unicast Forwarding Distribution Manager (UFDM)
Technet24
144 Chapter 3: Troubleshooting Nexus Platform Issues
■ High performance and low latency (provides low latency for exchanging messages
between interprocess communications)
■ Buffer management (manages the buffer for respective processes that are queued up
to be delivered to other processes)
■ Message delivery
MTS guarantees independent process restarts so that it does not impact other client or
nonclient processes running on the system and to ensure that the messages from other
processes are received after a restart.
A physical switch can be partitioned to multiple VDCs for resource partitioning, fault
isolation, and administration. One of the main features of the NX-OS infrastructure is
to make virtualization transparent to the applications. MTS provides this virtualization
transparency using the virtual node (vnode) concept and an architecturally clean com-
munication model. With this concept, an application thinks that it is running on a switch,
with no VDC.
MTS works by allocating a predefined chunk of system memory when the system boots
up. This memory exists in the kernel address space. When applications start up, the mem-
ory gets automatically mapped to the application address space. When an application
tries to send some data to the queue, MTS makes one copy of the data and copies the
payload into a buffer. It then posts a reference to the buffer into the application’s receive
queue. When the application tries to read its queue, it gets a reference to the payload,
which it reads directly as it’s already mapped in its address space.
Consider a simple example. OSPF learns a new route from an LSA update from its adja-
cent neighbor. The OSPF process requires that the route be installed in the routing table.
The OSPF process puts the needed information (prefix, next hop, and so on) into an MTS
message, which it then sends to URIB. In this example, MTS is taking care of exchanging
the information between the OSPF and the URIB components.
MTS facilitates the interprocess communication using Service Access Points (SAP) to
allow services to exchange messages. Each card in the switch has at least one instance
of MTS running, also known as the MTS domain. The node address is used to identify
Troubleshooting NX-OS System Components 145
which MTS domain is involved in processing a message. The MTS domain is kind of a
logical node that provides services only to the processes inside that domain. Inside the
MTS domain, a SAP represents the address used to reach a service. A process needs to
bind to a SAP before it communicates with another SAP. SAPs are divided into three
categories:
Note A client is required to know the server’s SAP (usually a static SAP) to communicate
with the server.
An MTS address is divided into two parts: a 4-byte node address and a 2-byte SAP
number. Because an MTS domain provides services to the processes associated with
that domain, the node address in the MTS address is used to decide the destination
MTS domain. Thus, the SAP number resides in the MTS domain identified by the node
address. If the Nexus switch has multiple VDCs, each VDC has its own MTS domain; this
is reflected as SUP for VDC1, SUP-1 for VDC2, SUP-2 for VDC3, and so on.
MTS also has various operational codes to identify different kinds of payloads in the
MTS message:
Various symptoms can indicate problems with MTS, and different symptoms mean differ-
ent problems. If a feature or process is not performing as expected, high CPU is noticed
on the Nexus switch, or ports are bouncing on the switch for no reason, then the MTS
message might be stuck in the queue. The easiest way to check is to check the MTS buf-
fer utilization, using the command show system internal mts buffer summary. This out-
put needs to be taken several times to see which queues are not clearing. Example 3-30
demonstrates how the MTS buffer summary looks when the queues are not clearing. The
process with SAP number 2938 seems to be stuck because the messages are stuck in the
receive queue; the other process with SAP number 2592 seems to have cleared the mes-
sages from the receive queue.
Technet24
146 Chapter 3: Troubleshooting Nexus Platform Issues
Messages stuck in the queue lead to various impacts on the device. For instance, if the
device is running BGP, you might randomly see BGP flaps or BGP peering not even com-
ing up, even though the BGP peers might have reachability and correct configuration.
Alternatively, the user might not be able to perform a configuration change, such as add-
ing a new neighbor configuration.
After determining that the messages are stuck in one of the queues, identify the process
associated with the SAP number. The command show system internal mts sup sap
sapno description obtains this information. The same information also can be viewed
from the sysmgr output using the command show system internal sysmgr service all.
For details about all the queued messages, use the command show system internal mts
buffers detail. Example 3-31 displays the description of the SAP 2938, which shows the
statsclient process. The statsclient process is used to collect statistics on supervisor or
line card modules. The second section of the output displays all the messages present in
the queue.
Troubleshooting NX-OS System Components 147
Note The SAP description information in Example 3-31 is taken from the default VDC.
For the information on the nondefault DVC, use the command show system internal mts
node sup-[vnode-id] sap sapno description.
The first and most important field to check in the previous output is the SAP number and
its age. If the duration of the message stuck in the queue is fairly long, those messages
need to be investigated; they might be causing services to misbehave on the Nexus plat-
form. The other field to look at is OPC, which refers to the operational code. After the
messages in the queue are verified from the buffers detail output, use the command show
system internal sup opcodes to determine the operational code associated with the mes-
sage, to understand the state of the process.
SAP statistics are also viewed to verify different queue limits of various SAPs and to
check the maximum queue limit that a process has reached. This is done using the com-
mand show system internal mts sup sap sapno stats (see Example 3-32).
Technet24
148 Chapter 3: Troubleshooting Nexus Platform Issues
msg rx: 30
byte rx: 6883
Along with these verification checks, MTS error messages are seen in OBFL logs or sys-
logs. When the MTS queue is full, the error logs in Example 3-33 appear. Use the com-
mand show logging onboard internal kernel to ensure that no error logs are reported as a
result of MTS.
The MTS errors are also reported in the MTS event history logs and can be viewed using
the command show system internal mts event-history errors.
If the MTS queue is stuck or an MTS buffer leak is observed, performing a supervisor
switchover clears the MTS queues and helps recover from service outages from an MTS
queue stuck problem.
Note If SAP number 284 appears in the MTS buffer queue, ignore it: It belongs to the
TCPUDP process client and is thus expected.
component runs as a separate process with multiple threads. In-band packets and fea-
tures specific to NX-OS, such as vPC- and VDC-aware capabilities, must be processed
in software. Netstack is the NX-OS component in charge of processing software-
switched packets. As stated earlier, the Netstack process has three main roles:
Netstack is made up of both Kernel Loadable Module (KLM) and user space components.
The user space components are VDC local processes containing Packet Manager, which
is the Layer 2 processing component; IP Input, the Layer 3 processing component; and
TCP/UDP functions, which handle the Layer 4 packets. The Packet Manager (PktMgr)
component is mostly isolated with IP input and TCP/UDP, even though they share the
same process space. Figure 3-1 displays the Netstack architecture and the components
part of KLM and user space.
AM TCP ICMPv6
ARP IP IPv6
Layer 2
User Space
Kernel Space
Packet KLM
Technet24
150 Chapter 3: Troubleshooting Nexus Platform Issues
allowing multiple instances of the Netstack process (one per each VDC) and restart-
ability in case of a process crash.
The PktMgr is the lower-level component within the Netstack architecture that takes care
of processing all in-band or management frames received from and sent to KLM. The
PktMgr demultiplexes the packets based on Layer 2 (L2) packets and platform header
information and passes them to the L2 clients. It also dequeues packets from L2 clients
and sends the packets out the appropriate driver. All the L2 or non-IP protocols, such
as Spanning Tree Protocol (STP), Cisco Discovery Protocol (CDP), Unidirectional Link
Detection (UDLD), Cisco Fabric Services (CFS), Link Aggregation Control Protocol
(LACP), and ARP, register directly with PktMgr. IP protocols register directly with the IP
Input process.
The Netstack process runs on the supervisor, so the following packets are sent to the
supervisor for processing:
■ Exception packets
■ Glean adjacency
■ Supervisor-terminated packets
The Netstack process is stateful across restarts and switchovers. The Netstack process
depends on Unicast Routing Information Base (URIB), IPv6 Unicast Routing Information
Base (U6RIB), and the Adjacency Manager (ADJMGR) process for bootup. Netstack uses
a CLI server process to restore the configuration and uses persistent storage services
(PSS) to restore the state of processes that were restarted. It uses RIB shared memory for
performing L3 lookup; it uses an AM shared database (SDB) to perform the L3-to-L2
lookup. For troubleshooting purpose, Netstack provides various internal show com-
mands and debugs that can help determine problems with different processes bound with
Netstack:
■ Packet Manager
■ IP/IPv6
■ TCP/UDP
■ ARP
■ Adjacency Manager (AM)
To understand the workings of the Packet Manager component, consider an example with
ICMPv6. ICMPv6 is a client of PktMgr. When the ICMPv6 process first initializes, it
registers with PktMgr and is assigned a client ID and control (Ctrl) SAP ID and Data SAP
ID. MTS handles communication between the PktMgr and ICMPv6. The Rx traffic from
PktMgr toward ICMPv6 is handed off to MTS with the destination of the data SAP ID.
The Tx traffic from ICMPv6 toward PktMgr is sent to the Ctrl SAP ID. PktMgr receives
frame from ICMPv6, builds the correct header, and sends it to KLM to transport to the
hardware.
To troubleshoot any of the PktMgr clients, figure out the processes that are clients of
PktMgr component. This is done by issuing the command show system internal pktmgr
client. This command returns the UUIDs and the Ctrl SAP ID for the PktMgr clients. The
next step is to view the processes under the Service Manager, to get the information on
the respective Universally Unique Identifier (UUID) and SAP ID. Example 3-35 illustrates
these steps. When the correct process is identified, use the command show system
internal pktmgr client uuid to verify the statistics for the PktMgr client, including drops.
Technet24
152 Chapter 3: Troubleshooting Nexus Platform Issues
If the packets being sent to the supervisor are from a particular interface, verify the
PktMgr statistics for the interface using the command show system internal pktmgr
interface interface-id (see Example 3-36). This example explicitly shows how many
unicast, multicast, and broadcast packets were sent and received.
Troubleshooting NX-OS System Components 153
PktMgr accounting (statistics) is useful in determining whether any low-level drops are
occurring because of bad encapsulation or other kernel interaction issues. This is verified
using the command show system internal pktmgr stats [brief] (see Example 3-37). This
command shows the PktMgr driver interface to the KLM. The omitted part of the output
also shows details about other errors and the management driver.
--------------------------------------------
Driver:
--------------------------------------------
State: Up
Filter: 0x0
Technet24
154 Chapter 3: Troubleshooting Nexus Platform Issues
For IP processing, Netstack queries the URIB—that is, the routing table and all other
necessary components, such as the Route Policy Manager (RPM)—to make a forwarding
decision for the packet. Netstack performs all the accounting in the show ip traffic com-
mand output. The IP traffic statistics are used to track fragmentation, Internet Control
Message Protocol (ICMP), TTL, and other exception packets. This command also dis-
plays the RFC 4293 traffic statistics. An easy way to figure out whether the IP packets
are hitting the NX-OS Netstack component is to observe the statistics for exception
punted traffic, such as fragmentation. Example 3-38 illustrates the different sections of
the show ip traffic command output.
Technet24
156 Chapter 3: Troubleshooting Nexus Platform Issues
■ TCP
■ UDP
Consider now how TCP socket creation happens on NX-OS. When it receives the TCP
SYN packet, Netstack builds a stub INPCB entry into the hash table. The partial informa-
tion is then populated into the protocol control block (PCB). When the TCP three-way
handshake is completed, all TCP socket information is populated to create a full socket.
This process is verified by viewing the output of the debug command debug sockets
tcp pcb. Example 3-39 illustrates the socket creation and Netstack interaction with the
help of the debug command. From the debug output, notice that when the SYN packet
is received, it gets added into the cache; when the three-way handshake completes, a full-
blown socket is created.
Troubleshooting NX-OS System Components 157
Necessary details of the TCP socket connection are verified using the command show
sockets connection tcp [detail]. The output with the detail option provides information
such as TCP windowing information, the MSS value for the session, and the socket state.
The output also provides the MTS SAP ID. If the TCP socket is having a problem, look
up the MTS SAP ID in the buffer to see whether it is stuck in a queue. Example 3-40 dis-
plays the socket connection details for BGP peering between two routers.
Netstack socket clients are monitored with the command show sockets client detail.
This command explains the socket client behavior and shows how many socket library
calls the client has made. This command is useful in identifying issues a particular socket
Technet24
158 Chapter 3: Troubleshooting Nexus Platform Issues
client is facing because it also displays the Errors section, where errors are reported for a
problematic client. As Example 3-41 illustrates, the output displays two clients, syslogd
and bgp. The output shows the associated SAP ID with the client and statistics on how
many socket calls the process has made. The Errors section is empty because no errors
are seen for the displayed sockets.
Netstack also has an accounting capability that gives statistics on UDP, TCP, raw sockets,
and internal tables. The Netstack socket statistics are viewed using the command show
sockets statistics all. This command helps view TCP drops, out-of-order packets, or
duplicate packets; the statistics are maintained on a per-Netstack instance basis. At the
end of the output, statistics and error counters are also viewed for INPCB and IN6PCB
tables. The table statistics provides insight into how many socket connections are being
created and deleted in Netstack. The Errors part of the INPCB or IN6PCB table indicates
a problem while allocating socket information. Example 3-42 displays the Netstack sock-
et accounting statistics.
TCP v4 Received:
402528 total packets received, 203911 packets received in sequence,
3875047 bytes received in sequence, 8 out-of-order packets received,
10 rcvd duplicate acks, 208189 rcvd ack packets,
3957631 bytes acked by rcvd acks, 287 Dropped no inpcb,
203911 Fast recv packets enqueued, 16 Fast TCP can not recv more,
208156 Fast TCP data ACK to app,
TCP v4 Sent:
406332 total packets sent, 20 control (SYN|FIN|RST) packets sent,
208162 data packets sent, 3957601 data bytes sent,
198150 ack-only packets sent,
INPCB Statistics:
in_pcballoc: 38 in_pcbbind: 9
in_pcbladdr: 18 in_pcbconnect: 14
in_pcbdetach: 19 in_pcbdetach_no_rt: 19
in_setsockaddr: 13 in_setpeeraddr: 14
in_pcbnotify: 1 in_pcbinshash_ipv4: 23
in_pcbinshash_ipv6: 5 in_pcbrehash_ipv4: 18
in_pcbremhash: 23
INPCB Errors:
IN6PCB Statistics:
in6_pcbbind: 5
in6_pcbdetach: 4 in6_setsockaddr: 1
in6_pcblookup_local: 2
IN6PCB Errors:
Technet24
160 Chapter 3: Troubleshooting Nexus Platform Issues
Multiple clients (ARP, STP, BGP, EIGRP, OSPF, and so on) interact with the Netstack com-
ponent. Thus, while troubleshooting control plane issues, if you are able to see the packet
in Ethanalyzer but the packet is not received by the client component itself, the issue might
be related to the Netstack or the Packet Manager (Pktmgr). Figure 3-2 illustrates the control
plane packet flow and placement of the Netstack and Pktmgr components in the system.
IPv4/IPv6
ARP OTV STP LACP
Packet Manager
OSPF EIGRP BGP MSDP
Netstack
Inband/Ethanalyzer
Line Card
CoPP
HW Rate-Limiters
Note If an issue arises with any Netstack component or Netstack component clients,
such as OSPF or TCP failure, collect output from the commands show tech-support net-
stack and show tech-support pktmgr, along with the relevant client show tech-support
outputs, to aid in further investigation by the Cisco TAC.
■ Punts the glean adjacency packets to the CPU, which then triggers ARP resolution
■ Has clients listening for ARP packets such as ARP snooping, HSRP, VRRP, and GLBP
Troubleshooting NX-OS System Components 161
All the messaging and communication with the ARP component happens with the help
of MTS. ARP packets are sent to PktMgr via MTS. The ARP component does not sup-
port the Reverse ARP (RARP) feature, but it does support features such as proxy ARP,
local proxy ARP, and sticky ARP.
Note If the router receives packets destined to another host in the same subnet and local
proxy ARP is enabled on the interface, the router does not send the ICMP redirect mes-
sages. Local proxy ARP is disabled by default.
If the Sticky ARP option is set on an interface, any new ARP entries that are learned are
marked so that they are not overwritten by a new adjacency (for example, gratuitous ARP).
These entries also do not get aged out. This feature helps prevent a malicious user from
spoofing an ARP entry.
Glean adjacencies can cause packet loss and also cause excessive packets to get punted to
CPU. Understanding the treatment of packets when a glean adjacency is seen is vital. Let’s
assume that a switch receives IP packets where the next hop is a connected network. If an
ARP entry exists but no host route (/32 route) is installed in the FIB or in the AM shared
database, the FIB lookup points to glean adjacency. The glean adjacency packets are rate-
limited. If no network match is found in FIB, packets are silently dropped in hardware
(known as a FIB miss).
To protect the CPU from high bandwidth flows with no ARP entries or adjacencies
programmed in hardware, NX-OS provides rate-limiters for glean adjacency traffic on
Nexus 7000 and 9000 platforms. The configuration for the preset hardware rate-limiters
for glean adjacency traffic is viewed using the command show run all | include glean.
Example 3-43 displays the hardware rate-limiters for glean traffic.
The control plane installs a temporary adjacency drop entry in hardware while ARP is
being resolved. All subsequent packets are dropped in hardware until ARP is resolved.
The temporary adjacency remains until the glean timer expires. When the timer expires,
the normal process of punt/drop starts again.
The ARP entries on the NX-OS are viewed using the command show ip arp [interface-
type interface-num]. The command output shows not only the learned ARP entries but
also the glean entries, which are marked as incomplete. Example 3-44 displays the ARP
table for VLAN 10 SVI interface with both learned ARP entry and INCOMPLETE entry.
Technet24
162 Chapter 3: Troubleshooting Nexus Platform Issues
IP ARP Table
Total number of entries: 2
Address Age MAC Address Interface
10.1.12.10 00:10:20 5087.894b.bb41 Vlan10
10.1.12.2 00:00:09 INCOMPLETE Vlan10
When an incomplete ARP is seen, the internal trace history is used to determine whether
the problem is with the ARP component or something else. When an ARP entry is popu-
lated, two operations (Create and Update) occur to populate the information in the FIB.
If a problem arises with the ARP component, you might only see the Create operation,
not the Update operation. To view the sequence of operations, use the command show
forwarding internal trace v4-adj-history [module slot] (see Example 3-45). This example
shows that for the next hop of 10.1.12.2, only a Create operation is happening after the
Destroy operation (drop adjacency); no Update operation occurs after that, causing the
ARP entry to be marked as glean.
To view the forwarding adjacency, use the command show forwarding ipv4 adjacency
interface-type interface-num [module slot]. If the adjacency for a particular next hop
appears as unresolved, there is no adjacency; FIB then matches the network glean adja-
cency and performs a punt operation. Example 3-46 illustrates the output of the show
forwarding ipv4 adjacency command with an unresolved adjacency entry.
Troubleshooting NX-OS System Components 163
The ARP component also provides an event history to be used to further understand
whether any errors could lead to problems with ARP and adjacency. To view the ARP
event history, use the command show ip arp internal event-history [events | errors].
Example 3-47 displays the output of the command show ip arp internal event-history
events, displaying the ARP resolution for the host 10.1.12.2/24. In the event history,
notice that the switch sends out an ARP request; based on the reply, the adjacency is built
and further updated into the AM database.
Technet24
164 Chapter 3: Troubleshooting Nexus Platform Issues
10) Event:E_DEBUG, length:49, at 713054 usecs after Sun May 7 17:31:30 2017
[116] [4200]: ARP request for 10.1.12.2 on Vlan10
Note The ARP packets are also captured using Ethanalyzer in both ingress and egress
directions.
The ARP component is closely coupled with the Adjacency Manager (AM)
component. The AM takes care of programming the /32 host routes in the hardware.
AM provides the following functionalities:
■ Adds host routes (/32 routes) into URIB/U6RIB for learned adjacencies
■ Performs IP/IPv6 lookup AM database while forwarding packets out of the interface
■ Handles adjacencies restart by maintaining the adjacency SDB for restoration of the
AM state
■ Provides a single interface for URIB/UFDM to learn routes from multiple sources
When an ARP is learned, the ARP entry is added to the AM SDB. AM then communi-
cates directly with URIB and UFDM to install a /32 adjacency in hardware. The AM
database queries the state of active ARP entries. The ARP table is not persistent upon
process restart and thus must requery the AM SDB. AM registers various clients that
can install adjacencies. To view the registered clients, use the command show system
internal adjmgr client (see Example 3-48). One of the most common clients of AM
is ARP.
Troubleshooting NX-OS System Components 165
Any unresolved adjacency is verified using the command show ip adjacency ip-address
detail. If the adjacency is resolved, the output populates the correct MAC address for the
specified IP; otherwise, it has 0000.0000.0000 in the MAC address field. Example 3-49
displays the difference between the resolved and unresolved adjacencies.
! Resolved Adjacency
N7k-1# show ip adjacency 10.1.12.10 detail
No. of Adjacency hit with type INVALID: Packet count 0, Byte count 0
No. of Adjacency hit with type GLOBAL DROP: Packet count 0, Byte count 0
No. of Adjacency hit with type GLOBAL PUNT: Packet count 0, Byte count 0
No. of Adjacency hit with type GLOBAL GLEAN: Packet count 0, Byte count 0
No. of Adjacency hit with type GLEAN: Packet count 0, Byte count 0
No. of Adjacency hit with type NORMAL: Packet count 0, Byte count 0
Address : 10.1.12.10
MacAddr : 5087.894b.bb41
Preference : 50
Source : arp
Interface : Vlan10
Physical Interface : Ethernet2/1
Packet Count : 0
Byte Count : 0
Best : Yes
Throttled : No
! Unresolved Adjacency
N7k-1# show ip adjacency 10.1.12.2 detail
! Output omitted for brevity
Technet24
166 Chapter 3: Troubleshooting Nexus Platform Issues
Address : 10.1.12.10
MacAddr : 5087.894b.bb41
Preference : 50
Source : arp
Interface : Vlan10
Physical Interface : Ethernet2/1
Packet Count : 0
Byte Count : 0
Best : Yes
Throttled : No
! Unresolved Adjacency
N7k-1# show ip adjacency 10.1.12.2 detail
! Output omitted for brevity
Address : 10.1.12.2
MacAddr : 0000.0000.0000
Preference : 255
Source : arp
Interface : Vlan10
Physical Interface : Vlan10
Packet Count : 0
Byte Count : 0
Best : Yes
Throttled : No
Step 5. The AM independently calls the UFDM API to install the adjacency in the
hardware.
Troubleshooting NX-OS System Components 167
The series of events within the AM component is viewed using the command show sys-
tem internal adjmgr internal event-history events. Example 3-50 displays the output of
this command, to illustrate the series of events that occur during installation of the adja-
cency for host 10.1.12.2. Notice that the prefix 10.1.12.2 is being added to the RIB buffer
for the IPv4 address family.
Note If an issue arises with any ARP or AM component, capture the show tech arp and
show tech adjmgr outputs during problematic state.
Technet24
168 Chapter 3: Troubleshooting Nexus Platform Issues
them. The URIB process uses several clients, which are also viewed using the command
show routing clients (see Example 3-51):
■ AM
■ RPM
CLIENT: ospf-100
index mask: 0x0000000000008000
epid: 23091 MTS SAP: 320 MRU cache hits/misses: 2/1
Stale Time: 2100
Routing Instances:
VRF: "default" routes: 1, rnhs: 0, labels: 0
Messages received:
Register : 1 Convergence-notify: 1 Modify-route : 1
Messages sent:
Modify-route-ack : 1
Each routing protocol has its own region of shared URIB memory space. When a routing
protocol learns routes from its neighbor, it installs those learned routes in its own region
of shared URIB memory space. URIB then copies updated routes to its own protected
region of shared memory, which is read-only memory and is readable only to Netstack
and other components. The routing decisions are made from the entry present in URIB
shared memory. It is vital to note that URIB itself does not perform any of the add,
modify, or delete operations in the routing table. URIB clients (the routing protocols and
Netstack) handle all updates, except when the URIB client process crashes. In such a case,
URIB might then delete abandoned routes.
Troubleshooting NX-OS System Components 169
OSPF CLI provides users with the command show ip ospf internal txlist urib to view
the OSPF routes sent to URIB. For all other routing protocols, the information is viewed
using event history commands. Example 3-52 displays the output, showing the source
SAP ID of OSPF process and the destination SAP ID for MTS messages.
Server up : L3VM|IFMGR|RPM|AM|CLIS|URIB|U6RIB|IP|IPv6|SNMP
Server required : L3VM|IFMGR|RPM|AM|CLIS|URIB|IP|SNMP
Server registered: L3VM|IFMGR|RPM|AM|CLIS|URIB|IP|SNMP
Server optional : none
Early hello : OFF
Force write PSS: FALSE
OSPF mts pkt sap 324
OSPF mts base sap 320
9: 10.1.12.0/24
10: 1.1.1.1/32
11: 2.2.2.2/32
11: RIB marker
N7k-1# show system internal mts sup sap 320 description
ospf-100
N7k-1# show system internal mts sup sap 324 description
OSPF pkt MTS queue
The routes being updated from an OSPF process or any other routing process to URIB
are recorded in the event history logs. To view the updates copied by OSPF from OSPF
process memory to URIB shared memory, use the command show ip ospf internal event-
history rib. Use the command show routing internal event-history msgs to examine
URIB updating the globally readable shared memory. Example 3-53 shows the learned
OSPF routes being processed and updated to URIB and also the routing event history
showing the routes being updated to shared memory.
Technet24
170 Chapter 3: Troubleshooting Nexus Platform Issues
After the routes are installed in the URIB, they can be viewed using the command show
ip route routing-process detail, where routing-process is the NX-OS process for the
respective routing protocols, as in Example 3-53 (ospf-100).
Note URIB stores all routing information in shared memory. Because the memory space
is shared, it can be exhausted by large-scale routing issues or memory leak issues. Use the
command show routing memory statistics to view the shared URIB memory space.
The UFDM has four sets of APIs performing various tasks in the system:
■ FIB API: URIB and U6RIB modules use this to add, update, and delete routes in
the FIB.
■ Statistics collection API: This is used to collect adjacency statistics from the
platform.
In this list of tasks, the first three functions happen in a top-down manner (from supervi-
sor to line card); the fourth function happens in a bottom-up direction (from line card to
supervisor).
Note NX-OS no longer has Cisco Express Forwarding (CEF). It now relies on hardware
FIB, which is based on AVL Trees, a self-balancing binary search tree.
The UFDM component distributes AM, FIB, and RPF updates to IPFIB on each line card
in the VDC and then sends an acknowledgment route-ack to URIB. This is verified using
the command show system internal ufdm event-history debugs (see Example 3-54).
Technet24
172 Chapter 3: Troubleshooting Nexus Platform Issues
808) Event:E_DEBUG, length:129, at 711230 usecs after Sun May 14 03:12:14 2017
[104] ufdm_route_distribute(615):TRACE: v4_rt_upd # 24 rt_count: 1, urib_xid
: 0x58f059ec, fib_xid: 0x58f059ec recp_cnt: 0 rmask: 0
809) Event:E_DEBUG, length:94, at 652231 usecs after Sun May 14 03:12:09 2017
[104] ufdm_route_send_ack(185):TRACE: sent route nack, xid: 0x58f059ec,
v4_ack: 0, v4_nack: 23
810) Event:E_DEBUG, length:129, at 651602 usecs after Sun May 14 03:12:09 2017
[104] ufdm_route_distribute(615):TRACE: v4_rt_upd # 23 rt_count: 1, urib_xid
: 0x58f059ec, fib_xid: 0x58f059ec recp_cnt: 0 rmask: 0
After the hardware FIB has been programmed, the forwarding information is verified
using the command show forwarding route ip-address/len [detail]. The command out-
put displays the information of the next hop to reach the destination prefix and the out-
going interface, as well as the destination MAC information. This information is also veri-
fied at the platform level to get more details on it from the hardware/platform perspective
using the command show forwarding ipv4 route ip-address/len platform [module slot].
Then the information must be propagated in the relevant line card. This is verified using
the command show system internal forwarding route ip-address/len [detail]. This com-
mand output also provides interface hardware adjacency information; this is further veri-
fied using the command show system internal forwarding adjacency entry adj, where
the adj value is the adjacency value received from the previous command.
Note Note that the previous outputs can be collected on the supervisor card as well as at
the line card level by logging into the line card console using the command attach module
slot and then executing the forwarding commands as already described.
Example 3-56 displays step-by-step verification of the route programmed in the FIB and
on the line card level.
Technet24
174 Chapter 3: Troubleshooting Nexus Platform Issues
----+---------------------+----------+----------+-----------
Dev | Prefix | PfxIndex | AdjIndex | LIF
----+---------------------+----------+----------+-----------
0 2.2.2.2/32 0x6320 0x5f 0x3
Note In case of any forwarding issues, collect the following show tech outputs during
problematic state:
■ Abstraction: Provides an abstraction layer for other components that want to inter-
act with the interfaces that EthPM manages
■ Port Finite State Machine (FSM): Provides an FSM for interfaces that it manages, as
well as handling interface creation and removal
The EthPM component interacts with other components, such as the Port-Channel
Manager, VxLAN Manager, and STP, to program interface states. The EthPM process
is also responsible for managing interface configuration (duplex, speed, MTU, allowed
VLANs, and so on).
Port-Client is a line card global process (specific to Nexus 7000 and Nexus 9000
switches) that closely interacts with the EthPM process. It maintains global information
received from EthPM across different VDCs. It receives updates from the local hardware
port ASIC and updates the EthPM. It has both platform-independent (PI) and platform-
dependent (PD) components. The PI component of the Port-Client process interacts with
EthPM, which is also a PI component, and the PD component is used for line
card-specific hardware programming.
The EthPM component CLI enables you to view platform-level information, such as the
EthPM interface index, which it receives from the Interface Manager (IM) component;
interface admin state and operational state; interface capabilities; interface VLAN state;
and more. All this information is viewed using the command show system internal
ethpm info interface interface-type interface-num. Example 3-57 displays the EthPM
information for the interface Ethernet 3/1, which is configured as an access port for
VLAN 10.
Technet24
176 Chapter 3: Troubleshooting Nexus Platform Issues
bundle_bringup_id(5)
service_xconnect(0)
Troubleshooting NX-OS System Components 177
Platform Information:
Local IOD(0xd7), Global IOD(0) Runtime IOD(0xd7)
Capabilities:
Speed(0xc), Duplex(0x1), Flowctrl(r:0x3,t:0x3), LinkDebounce(0x1)
udld(0x1), SFPCapable(0x1), TrunkEncap(0x1), AutoNeg(0x1)
channel(0x1), suppression(0x1), cos_rewrite(0x1), tos_rewrite(0x1)
dce capable(0x4), l2 capable(0x1), l3 capable(0x2) qinq capable(0x10)
ethertype capable(0x1000000), Fabric capable (y), EFP capable (n)
slowdrain congestion capable(y), slowdrain pause capable (y)
slowdrain slow-speed capable(y)
Num rewrites allowed(104)
eee capable speeds () and eee flap flags (0)
eee max wk_time rx(0) tx(0) fb(0)
Operational Vlans: 10
Technet24
178 Chapter 3: Troubleshooting Nexus Platform Issues
Pacer Information:
Pacer State: released credits
ISSU Pacer State: initialized
The port-client command show system internal port-client link-event tracks interface
link events from the software perspective on the line card. This command is a line card-
level command that requires you to get into the line card console. Example 3-58 displays
the port-client link events for ports on module 3. In this output, the events at different
time stamps are seen for various links going down and coming back up.
May 15 05:47:35 2017 00553866 Ethernet3/11 ---- DOWN Link down debounce
timer stopped and link is down
May 15 05:47:35 2017 00454119 Ethernet3/11 ---- DOWN Link down debounce
timer started(0x40e50006)
For these link events, relevant messages are seen in the port-client event history logs for
the specified port using the line card-level command show system internal port-client
event-history port port-num.
Note If issues arise with ports not coming up on the Nexus chassis, collect the output of
the command show tech ethpm during problematic state.
■ Loss of line protocol keepalives, which cause a line to go down and lead to route
flaps and major network transitions.
■ Excessive packet processing because packets are being punted to the CPU.
■ Loss of routing protocol updates, which leads to route flaps and major network tran-
sitions.
■ RP at near 100% utilization, which slows the response time at the user command line
(CLI) or locks out the CLI. This prevents the user from taking corrective action to
respond to the attack.
■ Router crashes
Technet24
180 Chapter 3: Troubleshooting Nexus Platform Issues
■ Policy-based traffic policing using control plane policing (CoPP) for traffic that has
passed rate-limiters
The hardware rate-limiters and CoPP policy together increase device security by protect-
ing its CPU (Route-Processor) from unnecessary traffic or DoS attacks and gives priority
to relevant traffic destined for the CPU. Note that the hardware rate limiters are available
only with Nexus 7000 and Nexus 9000 series switches and are not available on other
Nexus platforms.
Packets that hit the CPU or reach the control plane are classified into these categories:
■ Received packets: These packets are destined for the router (such as keepalive messages)
■ Multicast packets: These packets are further divided into three categories:
■ Copy packets: For supporting features such as ACL-log, a copy of the original
packet is made and sent to the supervisor. Thus, these are called copy packets.
■ ACL-log copy
■ Multicast copy
■ NetFlow copy
■ TTL expiry
■ MTU failure
■ Unsupported rewrite
■ Glean packets: When an L2 MAC for the destination IP or next hop is not present in
the FIB, the packet is sent to the supervisor. The supervisor then takes care of gener-
ating an ARP request for the destination host or next hop.
■ Broadcast, non-IP packets: The following packets fall under this category:
Remember that both the CoPP policy and rate-limiters are applied on per-module, per-
forwarding engine (FE) basis.
Note On the Nexus 7000 platform, CoPP policy is supported on all line cards except F1
series cards. F1 series cards exclusively use rate-limiters to protect the CPU. HWRL is sup-
ported on Nexus 7000/7700 and Nexus 9000 series platforms.
Example 3-59 displays the output of the command show hardware rate-limiters [module
slot] to view the rate-limiter configuration and statistics per each line card module pres-
ent in the chassis.
Module: 3
Technet24
182 Chapter 3: Troubleshooting Nexus Platform Issues
Module: 2
R-L Class Config Allowed Dropped Total
+------------------+--------+---------------+---------------+-----------------+
L3 glean 100 0 0 0
L3 mcast loc-grp 3000 0 0 0
access-list-log 100 0 0 0
bfd 10000 0 0 0
exception 50 0 0 0
fex 3000 0 0 0
span 50 0 0 0
dpss 6400 0 0 0
sflow 40000 0 0 0
For verifying the rate-limiter statistics on F1 module on Nexus 7000 switches, use
the command show hardware rate-limiter [f1 rl-1 | rl-2 | rl-3 | rl-4 | rl-5].
The Nexus 7000 series switches also enable you to view the rate-limiters for the SUP
bound traffic and its usage. Different modules determine what exceptions match each
rate-limiter. These differences are viewed using the command show hardware internal
forwarding rate-limiter usage [module slot]. Example 3-60 displays the output of this
command, showing not only the different rate-limiters but also which packet streams or
rate-limiters are handled by either CoPP or the L2 or L3 rate-limiters.
HWRL, CoPP, and System QoS 183
Note: The rate-limiter names have been abbreviated to fit the display.
-------------------------+------+------+--------+------+--------+--------
Packet streams | CAP1 | CAP2 | DI | CoPP | L3 RL | L2 RL
-------------------------+------+------+--------+------+--------+--------
L3 control (224.0.0.0/24) Yes x sup-hi x control copy
L2 broadcast x x flood x x strm-ctl
ARP request Yes x sup-lo Yes x copy
Mcast direct-con Yes x x Yes m-dircon copy
ISIS Yes x sup-lo x x x
L2 non-IP multicast x x x x x x
Access-list log x Yes acl-log x x acl-log
L3 unicast control x x sup-hi Yes x receive
L2 control x x x x x x
Glean x x sup-lo x x glean
Port-security x x port-sec x x port-sec
IGMP-Snoop x x m-snoop x x m-snoop
-------------------------+------+------+--------+------+--------+--------
Exceptions | CAP1 | CAP2 | DI | CoPP | L3 RL | L2 RL
-------------------------+------+------+--------+------+--------+--------
IPv4 header options 0 0 x Yes x
FIB TCAM no route 0 0 x Yes x
Same interface check 0 0 x x ttl x
IPv6 scope check fail 0 0 drop x x
Unicast RPF more fail 0 0 drop x x
Unicast RPF fail 0 0 drop Yes x
Multicast RPF fail 0 0 drop x x
Multicast DF fail 0 0 drop x x
TTL expiry 0 0 x x ttl x
Drop 0 0 drop x x
L3 ACL deny 0 0 drop x x
L2 ACL deny 0 0 drop x x
IPv6 header options 0 0 drop Yes x
MTU fail 0 0 x x mtu x
DHCP ACL redirect 0 0 x Yes mtu x
ARP ACL redirect 0 0 x Yes mtu x
Smac IP check fail 0 0 x x mtu x
Hardware drop 0 0 drop x x
Software drop 0 0 drop x x
Unsupported RW 0 0 x x ttl x
Invalid packet 0 0 drop x x
L3 proto filter fail 0 0 drop x x
Technet24
184 Chapter 3: Troubleshooting Nexus Platform Issues
Information about specific exceptions is seen using the command show hardware inter-
nal forwarding l3 asic exceptions exception detail [module slot].
The configuration settings for both l2 and l3 ASIC rate-limiters are viewed using the com-
mand show hardware internal forwarding [l2 | l3] asic rate-limiter rl-name detail [module
slot], where the rl-name variable is the name of the rate-limiter. Example 3-61 displays the
output for L3 ASIC exceptions, as well as the L2 and L3 rate-limiters. The first output shows
the configuration and statistics for packets that fail the RPF check. The second and third out-
puts show the rate-limiter and exception configuration for packets that fail the MTU check.
! L2 Rate-Limiter
N7K-1# show hardware internal forwarding l2 asic rate-limiter layer-3-glean detail
Device: 1
Device: 1
Enabled: 0
Packets/sec: 0
Match fields:
Cap1 bit: 0
Cap2 bit: 0
DI select: 0
DI: 0
Flood bit: 0
slot 3
=======
Egress exception priority table programming:
Reserved: 0
Disable LIF stats: 0
Trigger: 0
Mask RP: 0x1
Dest info sel: 0
Clear exception flag: 0x1
Egress L3 : 0
Same IF copy disable: 0x1
Mcast copy disable: 0x1
Ucast copy disable: 0
Exception dest sel: 0x6
Enable copy mask: 0
Disable copy mask: 0x1
CoPP in Nexus platforms is also implemented in hardware, which helps protects the
supervisor from DoS attacks. It controls the rate at which the packets are allowed to
reach the supervisor CPU. Remember that traffic hitting the CPU on the supervisor mod-
ule comes in through four paths:
2. Management interface
3. Control and monitoring processor (CMP) interface, which is used for the console
Technet24
186 Chapter 3: Troubleshooting Nexus Platform Issues
Only the traffic sent through the in-band interface is sent to the CoPP because this is the
only traffic that reaches the supervisor module though different forwarding engines (FE)
on the line cards. CoPP policing is implemented individually on each FE.
When any Nexus platform boots up, the NX-OS installs a default CoPP policy named
copp-system-policy. NX-OS also comes with different profile settings for CoPP, to pro-
vide different protection levels to the system. These CoPP profiles include the following:
■ Strict: Defines a BC value of 250 ms for regular classes and 1000 ms for the impor-
tant class.
■ Moderate: Defines a BC value of 310 ms for regular classes and 1250 ms for the
important class.
■ Lenient: Defines a BC value of 375 ms for regular classes and 1500 ms for the
important class.
■ Dense: Recommended when the chassis has more F2 line cards than other I/O mod-
ules. Introduced in release 6.0(1).
If one of the policies is not selected during initial setup, NX-OS attaches the Strict pro-
file to the control plane. You can choose not to use one of these profiles and instead
create a custom policy to be used for CoPP. The NX-OS default CoPP policy categorizes
policy into various predefined classes:
■ Management: All management traffic, such as Telnet, SSH, FTP, NTP, and Radius
Example 3-62 shows a sample strict CoPP policy when the system comes up for the first
time. The CoPP configuration is viewed using the command show run copp all.
Technet24
188 Chapter 3: Troubleshooting Nexus Platform Issues
class copp-system-p-class-management
set cos 2
police cir 10000 kbps bc 250 ms conform transmit violate drop
class copp-system-p-class-normal
set cos 1
police cir 680 kbps bc 250 ms conform transmit violate drop
class copp-system-p-class-exception
set cos 1
police cir 360 kbps bc 250 ms conform transmit violate drop
class copp-system-p-class-monitoring
set cos 1
police cir 130 kbps bc 1000 ms conform transmit violate drop
class class-default
set cos 0
police cir 100 kbps bc 250 ms conform transmit violate drop
To view the differences in the different CoPP profiles, use the command show copp diff
profile profile-type profile profile-type. The command displays the policy-map configu-
ration differences of both specified profiles.
Both HWRL and CoPP are done at the forwarding engine (FE) level. An aggregate
amount of traffic from multiple FEs can still overwhelm the CPU. Thus, both the HWRL
and CoPP are best-effort approaches. Another important point to keep in mind is that
the CoPP policy should not be too aggressive; it also should be designed based on the
network design and configuration. For example, if the rate at which routing protocol
packets are hitting the CoPP policy is more than the policed rate, even the legitimate
sessions can be dropped and protocol flaps can be seen. If the predefined CoPP poli-
cies must be modified, create a custom CoPP policy by copying a preclassified CoPP
policy and then edit the new custom policy. None of the predefined CoPP profiles can be
edited. Additionally, the CoPP policies are hidden from the show running-config output.
The CoPP policies are viewed from the show running-config all or show running-config
copp all commands. Example 3-63 shows how to use the CoPP policy configuration and
create a custom strict policy.
HWRL, CoPP, and System QoS 189
Example 3-63 Viewing a CoPP Policy and Creating a Custom CoPP Policy
Technet24
190 Chapter 3: Troubleshooting Nexus Platform Issues
One problem that is faced with the access lists part of the CoPP policy is that the
statistics per-entry command is not supported for IP and MAC access control lists
(ACL); thus, it has no effect when applied under the ACLs. To view the CoPP policy–
referenced IP and MAC ACL counters on an input/output (I/O) module, use the com-
mand show system internal access-list input entries detail. Example 3-65 displays the
output of the command show system internal access-list input entries detail, showing
the hits on the MAC ACL for the FabricPath MAC address 0180.c200.0041.
n7k-1# show system internal access-list input entries detail | grep 0180.c200.0041
[020c:4344:020a] qos 0000.0000.0000 0000.0000.0000 0180.c200.0041 ffff.ffff.ffff
[0]
[020c:4344:020a] qos 0000.0000.0000 0000.0000.0000 0180.c200.0041 ffff.ffff.ffff
[20034]
[020c:4344:020a] qos 0000.0000.0000 0000.0000.0000 0180.c200.0041 ffff.ffff.ffff
[19923]
[020c:4344:020a] qos 0000.0000.0000 0000.0000.0000 0180.c200.0041 ffff.ffff.ffff
[0]
Starting with NX-OS Release 5.1, the threshold value is configured to generate a syslog
message for the drops enforced by the CoPP policy on a particular class. The syslog
messages are generated when the drops within a traffic class exceed the user-configured
threshold value. The threshold is configured using the logging drop threshold
dropped-bytes-count [level logging-level] command. Example 3-66 demonstrates how
to configure the logging threshold value to be set for 100 drops and logging at level 7.
It also demonstrates how the syslog message is generated in case the drop threshold is
exceeded.
Scale factor configuration was introduced in NX-OS starting with Version 6.0. The scale
factor is used to scale the policer rate of the applied CoPP policy on a per-line card basis
without changing the actual CoPP policy configuration. The scale factor configuration
ranges from 0.10 to 2.0. To configure the scale factor, use the command scale-factor value
[module slot] under the control-plane configuration mode. Example 3-67 illustrates how
to configure the scale factor for various line cards present in the Nexus chassis. The scale
factor settings are viewed using the command show system internal copp info. This com-
mand displays other information as well, including the last operation that was performed
and its status, CoPP database information, and CoPP runtime status, which is useful
while troubleshooting issues with CoPP policies.
n7k-1(config)# control-plane
n7k-1(config-cp)# scale-factor 0.5 module 3
n7k-1(config-cp)# scale-factor 1.0 module 4
n7k-1# show system internal copp info
Technet24
192 Chapter 3: Troubleshooting Nexus Platform Issues
Runtime Info:
--------------
Config FSM current state: IDLE
Modules online: 3 4 5 7
Linecard Configuration:
-----------------------
Scale Factors
Module 1: 1.00
Module 2: 1.00
Module 3: 0.50
Module 4: 1.00
Module 5: 1.00
Module 6: 1.00
Module 7: 1.00
Module 8: 1.00
Module 9: 1.00
Note Refer to the CCO documentation for the appropriate scale factor recommendation
for the appropriate Nexus 7000 chassis.
A few best practices need to be kept in mind for NX-OS CoPP policy configuration:
■ Use the copp profile strict command after each NX-OS upgrade, or at least after
each major NX-OS upgrade. If a CoPP policy modification was previously done, it
must be reapplied after the upgrade.
■ The dense CoPP profile is recommended when the chassis is fully loaded
with F2 series Modules or loaded with more F2 series modules than any other
I/O modules.
■ Monitor unintended drops, and add or modify the default CoPP policy in accor-
dance with the expected traffic.
MTU Settings
The MTU settings on a Nexus platform work differently than on other Cisco platforms.
Two kinds of MTU settings exist: Layer 2 (L2) MTU and Layer 3 (L3) MTU. The L3
HWRL, CoPP, and System QoS 193
MTU is manually configured under the interface using the mtu value command. On the
other hand, the L2 MTU is configured either through the network QoS policy or by set-
ting the MTU on the interface itself on the Nexus switches that support per-port MTU.
The L2 MTU settings are defined under the network-qos policy type, which is then
applied under the system qos policy configuration. Example 3-68 displays the sample
configuration to enable jumbo L2 MTU on the Nexus platforms.
Having the jumbo L2 MTU enabled before applying jumbo L3 MTU on the interface is
recommended.
Note Not all platforms support jumbo L2 MTU at the port level. The port-level L2 MTU
configuration is supported only on the Nexus 7000, 7700, 9300, and 9500 platforms. All
the other platforms (such as Nexus 3048, 3064, 3100, 3500, 5000, 5500, and 6000) support
only network QoS policy-based jumbo L2 MTU settings.
The MTU settings on the Nexus 3000, 7000, 7700, and 9000 (platforms that support
per-port MTU settings) can be viewed using the command show interface interface-type
x/y. On the Nexus 3100, 3500, 5000, 5500, and 6000 (platforms supporting network
QoS policy-based MTU settings), these are verified using the command show queuing
interface interface-type x/y.
Technet24
194 Chapter 3: Troubleshooting Nexus Platform Issues
Note Beginning with NX-OS Version 6.2, the per-port MTU configuration on FEX ports
is not supported on Nexus 7000 switches. A custom network QoS policy is required to
configure these (see Example 3-69).
The first step for MTU troubleshooting is to verify the MTU settings on the interface
using the show interface or the show queuing interface interface-type x/y com-
mands. The devices supporting network QoS policy-based MTU settings use the
command show policy-map system type network-qos to verify the MTU settings (see
Example 3-70).
Summary 195
In NX-OS, the Ethernet Port Manager (ethpm) process manages the port-level MTU con-
figuration. The MTU information under the ethpm process is verified using the command
show system internal ethpm info interface interface-type x/y (see Example 3-71).
NX-1# show system internal ethpm info interface ethernet 2/1 | egrep MTU
medium(broadcast), snmp trap(on), MTU(9216),
The MTU settings also can be verified on the Earl Lif Table Manager (ELTM) process,
which maintains Ethernet state information. The ELTM process also takes care of manag-
ing the logical interfaces, such as switch virtual interfaces (SVI). To verify the MTU set-
tings under the ELTM process on a particular interface, use the command show system
internal eltm info interface interface-type x/y (see Example 3-72).
Note If MTU issues arise across multiple devices or a software issue is noticed with
the ethpm process or MTU settings, capture the show tech-support ethpm and show
tech-support eltm [detail] output in a file and open a TAC case for further investigation.
Summary
This chapter focused on troubleshooting various hardware- and software-related prob-
lems on Nexus platforms. From the hardware troubleshooting perspective, this chapter
covered the following topics:
■ GOLD tests
Technet24
196 Chapter 3: Troubleshooting Nexus Platform Issues
This chapter detailed how VDCs work and explored how to troubleshoot any issues
with the same. Various issues arise with a combination of modules within a VDC. This
chapter also demonstrated how to limit the resources on a VDC and deeply covered vari-
ous NX-OS components, such as Netstack, UFDM and IPFIB, EthPM, and Port-Client.
Finally, the chapter addressed CoPP and how to troubleshoot for any drops in the CoPP
policy, including how to fix any MTU issues on the Ethernet and FEX ports.
References
Cisco, Cisco Nexus 7000 Series: Configuring Online Diagnostics, http://www.cisco.com.
Cisco, Cisco Nexus 7000 Series: Virtual Device Context Configuration Guide, http://
www.cisco.com.
Chapter 4
Nexus Switching
■ Virtual LANs
■ Private VLANs
■ Port Security
When Cisco launched the Nexus product line, it introduced a new category of net-
working devices called data center switching. Data center switching products provide
high-density, high-speed switching capacity to serve the needs of the servers (physical
and virtual) in the data center. This chapter focuses on the core components of network
switching and how to verify which components are working properly to isolate and
troubleshoot Layer 2 forwarding issues.
Technet24
198 Chapter 4: Nexus Switching
As more devices were added to a cable, the less efficient the network became. All these
devices were in the same collision domain (CD). Network hubs proliferated the problem
because they added port density while repeating traffic. Network hubs do not have any
intelligence in them to direct network traffic.
Network switches enhance scalability and stability in a network through the creation
of virtual channels. Switches maintain a table that associate a host’s MAC Ethernet
addresses to the port that sourced the network traffic. Instead of flooding all traffic out
of every port, a switch uses the MAC address table to forward network traffic only to
the destination port associated to the destination MAC address of the packet. Packets
are forwarded out of all network ports for that LAN only if the destination MAC
address is not known on the switch (known as unicast flooding).
Network broadcasts (MAC Address: ff:ff:ff:ff:ff:ff) cause the switch to broadcast the
packet out of every LAN switch port interface. This is disruptive because it diminishes
the efficiencies of a network switch to those of a hub because it causes communica-
tion between network devices to stop because of CSMA/CD. Network broadcasts
do not cross Layer 3 boundaries (that is, from one subnet to another subnet). All
devices that reside in the same Layer 2 (L2) segment are considered to be in the same
broadcast domain.
Figure 4-1 displays PC-A’s broadcast traffic that is being advertised to all devices on
that network, which include PC-B, PC-C, and R1. R1 does not forward the broadcast
traffic from one broadcast domain (192.168.1.0/24) to the other broadcast domain
(192.168.2.0/24).
192.168.1.0/24 192.168.2.0/24
R1
Broadcast
Bro
Bro
t
as
adc
adc
dc
oa
ast
ast
Br
The local MAC address table contains the list of MAC addressees and the ports that
those MAC addresses learned. The MAC address table is displayed with the com-
mand show mac address-table [address mac-address]. To ensure that the switch hard-
ware ASICS are programmed correctly, the hardware MAC address table is displayed
with the command show hardware mac address-table module [dynamic] [address
mac-address].
Network Layer 2 Communication Overview 199
Example 4-1 displays the MAC address table on a Nexus switch. Locating the
switch port the network device is attached to is the first step of troubleshooting L2
forwarding. If multiple MAC addresses appear on the same port, it indicates that a
switch is connected to that port, and that connecting to the switch may be required
as part of the troubleshooting processs to identify the port the network device is
attached to.
Note The terms network device and hosts are considered interchangeable in this text.
Technet24
200 Chapter 4: Nexus Switching
Virtual LANs
Adding a router between LAN segments helps shrink broadcast domains and provides
for optimal network communication. Host placement on a LAN segment varies because
of network addressing. This could lead to inefficient usage of hardware because some
switch ports could go unused.
VLANs are defined in the Institute of Electrical and Electronics Engineers (IEEE)
802.1Q standard, which states that 32 bits are added to the packet header and are com-
posed of the following:
■ Tag protocol identifier (TPID): 16-bit field set to 0x8100 to identify the packet as
an 802.1Q.
■ Priority code point (PCP): A 3-bit field to indicate a class of service (CoS) as part of
Layer 2 quality of service (QoS) between switches.
■ Drop Eligible Indicator (DEI): A 1-bit field that indicates if the packet can be
dropped when there is bandwidth contention.
■ VLAN identifier (VID): A 12-bit field that specifies the VLAN associated to a
network packet.
802.1Q Fields
The VLAN identifier has only 12 bits, which provide 4094 unique VLANs. NX-OS uses
the following logic for VLAN identifiers:
■ VLANs 2 to 1005 are in the normal VLAN range and can be added, deleted, or
modified as necessary.
Virtual LANs 201
■ VLANs 1006 to 3967 and 4048 to 4093 are in the extended VLAN range and can
be added, deleted, or modified as necessary.
■ VLANs 3968 to 4047 and 4094 are considered internal VLANs and are used inter-
nally by NX-OS. These cannot be added, deleted, or modified.
■ VLANs 4095 is reserved by 802.1Q standards and cannot be used.
VLAN Creation
VLANs are created by using the global configuration command vlan vlan-id. A friendly
name (32 characters) is associated to the VLAN by using the VLAN submode configuration
command name name. The VLAN is not created until the CLI has been moved back to the
global configuration context or a different VLAN identifier. Example 4-2 demonstrates the
creation of VLAN 10 (Accounting), VLAN 20 (HR), and VLAN 30 (Security) on NX-1.
NX-1(config)# vlan 10
NX-1(config-vlan)# name Accounting
NX-1(config-vlan)# vlan 20
NX-1(config-vlan)# name HR
NX-1(config-vlan)# vlan 30
NX-1(config-vlan)# name Security
VLANs and their port assignment are verified with the show vlan [id vlan-id] command,
as demonstrated in Example 4-3. The output is reduced to a specific VLAN by using
the optional id keyword. Notice that the output is broken into three separate areas:
Traditional VLANs, Remote Switched Port Analyzer (RSPAN) VLANs, and Private
VLANs.
Technet24
202 Chapter 4: Nexus Switching
Note Most engineers assume that a VLAN maintains a one-to-one ratio of subnet-to-
VLAN. Multiple subnets can exist in the same VLAN by assigning a secondary IP address
to a router’s interface or by connecting multiple routers to the same VLAN. In situations
like this, both subnets are part of the same broadcast domain.
Access Ports
Access ports are the fundamental building block of a managed switch. An access port is
assigned to only one VLAN. It carries traffic from the VLAN to the device connected to
it, or from the device to other devices on the same VLAN on that switch.
NX-OS places a L2 switch port as an access port by default. The port is configured as an
access port with the command switchport mode access. A specific VLAN is associated
to the port with the command switchport access vlan vlan-id. If the VLAN is not speci-
fied, it defaults to VLAN 1. The 802.1Q tags are not included on packets transmitted or
received on access ports.
The switchport mode access command does not appear when looking at the tra-
ditional running configuration and requires the optional all keyword, as shown in
Example 4-4.
The command show interface interface-id displays the mode that the port is using. The
assigned VLAN for the port is viewed with the show vlan command, as shown earlier in
Example 4-2, or with show interface status. Example 4-5 demonstrates the verification
of an access port and the associated VLAN. It is important to verify that both hosts must
be on the same VLAN for L2 forwarding to work properly.
Technet24
204 Chapter 4: Nexus Switching
--------------------------------------------------------------------------------
Port Name Status Vlan Duplex Speed Type
--------------------------------------------------------------------------------
mgmt0 -- connected routed full 1000 --
Eth1/1 -- connected trunk full 1000 10g
Eth1/2 -- connected 10 full 1000 10g
Trunk Ports
Trunk ports can carry multiple VLANs across them. Trunk ports are typically used when
multiple VLANs need connectivity between a switch and another switch, router, or fire-
wall. VLANs are identified by including the 802.1Q headers in the packets as the packet
is transmitted across the link. The headers are examined upon the receipt of the packet,
associated to the proper VLAN, and then removed.
Trunk ports must be statically defined on Nexus switches with the interface com-
mand switchport mode trunk. Example 4-6 displays Eth1/1 being converted to a
trunk port.
NX-1# config t
Enter configuration commands, one per line. End with CNTL/Z.
NX-1(config)# int eth1/1
NX-1(config-if)# switchport mode trunk
NX-1# show interface eth1/1 | include Port
Port mode is trunk
The command show interface trunk provides a lot of valuable information into the fol-
lowing sections when troubleshooting connectivity between network devices:
■ The first section list all the interfaces that are trunk ports, status, association to a
port-channel, and native VLAN.
■ The second section of the output displays the list of VLANs that are allowed
on the trunk port. Traffic can be minimized on trunk ports to restrict VLANs
to specific switches, thereby restricting broadcast traffic, too. Other use cases
involve a form of load balancing between network links where select VLANs are
allowed on one trunk link, and a different set of VLANs are allowed on a differ-
ent trunk port.
Virtual LANs 205
■ The third section displays any ports or VLANs that are in an error disabled
(Err-disabled) state. Typically, these errors are related with an incomplete vir-
tual port channel (vPC) configuration. vPCs are explained in detail in Chapter 5,
“Port Channels, Virtual Port-Channels, and FabricPath.”
■ The fourth section displays the VLANs that are in a forwarding state on the switch.
Ports that are in blocking state are not listed under this section.
Example 4-7 demonstrates the use of the show interface trunk command.
--------------------------------------------------------------------------------
! Section 2 displays all of the VLANs that are allowed to be transmitted across
! the trunk port
Port Vlans Allowed on Trunk
--------------------------------------------------------------------------------
Eth1/1 1-4094
--------------------------------------------------------------------------------
! Section 3 displays ports that are disabled due to an error.
Port Vlans Err-disabled on Trunk
--------------------------------------------------------------------------------
Eth1/1 none
--------------------------------------------------------------------------------
! Section 4 displays all of the VLANs that are allowed across the trunk and are
! in a spanning tree forwarding state
Port STP Forwarding
--------------------------------------------------------------------------------
Eth1/1 1,10,20,30,99
--------------------------------------------------------------------------------
Port Vlans in spanning tree forwarding state and not pruned
--------------------------------------------------------------------------------
Feature VTP is not enabled
Eth1/1 1,10,20,30,99
Technet24
206 Chapter 4: Nexus Switching
Native VLANs
Traffic on a trunk port’s native VLAN does not include the 802.1Q tags. The native
VLAN is a port-specific configuration and is changed with the interface command
switchport trunk native vlan vlan-id.
The native VLAN should match on both ports, or traffic can change VLANs.
Although connectivity between hosts is feasible (assuming that they are on the differ-
ent VLAN numbers), this causes confusion for most network engineers and is not a
best practice.
Note All switch control-plane traffic is advertised using VLAN 1. As part of Cisco’s
security hardening guide, it is recommended to change the native VLAN to something
other than VLAN 1. More specifically, it should be set to a VLAN that is not used at all to
prevent VLAN hopping.
Allowed VLANs
As stated earlier, VLANs can be restricted from certain trunk ports as a method of traf-
fic engineering. This can cause problems if traffic between two hosts is expected to tra-
verse a trunk link, and the VLAN is not allowed to traverse that trunk port. The interface
command switchport trunk allowed vlan-ids specifies the VLANs that are allowed to
traverse the link. Example 4-8 displays sample configuration to limit the VLANs that can
cross the Eth1/1 trunk link to 1,10, 30, and 99.
Example 4-8 Viewing the VLANs that Are Allowed on a Trunk Link
Note The full command syntax is switchport trunk allowed {vlan-ids | all | none |
add vlan-ids | remove vlan-ids | except vlan-ids} provides a lot of power in a single
command.
When scripting configuration changes, it is best to use the add or remove keywords
because they are more prescriptive. A common mistake is using the switchport trunk
allowed vlan-ids command, where only the VLAN that is being added is listed. This
results in the current list being overwritten, causing traffic loss for the VLANs that were
omitted.
Virtual LANs 207
Private VLANS
Some network designs require segmentation between network devices. This is easily
accomplished by two techniques:
■ Creating unique subnets for every security domain and restricting network traffic
with an ACL. Using this technique can waste IP addresses when a host range falls
outside of a subnet range (that is, a security zone with 65 hosts requires /25 and
results in wasting 63 IP addresses; this does not take into consideration the broad-
cast and network addresses).
■ Promiscuous: Ports associated to this VLAN are a primary PVLAN (the first tier)
and are allowed to communicate to all hosts. Typically, these are ports assigned
to a router, firewall, or server that is providing centralized services (DHCP, DNS,
and so on).
■ Isolated: These ports are in a secondary PVLAN (in the second tier of the
hierarchy) and are allowed to communicate only with ports associated to the
promiscuous PVLAN. Traffic is not transmitted between ports in the same
isolated VLAN.
Figure 4-3 demonstrates the usage of PVLANs for a service provider. R1 is the
router for every host in the 10.0.0.0/24 network segment and is connected with a
promiscuous PVLAN. Host-2 and Host-3 are from different companies and should
not be able to communicate with any host. They should only be able to communicate
with R1.
Host-4 and Host-5 are from the same third company and need to talk with each other
along with R1. Host-6 and Host-7 are from the same fourth company and need to talk
with each other along with R1. All other communication is not allowed.
Technet24
208 Chapter 4: Nexus Switching
10.0.0.1
VLAN Mode
10 Promiscuous
R1
20 Isolated
VLAN 10
30 Community
40 Community
NX-1
VLAN 20
VLAN 20
VLAN 30
VLAN 30
VLAN 40
VLAN 40
Host-2 Host-3 Host-4 Host-5 Host-6 Host-7
10.0.0.2 10.0.0.3 10.0.0.4 10.0.0.5 10.0.0.6 10.0.0.7
Table 4-1 displays the communication capability between hosts. Notice that Host-4 and
Host-5 communicate with each other; but cannot communicate with Host-2, Host-3,
Host-6, and Host-7.
Step 1. Enable the private VLAN feature. Enable the PVLAN feature with the com-
mand feature private-vlan in the global configuration mode.
Virtual LANs 209
Step 2. Define the isolated PVLAN. Create the isolated PVLAN with the command
vlan vlan-id. Underneath the VLAN configuration context, identify the
VLAN as an isolated PVLAN with the command private-vlan isolated.
Step 3. Define the promiscuous PVLAN. Create the promiscuous PVLAN with the
command vlan vlan-id. Underneath the VLAN configuration context, identify
the VLAN as a promiscuous PVLAN with the command private-vlan primary.
Step 4. Associate the isolated PVLAN to the promiscuous PVLAN. Underneath the
promiscuous PVLAN configuration context, associate the secondary (isolated
or community) PVLANs with the command private-vlan secondary-pvlan-id.
If multiple secondary PVLANs are used, delineate with the use of a comma.
Step 5. Configure the switchport(s) for the promiscuous PVLAN. Change the
configuration context to the switch port for the promiscuous host with the
command interface interface-id. Change the switch port mode to promiscu-
ous PVLAN with the command switchport mode private-vlan promiscuous.
The switch port must then be associated to the promiscuous PVLAN with the
command switchport access vlan promiscuous-vlan-id. A mapping between
the promiscuous PVLAN and any secondary PVLANs must be performed
using the command switchport private-vlan mapping promiscuous-vlan-id
secondary-pvlan-vlan-id. If multiple secondary PVLANs are used, delineate
with the use of a comma.
Step 6. Configure the switchport(s) for the isolated PVLAN. Change the configura-
tion context to the switch port for the isolated host with the command inter-
face interface-id. Change the switch port mode to the secondary PVLAN
type with the command switchport mode private-vlan host.
The switch port must then be associated to the promiscuous PVLAN with the
command switchport access vlan isolated-vlan-id. A mapping between the
promiscuous PVLAN and the isolated PVLAN must be performed using
the command switchport private-vlan mapping host-association
promiscuous-vlan-id isolated-pvlan-vlan-id.
Technet24
210 Chapter 4: Nexus Switching
NX-1(config-vlan)# vlan 10
NX-1(config-vlan)# name PVLAN-PROMISCOUS
NX-1(config-vlan)# private-vlan primary
NX-1(config-vlan)# private-vlan association 20
NX-1(config-vlan)# exit
NX-1(config)# interface Ethernet1/1
NX-1(config-if)# switchport mode private-vlan promiscuous
NX-1(config-if)# switchport access vlan 10
NX-1(config-if)# switchport private-vlan mapping 10 20
NX-1(config-if)# interface Ethernet1/2
NX-1(config-if)# switchport mode private-vlan host
NX-1(config-if)# switchport access vlan 20
NX-1(config-if)# switchport private-vlan host-association 10 20
NX-1(config-if)# interface Ethernet1/3
NX-1(config-if)# switchport mode private-vlan host
NX-1(config-if)# switchport access vlan 20
NX-1(config-if)# switchport private-vlan host-association 10 20
In Example 4-10, the primary VLAN correlates to the promiscuous PVLAN, the sec-
ondary VLAN correlates to the isolated PVLAN, the PVLAN type is confirmed, and all
active ports are listed off to the side. The promiscuous ports are always included. If a
port is missing, recheck the interface configuration because an error probably exists in
the PVLAN mapping configuration.
! Notice how there are not any ports listed in the regular VLAN section because
! they are all in the PVLAN section.
VLAN Name Status Ports
---- -------------------------------- --------- -------------------------------
1 default active Eth1/4, Eth1/5, Eth1/6, Eth1/7
10 PVLAN-PROMISCOUS active
20 PVLAN-ISOLATED active
Virtual LANs 211
..
Primary Secondary Type Ports
------- --------- --------------- -------------------------------------------
Note An isolated or community VLAN can be associated with only one primary VLAN.
PVLAN ports require a different port type and are set by the switchport mode private-
vlan {promiscuous | host} command. This setting is verified by examining the interface
using the show interface command. Example 4-11 displays the verification of the
PVLAN switch port type setting.
Another technique is to verify that the isolated PVLAN host devices can reach
the promiscuous host device. This is achieved with a simple ping test, as shown in
Example 4-12.
Technet24
212 Chapter 4: Nexus Switching
Step 1. Enable the private VLAN feature. Enable the PVLAN feature with the com-
mand feature private-vlan in the global configuration mode.
Step 2. Define the community PVLAN. Create the community PVLAN with the
command vlan vlan-id. Underneath the VLAN configuration context, iden-
tify the VLAN as a community PVLAN with the command private-vlan
community.
Step 3. Define the promiscuous PVLAN. Create the promiscuous PVLAN with the
command vlan vlan-id. Underneath the VLAN configuration context, iden-
tify the VLAN as a promiscuous PVLAN with the command private-vlan
primary.
Step 5. Configure the switch port(s) for the promiscuous PVLAN. Change the con-
figuration context to the switch port for the promiscuous host with the com-
mand interface interface-id. Change the switch port mode to promiscuous
PVLAN with the command switchport mode private-vlan promiscuous.
The switch port must then be associated to the promiscuous PVLAN with the
command switchport access vlan promiscuous-vlan-id. A mapping between
the promiscuous PVLAN and any secondary PVLANs needs to be performed
using the command switchport private-vlan mapping promiscuous-vlan-id
secondary-pvlan-vlan-id. If multiple secondary PVLANs are used, delineate
with the use of a comma.
Step 6. Configure the switch port(s) for the community PVLAN. Change the
configuration context to the switch port for the isolated host with the
Virtual LANs 213
command interface interface-id. Change the switch port mode to the sec-
ondary PVLAN type with the command switchport mode private-vlan host.
The switch port must then be associated to the promiscuous PVLAN with the
command switchport access vlan isolated-vlan-id. A mapping between the
promiscuous PVLAN and the community PVLAN needs to be performed
using the command switchport private-vlan mapping host-association
promiscuous-vlan-id community-pvlan-vlan-id.
Example 4-13 displays the deployment of VLAN 30 as a community PVLAN for Host-4
and Host-5 along with VLAN 40 for Host-6 and Host-7, according to Figure 4-3. VLAN
10 is the promiscuous PVLAN.
NX-1(config)# vlan 30
NX-1(config-vlan)# name PVLAN-COMMUNITY1 10 40
NX-1(config-vlan)# private-vlan community
NX-1(config-vlan)# vlan 40
NX-1(config-vlan)# name PVLAN-COMMUNITY2
NX-1(config-vlan)# private-vlan community
NX-1(config-vlan)# vlan 10
NX-1(config-vlan)# name PVLAN-PROMISCOUS
NX-1(config-vlan)# private-vlan primary
NX-1(config-vlan)# private-vlan association 20,30,40
NX-1(config-vlan)# exit
NX-1(config)# interface Ethernet1/1
NX-1(config-if)# switchport mode private-vlan promiscuous
NX-1(config-if)# switchport access vlan 10
NX-1(config-if)# switchport private-vlan mapping 10 20,30,40
NX-1(config-if)# interface Ethernet1/4
NX-1(config-if)# switchport mode private-vlan host
NX-1(config-if)# switchport access vlan 30
NX-1(config-if)# switchport private-vlan host-association 10 30
NX-1(config-if)# interface Ethernet1/5
NX-1(config-if)# switchport mode private-vlan host
NX-1(config-if)# switchport access vlan 30
NX-1(config-if)# switchport private-vlan host-association 10 30
NX-1(config-if)# interface Ethernet1/6
NX-1(config-if)# switchport mode private-vlan host
NX-1(config-if)# switchport access vlan 40
NX-1(config-if)# switchport private-vlan host-association 10 40
NX-1(config-if)# interface Ethernet1/7
NX-1(config-if)# switchport mode private-vlan host
NX-1(config-if)# switchport access vlan 40
NX-1(config-if)# switchport private-vlan host-association 10 40
Technet24
214 Chapter 4: Nexus Switching
Note VLAN 20 was a part of the promiscuous port configuration to demonstrate how
isolated and community PVLANs co-exist as a continuation of the previous configuration
to provide the solution shown in Figure 4-3.
Example 4-14 displays all the PVLANS and associated ports. Notice how VLAN 10 is
the primary VLAN for VLAN 20, 30, and 40.
Example 4-15 provides basic verification that all hosts in the isolated and community
PVLANs can reach R1. All hosts are not allowed to reach any other host in the iso-
lated PVLAN, whereas hosts in community PVLANs can only reach hosts in the same
community PVLAN.
! Verification that both hosts can ping other hosts in the same community PVLAN
Host-4# ping 10.0.0.5
Sending 5, 100-byte ICMP Echos to 10.0.0.5, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/5/9 ms
! Verification that both hosts cannot ping hosts in the other community PVLAN
Host-4# ping 10.0.0.6
Sending 5, 100-byte ICMP Echos to 10.0.0.6, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)
! Verification that both hosts cannot ping hosts in the isolated PVLAN
Host-4# ping 10.0.0.2
Sending 5, 100-byte ICMP Echos to 10.0.0.2, timeout is 2 seconds:
.....
Success rate is 0 percent (0/5)
Technet24
216 Chapter 4: Nexus Switching
NX-1# conf t
NX-1(config)# interface vlan 10
NX-1(config-if)# ip address 10.0.0.10/24
NX-1(config-if)# private-vlan mapping 20,30,40
NX-1(config-if)# no shut
NX-1(config-if)# do show run vlan
! Output omitted for brevity
vlan 10
name PVLAN-PROMISCOUS
private-vlan primary
private-vlan association 20,30,40
vlan 20
name PVLAN-ISOLATED
private-vlan isolated
vlan 30
name PVLAN-COMMUNITY1
private-vlan community
vlan 40
name PVLAN-COMMUNITY2
private-vlan community
The promiscuous PVLAN SVI port mapping is confirmed with the command show
interface vlan promiscuous-vlan-id private-vlan mapping, as shown in Example 4-17.
Example 4-18 demonstrates the connectivity between the hosts with the promiscu-
ous PVLAN SVI. The two promiscuous devices (NX-1 and R1) can ping each other. In
addition, all the hosts (demonstrated by Host-2) ping both NX-1 and R1 without impact-
ing the PVLAN functionality assigned to isolated or community PVLAN ports.
! Verification that both the promiscuous SVI can ping the other promiscuous
! host (R1)
NX-1# ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1): 56 data bytes
Virtual LANs 217
! Verification that a isolated PVLAN host can ping the physical and SVI
! promiscuous ports
Host-2# ping 10.0.0.1
Sending 5, 100-byte ICMP Echos to 10.0.0.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 1/8/25 ms
! Verification that an isolated host cannot ping another host in the isolated PVLAN
■ All switches support PVLANs: In this scenario, all the PVLANs and their pri-
mary/secondary mappings must be configured on both upstream and downstream
switches. A normal 802.1Q trunk link is established between the devices. The switch
with the promiscuous port is responsible for directing traffic to/from the promiscu-
ous port. In this scenario, spanning tree maintains a separate instance for each of the
PVLANs.
■ The upstream switch does not support PVLANs: In this scenario, the PVLANs and
their primary/secondary mappings must be configured on the downstream switch.
Because the upstream switch does not support PVLANs, the downstream switch must
Technet24
218 Chapter 4: Nexus Switching
merge/separate the secondary PVLANs to the primary PVLANs so that devices on the
upstream switch only need to use the primary PVLAN-ID. The upstream trunk switch
port is configured with the command switchport mode private-vlan trunk promiscu-
ous. These trunk ports are often referred to as promiscuous PVLAN trunk ports.
■ The downstream switch does not support PVLANs: In this scenario, the PVLANs
and their primary/secondary mappings must be configured on the upstream switch.
Because the downstream switch does not support PVLANs, the upstream switch
must merge/separate the secondary PVLANs to the primary PVLANs so that
devices on the downstream switch only need to use the secondary PVLAN-ID. The
downstream trunk switch port is configured with the command switchport mode
private-vlan trunk secondary. These trunk ports are often referred to as isolated
PVLAN trunk ports.
Note In all three scenarios, regular VLANs are transmitted across the trunk link.
Note Not all Nexus platforms support the promiscuous or isolated PVLAN trunk ports.
Check www.cisco.com for feature parity.
However, these topologies cause problems when a switch has to forward broadcasts or
unknown unicast flooding occurs. Network broadcasts forward in a continuous loop
until the link becomes saturated and the switch is forced to drop packets. In addition,
the MAC address table will be constantly changing ports as the packets make loops,
therefore increasing CPU and memory consumption and probably crashing the switch.
The Spanning Tree Protocol is the protocol that builds a L2 loop-free topology in
an environment by temporarily blocking traffic on specific ports. The Spanning Tree
Protocol has multiple iterations:
Nexus switches operate as RSTP or MST mode only. Both of these are backward com-
patible with 802.1D standards.
■ Listening: The switch port has transitioned from a blocking state and can now send
or receive BPDUs. It cannot forward any other network traffic.
■ Learning: The switch port can now modify the MAC address table with any net-
work traffic that it receives. The switch still does not forward any other network
traffic besides BPDUs. The switch port transitions into this state after the forward
delay has expired.
■ Forwarding: The switch port can forward all network traffic and can update the
MAC address table as expected. This is the final state for a switch port to forward
network traffic.
The original Spanning Tree Protocol defined the following three port types:
■ Designated port: A network port that receives and forwards frames to other switches.
Designated ports provide connectivity to downstream devices and switches.
■ Root port: A network port that connects to the root switch or an upstream switch in
the spanning-tree topology.
■ Blocking port: A network that is not forwarding traffic because of Spanning Tree
Protocol.
Within the Spanning Tree Protocol are a couple key terms that must be understood:
■ Root bridge: The root bridge is the most important switch in the L2 topology. All
ports are in a forwarding state. This switch is considered the top of the spanning-
tree for all path calculations by other switches. All ports on the root bridge are
categorized as designated ports.
Technet24
220 Chapter 4: Nexus Switching
■ Bridge Protocol Data Unit (BPDU): This network frame is used strictly for detecting
the STP topology so that switches can identify the root bridge, root ports, designated
ports, and blocking ports. The BPDU consists of the following fields: STP Type, Root
Path Cost, Root Bridge Identifier, Local Bridge Identifier, Max Age, Hello Time,
Forward Delay. The BPDU uses a destination MAC address of 01:80:c2:00:00:00.
■ Root Path Cost: The combination of the cost for a specific path toward the root
switch.
■ Root Bridge Identifier: Combination of the root bridge system MAC, system-ID
extension, and system priority of the root bridge.
■ Max Age: The timer the controls the maximum length of time that passes before
a bridge port saves its BPDU information. On Nexus switches, this is relevant for
backward compatibility with switches using traditional 802.1D STP.
■ Hello Time: The time that a BPDU is advertised out of a port. The default value
is 2 seconds and is configured to a value of 1 to 10 seconds with the command
spanning-tree vlan vlan-id hello-time hello-time.
■ Forward Delay: The amount of time that a port stays in a listening and learning state.
The default value is 15 seconds and can be changed to a value of 15 to 30 seconds
with the command spanning-tree vlan vlan-id forward-time forward-time.
Note A lot of STP terminology uses the term bridge, even though STP runs on switches.
The term bridge and switch are interchangeable in this context.
PVST and PVST+ were proprietary spanning protocols. The concepts in these protocols
were incorporated with other enhancements to form the IEEE 802.1W specification.
The 802.1Q specification incorporated additional enhancements to provide faster con-
vergence and is called Rapid Spanning Tree Protocol (RSTP).
■ Designated port: A network port that receives and forwards frames to other switches.
Designated ports provide connectivity to downstream devices and switches.
Spanning Tree Protocol Fundamentals 221
■ Root port: (RP) A network port that connects to the root switch or an upstream
switch in the spanning-tree topology.
■ Alternate port: A network port that provides alternate connectivity toward the root
switch via a different switch.
■ Backup port: A network port that provides link redundancy toward the current root
switch. The backup port cannot guarantee connectivity to the root bridge in the
event the upstream switch fails. A backup port exists only when multiple links con-
nect between the same switches.
With RSTP protocol, switches exchange handshakes with other switches to transition
through the following Spanning Tree Protocol states faster:
■ Discarding: The switch port is enabled, but the port is not forwarding any traffic to
ensure a loop is created. This state combines the traditional Spanning Tree Protocol
states of Disabled, Blocking, and Listening.
■ Learning: The switch port now modifies the MAC address table with any network
traffic that it receives. The switch still does not forward any other network traffic
besides BPDUs.
■ Forwarding: The switch port forwards all network traffic and updates the MAC
address table as expected. This is the final state for a switch port to forward net-
work traffic.
Note A switch tries to establish an RSTP handshake with the device connected to the
port. If a handshake does not occur, the other device is assumed to be non-RSTP compat-
ible, and the port defaults to regular 802.1D behavior. This means that host devices such
as computers and printers still encounter a significant transmission delay (~50 seconds)
after the network link is established.
Note RSTP is enabled by default for any L2 switch port with a basic configuration.
Additional configuration can be applied to the switch to further tune RSTP.
Technet24
222 Chapter 4: Nexus Switching
■ If the neighbor’s BPDU is inferior to its own BPDU, the switch ignores that BPDU.
■ If the neighbor’s BPDU is preferred to its own BPDU, the switch updates its BPDUs
to include the new root bridge identifier along with a new root path cost that cor-
relates the total path cost to reach the new root bridge. This process continues until
all switches in a topology have identified the root bridge switch.
Spanning Tree Protocol deems a switch more preferable if the priority in the bridge
identifier is lower than other BPDUs. If the priority is the same, the switch prefers the
BPDU with the lower system MAC.
Note Generally, older switches have a lower MAC address and are considered more
preferable. Configuration changes can be made for optimizing placement of the root
switch in a L2 topology.
System MAC:
5e00.4000.0007
NX-1
DP /2
Et P
h1
h1
D
Et
/3
RP /1
Et
h1
h1
RP
Et
/1
System MAC:
System MAC: 5e00.4002.0007
5e00.4001.0002 NX-2 Eth1/2 NX-3
Eth1/3
DP /4
Et DP
DP ALTN
h1
h1
Et
/5
RP /2
Et RP
h1
h1
Et
/3
System MAC:
System MAC: Eth1/1 Eth1/1 5e00.4004.0007
5e00.4003.0007 NX-4 DP ALTN NX-5
Eth1/5 Eth1/4
DP ALTN
The same command is run on NX-2 and NX-3 with the output displayed in Example 4-20.
The Root ID field is the same as NX-1; however, the root path cost has changed to 2
because both switches must use the 10 Gbps link to reach NX-1. Eth 1/1 has been identi-
fied on both of these switches as the root port.
Technet24
224 Chapter 4: Nexus Switching
2. The interface associated to the lowest system priority of the advertising switch.
3. The interface associated to the lowest system MAC address of the advertising
switch.
4. When multiple links are associated to the same switch, the lowest port priority from
the advertising router is preferred.
5. When multiple links are associated to the same switch, the lower port number from
the advertising router is preferred.
The command show spanning-tree root is run on NX-4 and NX-5 with the output dis-
played in Example 4-21. The Root ID field is the same as NX-1 from Example 4-20;
however, the root path cost has changed to 4 because both switches must traverse two
10 Gbps link to reach NX-1. Eth1/3 was identified as the RP on both switches.
2. The system priority of the local switch is compared to the system priority of the
remote switch. The local port is moved to a blocking state if the remote system pri-
ority is lower than the local switch.
3. The system MAC address of the local switch is compared to the system priority of
the remote switch. The local port is moved to a blocking state if the remote system
MAC address is lower than the local switch.
Note Step 3 is the last step of the selection process. If a switch has multiple links toward
the root switch, the downstream switch always identifies the RP. All other ports will
match the criteria for Step 2 or Step 3 and are placed into a blocking state.
The command show spanning-tree [vlan vlan-id] is used to provide useful informa-
tion for locating a port’s Spanning Tree Protocol state. Example 4-22 displays NX-1’s
Spanning Tree Protocol information for VLAN 1. The first portion of the output dis-
plays the relevant root bridge’s information, which is then followed by the local bridge’s
information. The associated interface’s Spanning Tree Protocol port cost, port priority,
and port type are displayed as well. All of NX-1’s ports are designated ports (Desg)
because it is the root bridge.
■ Point-to-Point (P2P): This port type connects with another network device (PC or
RSTP switch).
■ P2P Peer (STP): This port type detects that it is connected to an 802.1D switch and
is operating with backward compatibility.
■ Network P2P: This port type is specifically configured to connect with another
RSTP switch and to provide bridge assurance.
■ Edge P2P: This port type is specifically configured to connect with another host
device (PC, not a switch). Portfast is enabled on this port.
Technet24
226 Chapter 4: Nexus Switching
VLAN0001
Spanning tree enabled protocol rstp
! The section displays the relevant information for the STP Root Bridge
Root ID Priority 32769
Address 5e00.4000.0007
This bridge is the root
Hello Time 2 sec Max Age 20 sec Forward Delay 15 sec
! The section displays the relevant information for the Local STP Bridge
Bridge ID Priority 32769 (priority 32768 sys-id-ext 1)
Address 5e00.4000.0007
Hello Time 2 sec Max Age 20 sec Forward Delay 15 sec
Note If the Type field includes *TYPE_Inc –, this indicates a port configuration
mismatch between the Nexus switch and the switch to which it is connected. It is either
the port type, or the port mode (access versus trunk) is misconfigured.
Example 4-23 displays the Spanning Tree Protocol topology from NX-2 and NX-3.
Notice that in the first root bridge section, the output provides the total root path cost
and the port on the switch that is identified as the RP.
All the ports on NX-2 are in a forwarding state, but port Eth1/2 on NX-3 is in a blocking
(BLK) state. Specifically, that port has been designated as an alternate port to reach the
root in the event that Eth1/1 connection fails.
The reason that NX-3’s Eth1/2 port was placed into a blocking state versus NX-2’s
Eth1/3 port is that NX-2’s system MAC address (5e00.4001.0007) is lower than NX-3’s
system MAC address (5e00.4002.0007). This was deduced by looking at the Figure 4-4
and the system MAC addresses in the output.
VLAN0001
Spanning tree enabled protocol rstp
Root ID Priority 32769
Spanning Tree Protocol Fundamentals 227
Address 5e00.4000.0007
Cost 2
Port 1 (Ethernet1/1)
Hello Time 2 sec Max Age 20 sec Forward Delay 15 sec
Example 4-24 Viewing VLANs Participating with Spanning Tree Protocol on an Interface
Technet24
228 Chapter 4: Nexus Switching
Selecting the primary keyword sets the priority to 24,576 and the secondary key-
word sets the priority to 28,672.
Note The best way to prevent erroneous devices from taking over the root role is to set
the priority to zero on the desired root bridge switch.
Example 4-25 demonstrates NX-1 being set as the root primary and NX-2 being set as
the root secondary. Notice on NX-2’s output that it displays the root system priority,
which is different from its system priority.
Address 5e00.4000.0007
Note Notice that the priority on NX-1 is off by one. That is because the priority in the
BPDU packets is the priority plus the value of the Sys-Id-Ext (which is the VLAN num-
ber). So the priority for VLAN 1 is 24,577, and the priority for VLAN 10 is 24,586.
Root Guard
Root guard is a Spanning-Tree Protocol feature that prevents a configured port from
becoming a root port by placing a port in ErrDisabled state if a superior BPDU is
received on a configured port. Root guard prevents a downstream switch (often miscon-
figured or rogue) from becoming a root bridge in a topology.
Modifying Spanning Tree Protocol Root Port and Blocked Switch Port Locations
The Spanning Tree Protocol port cost is used for calculating the Spanning Tree Protocol
tree. When a switch generates the BPDUs, the total path cost includes only its calculated
metric to the root and does not include the port cost that the BPDU is advertised out of.
The receiving router then adds the port cost on the interface the BPDU was received in
conjunction with the value of the total path cost in the BPDU.
In Figure 4-4, NX-1 advertises its BPDUs to NX-3 with a total path cost of zero. NX-3
receives the BPDU and adds its Spanning Tree Protocol port cost of 2 to the total path
cost in the BPDU (zero), resulting in a value of 2. NX-3 then advertises the BPDU toward
NX-5 with a total path cost of 2, which NX-5 then adds to its ports cost of 2. NX-5
reports a cost of 4 to reach the root bridge via NX-3. The logic is confirmed in the
output of Example 4-26. Notice that there is not a total path cost in NX-1’s output.
Technet24
230 Chapter 4: Nexus Switching
The interface path is modified to impact which ports are designated or alternate ports
with the interface configuration command spanning tree [vlan vlan-id] cost cost. This is
set up for all VLANs by omitting the optional vlan keyword, or for a specific VLAN.
Example 4-27 demonstrates the modification of NX-3’s port cost for Eth1/1, which
ultimately impacts the Spanning Tree Protocol topology because the Eth1/2 port is no
longer an alternate port, but is now a designated port. NX-2’s Eth1/3 port changed from
a designated port to an alternate port.
Technet24
232 Chapter 4: Nexus Switching
Example 4-28 verifies that this change has made NX-5’s Eth1/1 the RP toward NX-4.
Remember that system-Id and port cost is the same, so the next check is port priority
and then followed by the port number. Both the port priority and port number are con-
trolled by the upstream switch.
Modify the port priority on NX-4 with the command spanning-tree [vlan vlan-id] port-
priority priority. The optional vlan keyword allows changing the priority on a VLAN-
by-VLAN basis. Example 4-29 displays changing the port priority on NX-4’s Eth1/5 port
to 64, and the impact it has on NX-5. Notice how NX5’s Eth1/4 port is now the RP.
Example 4-29 Verification of Port Priority Impact on a Spanning Tree Protocol Topology
The switch that detects the link status change sends a topology change notification
(TCN) toward the root bridge. The root bridge creates a new TCN, which is then
flooded toward all the switches in the L2 forwarding domain. Upon receipt of the root
bridge’s TCN, all switches flush their MAC address table. This results in traffic being
flooded out all ports while the MAC address table is rebuilt. Remember that hosts com-
municate using CSMA/CD, so this behavior causes a delay in communication while the
switch rebuilds its MAC address table.
TCNs are generated on a VLAN basis, so the impact of TCNs directly correlate to the
number of hosts in a VLAN. As the number of hosts increase, the more likely the fre-
quency of TCN generation occurs and the more hosts that are impacted by the broad-
casts. Topology changes should be checked as part of the troubleshooting process.
Topology changes are seen with the command show spanning-tree [vlan vlan-id] detail
on the root bridge. In the output, examine the topology change count and time since the
last change has occurred. A sudden or continuous increase in TCNs indicates a potential
problem and should be investigated further.
Example 4-30 displays the output of the show spanning-tree vlan 10 detail command.
Notice that the time since the last TCN was detected in and the interface that the TCN
originated from is included. The next step is to locate the switch that is connected to the
port causing the TCN. This is found by looking at CDP tables or your network docu-
mentation. The show spanning-tree [vlan vlan-id] detail is executed again to find the last
switch in the topology to identify the problematic port.
Technet24
234 Chapter 4: Nexus Switching
Viewing the NX-OS event-history provides another insight to the Spanning Tree Protocol
activities on a switch. The Spanning Tree Protocol event-history is displayed with the com-
mand show spanning-tree internal event-history all as demonstrated in Example 4-31.
The generation of TCN for hosts does not make sense because they generally have only
one connection to the network. Restricting TCN creation to only ports that connect with
Spanning Tree Protocol Fundamentals 235
other switches and network devices increases the L2 network’s stability and efficiency.
The Spanning Tree Protocol portfast feature disables TCN generation for access ports.
Another benefit of the Spanning Tree Protocol portfast feature is that the access ports
bypass the earlier 802.1D Spanning Tree Protocol states (learning and listening) and for-
ward traffic immediately. This is beneficial in environments where computers use dynamic
host configuration protocol (DHCP) or preboot execution environment (PXE).
The portfast feature is enabled on a specific port with the command spanning-tree port
type edge, or globally on all access ports with the command spanning-tree port type
edge default.
Example 4-32 demonstrates enabling portfast for NX-1’s Eth1/6 port along with its veri-
fication. Notice how the portfast ports are displayed with Edge P2P. The last section
demonstrates how portfasts are enabled globally for all access ports.
Technet24
236 Chapter 4: Nexus Switching
MST Configuration
MST is configured by the following process:
Step 1. Set the Spanning Tree Protocol mode as MST. Define MST as the spanning-
tree protocol with the command spanning-tree mode mst.
Step 2. Define the MST instance priority (optional). The MST instance priority is set
to a MST region by one of two methods:
Selecting the primary keyword sets the priority to 24,576 and the
secondary keyword sets the priority to 28,672
Step 3. Associate VLANs to an MST instance. By default all VLANs are associated
to the MST 0 instance. The MST configuration submode must be entered
with the command spanning-tree mst configuration. Then the VLANs are
assigned to a different MST instance with the command instance instance-
number vlan vlan-id.
Step 4. Specify the MST version number. The MST version number must match for
all switches in the same MST region. The MST version number is configured
with the submode configuration command revision version.
Step 5. Define the MST region name (optional). MST regions are recognized by
switches that share a common name. By default, a region name is an empty
string. The MST region name is set with the command name mst-region-name.
Example 4-33 demonstrates the MST configuration on NX-1. MST instance 2 contains
VLAN 30, MST instance 1 contains VLANs 10 and 20, and MST instance zero contains
all other VLANs.
MST Verification
The relevant spanning tree information can still be obtained with the command show
spanning-tree. The primary difference is that the VLAN numbers are not shown, but
the MST instance is provided instead. As well, the priority value for a switch is the MST
instance plus the switch priority. Example 4-35 displays the output of this command.
Technet24
238 Chapter 4: Nexus Switching
A consolidated view of the MST topology table is displayed with the command
show spanning-tree mst [instance-number]. The optional instance-number can be
included to restrict the output to a specific instance. The command is demonstrated
in Example 4-36. Notice that the VLANs are displayed next to the MST instance,
simplifying any steps for troubleshooting.
Spanning Tree Protocol Fundamentals 239
The specific MST settings are viewed for a specific interface with the command show
spanning-tree mst interface interface-id. Example 4-37 demonstrates the command.
Notice that the output also includes additional information about optional Spanning
Tree Protocol features, like BPDU Filter and BPDU Guard.
Technet24
240 Chapter 4: Nexus Switching
MST Tuning
MST supports the tuning of port cost and port priority. The interface configura-
tion command spanning-tree mst instance-number cost cost sets the interface cost.
Example 4-38 demonstrates the configuration of NX-3’s Eth1/1 port being modified to
a cost of 1, and verification of the interface cost before and after the change.
NX-4’s Eth1/5 port being modified to a priority of 64, and verification of the interface
priority before and after the change.
■ Misconfigured load-balancer that transmits traffic out multiple ports with same
MAC address.
■ Misconfigured virtual switch that bridges two physical ports. Virtual switches typi-
cally do not participate in Spanning Tree Protocol.
Technet24
242 Chapter 4: Nexus Switching
Cisco added a protection mechanism that keeps CPU utilization from increasing during a
L2 forwarding loop. When the MAC address move threshold is crossed (three times back
and forth across a set of ports in a 10-second interval), the Nexus switch flushes the MAC
address table and stops learning MAC addresses for a specific amount of time. Packets
continue to be forwarded in a loop fashion, but the CPU does not max out, and it allows
for other diagnostic commands to be executed so that the situation is remediated.
Example 4-40 displays the detection of the forwarding loop on VLAN 1 and the
flushing of the MAC address table.
NX-OS provides an enhancement to this detection and places the port in a shutdown
state when it detects a flapping MAC address. This functionality is enabled with the
command mac address-table loop-detect port-down. Example 4-41 demonstrates the
configuration of this feature, an occurrence where the feature is engaged, and how
the interface is confirmed to be in a down state.
Note Some platforms do not display the MAC notifications by default and require the
following additional configuration commands:
logging level spanning-tree 6
logging level fwm 6
logging monitor 6
BPDU Guard
BPDU guard is a safety mechanism that shuts down ports configured with Spanning Tree
Protocol portfast upon receipt of a BPDU. This ensures that loops cannot accidentally
be created if an unauthorized switch is added to a topology.
BPDU guard is enabled globally on all Spanning Tree Protocol portfast ports with the
command spanning-tree port type edge bpduguard default. BPDU guard can be enabled
or disabled on a specific interface with the command spanning-tree bpduguard {enable |
disable}. Example 4-42 displays the BPDU guard configuration for a specific port or glob-
ally on all access ports. Upon examination of the spanning-tree port details the by default
keyword indicates that the global configuration is what applied BPDU guard to that port.
Technet24
244 Chapter 4: Nexus Switching
Note BPDU guard should be configured on all host facing ports. However, do not
enable BPDU guard on PVLAN promiscuous ports.
By default, ports that are put in ErrDisabled because of BPDU guard do not automatically
restore themselves. The Error Recovery service can be used to reactivate ports that
are shut down for a specific problem, thereby reducing administrative overhead. The
Error Recovery service recovers ports shutdown from BPDU guard with the command
errdisable recovery cause bpduguard. The period that the Error Recovery checks for
ports is configured with the command errdisable recovery interval time-seconds.
Example 4-43 demonstrates the configuration of the Error Recovery service for BPDU
guard and Error Recovery in action.
BPDU Filter
BPDU filter quite simply blocks BPDUs from being transmitted out of a port. BPDU
filter can be enabled globally or on a specific interface. The behavior changes depending
upon the configuration:
■ If BPDU filter is enabled globally with the command spanning-tree port type edge
bpdufilter enable, the port sends a series of at least 10 BPDUs. If the remote port has
BPDU guard on it, that generally shuts down the port as a loop prevention mechanism.
Note Be careful with the deployment of BPDU filter because it could cause problems.
Most network designs do not require BPDU filter, and the use of BPDU filter adds an
unnecessary level of complexity while introducing risk.
Example 4-44 verifies the BPDU filter was enabled globally on the Eth1/1 interface. This
configuration sends the 10 BPDUs when the port first becomes active.
Detecting and Remediating Forwarding Loops 245
■ Bridge Assurance
Loop guard is enabled globally using the command spanning-tree loopguard default, or
it can be enabled on an interface basis with the interface command spanning-tree guard
loop. It is important to note that loop guard should not be enabled on portfast enabled
ports (directly conflicts with the root/alternate port logic) nor should it be enabled on
virtual port-channel (vPC) ports.
Example 4-45 demonstrates the configuration of loop guard on NX-2’s Eth1/1 port.
Technet24
246 Chapter 4: Nexus Switching
Placing BPDU filter on NX-2’s Eth1/1 port that connects to the NX-1 (the root bridge)
triggers loop guard. This is demonstrated in Example 4-46.
At this point in time, the port is considered in an inconsistent state. Inconsistent ports
are viewed with the command show spanning-tree inconsistentports, as shown in
Example 4-47. Notice how an entry exists for all the VLANs carried across the Eth1/1 port.
The UDLD feature must be enabled first with the command feature udld. UDLD is then
enabled under the specific interface with the command udld enable. Example 4-48 dem-
onstrates NX-1’s UDLD configuration on the link to NX-2.
UDLD must be enabled on the remote switch as well. Once configured, the status of UDLD
for an interface is checked using the command show udld interface-id. Example 4-49 dis-
plays the output of UDLD status for an interface. The output contains the current state,
Device-IDs (Serial Numbers), originating interface-IDs, and return interface-IDs.
Interface Ethernet1/49
--------------------------------
Port enable administrative configuration setting: enabled
Port enable operational state: enabled
Current bidirectional state: bidirectional
Current operational state: advertisement - Single neighbor detected
Message interval: 15
Timeout interval: 5
Entry 1
----------------
Expiration time: 35
Cache Device index: 1
Current neighbor state: bidirectional
Device ID: FDO1348R0VM
Port ID: Eth1/2
Neighbor echo 1 devices: FOC1813R0C
Neighbor echo 1 port: Ethernet1/1
Message interval: 15
Timeout interval: 5
CDP Device name: NX-2
Technet24
248 Chapter 4: Nexus Switching
After a UDLD failure, the interface state indicates that the port is down because of
UDLD failure, as shown in Example 4-50.
The event-history provides relevant information for troubleshooting UDLD errors. The
history is viewed with the command show udld internal event-history errors. This pro-
vides a time stamp and preliminary indication as to the cause of the problem. The history
is displayed in Example 4-51.
There are two common UDLD failures, which are described in the following sections:
■ Empty Echo
■ Tx-Rx Loop
Empty Echo
The Empty echo UDLD problem occurs in the following circumstances:
Example 4-52 demonstrates the syslog messages that appear with a UDLD Empty Echo
Detection.
Tx-Rx Loop
This condition occurs when a UDLD frame appears to be received on the same port that
it was advertised on. This means that the system-ID and port-ID in the received UDLD
packet match the system-ID and port-ID on the receiving switch (that is, what was trans-
mitted by the other switch). The Tx-Rx loop occurs in the following circumstances:
Example 4-53 demonstrates the syslog messages that appear with a UDLD Empty Echo
Detection.
Technet24
250 Chapter 4: Nexus Switching
Bridge Assurance
Bridge assurance overcomes some of the limitations that loop guard and UDLD are
affected by. Bridge assurance works on Spanning Tree Protocol designated ports (which
loop guard cannot) and overcomes issues when a port starts off in a unidirectional state.
Bridge assurance makes Spanning Tree Protocol operate like a routing protocol (EIGRP/
OSPF, and so on) where it requires health-check packets to occur bidirectionally.
The bridge assurance process is enabled by default, but requires that the trunk ports
are explicitly configured with the command spanning-tree port type network.
Example 4-54 demonstrates bridge assurance being configured on the interfaces
connecting NX-1, NX-2, and NX-3 with each other.
Example 4-55 displays the Spanning Tree Protocol port type after configuring bridge
assurance. Notice how the Network keyword has been added to the P2P type.
Example 4-55 Viewing the Spanning Tree Protocol Type of Ports with Bridge Assurance
Example 4-56 demonstrates a BPDU filter being applied on NX-2’s link to NX-3. Almost
instantly, bridge assurance has engaged on NX-2 and NX-3 because it cannot maintain a
mutual handshake of BPDU packets.
The Spanning Tree Protocol port types now include the comment *BA_Inc*, which
refers to the fact that those interfaces are now in an inconsistent port state for bridge
assurance. Example 4-57 displays the new interface port types.
Technet24
252 Chapter 4: Nexus Switching
The command show spanning-tree inconsistentports list all the interfaces and the reasons
that port is identified as inconsistent. Example 4-58 demonstrates the use of the command
on NX-2. This is relevant to cross-referencing the event-history as shown earlier.
And upon removal of the BPDU filter, bridge assurance disengages and returns the port
to a forwarding state, as shown in Example 4-59.
Note Bridge assurance is the preferred method for detection of unidirectional links pro-
tection and should be used when all platforms support it.
Summary
This chapter provided a brief review of the Ethernet communication standards and the
benefits that a managed switch provides to L2 topology. Troubleshooting L2 forwarding
issues are composed of many components. The first step in troubleshooting L2 forwarding
Summary 253
is to identify both the source and destination switch ports. From there it is best to follow
the flowchart in Figure 4-5 for troubleshooting. Depending upon the outcome, the
flowchart will redirect you back to the appropriate section in this chapter.
Identify the
access ports that Is either port in an Yes Investigate STP at the access layer for Configure PVLAN
the network ErrDisabled state? rogue switch and resolve the problem. mappings.
device is
associated with.
No No
No
Yes
Assign both access ports
to the same VLAN.
Take a SPAN capture to Yes Are the hosts Yes
verify each host is connected to the same
sending traffic. switch? Configure trunk links to
accommodate PVLANs.
No
No
No
Yes
Yes
Were the ports shut Yes ldentify the cause for the loop, and remediate.
down because of MAC Remove any hubs and BPDU filtering to help
address flapping? isolate the location of the loop.
No
Yes
Yes
Technet24
254 Chapter 4: Nexus Switching
References
Fuller, Ron, David Jansen, and Matthew McPherson. NX-OS and Cisco Nexus
Switching. Indianapolis: Cisco Press, 2013.
Proper network design takes into account single points of failure by ensuring that alter-
nate paths and devices can forward traffic in case of failure. Routing protocols make sure
that redundant paths can still be consumed because of equal-cost multipath (ECMP).
However, Spanning Tree Protocol (STP) stops forwarding on redundant links between
switches to prevent forwarding loops.
Port-Channels
Port-channels are a logical link that consists of one or multiple physical member links.
Port-channels are defined in the IEEE 803.3AD Link Aggregation Specification and are
sometimes referred to as EtherChannels. The physical interfaces that are used to assemble
the logical port-channel are called member interfaces. Port-channels are either Layer 2 (L2)
switching or Layer 3 (L3) routing.
Figure 5-1 visualizes some of the key components of a port-channel (member inter-
face and logical interface), along with the advantages it provides over individual links.
Technet24
256 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
In Figure 5-2, NX-1 and NX-2 have combined their Ethernet1/1 and Ethernet1/2 interfaces
into Port-Channel1. A failure on Link-A between the optical transport devices DWDM-1
and DWDM-2 is not propagated to the Eth1/1 interface on NX-1 or NX-2. The Nexus
switches continue to forward traffic out the Eth1/1 interface because those ports still
maintain physical state to DWDM-1 or DWDM-2. There is not a health-check mechanism
with the port-channel ports being statically set to “on.” However, if LACP was config-
ured, NX-1 and NX-2 would detect that traffic cannot flow end-to-end on the upper path
and would remove that link from the logical port-channel.
DWDM-1 DWDM-2
Link-A
Eth 1/1 Eth 1/1
NX-1 NX-2
Eth 1/2 Eth 1/2
Link-B
DWDM-3 DWDM-4
A member link becomes active within a port-channel after establishing an LACP using the
following messages:
■ Sync (S): Initial flag, indicating that the local switch includes the member interface as
part of the port-channel
■ Collecting (C): Second flag, indicating that the local switch processes network traf-
fic that is received on this interface
■ Distributing (D): Third flag, indicating that the local switch transmits network traffic
using this member interface
When a port comes up, messages are exchanged following these steps:
Step 1. Both switches (source and destination) advertise LACP packets with the Sync,
Collecting, and Distributing flags set to zero (off).
Step 2. As the source switch receives an LACP packet from the destination switch, it
collects the system-ID and port-ID from the initial LACP packet. The source
switch then transmits a Sync LACP packet indicating that it is willing to
participate in the port-channel. The initial LACP Sync packet includes the
local system-ID, port-ID, and port-priority, along with the detected remote
switches’ information (system-ID, port-ID, and port-priority). LACP members
for the port-channel are selected at this time.
Step 3. Upon receipt of the Sync LACP packet, the source switch verifies that the
local and remote (destination switch) system-IDs match the Sync LACP
packets to ensure that the switch-ID is the same across all member links and
that no multiple devices exist on a link (that is, no device is operating in the
middle, providing connectivity to a third switch). The source switch then
transmits a Collecting LACP packet indicating that the source switch is ready
to receive traffic on that interface.
Step 4. The destination switch verifies the accuracy of the Sync LACP packet for the
source switch against what was performed by the source switch. The destina-
tion switch then sends a Collecting LACP packet indicating that the destina-
tion switch is ready to receive traffic on that interface.
Step 5. The source switch receives the Collecting LACP packet from the destination
switch and transmits a Distributing LACP packet to the destination switch
indicating that it is transmitting data across that member link.
Step 6. The destination switch receives the Collecting LACP packet from the source
switch and transmits a Distributing LACP packet to the source switch indicat-
ing that it is transmitting data across that member link.
Step 7. Both switches transmit data across the member link interface that has com-
pleted the previous steps successfully.
Technet24
258 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
Note The LACP packets in Step 7 happen independently of other switches, assuming that
the requirements are met.
Figure 5-3 demonstrates the exchange of LACP messages between NX-1 (source switch)
and NX-2 (destination switch).
1
Sync: 0 Collecting:0 Distributing:0
3
Sync: 1 Collecting:1 Distributing:0
NX-1 NX-2
4
Sync: 1 Collecting:1 Distributing:0
NX-1 NX-2
5
Sync: 1 Collecting:1 Distributing:1
NX-1 NX-2
6
Sync: 1 Collecting:1 Distributing:1
NX-1 NX-2
7
Traffic Is Exchanged
NX-1 NX-2
Note This process occurs on every member link when it joins a port-channel interface.
Port-Channels 259
■ Passive: An interface does not initiate a port-channel to be established and does not
transmit LACP packets out of it. If the remote switch receives an LACP packet, this
interface responds and then establishes an LACP adjacency. If both devices are LACP
passive, no LACP adjacency forms.
The LACP feature must first be enabled with the global command feature lacp. Then the
interface parameter command channel-group portchannel-number mode {on | active |
passive} converts a regular interface into a member interface.
Example 5-1 demonstrates the configuration port-channel 1 using the member interfaces
Eth1/1 and Eth1/2. Notice that the port-channel is configured as a trunk interface, not as
the individual member interfaces.
NX-1# conf t
Enter configuration commands, one per line. End with CNTL/Z.
NX-1(config)# feature lacp
NX-1(config)# interface ethernet 1/1-2
NX-1(config-if-range)# channel-group 1 mode active
! Output omitted for brevity
03:53:14 NX-1 %$ VDC-1 %$ %ETH_PORT_CHANNEL-5-CREATED: port-channel1 created
03:53:14 NX-1 %$ VDC-1 %$ %ETHPORT-5-IF_DOWN_CHANNEL_MEMBERSHIP_UPDATE_IN_PROGRESS:
Interface Ethernet1/2 is down (Channel membership update in progress)
03:53:14 NX-1 %$ VDC-1 %$ %ETHPORT-5-IF_DOWN_CHANNEL_MEMBERSHIP_UPDATE_IN_PROGRESS:
Interface Ethernet1/1 is down (Channel membership update in progress)
..
03:53:16 NX-1 %$ VDC-1 %$ %ETHPORT-5-SPEED: Interface port-channel1, operational
speed changed to 10 Gbps
03:53:16 NX-1 %$ VDC-1 %$ %ETHPORT-5-IF_DUPLEX: Interface port-channel1, operational
duplex mode changed to Full
03:53:21 NX-1 %$ VDC-1 %$ %ETH_PORT_CHANNEL-5-PORT_UP: port-channel1: Ethernet1/1
is up
03:53:21 NX-1 %$ VDC-1 %$ %ETH_PORT_CHANNEL-5-FOP_CHANGED: port-channel1: first
operational port changed from none to Ethernet1/1
03:53:21 NX-1 %$ VDC-1 %$ %ETH_PORT_CHANNEL-5-PORT_UP: port-channel1: Ethernet1/2
is up
03:53:21 NX-1 %$ VDC-1 %$ %ETHPORT-5-IF_UP: Interface Ethernet1/1 is up in mode
access
Technet24
260 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
When viewing the output of the show port-channel summary command, check the port-
channel status, which is listed below the port-channel interface. The status should be “U,”
as in Example 5-2.
Table 5-2 briefly explains the fields related to the member interfaces.
The logical interface is viewed with the command show interface port-channel port-
channel-id. The output includes data fields that are typically displayed with a traditional
Technet24
262 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
Ethernet interface, with the exception of the member interfaces and the fact that
the bandwidth reflects the combined throughput of all active member interfaces. As this
changes, factors such as QoS policies and interface costs for routing protocols adjust
accordingly.
Example 5-3 displays the use of the command on NX-1. Notice that the bandwidth is
20 Gbps and correlates to the two 10-Gbps interfaces in the port-channel interface.
The output includes a list of the port-channel interfaces, their associated member inter-
faces, counters for LACP packets sent/received, and any errors. An interface should see
the Sent and Received columns increment over a time interval. If the counters do not
increment, this indicates a problem. The problem could be related to the physical link
or an incomplete/incompatible configuration with the remote device. Check the LACP
counters on that device to see if it is transmitting LACP packets.
Example 5-4 demonstrates the command. Notice that the Received column
does not increment on Ethernet1/2 for port-channel 1, but it does increment on the
Sent column.
Port-Channels 263
------------------------------------------------------------------------------
LACPDUs Markers/Resp LACPDUs
Port Sent Recv Recv Sent Pkts Err
------------------------------------------------------------------------------
port-channel1
Ethernet1/1 5753 5660 0 0 0
Ethernet1/2 5319 0 0 0 0
------------------------------------------------------------------------------
LACPDUs Markers/Resp LACPDUs
Port Sent Recv Recv Sent Pkts Err
------------------------------------------------------------------------------
port-channel1
Ethernet1/1 5755 5662 0 0 0
Ethernet1/2 5321 0 0 0 0
Another method involves using the command show lacp internal info interface
interface-id. This command includes a time stamp for the last time a packet was
transmitted or received out of an interface. Example 5-5 demonstrates the use of this
command.
Technet24
264 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
Example 5-6 demonstrates the use of this command. The output includes the neighbor’s
system ID, system priority, remote port number, remote port-priority, and details on
whether it is using fast or slow LACP packet intervals.
Partner's information
Partner Partner Partner
Port System ID Port Number Age Flags
Eth1/2 32768,18-9c-5d-11-99-800x139 985 SA
Note Use the LACP system identifier to verify that the member interfaces are connected
to the same device and are not split between devices. The local LACP system-ID is
viewed using the command show lacp system-identifier.
Port-Channels 265
The NX-OS Ethanalyzer tool is used to view the LACP packets being transmitted and
received on the local Nexus switch by capturing packets with the LACP MAC destina-
tion address. The command ethanalyzer local interface inband capture-filter "ether host
0180.c200.0002" [detail] captures LACP packets that are received. The optional detail
keyword provides additional information. Example 5-7 demonstrates the technique.
Capturing on inband
2017-10-23 03:58:11.213625 88:5a:92:de:61:58 -> 01:80:c2:00:00:02 LACP Link Aggr
egation Control Protocol
2017-10-23 03:58:11.869668 88:5a:92:de:61:59 -> 01:80:c2:00:00:02 LACP Link Aggr
egation Control Protocol
2017-10-23 03:58:23.381249 00:62:ec:9d:c5:1c -> 01:80:c2:00:00:02 LACP Link Aggr
egation Control Protocol
2017-10-23 03:58:24.262746 00:62:ec:9d:c5:1b -> 01:80:c2:00:00:02 LACP Link Aggr
egation Control Protocol
2017-10-23 03:58:41.218262 88:5a:92:de:61:58 -> 01:80:c2:00:00:02 LACP Link Aggr
egation Control Protocol
NX-1# conf t
NX-1(config)# interface port-channel 1
NX-1(config-if)# lacp min-links 2
NX-1(config-if)# interface Eth1/1
NX-1(config-if)# shut
Technet24
266 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
Note The minimum number of port-channel member interfaces does not need to be
configured on both devices to work properly. However, configuring it on both switches is
recommended to accelerate troubleshooting and assist operational staff.
Port-Channels 267
Technet24
268 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
The port-channel master switch controls which member interfaces (and associated links)
are active by examining the LACP port priority. A lower port priority is preferred. If the
port-priority is the same, the lower interface number is preferred.
Example 5-10 demonstrates how the LACP system priority is verified and changed.
NX-1# configuration t
Enter configuration commands, one per line. End with CNTL/Z.
NX-1(config)# lacp system-priority 1
Example 5-11 changes the port priority on NX-1 for Eth1/8 so that it is the most pre-
ferred interface. Because NX-1 is the master switch for port-channel 2, the Eth1/8 inter-
face becomes active, and ports Eth1/6 and Eth1/7 are in Hot-Standby because of the
previous configuration of maximum links set to four.
Port-Channels 269
LACP Fast
The original LACP standards sent out LACP packets every 30 seconds. A link is deemed
unusable if an LACP packet is not received after three intervals. This results in poten-
tially 90 seconds of packet loss for a link before that member interface is removed from
a port-channel.
An amendment to the standards was made so that LACP packets are advertised
every second. This is known as LACP fast because a link is identified and removed
in 3 seconds, compared to the 90 seconds of the initial LACP standard. LACP fast
is enabled on the member interfaces with the interface configuration command
lacp rate fast.
Note All interfaces on both switches must be configured the same, either LACP fast or
LACP slow, for the port-channel to successfully come up.
Technet24
270 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
Note When using LACP fast, check your respective platform’s release notes to ensure
that in-service software upgrade (ISSU) and graceful switchover are still supported.
Example 5-12 demonstrates identifying the current LACP state on the local and neighbor
interface, along with converting an interface to LACP fast.
Example 5-12 Configuring LACP Fast and Verifying LACP Speed State
NX-1# conf t
Enter configuration commands, one per line. End with CNTL/Z.
NX-1(config)# interface Eth1/1
NX-1(config-if)# lacp rate fast
Graceful Convergence
Nexus switches have LACP graceful convergence enabled by default with the port-
channel interface command lacp graceful-convergence. When a Nexus switch is con-
nected to a non-Cisco peer device, its graceful failover defaults can delay the time to
bring down a disabled port.
Another scenario involves forming LACP adjacencies with devices that do not fully
support the LACP specification. For example, a non-compliant LACP device might
start to transmit data upon receiving the Sync LACP message (step 2 from forming
LACP adjacencies) before transmitting the Collecting LACP message to a peer.
Because the local switch still has not reached a Collecting state, these packets are
dropped.
Sync LACP message to the peer. This ensures that the port receives packets upon sending
the Sync LACP message.
Suspend Individual
By default, Nexus switches place an LACP port in a suspended state if it does not receive
an LACP PDU from the peer. Typically, this behavior helps prevent loops that occur with
a bad switch configuration. However, it can cause some issues with some servers that
require LACP to logically bring up the port.
This behavior is changed by disabling the feature with the port-channel interface com-
mand no lacp suspend-individual.
■ Native virtual local area network (VLAN): The member interfaces on an L2 trunk
port-channels must be configured with the same native VLAN with the command
switchport trunk native vlan vlan-id. Otherwise, the error message “port not com-
patible [port native VLAN]” appears.
■ Speed: All member interfaces must be the same speed. In this scenario, an interface
is placed into a suspended state and the syslog message “%ETH_PORT_CHANNEL-
5-IF_DOWN_SUSPENDED_BY_SPEED” appears.
■ Duplex: The duplex must be the same for all member interfaces. Otherwise, the sys-
log message “command failed: port not compatible [Duplex Mode]” appears. This is
applicable only for interfaces operating at 100 Mbps or slower.
■ MTU: All L3 member interfaces must have the same maximum transmission unit
(MTU) configured. The interface cannot be added to the port-channel if the MTU
does not match the other member interfaces. The syslog message “command failed:
port not compatible [Ethernet Layer]” appears in this scenario. This message matches
the Port Type message and requires examining the member interface configuration
to identify the mismatched MTU.
Technet24
272 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
■ Load Interval: The load interval must be configured on all member interfaces.
Otherwise, the syslog message “command failed: port not compatible [load interval]”
appears.
■ Storm Control: The port-channel member ports must be configured with the same
storm control settings. Otherwise, the syslog message “port not compatible [Storm
Control]” appears.
Note A full list of compatibility parameters that must match is included with the com-
mand show port-channel compatibility-parameters.
As a general rule, when configuring port-channels on a Nexus switch, place the mem-
ber interfaces into the appropriate switch port type (L2 or L3) and then associate the
interfaces with a port-channel. All other port-channel configuration is done via the port-
channel interface.
If a consistency error occurs, locate member interfaces with the show port-channel sum-
mary command, view a member interface configuration, and apply it to the interface you
want to join the port-channel group.
■ Determine that both end links are statically set to “on” or are LACP enabled, with at
least one side set to “active.”
■ Ensure that all member interface ports are consistently configured (except for LACP
port priority).
Some member links in a port-channel might have a higher utilization than other links.
This scenario can occur depending on the port-channel configuration and the traffic
crossing it.
The load-balancing hash is seen with the command show port-channel load-balance, as
Example 5-14 shows. The default system hash is source-dest-ip, which calculates the hash
based upon the source and destination IP address in the packet header.
Technet24
274 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
If the links are unevenly distributed, changing the hash value might provide a different
distribution ratio across member-links. For example, if the port-channel is established
with a router, using a MAC address as part of the hash could impact the traffic flow
because the router’s MAC address does not change (the MAC address for the source or
destination is always the router’s MAC address). A better choice is to use the source/
destination IP address or base it off session ports.
Note: Add member links to a port-channel in powers of 2 (2, 4, 8, 16) to ensure that the
hash is calculated consistently.
In rare cases, troubleshooting is required to determine which member link a packet is tra-
versing on a port-channel. This involves checking for further diagnostics (optic, ASIC, and
so on) when dealing with random packet loss. A member link is identified with the com-
mand show port-channel load-balance [ forwarding-path interface port-channel number
{ . | vlan vlan_ID } [ dst-ip ipv4-addr ] [ dst-ipv6 ipv6-addr ] [ dst-mac dst-mac-addr ]
[ l4-dst-port dst-port ] [ l4-src-port src-port ] [ src-ip ipv4-addr ] [ src-ipv6 ipv6-addr ]
[ src-mac src-mac-addr ]].
Example 5-15 demonstrates how the member link is identified on NX-1 for a packet com-
ing from 192.168.2.2 toward 192.168.1.1 on port-channel 1.
Virtual Port-Channel
Port-channels lend many benefits to a design, but only two devices (one local and
one remote) can be used. NX-OS includes a feature called virtual port-channel
(vPC) that enables two Nexus switches to create a virtual switch in what is called
a vPC domain. vPC peers then provide a logical Layer 2 (L2) port-channel to a
remote device.
Figure 5-4 provides a topology to demonstrate vPCs. NX-2 and NX-3 are members of the
same vPC domain and are configured with a vPC providing a logical port-channel toward
NX-1. From the perspective of NX-1, it is connected to only one switch.
Virtual Port-Channel 275
vPC Domain
NX-2 NX-3
Virtual
Port-Channel
Traditional
Port-Channel
NX-1
Note Unlike switch stacking or Virtual Switching Systems (VSS) clustering technolo-
gies, the configuration of the individual switch ports remains separate. In other words, the
Nexus switches are configured independently.
vPC Fundamentals
Only two Nexus switches can participate in a vPC domain. The vPC feature also includes
a vPC peer-keepalive link, vPC member links, and the actual vPC interface. Figure 5-5
shows a topology with these components.
vPC Domain
A Nexus switch can have regular port-channel and vPC interfaces at the same time. A
different LACP system ID is used in the LACP advertisements between the port-channel
and vPC interfaces. Both Nexus peer switches use a virtual LACP system ID for the vPC
member link.
One of the switches is the primary device and the other is the secondary device. The
Nexus switches select the switch with the lower role priority as the primary device. If a
tie occurs, the Nexus switch with the lower MAC address is preferred. No pre-emption
takes place in identifying the primary device, so the concept of operational primary
device and operational secondary device is introduced.
Technet24
276 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
vPC Peer
Eth1/48 Link(s)
Eth1/48
Eth1/46 Eth1/46
NX-2 NX-3
Eth1/47 Eth1/47
vPC Member Link vPC Member Link
Et
/3
h1
h1
Et
/3
Et
/2
h2
h2
Et
/1
NX-1
This concept is demonstrated in the following steps by imagining that NX-2 and NX-3
are in the same vPC domain, and NX-2 has a lower role priority.
Step 1. As both switches boot and initialize, neither switch has been elected as the
vPC domain primary device. Then NX-2 becomes the primary device and the
operational primary device, while NX-3 becomes the secondary device and
the operational secondary device.
Step 2. NX-2 is reloaded. NX-3 then becomes the primary device and the operational
primary device.
Step 3. When NX-2 completes its initialization, it again has the lower role priority
but does not preempt NX-3. At this stage, NX-2 is the primary device and
the operational secondary device, and NX-3 is the secondary device and the
operational primary device. Only when NX-3 reloads or shuts down all vPC
interfaces does NX-2 become the operational primary device.
vPC Peer-Keepalive
The vPC peer-keepalive link monitors the health of the peer vPC device. It sends
keepalive messages on a periodic basis (system default of 1 second). The heartbeat packet
is 96 bytes in length, using UDP port 3200. If the peer link fails, connectivity is checked
across the vPC peer link. Not a lot of network traffic is submitted across the peer-
keepalive link, so a 1-Gbps interface is used.
Virtual Port-Channel 277
A vPC peer device detects a peer failure by not receiving any peer-keepalive messages.
A hold-timeout timer starts as soon as the vPC peer is deemed unavailable. During the
hold-timeout period (system default of 5 seconds), the secondary vPC device ignores any
vPC keep-alive messages to ensure that the network can converge before action is taken
against vPC interfaces. After the hold-timeout period expires, the timeout timer begins
(system default of 3 seconds). If a vPC keep-alive message is not received during this
interval, the vPC interfaces on the secondary vPC switch are shut down. This behavior
prevents a split-blain scenario.
Note Although using a VLAN interface for the peer-keepalive interface is technically
feasible, this approach is discouraged because it can cause confusion. Additionally, the
link should be directly connected where possible (with the exception of the management
ports).
The vPC peer link must be on a 10-Gbps or higher Ethernet port. Typically, a port-
channel is used to ensure that enough bandwidth exists for traffic sent from one vPC
peer to be redirected where appropriate to the remote vPC peer. In addition, on modular
Nexus switches, the links should be spread across different line cards/modules to ensure
that the peer link stays up during a hardware failure.
■ Traffic received on a vPC peer link is never advertised out a vPC member port. This
is part of a loop-prevention mechanism.
Technet24
278 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
The Hot Standby Router Protocol (HSRP) runs on network devices and provides a
fault-tolerant virtual IP for hosts on a network segment. With HSRP, only one network
device actively forwards traffic for the virtual IP. However, on some Nexus platforms
that are deployed with vPC, both Nexus switches actively forward traffic for the virtual
gateway. This improves bandwidth and reduces sending Layer 3 (L3) network traffic
across the vPC peer link.
vPC Configuration
The vPC configuration contains the following basic steps:
Step 1. Enable the vPC feature. The vPC feature must be enabled with the command
feature vpc.
Step 2. Enable the LACP feature. vPC port-channels require the use of LACP, so the
LACP feature must be enabled with the command feature lacp.
Step 3. Configure the peer-keepalive link. The peer-keepalive link must be config-
ured. Cisco recommends creating a dedicated virtual routing and forwarding
(VRF) for the peer-keepalive link. Then an IP address must be associated with
that interface using the command ip address ip-address mask.
Note Using the management interface for the peer-keepalive link is possible, but this
requires a management switch to provide connectivity between peer devices. If a system
has multiple supervisors (as with Nexus 7000/9000), both the active and standby manage-
ment ports on each vPC peer need to connect to the management switch.
Step 4. Configure the vPC domain. The vPC domain is the logical construct that
both Nexus peers use. The vPC domain is created with the command vpc
domain domain-id. The domain ID must match on both devices.
NX-OS automatically creates a vPC system MAC address for the LACP mes-
saging, but the MAC address is defined with the system-mac mac-address
command. The LACP system priority for vPC domain is 32768, but it can be
modified with the command system-priority priority to increase or lower
the virtual LACP priority.
Step 5. Configure the vPC device priority (optional). The vPC device priority is
configured with the command role priority priority. The priority can be set
from 1 to 65,535, with the lower value more preferred. The preferred node is
the primary vPC node; the other node is the secondary.
Virtual Port-Channel 279
The vPC autorecovery feature provides a method for one of the vPC peers to
start forwarding traffic. Upon initialization, if the vPC peer link is down and
three consecutive peer-keepalive messages are not responded to, the second-
ary device assumes the operational primary role and initializes vPC interfaces
to allow some traffic to be forwarded. vPC autorecovery is explained later in
this chapter.
Step 8. Configure the vPC. Ports are assigned to the port-channel with the
command channel-group portchannel-number mode active command.
The port-channel interface is assigned a unique vPC identifier with the
command vpc vpc-id. The vpc-id needs to match on the remote peer
device.
Example 5-16 demonstrates the vPC configuration of NX-2 from Figure 5-5.
NX-2# configuration t
Enter configuration commands, one per line. End with CNTL/Z.
! Enable the vPC and LACP features
NX-2(config)# feature vpc
NX-2(config)# feature lacp
! Creation of the vPC Peer-KeepAlive VRF and association of IP address
NX-2(config)# vrf context VPC-KEEPALIVE
NX-2(config-vrf)# address-family ipv4 unicast
NX-2(config-vrf-af-ipv4)# interface Ethernet1/48
NX-2(config-if)# description vPC-KeepAlive
NX-2(config-if)# no switchport
NX-2(config-if)# vrf member VPC-KEEPALIVE
Technet24
280 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
vPC Verification
Now that both Nexus switches are configured, the health of the vPC domain must be
examined.
Example 5-17 demonstrates the output of the show vpc command for NX-2.
vPC status
----------------------------------------------------------------------------
Id Port Status Consistency Reason Active vlans
-- ------------ ------ ----------- ------ ---------------
1 Po1 up success success 1
As stated earlier, the peer link should be in a forwarding state. This is verified by
examining the STP state with the command show spanning-tree, as Example 5-18 dem-
onstrates. Notice that the vPC interface (port-channel 100) interface is in a forwarding
state and is identified as a network point-to-point port.
VLAN0001
Spanning tree enabled protocol rstp
Root ID Priority 28673
Address 885a.92de.617c
Cost 1
Port 4096 (port-channel1)
Hello Time 2 sec Max Age 20 sec Forward Delay 15 sec
Technet24
282 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
If the status shows as “down,” verify that each switch can ping the other switch from
the VRF context that is configured. If the ping fails, troubleshooting basic connectivity
between the two switches needs to be performed.
vPC Consistency-Checker
Just as with port-channel interfaces, certain parameters must match on both Nexus
switches in the vPC domain. NX-OS contains a specific process called the consistency-
checker to ensure that the settings are compatible and to prevent unpredictable packet
loss. The consistency-checker has two types of errors:
■ Type 1
■ Type 2
Type 1
When a Type 1 vPC consistency-checker error occurs, the vPC instance and vPC mem-
ber ports on the operational secondary Nexus switch enter a suspended state and stop
forwarding network traffic. The operational primary Nexus switch still forwards network
traffic. These settings must match to avoid a Type 1 consistency error:
■ Native VLAN
■ VLANs allowed on trunk
■ Tagging of native VLAN traffic
■ STP mode
■ STP region configuration for Multiple Spanning Tree
■ Loop Guard
■ Root Guard
Technet24
284 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
■ MTU
Note NX-OS version 5.2 introduced a feature called graceful consistency checker that
changes the behavior for Type 1 inconsistencies. The graceful consistency checker enables
the operational primary device to forward traffic. If this feature is disabled, the vPC is shut
down completely. This feature is enabled by default.
Type 2
A Type 2 vPC consistency-checker error indicates the potential for undesired forwarding
behavior, such as having a VLAN interface on one node and not another.
vPC status
----------------------------------------------------------------------------
Id Port Status Consistency Reason Active vlans
-- ------------ ------ ----------- ------ ---------------
1 Po1 up failed Global compat check 1,10,20
failed
Example 5-21 displays the output for the show vpc consistency-parameters global
command.
Legend:
Type 1 : vPC will be suspended in case of mismatch
Technet24
286 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
Example 5-22 displays the output for the show vpc consistency-parameters vlan command.
Configuration inconsistencies in this output can introduce undesirable forwarding behaviors.
vPC consistency parameters that are directly related to a port-channel interface are dis-
played with the command show vpc consistency-parameters {vpc vpc-id | port-channel
port-channel-identifier} options. The port-channel is viewed by identifying the vpc-id
(which might be different from the port-channel interface number). The output is exactly
the same for either iteration of the command. Example 5-23 displays the output for the
show vpc consistency-parameters vpc vpc-id command.
Legend:
Type 1 : vPC will be suspended in case of mismatch
Technet24
288 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
■ Prune VLANS from all the vPC interfaces and vPC peer link interfaces with the
switchport trunk allowed vlan command. Those VLANS then are associated with
the interface when a device can connect to only one network link. This is more of a
design change.
Example 5-25 displays this feature being enabled on Ethernet 1/44 and 1/45 on NX-2.
vPC Autorecovery
As a safety mechanism, a vPC peer does not enable any vPC interfaces until it detects the
other vPC peer. In some failure scenarios, such as power failures, both vPC devices are
restarted and do not detect each other. This can cause a loss of traffic because neither
device forwards traffic.
The vPC autorecovery feature provides a method for one of the vPC peers to start
forwarding traffic. Upon initialization, if the vPC peer link is down and three consecu-
tive peer-keepalive messages were not responded to, the secondary device assumes
the operational primary role and can initialize vPC interfaces to allow some traffic to
forward.
This feature is enabled with the vPC domain configuration command auto-recovery
[reload-delay delay]. The default delay is 240 seconds before engaging this feature,
but this can be changed with the optional reload-delay keyword. The delay is a value
between 240 and 3600. Example 5-26 displays the configuration and verification of
vPC autorecovery.
vPC Peer-Gateway
The vPC peer-gateway capability allows a vPC device to route packets that are
addressed to the router MAC address of the vPC peer. This functionality is used
to overcome scenarios with misconfigurations and issues that arise with load
balancers or network attached storage (NAS) devices that try to optimize packet
forwarding.
For example, Figure 5-6 demonstrates a topology in which NX-2 and NX-3 are acting
as the gateway for VLAN 100 and VLAN 200. NX-2 and NX-3 have a vPC configured
for the web server and NX-1, which connects to the NAS. NX-1 is only switching (not
routing) packets to or from the NAS device.
Technet24
290 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
VLAN 100
172.32.100.0/24
NX-1
PC #1
NX-2 NX-3
vPC #60
VLAN 200
172.32.200.0/24 PC #22
Web Server
172.32.200.22
When the web server sends a packet to the NAS device (172.32.100.22), it computes a
hash to identify which link it should send the packet on to reach the NAS device. Assume
that the web server sends the packet to NX-2, which then changes the packet’s source
MAC address to 00c1.5c00.0011 (part of the routing process) and forwards the packet on
to NX-1. NX-1 forwards (switches) the packet on to the NAS device.
Now the NAS device creates the reply packet and, when generating the packet head-
ers, uses the destination MAC address of the HSRP gateway 00c1.1234.0001 and
forwards the packet to NX-1. NX-1 computes a hash based on the source and destina-
tion IP address and forwards the packet toward NX-3. NX-2 and NX-3 both have the
destination MAC address for the HSRP gateway and can then route the packet for the
172.32.200.0/24 network and forward it back to the web server. This is the correct and
normal forwarding behavior.
The problem occurs when the NAS server enables a feature for optimizing packet flow.
After the NAS device receives the packet from the web server and generates the reply
packet headers, it just uses the source and destination MAC addresses from the packet
it originally received. When NX-1 receives the reply packet, it calculates the hash
and forwards the packet toward NX-3. Now NX-3 does not have the MAC address
Virtual Port-Channel 291
00c1.5c00.0011 (NX-2’s VLAN 100 interface) and cannot forward the packet toward
NX-1. The packet is dropped because packets received on a vPC member port cannot be
forwarded across the peer link, as a loop-prevention mechanism.
Enabling a vPC peer-gateway on NX-2 and NX-3 allows NX-3 to route packets destined
for NX-2’s MAC addresses, and vice versa. The vPC peer-gateway feature is enabled with
the command peer-gateway under the vPC domain configuration. The vPC peer-gateway
functionality is verified with the show vpc command. Example 5-27 demonstrates the
configuration and verification of the peer-gateway feature.
Note In addition, NX-OS automatically disables IP redirects on SVIs where the VLAN is
enabled on a vPC trunk link.
Note Packets that are forwarded by the peer-gateway feature have their time to live (TTL)
decremented. Packets carrying a TTL of 1 thus might get dropped in transit because of
TTL expiration.
Technet24
292 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
ARP synchronization is enabled with the command ip arp synchronize under the vPC
domain configuration. Example 5-28 demonstrates enabling ARP synchronization on NX-2.
NX-2# conf t
Enter configuration commands, one per line. End with CNTL/Z.
NX-2(config)# vpc domain 100
NX-2(config-vpc-domain)# ip arp synchronize
Figure 5-7 demonstrates a simple topology in which NX-2 and NX-3 have an SVI inter-
face for VLAN 200 that acts a gateway for the web server. NX-2, NX-3, and R4 are all
running OSPF so that NX-2 and NX-3 can forward packets to R4. NX-3 is the operational
primary Nexus switch.
Operational Operational
Secondary Nexus Primary Nexus
Eth1/22
R4 Gi0/0 Eth1/46 Eth1/46
NX-2 NX-3
Eth1/47 Eth1/47
/3
Et
h1
h1
Et
/3
172.16.0.0/30
Network Link
Et
/2
h2
h
2/
Et
1
VLAN 200
172.32.200.0/24
Web Server
172.32.200.22
If the vPC peer link is broken (physically or through an accidental change that triggers
a Type 1 consistency checker error), NX-2 suspends activity on its vPC member port
and shuts down the SVI for VLAN 200. NX-3 drops its routing protocol adjacency
with NX-2 and then cannot provide connectivity to the corporate network for the web
server. Any packets from the web server for the corporate network received by NX-3
are dropped.
Virtual Port-Channel 293
Note Remember that the vPC peer link does not support the transmission of routing
protocols as transient traffic. For example, suppose that Eth1/22 on NX-2 is a switch port
that belongs to VLAN 200 and R4’s Gi0/0 interface is configured with the IP address of
172.32.200.5. R4 pings NX-3, but it does not establish an OSPF adjacency with NX-3
because the OSPF packets are not transmitted across the vPC peer link. This is resolved by
deploying the second solution listed previously.
However, vPC functionality was never meant to provide a logical L2 link to be used
to form routing protocol adjacencies. However, the release of NX-OS version 7.3 provides
the capability for the SVIs to form a routing protocol adjacency using a vPC interface
with a router.
Note L3 Routing over vPC is specific only to unicast and does not include support for
multicast network traffic.
Figure 5-8 demonstrates the concept in which NX-2 and NX-3 want to exchange routes
using OSPF with R4 across the vPC interface. NX-2 and NX-3 enable Layer 3 rout-
ing over vPC to establish an Open Shortest Path First (OSPF) neighborship with R4. In
essence, this design places NX-2, NX-3, and R4 on the same LAN segment.
Physical Logical
vPC Domain
NX-2 NX-3
NX-2 NX-3
R4
Po1
R4
Technet24
294 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
Layer 3 routing over vPC is configured under the vPC domain with the command layer3
peer-router. The peer-gateway is enabled when using this feature. The feature is verified
with the command show vpc.
Example 5-29 demonstrates the configuration and verification of Layer 3 routing over vPC.
Note If vPC peering is not being established or vPC inconsistencies result, collect the
show tech vpc command output and contact Cisco technical support.
FabricPath
Until recently, all L2 networks traditionally were enabled with STP to build a loop-free
topology. However, the STP-based L2 network design introduces some limitations. One
limitation is the inability of STP to leverage parallel forwarding paths. STP blocks addi-
tional paths, forcing the traffic to take only one path as the STP forms a forwarding tree
rooted at a single device, even though redundant paths are physically available. Other
limitations include the following:
To overcome these challenges, vPC was introduced in 2008. An Ethernet device then
could connect simultaneously to two discrete Nexus switches while bundling these links
into a logical port-channel. vPC provided users with active-active forwarding paths, thus
overcoming the limitation of STP. Still, although vPC overcame most of the challenges,
FabricPath 295
others remained. For example, no provision was made for adding third or fourth aggre-
gation layer switches to further increase the density or bandwidth on the downstream
switch. In addition, vPC doesn’t overcome the traditional STP design limitation of
extending the VLANs.
The Cisco FabricPath feature provides a foundation for building a simplified, scalable,
and multipath-enabled L2 fabric. From the control plane perspective, FabricPath uses a
shortest path first (SPF)–based routing protocol, which helps with best path selection to
reach a destination within the FabricPath domain. It uses the L2 IS-IS protocol, which
provides all IS-IS capabilities for handling unicast, broadcast, and multicast packets.
Enabling a separate process for the L2 IS-IS is not needed; this is automatically enabled
on the FabricPath-enabled interfaces.
FabricPath provides Layer 3 routing benefits to flexible L2 bridged Ethernet networks. It
provides the following benefits of both routing and switching domains:
■ Routing
■ Multipathing (ECMP), with up to 256 links active between any two devices
■ Fast convergence
■ High scalability
■ Switching
■ Easy configuration
■ Plug and Play
■ Provision flexibility
Because the FabricPath core runs on L2 IS-IS, no STP is enabled between the spine and
the leaf nodes, thus providing reliable L2 any-to-any connectivity. A single MAC address
lookup at the ingress edge device identifies the exit port across the fabric. The traffic is
then switched using the shortest path available.
FabricPath-based design allows hosts to leverage the benefit of multiple active Layer 3
default gateways, as Figure 5-9 shows. The hosts see a single default gateway. The fabric pro-
vides forwarding toward the active default gateways transparently and simultaneously, thus
extending the multipathing from inside the fabric to the Layer 3 domain outside the fabric.
L3
FabricPath dg dg
s1
A
Technet24
296 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
The fabric also is used to extend Layer 3 networks. An arbitrary number of routed inter-
faces can be created at the edge or within the fabric. The attached Layer 3 devices
peer with those interfaces, thus providing a seamless Layer 3 network integration.
FP Core Ports
S10 S20 S30 S40
Spine Switch
The FP core ports provide connectivity to the spine and are FabricPath-enabled inter-
faces. The FP core network is used to perform the following functions:
■ Avoid STP, require no MAC learning, and require no MAC address table maintained
by FP Core ports
The CE edge ports are regular trunk or access ports that provide connectivity to the
hosts or other classical switches. The CE ports perform the following functions:
■ Run STP, perform MAC address learning, and maintain a MAC address table
The FP edge device maintains the association of MAC addresses and switch-IDs (which
IS-IS automatically assigns to all switches). FP also introduces a new data plane encapsu-
lation by adding a 16-byte FP frame on top of the classical Ethernet header. Figure 5-11
FabricPath 297
displays the FP encapsulation header, which is also called the MAC-in-MAC header. The
external FP header consists of Outer Destination Address, Outer Source Address, and
FP tag. Important fields of the Outer Source or Destination address fields within the FP
header include the following:
Outer Outer FP
FabricPath DA SA Tag DMAC SMAC 802.1Q Etype Payload
CRC
(New)
Frame (48) (48) (32)
OOO/DL
RSVD
■ TTL: The TTL is decremented at each switch hop, to prevent frames from looping
infinitely.
Note If more than 1024 topologies are required, the FTAG value is set to 0 and the
VLAN is used to identify the topology for multidestination trees.
Technet24
298 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
Root for
Tree 1
Multidestination
S10 S20 S30 S40
Trees on Switch 10
4 Tree IF
Ftag DA FF
1 po100,po200,po300 Ftag 1 po300
SA 100.0.12 DA FF
po100 po200
DMAC FF Ftag 1
SMAC A SA 100.0.12
Multidestination Payload DMAC FF
Trees on Switch 100 po10 po20 po30 po20 po30 po40 SMAC A
3 po40 po10 Payload
Broadcast Tree IF
S100 S200 6 S300
1 po10,po20,po30,po40
Multidestination
FabricPath 5 Trees on Switch 300 7
e1/1 Ftag e2/29 Payload
MAC Table on S100 DMAC FF Tree IF
SMAC A
MAC IF/SID SMAC A 1 po10
DMAC FF
A e1/1 (Local) 2 Payload
1 MAC A MAC B
FabricPath
MAC Table on S200
MAC IF/SID
Don’t Learn MACs from
Learn MACs of Directly Connected Flood Frames
Devices Unconditionally
When hosts A and B do not know about each other’s MAC addresses, the first packet
is a broadcast ARP request. The following steps describe the packet flow for the broad-
cast frame from host A to host B:
Step 1. Host A sends an ARP request for host B. Because the ARP request is a broad-
cast packet, the source MAC is set to A and the destination MAC is set to
FF (the broadcast address).
Step 2. When the packet reaches the CE edge port on leaf S100, the MAC address
table of S100 is updated with MAC address A and the interface from which it
is learned. In this case, it is Ethernet1/1.
Step 3. The leaf switch S100 then encapsulates the Ethernet frame with an FP header.
Because the FP core ports are enabled, IS-IS has already precalculated multi-
destination trees on S100. Tree 1 represents the multidestination tree and indi-
cates that the packet must be sent over the FP core ports (po10, po20, po30,
and po40). Because this is a broadcast frame, the Ftag is set to 1. Note that the
broadcast graph uses Tree ID 1 (Ftag 1) and the packet is forwarded to S10.
Step 4. When the FP encapsulated frame reaches S10, it honors the Ftag 1 and does a
lookup for the Tree ID.
Step 6. The egress FP switch (S300) receives the packet on link po10, performs an
RPF check to validate the reception of the packet, and floods the packet to
its CE ports.
FabricPath 299
Step 7. S300 then removes the FP header and floods the packet within the VLAN
based on the broadcast frame. Note that the egress FP switch (S300, in this
case) does not update its MAC address table with A. This is because the edge
devices don’t learn the MAC address from flood frames received from the FP
core where the destination MAC address is set to FF. The original broadcast
packet is then sent to host B.
When the ARP request reaches host B, the ARP reply is a unicast packet. Figure 5-13
depicts the packet flow for ARP reply from host B to host A.
Root for
Multidestination Tree 1
Trees on Switch 10 S10 S20 S30 S40
4
Tree IF
Ftag 1 po100,po200,po300
po300
2 po100 DA MC1**
DA MC1
po100 po200 Ftag 1
Ftag 1
SA 300.0.64
SA 300.0.64
DMAC A
DMAC A
Multidestination SMAC B
SMAC B
po10 po20 po30 po20 po30 po40
Trees on Switch 100 Payload po10
Payload
5 po40
Ftag Tree IF
S200 Multidestination S300
1 po10,po20,po30,po40 3
Trees on Switch 300
FabricPath Tree IF
e1/1 e2/29
MAC Table on S100 Payload 1 po10 1 DMAC A
6 SMAC B
MAC IF/SID SMAC B
Unknown Payload
A e1/1 (Local) DMAC A
FabricPath MAC B
B 300.0.64 (Remote) MAC A
MAC Table on S300
2
MAC IF/SID
A
MISS
If DMAC Is Known, Then B e2/29 (Local)
Learn Remote MAC **MC1 = 01:0f:ff:c1:01:c0
The following steps describe the packet flow for ARP reply from B to A across the fabric.
Step 1. Host B with MAC address B sends the ARP reply back to host A. In the ARP
response, the source MAC is set to B and the destination MAC is set to A.
Step 2. When the packet reaches the leaf switch S300, it updates its MAC address
table with MAC address B, but it still does not have information about MAC
address A. This makes the packet an unknown unicast packet.
Step 3. S300, the ingress FP switch, determines which tree to use. Unknown unicast
typically uses the first Tree ID (Ftag 1). The Tree ID 1 points to all the FP
core interfaces on switch S300 (po10, po20, po30, and po40). The ingress FP
switch also sets the outer destination MAC address to the well-known “flood
to fabric” multicast address represented as MC1—01:0F:FF:C1:01:C0.
Step 4. The FP encapsulated unknown unicast packet is sent to all the spine switches.
Other FP switches honor the Tree ID selected by the ingress switch (Tree 1, in
this case). When the packet reaches the root for Tree 1 (S10), it uses the same
Ftag 1 and forwards the packet out of interfaces po100 and po200. (It does
not forward the packet on po300 because this is the interface from which the
frame was received).
Technet24
300 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
Step 5. When the packet reaches S100, it performs a lookup on the FP trees and uses
Tree ID 1, which is set to po10. Because the packet from S10 was received on
po10 on the S100 switch, the packet is not forwarded back again to the fabric.
Step 6. The FP header is then decapsulated and the ARP reply is forwarded to the
host with MAC A. At this point, the MAC address table on S100 is updated
with MAC address B, with the IF/SID pointing to S300. This is because the
destination MAC is known inside the frame.
The next time host A sends a packet to host B, the packet from A is sent with source
MAC A and destination MAC B. The switch S100 receives the packet on the CE port, and
the destination MAC is already known and points to the switch S300 in an FP-enabled
network. The FP routing table is looked up to find the shortest path to S300 using a flow-
based hash because multiple paths to S300 exist. The packet is encapsulated with the FP
header with a source switch-ID (SWID) of S100 and a destination SWID of S300, and the
FTAG is set to 1. The packet is received on one of the spine switches. The spine switch
then performs an FP routing lookup for S300 and sends the packet to an outgoing inter-
face toward S300. When the packet reaches S300, the MAC address for A is updated in
the MAC address table with the IF/SID pointing to S100.
FabricPath Configuration
To configure FabricPath and verify a FabricPath-enabled network, examine the topology
shown in Figure 5-14. This figure has two spine nodes (NX-10 and NX-20) and three leaf
nodes (NX-1, NX-2, and NX-3). The end host nodes, host A and host B, are connected to
leaf nodes NX-1 and NX-3.
NX-10 NX-20
Eth /17
6/5 6
Eth
/1
Et
14
h6
Et
h6
6/
Et
h6
h
/3
Et
/2
Et
/1
Et
/5
h6
h6
h6
h6
Et
6/5 Eth
/1
Et
/1
Eth 6/1
3
Host A Host B
Enabling the FabricPath feature is a bit different than enabling other features. First
the FabricPath feature set is installed, then the feature-set fabricpath is enabled, and
then the FabricPath feature is enabled. Example 5-30 demonstrates the configuration
for enabling FabricPath feature. FabricPath uses the Dynamic Resource Allocation
Protocol (DRAP) for the allocation of switch-IDs. However, a switch-ID can be manu-
ally configured on a Nexus switch using the command fabricpath switch-id [1-4094].
Every switch in the FabricPath domain is required to be configured with the unique
switch-ID.
FP VLANs are the VLANs that are carried over the FP-enabled links; CE VLANs are
regular VLANs carried over the classical Ethernet links, such as trunk or access ports.
To enable a VLAN as an FP VLAN, use the command mode fabricpath under VLAN
configuration mode. When the FP VLAN is configured, configure the FP core links using
the command switchport mode fabricpath. Finally, configure the CE link as a trunk or
access port. Example 5-31 examines the configuration of FP VLAN, FP ports, and CE
ports on NX-1 as shown in topology.
Technet24
302 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
Various timers can also be configured with FabricPath, ranging from 1 to 1200 seconds:
■ allocate-delay: This timer is used when a new switch-ID is allocated and is required
to be propagated throughout the network. The allocate-delay defines the delay
before the new switch-ID is propagated and becomes available and permanent.
■ linkup-delay: This timer configures the delay before the link is brought up, to detect
any conflicts in the switch-ID.
■ transition-delay: This command sets the delay for propagating the transitioned
switch-ID value in the network. During this period, all old and new switch-ID values
exist in the network.
FabricPath does not require a specific IS-IS configuration. Authentication and other IS-IS-
related configuration settings (such as IS-IS hello timers, hello-padding, and metrics) can be
configured using the command fabricpath isis under interface configuration mode.
Example 5-32 illustrates the configuration for enabling IS-IS MD5 authentication for
FabricPath. The example also displays the various IS-IS settings defined under the inter-
face. Ensure that the configuration matches on both ends of the interface.
Technet24
304 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
Interface: Ethernet6/5
Status: protocol-up/link-up/admin-up
Index: 0x0003, Local Circuit ID: 0x01, Circuit Type: L1
No authentication type/keychain configured
Authentication check specified
Extended Local Circuit ID: 0x1A284000, P2P Circuit ID: 0000.0000.0000.00
Retx interval: 5, Retx throttle interval: 66 ms
LSP interval: 33 ms, MTU: 1500
P2P Adjs: 1, AdjsUp: 1, Priority 64
Hello Interval: 10, Multi: 3, Next IIH: 00:00:03
Level Adjs AdjsUp Metric CSNP Next CSNP Last LSP ID
1 1 1 40 60 Inactive ffff.ffff.ffff.ff-ff
Topologies enabled:
Level Topology Metric MetricConfig Forwarding
0 0 40 no UP
1 0 40 no UP
The IS-IS adjacency between the leaf and the spine nodes is also verified using the com-
mand show fabricpath isis adjacency [detail]. Example 5-35 displays the adjacency on
NX-1 and NX-10. The command displays the system ID, circuit type, interface participat-
ing in IS-IS adjacency for FabricPath, topology ID, and forwarding state. The command
also displays the last time when the FabricPath transitioned to current state (that is, the
last time the adjacency flapped).
FabricPath 305
Next, validate whether the necessary FabricPath VLANs are configured on the edge/leaf
switches. This is verified by using the command show fabricpath isis vlan-range. When
the FP VLANs are configured and CE-facing interfaces are configured, the edge devices
learn about the MAC addresses of the hosts attached to the edge node. This is verified
using the traditional command show mac address-table vlan vlan-id. Example 5-36
verifies the FP VLAN and the MAC addresses learned from the hosts connected to the
FP VLAN 100.
Legend:
* - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC
age - seconds since last seen,+ - primary entry using vPC Peer-Link,
(T) - True, (F) - False , ~~~ - use 'hardware-age' keyword to retrieve
age info
VLAN MAC Address Type age Secure NTFY Ports/SWID.SSID.LID
---------+-----------------+--------+---------+------+----+------------------
* 100 30e4.db97.e8bf dynamic ~~~ F F Eth6/6
100 30e4.db98.0e7f dynamic ~~~ F F 300.0.97
Technet24
306 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
Legend:
* - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC
age - seconds since last seen,+ - primary entry using vPC Peer-Link,
(T) - True, (F) - False , ~~~ - use 'hardware-age' keyword to retrieve
age info
VLAN MAC Address Type age Secure NTFY Ports/SWID.SSID.LID
---------+-----------------+--------+---------+------+----+------------------
100 30e4.db97.e8bf dynamic ~~~ F F 100.0.85
* 100 30e4.db98.0e7f dynamic ~~~ F F Eth6/18
Similar to Layer 3 IS-IS, Layer-2 IS-IS maintains multiple topologies within the network.
Each topology is represented as a tree ID in the FabricPath domain. The trees are nothing
but multidestination trees within the fabric. To view the IS-IS topologies in FabricPath
domain, use the command show fabricpath isis topology [summary]. Example 5-37 dis-
plays the different IS-IS topologies in the present topology.
If issues arise with traffic forwarding or MAC addresses not being learned, it is important
to check whether the FP IS-IS adjacency has been established and whether the FP IS-IS
routes are present in the Unicast Routing Information Base (URIB). This is easily validated
through the command show fabricpath route [detail | switchid switch-id]. This command
displays the routes for the remote nodes (leaf or spine nodes). The route is seen in the
form of ftag/switch-id/subswitch-id. In Example 5-38, the route for remote edge device
NX-3 is seen with FTAG 1, switch-ID 300, and Subswitch-ID 0 (because no vPC+ con-
figuration was enabled).
Technet24
308 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
The previous output makes it clear that the route for NX-3 has an FTAG value of 1.
When the route is verified in URIB, validate that the route is installed in the Forwarding
Information Base (FIB). To verify the route present in the FIB, use the line card command
show fabricpath unicast routes vdc vdc-number [ftag ftag] [switchid switch-id]. This
command displays hardware route information along with its RPF interface in the soft-
ware table on the line card. As part of the platform-dependent information, the command
output returns the hardware table address, which is further used to verify the hardware
forwarding information for the route. In the output shown in Example 5-39, the software
table shows that the route is a remote route with the RPF interface of Ethernet6/5. It also
returns the hardware table address of 0x18c0.
Note The commands in Example 5-39 are relevant for F2 and F3 line card modules on
Nexus 7000/7700 series switches. The verification commands vary among line cards
and also platforms (for instance, Nexus 5500).
Using the hardware address in the software table, execute the command show hard-
ware internal forwarding instance instance-id table sw start hw-entry-addr end
hw-entry-addr. The instance-id value is achieved from the FE num field in the previ-
ous example. The hw-entry-addr address is the address highlighted in the previous
example output. This command output displays the switch-ID (swid), the Subswitch-ID
(sswid), and various other fields. One of the important fields to note is ssw_ctrl. If the
ssw_ctrl field is 0x0 or 0x3, the switch does not have subswitch-IDs (available only in
the case of vPC+). If vPC+ configuration is available, the value is usually 0x1. Another
field to look at is the local field. If the local field is set to n, multipath is available for
the route, so a multipath table is required for verification. Example 5-40 demonstrates
this command.
module-6# show hardware internal forwarding inst 0 table sw start 0x18c0 end 0x18c0
-----------------------------------------------------------
------------------------ SW Table ------------------------
(INST# 0)
-----------------------------------------------------------
[18c0]| KEY
[18c0]| vdc : 0 sswid : 0
[18c0]| swid : 12c ftag : 1
[18c0]| DATA
[18c0]| valid : y mp_mod : 1
[18c0]| mp_base : 24 local : n
[18c0]| cp_to_sup1 : n cp_to_sup2 : n
[18c0]| drop : n dc3_si : 11c1
[18c0]| data_tbl_ptr : 0 ssw_ctrl : 0
[18c0]| iic_port_idx : 54
[18c0]| l2tunnel_remote (CR only) : 0
Technet24
310 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
Duplicate switch-IDs can cause forwarding issues and instability in the FabricPath-
enabled network. To check whether the network has duplicate or conflicting switch-IDs,
use the command show fabricpath conflict all. In case of any FabricPath-related errors,
event-history logs for a particular switch-ID can be verified using the command show
system internal fabricpath switch-id event-history errors. Alternatively, the show
tech-support fabricpath command output can be collected for further investigation.
Note If an issue arises with FabricPath, collect the following show tech-support outputs
during the problematic state:
show tech u2rib
show tech pixm
show tech eltm
show tech l2fm
show tech fabricpath isis
show tech fabricpath topology
show tech fabricpath switch-id
Along with these show tech outputs, show tech details are useful in investigating issues in
the FabricPath environment.
FabricPath Devices
FabricPath is supported on Nexus 7000/7700 and Nexus 5500 series switches. Check the
FabricPath Configuration Guide for scalability and supported switch modules.
S10 S20
Eth /17
6/5 Eth6
Et
/1
4
h6
Et
h6
/1
Et
h6
h6
/3
Et
/2
Et
h6
/1
Et
/5
h6
Eth
h6
/13
h6
6/5 6/1
Et
Et
Eth
/1
7
S3 MAC-A <-> S1
S1 S2
MAC-Flap
MAC-A <-> S2
Host A Host B
With emulating switches, it is also important to understand the forwarding mechanism for
multidestination packets. In a FabricPath network, eliminating duplication of multides-
tination frames is achieved by computing multidestination trees rooted at shared nodes
that guarantee a loop-free path to any switch. This implies that only one of the emulating
switches should announce connectivity to the emulated switch for a particular multidestina-
tion tree and should be responsible for forwarding packets to the emulated switch. Likewise,
traffic from the emulated switch can ingress from one of the emulating switches into the
FabricPath network along the graph path that has reachability to the emulating switch.
Otherwise, the packet will be dropped by ingress interface check (IIC). For this reason, for
each multidestination tree, one emulating switch is used for ingress and egress traffic.
vPC+ Configuration
To configure vPC+, two primary features must be enabled on the Nexus switch:
■ FabricPath
■ vPC
To understand how the vPC+ feature works, examine the topology in Figure 5-16. In
this topology, NX-1 and NX-2 are forming a vPC with SW-12, and NX-3 and NX-4 are
forming a vPC with SW-34. All the links among the four Nexus switches are FabricPath-
enabled links, including the vPC peer link.
Technet24
312 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
VLAN 100
10.100.1.0/24
SW-34
8
/1
Et
h6
h6
Et
/1
8
Eth5/13
NX-3 NX-4
Eth6/13-14
Eth6/16
Eth6/16
Et
h6
/17 /17
h6
Et
Eth6/4
Eth6/4
6/5 Et
h6
Eth /5
Eth5/1
NX-1 NX-2
Et Eth6/1-2
h6 /6
/6 h6
Et
VLAN 100
10.100.1.0/24
SW-12
Examine the vPC and FabricPath configuration for NX-1 and NX-3 in Example 5-41.
Most of the configuration is similar to the configuration shown in the section on vPC
and FabricPath. The main differentiating configuration is the fabricpath switch-id
switch-id command configured under vPC configuration mode. The same switch-ID val-
ues are assigned on both the emulated switches NX-1 and NX-2 (assigned the switch-ID
of 100) and NX-3 and NX-4 (assigned the switch-ID of 200).
NX-1
install feature-set fabricpath
feature-set fabricpath
feature vpc
vlan 100,200,300,400,500
Emulated Switch and vPC+ 313
mode fabricpath
!
fabricpath switch-id 100
!
vpc domain 10
peer-keepalive destination 10.12.1.2 source 10.12.1.1 vrf default
fabricpath switch-id 100
!
interface port-channel1
switchport mode fabricpath
vpc peer-link
!
interface Ethernet6/4
switchport mode fabricpath
!
interface Ethernet6/5
switchport mode fabricpath
!
interface port-channel10
switchport
switchport mode trunk
vpc 10
NX-3
install feature-set fabricpath
feature-set fabricpath
feature vpc
vlan 100,200,300,400,500
mode fabricpath
!
fabricpath switch-id 200
!
vpc domain 20
peer-keepalive destination 10.34.1.4 source 10.34.1.3 vrf default
fabricpath switch-id 200
!
interface port-channel1
switchport mode fabricpath
vpc peer-link
!
interface Ethernet6/16
switchport mode fabricpath
Technet24
314 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
!
interface Ethernet6/17
switchport mode fabricpath
!
interface port-channel20
switchport
switchport mode trunk
vpc 20
When verifying the emulated FabricPath switch-IDs, the command show fabricpath
switch-id displays not only the static switch-IDs, but also the emulated switch-IDs.
The Flag (E) is set beside the node representing the local emulated switch. Example 5-43
displays the emulated switch-IDs using the show fabricpath switch-id command.
When both edge devices running the emulated switch learn the MAC addresses from the
remote edge nodes, the address for the interfaces is shown with the MAC route assigned
on the vPC interface of the remote edge node. Example 5-44 displays the MAC address
table on both NX-1 and NX-3 nodes. Notice that the MAC address for the remote host
connected on the NX-3/NX-4 vPC link is learned with the interface assigned with FP
MAC route 200.11.65535 on NX-1. For the host connected to NX-1 and NX-2, the vPC
link is learned with the interface assigned with FP MAC route 100.11.65535.
Technet24
316 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
Legend:
* - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC
age - seconds since last seen,+ - primary entry using vPC Peer-Link, E -
EVPN entry
(T) - True, (F) - False , ~~~ - use 'hardware-age' keyword to retrieve
age info
VLAN/BD MAC Address Type age Secure NTFY Ports/SWID.SSID.LID
---------+-----------------+--------+---------+------+----+------------------
100 0022.56b9.007f dynamic ~~~ F F 200.11.65535
* 100 24e9.b3b1.8cff dynamic ~~~ F F Po10
Legend:
* - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC
age - seconds since last seen,+ - primary entry using vPC Peer-Link, E -
EVPN entry
(T) - True, (F) - False , ~~~ - use 'hardware-age' keyword to retrieve
age info
VLAN/BD MAC Address Type age Secure NTFY Ports/SWID.SSID.LID
---------+-----------------+--------+---------+------+----+------------------
* 100 0022.56b9.007f dynamic ~~~ F F Po20
100 24e9.b3b1.8cff dynamic ~~~ F F 100.11.65535
If issues arise with MAC learning, check that an IS-IS adjacency exists between the
devices. The IS-IS adjacency is established with the vPC peer device and the other spines
or edge nodes based on their connectivity. After verifying the adjacency, the FP routes
are learned through IS-IS. The route for the emulated switch from the vPC peer is learned
through the vPC Manager (vPCM) and is seen in the URIB as learned through vpcm, as
Example 5-45 shows. Looking deeper into the URIB, notice that the route learned from
remote emulated switch has the Flag or Attribute set to E.
Technet24
318 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
The edge ports are usually vPC ports in a vPC+-based design, so verifying the vPCM
information related to the vPC link is vital. This is verified using the command show
system internal vpcm info interface interface-id. This command displays the outer FP
MAC address of the port-channel interface, VLANs, vPC peer information, and also the
information stored in the PSS. Note that the PSS information helps ensure restoration of
the information in case of any link flaps or VDC/switch reloads. Example 5-46 displays
the vPCM information for port-channel 10 on NX-1 node, highlighting the FP MAC
addresses and the information from vPC peers.
IF Elem Information:
IF Index: 0x16000009
MCEC NUM: 10 Is MCEC
Allowed/Config VLANs : 6 - [1,100,200,300,400,500]
Allowed/Config BDs : 0 - []
MCECM DB Information:
IF Index : 0x16000009
vPC number : 10
Num members : 0
vPC state : Up
Internal vPC state: Up
Compat Status :
Old Compat Status : Pass
Current Compat Status: Pass
Up VLANs : 5 - [100,200,300,400,500]
Suspended VLANs : 1 - [1]
Compat check pass VLANs: 4096 - [0-4095]
Compat check fail VLANs: 0 - []
Up BDs : 0 - []
Suspended BDs : 0 - []
Compat check pass BDs : 0 - []
Compat check fail BDs : 0 - []
Compat check pass VNIs : 0 - []
Compat check fail VNIs : 0 - []
Peer Number : 10
Peer IF Index: 0x16000009
Peer state : Up
Card type : F2
Fabricpath outer MAC address info of peer: 100.11.0
Peer configured VLANs : 6 - [1,100,200,300,400,500]
Peer Up VLANs : 5 - [100,200,300,400,500]
Peer configured VNIs : 0 - []
Peer Up BDs : 0 - []
PSS Information:
IF Index : 0x16000009
vPC number: 10
vPC state: Up
Internal vPC state: Up
Old Compat Status: Pass
Compat Status: Pass
Card type : F2
Fabricpath outer MAC address info: 100.11.65535
Designated forwarder state: Allow
Up VLANs : 5 - [100,200,300,400,500]
Suspended VLANs : 1 - [1]
Up BDs : 0 - []
Suspended BDs : 0 - []
Technet24
320 Chapter 5: Port-Channels, Virtual Port-Channels, and FabricPath
Peer number: 10
Peer if_index: 0x16000009
Peer state: Up
Card type : F2
Fabricpath outer MAC address info of peer: 100.11.0
Peer configured VLANs : 6 - [1,100,200,300,400,500]
Peer Up VLANs : 5 - [100,200,300,400,500]
..
Note The platform-dependent commands vary among platforms and also depend on the
line card present on the Nexus 7000/7700 chassis. If you encounter any issues with
vPC+, collect the following show tech-support command outputs:
show tech-support fabricpath
show tech-support vpc
Other show tech-support commands are collected as covered in the FabricPath section.
Summary
This chapter covered the technologies and features that provide resiliency and increased
capacity between switches from an L2 forwarding perspective. Port-channels and virtual
port-channels enable switches to create a logical interface with physical member ports.
Consistency in port configuration for all member ports is the most common problem
when troubleshooting these issues. This chapter detailed additional techniques and error
messages to look for when troubleshooting these issues.
FabricPath provides a different approach for removing spanning tree while increasing
link throughput and scalability and minimizing broadcast issues related to spanning tree.
Quite simply, FabricPath involves routing packets in an L2 realm in an encapsulated state;
the packet is later decapsulated before being forwarded to a host. Troubleshooting packet
forwarding in a FabricPath topology uses some of the basic concepts from troubleshoot-
ing STP and port forwarding while combining them with concepts involved in trouble-
shooting an IS-IS network.
References
Fuller, Ron, David Jansen, and Matthew McPherson, Matthew. NX-OS and Cisco Nexus
Switching (Indianapolis: Cisco Press, 2013).
Troubleshooting IP and
IPv6 Services
■ Object Tracking
■ IPv4 Services
■ IPv6 Services
■ Troubleshooting for First-Hop Redundancy Protocols
IP SLA
IP Service Level Agreement (SLA) is a network performance monitoring application
that enables users to do service-level monitoring, troubleshooting, and resource
planning. It is an application-aware synthetic operation agent that monitors
network performance by measuring response time, network reliability, resource
availability, application performance, jitter, connect time, and packet loss. The
statistics gained from this feature help with SLA monitoring, troubleshooting,
problem analysis, and network topology design. The IP SLA feature consists of
two main entities:
■ IP SLA sender: The IP SLA sender generates active measurement traffic based
on the operation type, as configured by the user and reports metrics. Apart
from reporting metrics, the IP SLA sender also detects threshold violations and
sends notifications. Figure 6-1 shows the various measurements for different
operation types.
Technet24
322 Chapter 6: Troubleshooting IP and IPv6 Services
Network
Latency Packet Loss Dist. of Stats Connectivity
Jitter
Measurements
Operations
■ IP SLA responder: The responder runs on a separate switch from the sender. It
responds to the User Datagram Protocol/Transmission Control Protocol (UDP/TCP)
probes and reacts to control packets. The control packets and the command-line
interface (CLI) determine the TCP/UDP ports and addresses the responder checks for
the packets from the sender.
The IP SLA feature is not enabled by default. To enable the IP SLA feature, use the com-
mand feature sla [responder | sender]. Unless the IP SLA sender device is also acting as a
responder for a remote device, both SLA sender and responder features are not required
to be enabled on the same device.
■ ICMP Echo
■ ICMP Jitter
■ UDP Echo
■ UDP Jitter
■ TCP Connect
After it is configured, the probe does not start on its own. It can start immediately or
after a certain period of time, specified using the global configuration ip sla schedule
number start-time [now | after time | time], where time is specified in hh:mm:ss format.
IP SLA 323
For ICMP echo probes, it is not required to configure the IP SLA responder on the
remote device where the probe is destined to. After the probe is started, the statistics for
the probe are verified using the command show ip sla statistics [number] [aggregated
| details]. The aggregated option displays the aggregated statistics, whereas the details
option displays the detailed statistics. Example 6-2 displays the statistics of the ICMP
echo probe configured in Example 6-1. In the show ip sla statistics command output,
carefully verify fields such as the RTT value, return code, number of successes, and
number of failures. In the aggregated command output, the RTT value is shown as an
aggregated value (for example, the Min/Avg/Max values of RTT for the probe).
RTT Values:
Number Of RTT: 694 RTT Min/Avg/Max: 2/2/7 milliseconds
Number of successes: 694
Number of failures: 0
Technet24
324 Chapter 6: Troubleshooting IP and IPv6 Services
Note The configuration for an IP SLA probe can be viewed using either the command
show running-config sla sender or the command show ip sla configuration number.
To define a UDP Echo IP SLA probe, use the command udp-echo [dest-ip-address |
dest-hostname] dest-port-number source-ip [src-ip-address | src-hostname] source-
port src-port-number [control [enable | disable]]. Example 6-3 illustrates a UDP Echo
probe on the NX-1 switch and a responder configured on the NX-2 switch. This section
of the output also displays the statistics after the probe is enabled. Note that unless the
responder is configured on the remote end, the probe results in failures. To configure the
IP SLA responder, use the command ip sla responder. To configure the UDP Echo probe
responder, use the command ip sla responder ipaddress ip-address port port-number.
NX-1 (Sender)
ip sla 11
udp-echo 192.168.2.2 5000 source-ip 192.168.1.1 source-port 65000
tos 180
frequency 10
ip sla schedule 11 start-time now
NX-2 (Responder)
ip sla responder
ip sla responder udp-echo ipaddress 192.168.2.2 port 5000
NX-1
NX-1# show ip sla statistics 11 details
Note When a UDP Echo probe responder is configured, the responder device
continuously listens on the specified UDP port on the responder node.
The UDP Plus operation is a superset of the UDP echo operation. In addition to measur-
ing UDP RTT, the UDP Plus operation measures per-direction packet loss and jitter. Jitter
is interpacket delay variance. Jitter statistics are useful for analyzing traffic in a voice over
IP (VoIP) network.
The UDP jitter probe is defined using the command udp-jitter [dest-ip-address |
dest-hostname] dest-port-number codec codec-type [codec-numpackets
number-of-packets] [codec-size number-of-bytes] [codec-interval milliseconds]
[advantage-factor value] source-ip [src-ip-address | src-hostname] source-port
src-port-number [control [enable | disable]]. The default request packet data size for an
IP SLAs UDP jitter operation is 32 bytes. Use the request-data-size option under the
ip sla command to modify this value.
Table 6-1 shows some of the options that are specified in the udp-jitter configuration.
Technet24
326 Chapter 6: Troubleshooting IP and IPv6 Services
Option Description
codec-size number-of-bytes (Optional) Specifies the number of bytes in each packet trans-
mitted. (Also called the payload size or request size.) The range
is from 16 to 1500. The default varies by codec.
codec-interval milliseconds Specifies the interval (delay) between packets that should be
used for the operation, in milliseconds (ms). The range is from
1 to 60000. The default is 20.
advantage-factor value Specifies the expectation factor to be used for ICPIF calcula-
tions. This value is subtracted from the measured impairments
to yield the final ICPIF value (and corresponding MOS value).
Example 6-4 illustrates the configuration of UDP jitter probe using the g729a codec,
which is set with a type of service (ToS) value of 180. Specify the life of the probe
along with the ip sla schedule command by specifying the command option life [time-
in-seconds | forever]. For a UDP jitter probe, more detailed information is maintained
as part of the statistics. Statistical information of one-way latency, jitter time, packet
loss, and voice score values is maintained for UDP jitter probe.
NX-1
ip sla 15
udp-jitter 192.168.2.2 5000 codec g729a codec-numpackets 50 codec-interval 100
tos 180
verify-data
frequency 10
ip sla schedule 15 life forever start-time now
NX-1# show ip sla statistics 15 details
Note To cause an IP SLA operation to check each reply packet for data corruption, use
the verify-data command under ip sla configuration mode.
As the latency and jitter increases, the MOS score goes down. Such statistics help the
network design and implementation team optimize the network for the applications.
Technet24
328 Chapter 6: Troubleshooting IP and IPv6 Services
The TCP connection operation is used to discover the time required to connect to the
target device. This operation is used to test virtual circuit availability or application
availability. If the target is a Cisco router, the IP SLA probe makes a TCP connection to
any port number the user specifies. If the destination is a non-Cisco IP host, you must
specify a known target port number (for example, 21 for File Transfer Protocol [FTP], 23
for Telnet, or 80 for Hypertext Transfer Protocol [HTTP] server). This operation is useful
in testing Telnet or HTTP connection times.
To define a TCP connect IP SLA probe, use the command tcp-connect [dest-ip-address |
dest-hostname] dest-port-number source-ip [src-ip-address | src-hostname] source-port src-
port-number [control [enable | disable]]. For the TCP connect probe, the responder must be
configured on the destination router/switch using the command ip sla responder tcp-connect
ipaddress ip-address port port-number. Example 6-5 demonstrates the configuration of an
IP SLA TCP connect probe to probe a TCP connection between NX-1 and NX-2 switches.
NX-1 (Sender)
ip sla 20
tcp-connect 192.168.2.2 10000 source-ip 192.168.1.1
ip sla schedule 20 life forever start-time now
NX-2 (Responder)
ip sla responder tcp-connect ipaddress 192.168.2.2 port 10000
NX-1
NX-1# show ip sla statistics 20 details
Note Refer to Nexus Cisco Connection Online (CCO) documentation for additional
information on other command options available with IP SLA.
Object Tracking
Several IP and IPv6 services, such as First-Hop Redundancy Protocol (FHRP), are
deployed in a network for reliability and high availability purposes, to ensure load bal-
ancing and failover capability. In spite of all these capabilities, network uptime is not
guaranteed when, for example, the WAN link goes down, which is more likely to occur in
a network than router failure. This results in considerable downtime on the link.
Object tracking offers a flexible and customizable mechanism for affecting and control-
ling the failovers in the network. With this feature, you can track specific objects in the
network and take necessary action when any object’s state change affects the network
traffic. The main objective of the object tracking feature is to allow the processes and
protocols in a router system to monitor the properties of other unrelated processes and
protocols in the same system, to accomplish the following goals:
Clients such as Hot Standby Router Protocol (HSRP), Virtual Router Redundancy
Protocol (VRRP), and Gateway Load Balancing Protocol (GLBP) can register their interest
in specific tracked objects and take action when the state of the object changes. Along
with these protocols, other clients that use this feature include the following:
■ Route reachability
Object tracking has the configuration syntax of track number <object-type> <object-
instance> <object-parameter>, where the object-number value ranges from 1 to 1000.
The object-type indicates one of the supported tracked objects (interface, ip route, or
track list). Object-instance refers to an instance of a tracked object (interface-name, route
prefix, mask, and so on). The object-parameter indicates the parameters related to the
object-type.
Technet24
330 Chapter 6: Troubleshooting IP and IPv6 Services
NX-1
NX-1(config)# track 1 interface ethernet 2/5 line-protocol
NX-1(config)# track 2 interface ethernet 2/5 ip routing
NX-1(config)# interface ethernet2/5
NX-1(config-if)# shut
Track 2
Interface Ethernet2/5 IP Routing
IP Routing is DOWN
2 changes, last change 00:00:08
Track 2
Interface Ethernet2/5 IP Routing
IP Routing is DOWN
4 changes, last change 00:00:42
the reachability of the route. Example 6-7 demonstrates the configuration for both IPv4
and IPv6 route status tracking objects. If the reachability for the tracked route is lost for
any reason (such as packet loss or routing protocol flap), the track goes down. You can also
configure the delay for the up and down events. The command delay [down | up] time-
in-seconds sets the track down delay and track up delay in seconds. The delay command
option prevents transient or nonpersistent events from triggering the track to go down.
NX-1
NX-1(config)# track 5 ip route 192.168.2.2/32 reachability
NX-1(config-track)# delay down 3
NX-1(config-track)# delay up 1
NX-1# show track 5
Track 5
IP Route 192.168.2.2/32 Reachability
Reachability is UP
3 changes, last change 00:02:07
Delay up 1 secs, down 3 secs
A tracking object is also configured for the IP SLA probe using the command track
number ip sla [reachability | status]. Thus, the tracking object can indirectly be verify-
ing reachability to the remote prefix. The benefit of using IP SLA probes is that network
operators can use IP SLA not only to verify reachability, but also to track the status of
other probes for UDP echo, UDP jitter, and TCP connection.
Example 6-8 displays the configuration for object tracking with IP SLA probes. Notice
that the show track command output not only displays the state information, but also
returns the operation code and RTT information, which is actually part of the show ip sla
statistics command output.
NX-1
ip sla 10
icmp-echo 192.168.2.2 source-interface loopback0
request-data-size 1400
frequency 5
ip sla schedule 10 start-time now
!
track 10 ip sla 10 state
NX-1# show track 10
Track 10
IP SLA 10 State
State is UP
1 changes, last change 00:01:01
Latest operation return code: OK
Latest RTT (millisecs): 3
Technet24
332 Chapter 6: Troubleshooting IP and IPv6 Services
Example 6-9 illustrates the configuration for object tracking on a track list using both
and and or Boolean expressions. Note that, in the second section, the show track
command output for track list object 20 shows the state as DOWN. This is because the
object 2 state is not UP as a result of IP routing being enabled on interface Eth2/5.
NX-1
! Previous track configurations
NX-1# show run track
track 1 interface Ethernet2/5 line-protocol
track 2 interface Ethernet2/5 ip routing
track 5 ip route 192.168.2.2/32 reachability
delay up 1 down 3
track 10 ip sla 10
! Track List with Boolean AND for matching track 1 and not matching track 2.
NX-1(config)# track 20 list boolean and
NX-1(config-track)# object 1
NX-1(config-track)# object 2 not
List Boolean or
Boolean or is UP
1 changes, last change 00:01:23
Track List Members:
object 5 UP
object 1 UP
You can also specify the threshold value used to maintain the state of the track list. The
threshold is defined in two forms:
■ Percentage
■ Weight
Either of these methods can be used with a track list, using the command track number
list threshold [percentage | weight].
Similarly, the weight-based threshold value is configured using the command threshold
weight up value down value. The combined weight of the objects in the UP state must
exceed the configured threshold weight for the track to remain in the UP state.
Example 6-10 displays the sample configuration for the percentage- and weight-based
threshold for track list objects. In the first configuration, with the percentage threshold,
at least two objects should be in the UP state because the UP percentage is configured to
be 60. In the second example, with a weight-based threshold, the track remains in the UP
state only if object 1 and either of the other two objects (object 2 or object 5) are in the
UP state because the weight for UP state is configured to be 45.
Technet24
334 Chapter 6: Troubleshooting IP and IPv6 Services
NX-1
NX-1(config)# ip route 192.168.2.2/32 10.12.1.2 track 1
NX-1(config)# ip route 192.168.2.2/32 10.13.1.3 254
NX-1# show ip route 192.168.2.2/32
192.168.2.2/32, ubest/mbest: 1/0
*via 10.12.1.2, [1/0], 00:00:48, static
Note If any issues with object tracking arise, collect the show tech track command
output and share it with the Cisco Technical Assistance Center (TAC).
IPv4 Services 335
IPv4 Services
NX-OS contains a wide array of critical network services that provide flexibility, scalabil-
ity, reliability, and security in the network and solve critical problems that enterprise or
data centers face. This section discusses the following IP services:
■ DHCP relay
■ DHCP snooping
■ IP source guard
■ Unicast RPF
DHCP Relay
Unlike traditional Cisco IOS or Cisco IOS XE software, NX-OS does not support the
Dynamic Host Configuration Protocol (DHCP) server feature. However, you can enable
the NX-OS device to function as a DHCP relay agent. A DHCP relay agent is a device
that helps in relaying DHCP requests/replies between the DHCP client and the DHCP
server when they are on different subnets. The relay agent listens for the client’s request
and adds vital data such as the client’s link information, which the server needs to allocate
address for the client. When the server replies, the relay agent forwards the information
back to the client.
The DHCP relay agent is a useful feature, but some security concerns do arise:
■ A host on one port cannot see other hosts traffic on other ports.
■ Hosts connected to the metro port can no longer be trusted. Therefore, a mechanism
is needed to identify them more securely.
DHCP option 82 helps overcome these issues. Defined in RFC 3046, DHCP option 82 is
a new type of container option that contains suboption information gathered by the relay
agent. Figure 6-2 shows the format of the DHCP relay agent information option.
The length N gives the total number of bytes in the Agent Information Field, which con-
sists of a sequence of SubOpt/Length/Value tuples for each suboption.
Technet24
336 Chapter 6: Troubleshooting IP and IPv6 Services
Following is a sample sequence of a DHCP message flow when the DHCP option
82 feature is enabled on the access switch:
■ The relay agent on the switch intercepts the broadcast request and inserts the option
82 data (circuit ID and remote ID). It places the relay agent IP address in the DHCP
packet giaddr field, replaces UDP source port 68 with relay agent server port 67, and
then unicasts the client request to the DHCP server on the same UDP destination
port. The DHCP server IP address is configured as an IP helper address when option
82 feature is enabled on the relay agent interface.
■ The DHCP server receives and processes the relayed DHCP client request. A DHCP
server that is capable of handling option 82 data responds with a DHCPOFFER mes-
sage that includes an available network address in the message yiaddr field, along
with all the option 82 data. The response is a UDP unicast message directly routed
to the relay agent with the UDP destination port as 67.
■ The relay agent receives the server reply. It removes the option 82 data and either
unicasts or broadcasts the DHCPOFFER message back to the client.
■ The client might receive multiple DHCPOFFERs from multiple DHCP servers.
When it decides to accept an offer from a particular DHCP server, it broadcasts a
DHCPREQUEST message to the server with a UDP destination port of 67. It uses
the server identifier option in the message to indicate which server it has selected.
■ Similarly, the relay agent intercepts the broadcast request, inserts the option 82 data,
and relays the request to the DHCP server.
■ The selected DHCP server acknowledges the request by committing the assigned IP
address and it unicasts a DHCPACK message to the relay agent with the UDP desti-
nation port as 67.
■ The relay agent receives the reply, removes the option 82 data from the message, and
relays the message back to the client.
■ When the client receives the DHCPACK message with the configuration parameter,
it performs an Address Resolution Protocol (ARP) check for allocated IP addresses
to make sure it is not being used by another host. If it detects the same IP address
already in use, it sends a DHCPDECLINE message to the server and restarts the con-
figuration process. Similarly, the relay agent intercepts the message and relays it.
■ When the client chooses to relinquish its lease on the IP address, it sends a
DHCPRELEASE message to the server. The DHCPRELEASE message is always a
unicast message to the server, so no relay agent is used here.
■ Upon receipt of DHCPRELEASE message, the server marks the IP address as not
allocated.
To enable the DHCP relay agent on NX-OS, the DHCP feature must be enabled on the
system using the command feature dhcp. To enable the device to act as a DHCP relay
IPv4 Services 337
agent, configure the global command ip dhcp relay. The DHCP relay is configured on the
interface using the command ip dhcp relay address ip-address, where the ip-address
variable is the address of the DHCP server. To enable the DHCP option 82, configure the
global command option ip dhcp relay information option.
To further understand the DHCP Relay feature, examine the topology in Figure 6-3. In
this topology, NX-1 is acting as the relay agent.
192.168.2.2/32
E7/1 E7/13
10.12.1.0/24 R2
Example 6-12 displays the configuration of the DHCP relay agent on NX-1. Note that
there should be reachability to the DHCP server sourcing the interface on which the
DHCP relay is configured.
NX-1
NX-1(config)# feature dhcp
NX-1(config)# ip dhcp relay
NX-1(config)# interface e7/1
NX-1(config-if)# ip dhcp relay address 192.168.2.2
When the configuration is done and the client tries to request an IP address, the DHCP
relay agent helps exchange the messages between the client and the server. Use the com-
mand show ip dhcp relay to verify that the interface is enabled with DHCP relay. After the
messages are exchanged, verify the statistics of all the messages received and forwarded
by the relay agent in both directions (between server and client) using the command show
ip dhcp relay statistics. Example 6-13 examines the DHCP relay configuration and statis-
tics on NX-1. The show ip dhcp relay statistics command output displays the statistics
for all the different kinds of DHCP packets received, forwarded, and dropped by the relay
agent. Along with this information, the command output displays the various reasons why
a relay agent drops the packet, along with its statistics.
Technet24
338 Chapter 6: Troubleshooting IP and IPv6 Services
DHCP L3 FWD:
Total Packets Received : 0
Total Packets Forwarded : 0
Total Packets Dropped : 0
Non DHCP:
Total Packets Received : 0
Total Packets Forwarded : 0
Total Packets Dropped : 0
DROP:
DHCP Relay not enabled : 0
Invalid DHCP message type : 0
Interface error : 0
Tx failure towards server : 0
Tx failure towards client : 0
Unknown output interface : 0
Unknown vrf or interface for server : 0
Max hops exceeded : 0
Option 82 validation failed : 0
Packet Malformed : 0
Relay Trusted port not configured : 0
* - These counters will show correct value when switch
receives DHCP request packet with destination ip as broadcast
address. If request is unicast it will be HW switched
IPv4 Services 339
When DHCP relay address is configured, access control list (ACL) programming happens
on the Nexus switch:
■ If the L3 interface is Switched Virtual Interface (SVI), the VLAN ACL (VACL) is
programmed on the hardware.
■ Filter
■ Permit source ports 67 and 68 to any destination port
■ Permit any port to destination ports 67 and 68
■ Action
■ Redirect to DHCP Snoop on supervisor
The DHCP process registers with Netstack for this particular exception cause. As a result,
all DHCP requests/replies captured by the LC come to the DHCP snooping process via
Netstack fast MTS queue.
When the DHCP relay is configured on the interface, the command show system internal
access-list interface interface-id [module slot] checks that the ACL is programmed in
hardware (see Example 6-14). Notice that, in the output, the policy type is DHCP and
the policy name is Relay. The command output displays the number of ternary content-
addressable memory (TCAM) entries held by the ACL and the number of adjacencies.
Example 6-14 Verifying ACL on the Line Card for DHCP Relay
INSTANCE 0x0
---------------
Technet24
340 Chapter 6: Troubleshooting IP and IPv6 Services
Label_b = 0x201
Bank 0
------
IPv4 Class
Policies: DHCP(Relay) [Merged]
Netflow profile: 0
Netflow deny profile: 0
5 tcam entries
No egress policies
No Netflow profiles in egress direction
When the ACL is programmed on the line card, view the hardware statistics for the ACL
using the command show system internal access-list input statistics [module slot].
Example 6-15 displays the statistics for the DHCP relay ACL, where five hits match the
traffic coming from source port 67. If during regular operation DHCP is not functioning
properly, use the command show system internal access-list input statistics [module
slot] and the command in the previous example to ensure that both the DHCP relay ACL
is programmed in hardware and the statistics counters are incrementing.
Example 6-15 Verifying ACL Statistics on the Line Card for DHCP Relay
INSTANCE 0x0
---------------
Entries:
[Index] Entry [Stats]
---------------------
DHCP Snooping
DHCP snooping is an L2 security feature. It resolves some types of DoS attacks that can
be engineered by DHCP messages and helps avoid IP spoofing, in which a malicious host
tries to use the IP address of another host. DHCP snooping works at two levels:
■ Discovery
■ Enforcement
Discovery includes the functions of intercepting DHCP messages and building a database
of {IP address, MAC address, Port, VLAN} records. This database is called the binding
table. Enforcement includes the functions of DHCP message validation, rate limiting, and
conversion of DHCP broadcasts to unicasts.
Note DHCP snooping is associated with the DHCP relay agent, which helps extend the
same security features when the DHCP client and server are in different subnets.
To understand how DHCP snooping works, examine the same topology in Figure 6-4.
In this topology, both the DHCP server and the client are part of same VLAN 100.
Nexus switch NX-1 is providing Layer 2 connectivity between the DHCP server and the
client host.
Technet24
342 Chapter 6: Troubleshooting IP and IPv6 Services
VLAN 100
E7/1 E7/13
R2
To enable DHCP snooping, configure the command ip dhcp snooping globally on the
Nexus switch and then enable the DHCP snooping for the VLAN using the command ip
dhcp snooping vlan vlan-id. Usually, the ports connected to the DHCP server are con-
figured as trusted ports and the ports connecting the clients are untrusted ports. To con-
figure the port connecting the server as a trusted port, enable the interface configuration
command ip dhcp snooping trust. Example 6-16 illustrates the configuration of DHCP
snooping on NX-1. When DHCP snooping is enabled, use the command show ip dhcp
snooping to validate the status of DHCP snooping on the switch. Notice that, in the out-
put of the command show ip dhcp snooping, NX-1 shows the DHCP snooping feature
being enabled and operational for VLAN 100.
NX-1
NX-1(config)# ip dhcp snooping
NX-1(config)# ip dhcp snooping vlan 100
NX-1(config)# interface e7/13
NX-1(config-if)# ip dhcp snooping trust
NX-1# show ip dhcp snooping
Switch DHCP snooping is enabled
DHCP snooping is configured on the following VLANs:
100
DHCP snooping is operational on the following VLANs:
100
Insertion of Option 82 is disabled
Verification of MAC address is enabled
DHCP snooping trust is configured on the following interfaces:
Interface Trusted
------------ -------
Ethernet7/13 Yes
After the requests/replies are exchanged between the client and the server, a bind-
ing entry is built on the device with DHCP snooping configuration for the untrusted
port. The binding table is also used by IP source guard (IPSG) and the Dynamic ARP
Inspection (DAI) feature. To view the binding table, use the command show ip dhcp
snooping binding (see Example 6-17). In this example, notice that the entry is built for
the untrusted port Eth7/1 and also shows the IP address assigned to the host with the
listed MAC address.
IPv4 Services 343
■ Allow request (BOOTREQUEST) from client (source port 68) to server (destination
port 67).
■ For server (src port 67) to client (dst port 68) response (BOOTREPLY), perform
binding table updates, strip off option 82, and forward the packet.
■ For client (src port 68) to server (dst port 67), just forward the request without any
validation.
Thus, for performing the previously mentioned validations, an ACL gets installed on the
line card for the DHCP snooping feature and you can view the statistics for the different
entries’ part of the programmed ACL. Example 6-18 displays the DHCP snooping ACL
programmed in hardware and the statistics for the same.
INSTANCE 0x0
---------------
Technet24
344 Chapter 6: Troubleshooting IP and IPv6 Services
IPv4 Class
Policies: DHCP(Snooping) [Merged]
Netflow profile: 0
Netflow deny profile: 0
5 tcam entries
INSTANCE 0x0
---------------
INSTANCE 0x3
---------------
DAI is enabled on a per-VLAN basis and supports enabling src-MAC, dst-MAC, and IP
address validation. The [Source, Destination] and [MAC, IP] addresses of the ARP packets
are validated against the snooping binding entry for valid unicast IP addresses. If a device
has no binding entry, a DAI trust port needs to be configured for that ingress interface
before ARP inspection works on that device.
Example 6-19 displays the configuration for DAI on VLAN 100 and the use of the
command show ip arp inspection statistics vlan vlan-id to display the statistics of the
ARP requests/responses and the number of packets forwarded. DAI is configured for a
Technet24
346 Chapter 6: Troubleshooting IP and IPv6 Services
VLAN using the command ip arp inspection vlan vlan-id. The port toward the server is
configured as the trusted port, so it is enabled using the interface-level command ip arp
inspection trust. On a DAI trusted port, no checks are placed for the rx and tx packets.
Vlan : 100
-----------
ARP Req Forwarded = 2
ARP Res Forwarded = 3
ARP Req Dropped = 0
ARP Res Dropped = 0
DHCP Drops = 0
DHCP Permits = 5
SMAC Fails-ARP Req = 0
SMAC Fails-ARP Res = 0
DMAC Fails-ARP Res = 0
IP Fails-ARP Req = 0
IP Fails-ARP Res = 0
For DAI, an ARP snooping ACL (VACL) is programmed on the line card. Note that
because the DAI feature is enabled along with the DHCP snooping feature, both the
ACLs are seen on the line card. Example 6-20 displays the ACL programmed on the line
card and the relevant statistics for the same.
INSTANCE 0x0
---------------
IPv4 Services 347
INSTANCE 0x0
---------------
ARP Class
Policies: ARP(Snooping) [Merged]
Technet24
348 Chapter 6: Troubleshooting IP and IPv6 Services
Netflow profile: 0
Netflow deny profile: 0
Entries:
[Index] Entry [Stats]
---------------------
[0062:0018:0018] prec 1 redirect(0x0) arp/response ip 0.0.0.0/0 0.0.0.0/0 0000
.0000.0000 0000.0000.0000 [2]
[0063:0019:0019] prec 1 redirect(0x0) arp/request ip 0.0.0.0/0 0.0.0.0/0 0000.
0000.0000 0000.0000.0000 [1]
[0064:001a:001a] prec 1 permit arp-rarp/all ip 0.0.0.0/0 0.0.0.0/0 0000.0000.0
000 0000.0000.0000 [0]
ARP ACLs
In non-DHCP (no DHCP snooping enabled) scenarios, you can define ARP ACLs to
filter out malicious ARP requests and responses. No packets are redirected to the
supervisor. ARP packets coming on the line card get forwarded and dropped in the line
card based on the ACL list by user config for the ARP inspection filter. The ARP ACL
filters are configured on a per-VLAN basis. An ARP ACL is configured using the com-
mand arp access-list acl-name. It accepts the entries in the format of [permit | deny]
[request | response] ip ip-address subnet-mask mac [mac-address mac-address-range].
Example 6-21 demonstrates an ARP ACL that is applied as an ARP inspection filter
for VLAN 100. After it is configured, the ARP ACL gets programmed in the hardware
and you can verify the statistics in hardware using the same command of show system
internal access-list input statistics [module slot].
INSTANCE 0x0
---------------
IP Source Guard
IP Source Guard (IPSG) provides IP and MAC filters to restrict IP traffic on DHCP
snooping untrusted ports. IP traffic with source IP and MAC addresses that correspond
to a valid IP source binding (both static IP and DHCP binding) is permitted; all other IP
traffic except DHCP is dropped. Traditionally, this prevents IP spoofing by allowing only
IP addresses obtained through DHCP snooping on a particular port.
The IPSG feature is enabled on a DHCP snooping untrusted Layer 2 port. Initially, all IP
traffic on the port is blocked except for DHCP packets that are captured by the DHCP
snooping process. When a client receives a valid IP address from the DHCP server, IP
traffic from hosts connected to the switch are allowed only if the MAC–IP address
matches with what is programmed by the IPSG module. The IPSG feature picks up the
MAC–IP bindings from the binding table and programs the source MAC (SMAC)–IP
binding check in the reverse path forwarding (RPF) table on the line card, thus providing
a per-port IP traffic filter in hardware.
IPSG is enabled on the L2 port on the switch connecting the host using the command
ip verify source dhcp-snooping-vlan. Verify the IPSG table after the host has been
assigned an IP address using the command show ip verify source interface interface-id.
Example 6-22 demonstrates IPSG being enabled on port Ethernet 7/1, which is the port
facing the host (untrusted port) and the IPSG table, after the DHCP server has assigned a
DHCP address.
Technet24
350 Chapter 6: Troubleshooting IP and IPv6 Services
When IPSG is enabled and IP–MAC entries have been programmed through the
Forwarding Information Base (FIB), all the traffic is checked for IP–MAC binding. For
instance, PING from that client (with a valid IP–MAC binding) should work. When the
IPSG Binding entry is removed, the PING fails and the FIB drops any such invalid traffic.
The SMAC–IP binding in the RPF table is programmed through Security Abstraction
Layer (SAL), a virtual device context (VDC) local and conditional compulsory process
running on a Nexus system. It uses the NX-OS infrastructure for system startup, restart,
and high availability (HA) capability and interprocess communication. SAL is thus treated
as a hardware abstraction layer in the supervisor for programming the IPSG bindings
database in FIB, which ensures security in the packet forwarding stage.
The SAL database information is viewed using the command show system internal sal
info database vlan vlan-id. This command provides the IPv4 and IPv6 table IDs, which
are further used to verify the information in the FIB using the command show system
internal forwarding table table-id route ip-address/mask [module slot] (here, table-id
is the field received from the SAL database output). Example 6-23 demonstrates how to
verify the IPSG FIB programming using SAL database info.
Example 6-23 SAL Database Info and FIB Verification for IPSG
----+---------------------+----------+----------+-----------
Dev | Prefix | PfxIndex | AdjIndex | LIF
----+---------------------+----------+----------+-----------
0 10.12.1.3/32 0x406 0x4d 0xfff
IPv4 Services 351
Unicast RPF
Unicast Reverse Path Forwarding (URPF) is a technique that matches on source IP
addresses to drop the traffic at the edge of the network. In other words, URPF prevents
the network from source IP spoofing attacks. This allows other legitimate sources to send
their traffic towards the destination server. URPF is implemented in two different modes:
■ Loose mode: A loose mode check is successful when a lookup of a packet source
address in the FIB returns a match and the FIB result indicates that the source is
reachable through at least one real interface. The ingress interface through which the
packet is received is not required to match any of the interfaces in the FIB result.
■ Strict mode: A strict mode check is successful when Unicast RFP finds a match in
the FIB for the packet source address and the ingress interface through which the
packet is received matches one of the Unicast RPF interfaces in the FIB match. If
this check fails, the packet is discarded. Use this type of Unicast RPF check when
packet flows are expected to be symmetrical.
Strict mode URPF is used on up to eight ECMP interfaces; if more than eight are in use,
it reverts to loose mode. Loose mode URPF is used on up to 16 ECMP interfaces. URPF
is applied on L3 interfaces, SVI, L3 port-channels, and subinterfaces. One caveat of URPF
strict mode is that /32 ECMP routes are incompatible. Thus, using URPF strict mode on
the uplink to the core is not recommended because the /32 route could be dropped.
URPF is configured using the command ip verify unicast source reachable-via [any
[allow-default] | rx]. The rx option enables strict mode; the any option enables loose
mode URPF. The allow-default option is used with loose mode to include IP addresses
that are not specifically contained in the routing table. Example 6-24 demonstrates the
configuration for enabling URPF strict mode on an L3 interface. After configuration, use
the command show ip interface interface-id to check whether URPF has been enabled
on the interface. In the following example, the URPF mode enabled on interface Eth7/1 is
strict mode.
Technet24
352 Chapter 6: Troubleshooting IP and IPv6 Services
IPv6 Services
With data centers growing so rapidly, IPv6 has become more relevant in the network to
overcome addressing as well as security challenges. NX-OS provides various IPv6 services
that provide reliability as well as security in a scaled data center environment. This sec-
tion discusses the following IPv6 services:
■ Neighbor discovery
This section details those features and shows how to troubleshoot them on NX-OS switches.
Neighbor Discovery
Defined in RFC 4861, IPv6 Neighbor Discovery (ND) is a set of messages and processes
that determine the relationships between two IPv6 neighboring nodes. The IPv6 ND is
built on top of ICMPv6, which is defined in RFC 2463. IPv6 ND replaces protocols such
as ARP, ICMP redirect, and ICMP router discovery messages, used in IPv4. Both IPv6
ND and ICMPv6 are critical for operations of IPv6.
IPv6 ND defines five ICMPv6 packets to provide the nodes with the information they
must and should know before establishing a communication:
When an interface is enabled, hosts can send out a Router Solicitation (RS) that
requests routers to generate Router Advertisements immediately instead of at their
IPv6 Services 353
next scheduled time. When an RS message is sent, the source address field is set to the
MAC address of the sending network interface card (NIC). The destination address
field is set to 33:33:00:00:00:02 in the Ethernet header. In the IPv6 header, the source
address field is set to either the link-local IPv6 address assigned to the sending interface
or the IPv6 unspecified address (::). The destination address is set to All Router multicast
address with link local scope (FF02:2) and the hop limit is set to 255.
Routers advertise their presence together with various link and Internet parameters either
periodically or in response to a Router Solicitation message. Router Advertisements
(RAs) contain prefixes that are used for on-link determination and/or address configura-
tion, a suggested hop limit value, maximum transmission unit (MTU), and so on. In the
Ethernet header of the RA message, the source address field is set to the sending NIC;
the destination address field is set to 33:33:00:00:00:01 or the unicast MAC address of
the host that sent a RS message from a unicast address. Similar to the RS message, the
source address field is set to the link-local address assigned to the sending interface; the
destination address is set to either the all-nodes multicast address with link-local scope
(FF02:1) or the unicast IPv6 address of the host that sent the RS message. The hop limit
field is set to 255.
A Redirect Message (RM) is used by routers to inform hosts of a better first hop for a
destination. In the Ethernet header, the destination MAC is set to the unicast MAC of the
originating sender. In the IPv6 header, the source address field is set to the unicast IPv6
address of the sending interface and the destination address is set to the unicast address
of the originating host.
To enable neighbor discovery, the first step is to enable IPv6 or configure an IPv6 address
on an interface. An IPv6 address is configured using either the command ipv6 address
ipv6-address [eui64] or the command ipv6 address use-link-local-only. The command
Technet24
354 Chapter 6: Troubleshooting IP and IPv6 Services
option eui64 configures the IPv6 address in EUI64 format. The command option use-
link-local-only manually configures a link-local address on the interface instead of using
the automatically assigned link-local address. Examine an IPv6-enabled link between two
switches NX-1 and NX-2, as in Figure 6-5. In this topology, the link is configured with
the IPv6 address of subnet 2002:10:12:1::/64.
E4/1 E4/13
2002:10:12:1::/64
NX-1 NX-2
When the IPv6 address is configured on both sides of the link and one of the sides initi-
ates a ping, the ND process starts and an IPv6 neighborship is established. An IPv6 neigh-
bor is viewed using the command show ipv6 neighbor [detail]. Example 6-25 demon-
strates an IPv6 neighborship between two switches. Notice that when the IPv6 address is
configured and the user initiates a ping to either the IPv6 unicast address or the link-local
address of the remote peer, the IPv6 ND process is initiated, messages are exchanged,
and an IPv6 neighborship is formed.
NX-1
NX-1(config)# interface Eth4/1
NX-1(config-if)# ipv6 address 2002:10:12:1::1/64
NX-2
NX-2(config)# interface Eth4/13
NX-2(config-if)# ipv6 address 2002:10:12:1::2/64
NX-1
! IPv6 neighbor output after initiating ipv6 ping
NX-1# show ipv6 neighbor
To understand the whole process of neighbor discovery, use the Ethanalyzer tool.
Ethanalyzer is used not only to identify the process of IPv6 ND, but also to assist with
any ND issues. Example 6-26 displays the Ethanalyzer output when an ICMPv6 ping
IPv6 Services 355
is initiated to the peer device from NX-1. When the ping is initiated from NX-1, an NS
message is sent toward NX-2. The reply packet is an NA message received from NX-2
on NX-1. Notice that, as part of the NA message, the Router (rtr), Solicited (sol), and
Override (ovr) flags are set; the target address is set to 2002:10:12:1::2 and is reached at
MAC address 0002.0002.0012.
NX-1
NX-1# ethanalyzer local interface inband display-filter "ipv6" limit-captured-frames 0
Capturing on inband
2017-10-14 21:25:51.314297 2002:10:12:1::1 -> ff02::1:ff00:2 ICMPv6 86 Neighbor
Solicitation for 2002:10:12:1::2 from 00:01:00:01:00:12
4 2017-10-14 21:25:51.315476 2002:10:12:1::2 -> 2002:10:12:1::1 ICMPv6 86 Neighbor
Advertisement 2002:10:12:1::2 (rtr, sol, ovr) is at 00:02:00:02:00:12
2017-10-14 21:25:53.319291 2002:10:12:1::1 -> 2002:10:12:1::2 ICMPv6 118 Echo (ping)
request id=0x1eaf, seq=1, hop limit=255
2017-10-14 21:25:53.319620 2002:10:12:1::2 -> 2002:10:12:1::1 ICMPv6 118 Echo (ping)
reply id=0x1eaf, seq=1, hop limit=2 (request in 2580)
! Output omitted for brevity
NX-1
NX-1# show ipv6 nd interface ethernet 4/1
ICMPv6 ND Interfaces for VRF "default"
Ethernet4/1, Interface status: protocol-up/link-up/admin-up
IPv6 address:
2002:10:12:1::1/64 [VALID]
IPv6 link-local address: fe80::201:ff:fe01:12 [VALID]
ND mac-extract : Disabled
ICMPv6 active timers:
Last Neighbor-Solicitation sent: 00:07:38
Last Neighbor-Advertisement sent: 00:06:26
Last Router-Advertisement sent: 00:01:34
Next Router-Advertisement sent in: 00:06:50
Router-Advertisement parameters:
Periodic interval: 200 to 600 seconds
Technet24
356 Chapter 6: Troubleshooting IP and IPv6 Services
■ Manual: An IPv6 address can be manually configured using CLI or a graphical user
interface (GUI).
To enable the DHCPv6 relay agent, configure the global command ipv6 dhcp relay.
After enabling the DHCPv6 relay agent globally, it must be enabled on the client-facing
Technet24
358 Chapter 6: Troubleshooting IP and IPv6 Services
When the DHCPv6 relay agent is configured and messages are exchanged between the
client and the server, view the statistics for the relay agent using the command show ipv6
dhcp relay statistics. Example 6-28 displays the DHCPv6 relay statistics on module 7,
where the client is connected. Remember that when the client sends a DHCPv6 request, it
sends a DHCPv6 solicit message to the router or switch connected to it. When received
by the first hop or the relay agent, these solicit messages are then relayed to the DHCPv6
server as a relay-forward message. The server sends a RELAY-REPLY message to return a
response to the client if the original message from the client was relayed to the server in a
relay-forward message. If any DHCPv6 packets are dropped, the output also shows those
drop counters.
--------------------------------------------------------------------------------
Relay Address VRF name Dest. Interface Request Response
--------------------------------------------------------------------------------
2001:10:12:1::1 --- --- 7 4
DROPS:
------
DHCPv6 Relay is disabled : 0
Max hops exceeded : 0
Packet validation fails : 0
Unknown output interface : 0
Invalid VRF : 0
Option insertion failed : 0
Direct Replies (Recnfg/Adv/Reply) from server: 0
IPv6 addr not configured : 0
Interface error : 0
VPN Option Disabled : 0
IPv6 extn headers present : 0
Similar to the IPv4 DHCP relay agent, an ACL gets programmed in hardware for the
DHCPv6 relay. View the statistics using the command show system internal access-list
input statistics [module slot] (see Example 6-29).
INSTANCE 0x0
---------------
Technet24
360 Chapter 6: Troubleshooting IP and IPv6 Services
A DHCPv6 relay agent adds an interface identifier option in the upstream DHCPv6 mes-
sage (from client to server) to identify the interface on which the client is connected. The
DHCPv6 relay agent uses this information while forwarding the downstream DHCPv6
message to the DHCPv6 client.
This works fine when end hosts are directly connected to DHCPv6 relay agents. In
some network configurations, however, one or more Layer 2 devices reside between
DHCPv6 clients and the relay agent. In these network scenarios, using the DHCPv6
relay agent Interface-ID option for client identification is difficult. A Layer 2 device
thus needs to append an Interface-ID option in DHCPv6 messages because they are
close to the end hosts. Such devices are typically known as a Lightweight DHCPv6
Relay Agent (LDRA).
When clients do not have an IPv6 address or do not know the location of DHCPv6 serv-
ers, DHCPv6 sends an Information-Request message to a reserved, link-scoped multicast
address ff02::1:2. The clients listen for DHCPv6 messages on UDP port 546. Servers and
relay agents listen for DHCPv6 messages on UDP port 547. LDRA checks whether the
incoming interface is L3 or L2. If it is L3, the packet is sent to the DHCPv6 Relay Agent;
if it is L2, the following checks are performed:
■ The LDRA feature checks whether LDRA is enabled or disabled on the incoming
interface or VLAN. If it is disabled, the packet is switched normally.
■ The incoming interface on which LDRA is enabled needs to be classified among the
following categories; if it is not one of these, LDRA drops the packet.
■ Client-facing trusted
■ Client-facing untrusted
■ Server-facing untrusted
■ ADVERTISE
■ REPLY
■ RECONFIGURE RELAY-REPLY
IPv6 Services 361
■ If hop count is greater than the maximum allowed value, the packet is dropped.
■ If the packet passes all the validation checks, a new frame is created and relayed to
the server.
■ msg-type: RELAY-FORWARD
■ hop-count:
■ If the received message is not RELAY-FORWARD, hop-count is set to 0.
■ Otherwise, hop-count increases by 1.
■ link-address: Unspecified (::)
■ peer-address: Client’s link-local address (source IP address received in the incom-
ing frame’s IP header)
■ Interface-ID option: Fill in the Interface-ID details to identify the interface on
which the packet was received.
As previously shown, the link-address parameter must be set to 0. LDRA includes the
Interface-ID option and the Relay-Message option in Relay-Forward messages. All
other options are optional. LDRA uses the Interface-ID to denote both the switch and
the interface on which the packet is received. Interface-ID is an opaque value; the server
does not try to parse the contents of the Interface-ID option. LDRA creates a String with
the switch MAC address, along with the interface ifindex, and uses it as Interface-ID. If
the incoming message is a RELAY-FORWARD message and is received on a client-trusted
interface, then a Layer 2 or Layer 3 agent is already available in the network and precedes
the local relay agent.
When the LDRA-enabled device receives a response from the server, the device performs
the following actions:
Technet24
362 Chapter 6: Troubleshooting IP and IPv6 Services
To enable LDRA globally, configure the command ipv6 dhcp-ldra. Then enable
LDRA on the L2 interface using the command ipv6 dhcp-ldra. LDRA also enables
you to specify the interface policy using the command ipv6 dhcp-ldra attach-policy
[client-facing-untrusted | client-facing-trusted | client-facing-disabled | server-facing-
trusted]. The policy options perform different actions:
■ Client-facing-trusted: Any other L2 or L3 relay agent that precedes this box in the
network is connected on this interface. The relay agent connected on this interface
should be between the actual client and this L2 relay agent (that is, on the upstream
network).
■ Server-facing-trusted: Any DHCP server or L3 relay agent that follows this box
toward the server end should be connected on this interface. If a relay agent is con-
nected on this interface, it should reside on the network path from this box toward
the actual DHCP server.
Note If any issues arise with DHCPv6 relay agents, collect the show tech dhcp command
output to be further analyzed by Cisco TAC.
Some of the IPv6 First Hop Security features are just as applicable as in IPv4. For
instance, some malware installed on a VM could send Router Advertisements to pretend
to be the default gateway for other VMs on the link. Note that although this scenario is
plausible, rogue Router Advertisements in the enterprise and the campus networks mainly
come from careless users. This problem is much less critical in the data center because
the VMs and servers are usually managed. Still, some VMs might be unmanaged, so the
careless user issue then becomes relevant. Table 6-2 lists common attacks in IPv6 net-
works and the associated FHS mitigation technique.
IPv6 Services 363
Note Other FHS techniques exist, but this chapter does not address them. Refer to the
Cisco.com documentation for more details.
RA Guard
RA Guard is a feature that enables the user of the Layer 2 switch to configure which
switch ports face routers. Router Advertisements received on any other port are dropped,
so they never reach the end hosts of the link. RA Guard performs further deep packet
inspection to validate the source of the RA, the prefix list, the preference, and any other
information carried. RA Guard is specified in RFC 6105. The goal of this feature is to
inspect Router Neighbor Discovery (ND) traffic (such as Router Solicitations [RS], Router
Advertisements [RA], and redirects) and to drop bogus messages. The feature introduces
the capability to block unauthorized messages based on policy configuration (for exam-
ple, RAs are not allowed on a Host port).
To enable IPv6 RA Guard, first an RA Guard policy is defined and then the policy is
applied on an interface. The RA Guard policy is defined using the command ipv6 nd
raguard policy policy-name. Table 6-3 displays all the options available as part of the
RA Guard policy.
Technet24
364 Chapter 6: Troubleshooting IP and IPv6 Services
When the policy is defined, it is applied to the interface using the interface-level configu-
ration command ipv6 nd raguard attach-policy policy-name. Example 6-30 displays the
sample RA Guard configuration. The command show ipv6 nd raguard policy policy-
name shows the RA Guard policy attached on different interfaces.
Note To debug any issues with IPv6 RA Guard, use the debug command debug ipv6
snooping raguard, which is captured in a debug logfile.
IPv6 Services 365
IPv6 Snooping
IPV6 Snooping is a combination of two features: ND Snooping and DHCPv6 Snooping.
IPv6 ND Snooping analyzes IPv6 neighbor discovery traffic and determines whether
it is harmless for nodes on the link. During this inspection, it gleans address bindings
(IP, MAC, port) when available and stores them in a binding table. The binding entry is
then used to determine address ownership, in case of contention between two clients.
IPv6 DHCP Snooping traps DHCPv6 packets between the client and the server. From
the packets snooped, assigned addresses are learned and stored in the binding table.
The IPv6 Snooping feature can also limit the number of addresses that any node on the
link can claim. This helps protect the switch binding table against DoS flooding attacks.
Figure 6-6 explains the role of IPv6 snooping and shows how it prevents the device from
invalid or unwanted hosts.
A B C
Binding Table
IP
MAC
ADDRESS
A 2001::1
B 2001::2
Monitor NDP and DHCP messages between end nodes. Enforce address ownership.
Disallowed NA with
Target Address 2001::2
The IPv6 snooping policy is configured using the command ipv6 snooping policy
policy-name. Within the IPv6 snooping policy, you can specify various options, as in
Table 6-4.
Technet24
366 Chapter 6: Troubleshooting IP and IPv6 Services
guard: Works like inspect, but also drops IPv6, ND, RA,
and IPv6 DHCP server packets in case of a threat.
When the policy is defined, it can be attached using the command ipv6 snooping attach-
policy policy-name under the vlan configuration vlan-id subconfiguration mode.
Example 6-31 displays the configuration of IPv6 snooping policy for VLAN 100.
IPv6 Services 367
Similar to other FHS features, IPv6 snooping programs an ACL in the hardware, which
is verified using the command show system internal access-list interface interface-id
[module slot]. The command show system internal access-list input statistics [module
slot] shows the statistics (see Example 6-32).
INSTANCE 0x0
---------------
Technet24
368 Chapter 6: Troubleshooting IP and IPv6 Services
Entries:
[Index] Entry [Stats]
---------------------
[0058:000e:000e] prec 1 redirect(0x0) icmp 0x0/0 0x0/0 137 0 flow-label 35072 [0]
[0059:000f:000f] prec 1 redirect(0x0) icmp 0x0/0 0x0/0 136 0 flow-label 34816 [0]
[005a:0010:0010] prec 1 redirect(0x0) icmp 0x0/0 0x0/0 135 0 flow-label 34560 [0]
[005b:0011:0011] prec 1 redirect(0x0) icmp 0x0/0 0x0/0 134 0 flow-label 34304 [0]
[005c:0012:0012] prec 1 redirect(0x0) icmp 0x0/0 0x0/0 133 0 flow-label 34048 [0]
[005d:0013:0013] prec 1 redirect(0x0) udp 0x0/0 0x0/0 eq 547 flow-label 547 [0]
[005e:0014:0014] prec 1 redirect(0x0) udp 0x0/0 0x0/0 eq 546 flow-label 546 [0]
[005f:0015:0015] prec 1 redirect(0x0) udp 0x0/0 eq 547 0x0/0 flow-label 196608
[0]
[0060:0016:0016] prec 1 permit ip 0x0/0 0x0/0 [0]
DHCPv6 Guard
The main purpose of the DHCPv6 Guard feature is to block DHCP replies or advertise-
ments that do not come from a legitimate DHCP server or relay agents. Based on what
configuration is deployed, it decides whether to bridge, switch, or block them. It also
verifies information found in the message, such as whether the addresses and prefixes in
the message are in the specified range. The device can be configured in a client or server
mode, which protects the clients from receiving replies from rogue DHCP servers.
The default mode of the box is to guard, so by default, all ports configured with
DHCPv6 Guard are in client mode. Thus, all ports drop any DHCPv6 server messages by
default. For a meaningful DHCPv6 deployment, at least one port should be assigned to
the dhcp-server role, which then permits DHCPv6 server messages. This is the simplest
configuration for a reasonable level of security.
When DHCP Guard is enabled on an interface or VLAN, ACLs are programmed in the
hardware with ACL action to punt DHCP packets to the Supervisor. The ACL has the fol-
lowing filter:
■ The source port should be either the DHCP client port (port 546) or the server port
(port 547).
■ The destination port should be either the DHCP client port (port 546) or the server
port (port 547).
The ACL can be verified on the line card or hardware using the command show system
internal access-list input interface interface-id module slot.
When both DHCPv6 Guard and DHCPv6 relay features are configured on the same
device, DHCPv6 request packets received by FHS are first handled by the DHCPv6
IPv6 Services 369
Guard feature. After all processing is done by the DHCPv6 Guard process, the packet is
given to DHCPv6 Relay feature, which relays it to the specified server.
Similarly, the DHCP relay agent first processes DHCP reply packets from another relay
agent or DHCP server. If the enclosed packet is not a relay packet (Relay forward or Relay
reply), it is passed on to DHCP Guard feature.
Note DHCP Guard and the DHCP relay agent essentially work together only at the first
hop. In later hops, DHCP relay agent is given priority over DHCP Guard. Statistics are
maintained separately for both features.
To configure DHCPv6 Guard policy, use the command ipv6 dhcp guard policy policy-
name. Under the policy, the first step is to define the device role, which is client, server,
or monitor. Then you define the advertised minimum and maximum allowed server pref-
erence. You can also specify whether the device is connected on a trusted port using the
trusted port command option. After configuring the policy, use the command ipv6 dhcp
guard attach-policy policy-name command to attach the policy to a port or a VLAN.
Example 6-33 shows the configuration of DHCPv6 Guard. To check the policy configu-
ration, use the command show ipv6 dhcp guard policy or use the command show ipv6
snooping policies to verify both the IPv6 snooping and the DHCPv6 guard policies;
DHCPv6 Guard works in conjunction with IPv6 snooping.
Technet24
370 Chapter 6: Troubleshooting IP and IPv6 Services
This section explains how those FHRP protocols work and details how to troubleshoot
them.
HSRP
Defined in RFC 2281, Hot Standby Routing Protocol (HSRP) provides transparent
failover of the first-hop device, which typically acts as a gateway to the hosts. HSRP
provides routing redundancy for IP hosts on Ethernet networks configured with a default
gateway IP address. It requires a minimum of two devices to enable HSRP; one device
acts as the active device and takes care of forwarding the packets, and the other acts as a
standby, ready to take over the role of active device in case of any failure.
HSRP-enabled interfaces send and receive multicast UDP-based hello messages to detect
any failure and designate active and standby routers. If the standby device does not
receive a hello message or the active device fails to send a hello message, the standby
device with the second-highest priority becomes HSRP active. The transition of HSRP
active between the devices is transparent to all hosts on the segment.
HSRP supports two versions: version 1 and version 2. Table 6-5 includes some of the dif-
ferences between HSRP versions.
Note Transitioning from HSRP version 1 to version 2 can be disruptive, given the change
in MAC address between both versions.
When the HSRP is configured on the segment and both the active and standby devices
are chosen, the HSRP control packets contain the following fields:
■ Source MAC: Virtual MAC of the active device or the interface MAC of the standby
or listener device
To understand the functioning of HSRP, examine the topology in Figure 6-7. Here, HSRP
is running on VLAN 10.
Technet24
372 Chapter 6: Troubleshooting IP and IPv6 Services
E4/1 E4/13
NX-1 NX-2
VLAN 10
NX-3
Host
To enable HSRP, use the command feature hsrp. When configured, HSRP runs on default
HSRP version 1. To manually change the HSRP version, use the command hsrp version
[1 | 2] under the interface where HSRP is configured.
Example 6-34 illustrates the configuration of HSRP for VLAN 10. In this example, HSRP
is configured with the group number 10 and a VIP of 10.12.1.1. NX-1 is set to a priority
of 110, which means that NX-1 acts as the active HSRP gateway. HSRP is also configured
with preemption; in case of a failure on NX-1 and the HSRP active gateway failover to
NX-2, the NX-1 regains the active role when NX-1 becomes active and available.
NX-1
interface Vlan10
no shutdown
no ip redirects
ip address 10.12.1.2/24
hsrp version 2
hsrp 10
preempt
priority 110
First-Hop Redundancy Protocol 373
ip 10.12.1.1
NX-2
interface Vlan10
no shutdown
no ip redirects
ip address 10.12.1.3/24
hsrp version 2
hsrp 10
ip 10.12.1.1
To view the status of HSRP groups and determine which device is acting as an active or
standby HSRP device, use the command show hsrp brief. This command displays the
group information, the priority of the local device, the active and standby HSRP inter-
face address, and also the group address, which is the HSRP VIP. You can also use the
command show hsrp [detail] to view more details about the HSRP groups. This com-
mand not only details information about the HSRP group, but it also lists the timeline of
the state machine a group goes through. This command is useful when troubleshooting
any HSRP finite state machine issues. The show hsrp [detail] command also displays
any authentication configured for the group, along with the virtual IP (VIP) and virtual
MAC address for the group. Example 6-35 displays both the show hsrp brief and show
hsrp detail command outputs. One important point to note in the following output is
that if no authentication is configured, the show hsrp detail command displays it as
Authentication text “cisco”.
NX-1
NX-1# show hsrp brief
*:IPv6 group #:group belongs to a bundle
P indicates configured to preempt.
|
Interface Grp Prio P State Active addr Standby addr Group addr
Vlan10 10 110 P Active local 10.12.1.3 10.12.1.1
(conf)
NX-1# show hsrp detail
Vlan10 - Group 10 (HSRP-V2) (IPv4)
Local state is Active, priority 110 (Cfged 110), may preempt
Forwarding threshold(for vPC), lower: 1 upper: 110
Hellotime 3 sec, holdtime 10 sec
Next hello sent in 0.951000 sec(s)
Virtual IP address is 10.12.1.1 (Cfged)
Active router is local
Standby router is 10.12.1.3 , priority 100 expires in 9.721000 sec(s)
Authentication text "cisco"
Virtual mac address is 0000.0c9f.f00a (Default MAC)
Technet24
374 Chapter 6: Troubleshooting IP and IPv6 Services
When HSRP is configured on an interface, the interface automatically joins the HSRP
multicast group based on the HSRP version. This information is viewed using the com-
mand show ip interface interface-id. This command does not provide information on the
HSRP virtual IP on the interface. To view the virtual IP along with the HSRP multicast
group, use the command show ip interface interface-id vaddr. Example 6-36 displays
the output of both commands.
The active HSRP gateway device also populates the ARP table with the virtual IP and
the virtual MAC address, as in Example 6-37. Notice that the virtual IP 10.12.1.1 maps to
MAC address 0000.0c9f.f00a, which is the virtual MAC of group 10.
Example 6-37 HSRP Virtual MAC and Virtual IP Address in ARP Table
IP ARP Table
Total number of entries: 1
Address Age MAC Address Interface
10.12.1.1 - 0000.0c9f.f00a Vlan10
If the HSRP is down or flapping between the two devices, or if the HSRP has not estab-
lished the proper states between the two devices (for example, both devices are showing
in Active/Active state), it might be worth enabling packet capture or running a debug to
investigate whether the HSRP hello packets are making it to the other end or whether
they are being generated locally on the switch. Because the HSRP control packets are
destined for the CPU, use Ethanalyzer to capture those packets. The display-filter of hsrp
helps capture HSRP control packets and determine whether any are not being received.
Along with Ethanalyzer, you can enable HSRP debug to see whether the hello packet
is being received. The HSRP debug for the hello packet is enabled using the command
debug hsrp engine packet hello interface interface-id group group-number. The com-
mand displays the hello packet from and to the peer, along with other information such as
authentication, hello, and the hold timer.
Example 6-38 displays the Ethanalyzer and HSRP debug for capturing hello packets.
Note that HSRP version 2 assigns a 6-byte ID to identify the sender of the HSRP hello
packet, which is usually the interface MAC address.
NX-2
NX-2# ethanalyzer local interface inband display-filter hsrp limit-captured-frames 0
Capturing on inband
1 2017-10-21 07:45:18.646334 10.12.1.2 -> 224.0.0.102 HSRPv2 94 Hello (state
Active)
2 2017-10-21 07:45:18.915261 10.12.1.3 -> 224.0.0.102 HSRPv2 94 Hello (state
Standby)
2 2017-10-21 07:45:21.503535 10.12.1.2 -> 224.0.0.102 HSRPv2 94 Hello (state
Active)
Technet24
376 Chapter 6: Troubleshooting IP and IPv6 Services
One of the most common problems with HSRP is the group remaining in down state. This
can happen for the following reasons:
Thus, while troubleshooting any HSRP group down-state issues, these points should all
be checked.
HSRPv6
HSRP for IPv6 (HSRPv6) provides the same functionality to IPv6 hosts as HSRP for IPv4.
An HSRP IPv6 group has a virtual MAC address that is derived from the HSRP group
number and has a virtual IPv6 link-local address that is, by default, derived from the
HSRP virtual MAC address. When the HSRPv6 group is active, periodic RA messages are
sent for the HSRP virtual IPv6 link-local address. These RA messages stop after a final
RA is sent, when the group leaves the active state (moves to standby state).
First-Hop Redundancy Protocol 377
HSRPv6 has a different MAC address range and UDP port than HSRP for IPv4. Consider
some of these values:
■ HSRP version 2
No separate feature is required to enable HSRPv6. Feature hsrp enables HSRP for both
IPv4 and IPv6 address families. Example 6-39 illustrates the configuration of HSRPv6
between NX-1 and NX-2 on VLAN 10. In this example, NX-2 is set with a priority of
110, which means NX-2 acts as the active switch and NX-1 acts as the standby. In this
example, a virtual IPv6 address is defined using the command ip ipv6-address, but this
virtual IPv6 address is a secondary virtual IP address. The primary virtual IPv6 address is
automatically assigned for the group.
NX-1
interface Vlan10
no shutdown
no ipv6 redirects
ipv6 address 2001:db8::2/48
hsrp version 2
hsrp 20 ipv6
ip 2001:db8::1
NX-2
interface Vlan10
no shutdown
no ipv6 redirects
ipv6 address 2001:db8::3/48
hsrp version 2
hsrp 20 ipv6
preempt
priority 110
ip 2001:db8::1
Similar to IPv4, HSRPv6 group information is viewed using the command show hsrp
[group group-number] [detail]. The command displays information related to the current
state of the device, priority, the primary and secondary virtual IPv6 address, the virtual
MAC address, and the state history for the group. Example 6-40 displays the detailed
output of HSRP group 20 configured on VLAN 10. Notice that the virtual IPv6 address
Technet24
378 Chapter 6: Troubleshooting IP and IPv6 Services
is calculated based on the virtual MAC address assigned for the group. The configured
virtual IPv6 address is under the secondary VIP list.
NX-2
NX-2# show hsrp group 20 detail
Vlan10 - Group 20 (HSRP-V2) (IPv6)
Local state is Active, priority 110 (Cfged 110), may preempt
Forwarding threshold(for vPC), lower: 1 upper: 110
Hellotime 3 sec, holdtime 10 sec
Next hello sent in 1.621000 sec(s)
Virtual IP address is fe80::5:73ff:fea0:14 (Implicit)
Active router is local
Standby router is fe80::5287:89ff:fe40:2042 , priority 100 expires in 9.060000
sec(s)
Authentication text "cisco"
Virtual mac address is 0005.73a0.0014 (Default MAC)
2 state changes, last state change 00:02:40
IP redundancy name is hsrp-Vlan10-20-V6 (default)
Secondary VIP(s):
2001:db8::1
HSRPv6 does not come up if the virtual IPv6 address is configured and assigned on the
interface. This information is verified using the command show ipv6 interface interface-id.
In addition, the virtual IPv6 address and virtual MAC addresses must be added to ICMPv6.
This information is validated using the command show ipv6 icmp vaddr [link-local | global].
The keyword link-local displays the primary virtual IPv6 address, which is automatically
calculated using the virtual MAC. The keyword global displays the manually configured
virtual IPv6 address. Example 6-41 examines the output of both these commands.
First-Hop Redundancy Protocol 379
NX-2
NX-2# show ipv6 interface vlan 10
IPv6 Interface Status for VRF "default"(1)
Vlan10, Interface status: protocol-up/link-up/admin-up, iod: 121
IPv6 address:
2001:db8::3/48 [VALID]
IPv6 subnet: 2001:db8::/48
IPv6 link-local address: fe80::e6c7:22ff:fe1e:9642 (default) [VALID]
IPv6 virtual addresses configured:
fe80::5:73ff:fea0:14 2001:db8::1
IPv6 multicast routing: disabled
! Output omitted for brevity
NX-2# show ipv6 icmp vaddr link-local
Virtual IPv6 addresses exists:
Interface: Vlan10, context_name: default (1)
Group id: 20, Protocol: HSRP, Client UUID: 0x196, Active: Yes (1) client_state:1
Virtual IPv6 address: fe80::5:73ff:fea0:14
Virtual MAC: 0005.73a0.0014, context_name: default (1)
For flapping HSRPv6 neighbors, the same Ethanalyzer trigger can be used as for IPv4.
Example 6-42 displays the Ethanalyzer output for HSRPv6 control packets, showing
packets from both HSRP active and standby switches.
NX-2
NX-2# ethanalyzer local interface inband display-filter hsrp limit-captured-frames 0
Capturing on inband
20:32:29.596977 fe80::5287:89ff:fe40:2042 -> ff02::66 HSRPv2 114 Hello (state
Standby)
20:32:29.673860 fe80::e6c7:22ff:fe1e:9642 -> ff02::66 HSRPv2 114 Hello (state
Active)
20:32:32.307507 fe80::5287:89ff:fe40:2042 -> ff02::66 HSRPv2 114 Hello (state
Standby)
20:32:32.333125 fe80::e6c7:22ff:fe1e:9642 -> ff02::66 HSRPv2 114 Hello (state
Active)
Note For any failure or problem with HSRP or HSRPv6, collect the show tech hsrp
output in problematic state.
Technet24
380 Chapter 6: Troubleshooting IP and IPv6 Services
VRRP
Virtual Router Redundancy Protocol (VRRP) was initially defined in RFC 2338, which
defines version 1. RFC 3768 and RFC 5798 define version 2 and version 3, respectively.
NX-OS supports only VRRP version 2 and version 3. VRRP works in a similar concept
as HSRP. VRRP provides box-to-box redundancy by enabling multiple devices to elect
a member as a VRRP master that assumes the role of default gateway, thus eliminating a
single point of failure. The nonmaster VRRP member forms a VRRP group and takes the
role of backup. If the VRRP master fails, the VRRP backup assumes the role of VRRP
master and acts as the default gateway.
VRRP is enabled using the command feature vrrp. VRRP has a similar configuration as
HSRP. VRRP is configured using the command vrrp group-number. Under the interface
VRRP configuration mode, network operators can define the virtual IP, priority, authenti-
cation, and so on. A no shutdown is necessary under the vrrp configuration to enable the
vrrp group. Example 6-43 displays the VRRP configuration between NX-1 and NX-2.
NX-1
interface Vlan10
no shutdown
no ip redirects
ip address 10.12.1.2/24
vrrp 10
priority 110
authentication text cisco
address 10.12.1.1
no shutdown
NX-2
interface Vlan10
no shutdown
no ip redirects
ip address 10.12.1.3/24
vrrp 10
authentication text cisco
address 10.12.1.1
no shutdown
To verify the VRRP state, use the command show vrrp [master | backup]. The master and
backup options display information on the respective nodes. The show vrrp [detail] com-
mand output is used to gather more details about the VRRP. Example 6-44 displays the
detailed VRRP output, as well as VRRP state information. Notice that, in this example,
the command show vrrp detail output displays the virtual IP as well as the virtual MAC
address. The VRRP virtual MAC address is of the format 0000.5e00.01xy, where xy is
the hex representation of the group number.
First-Hop Redundancy Protocol 381
NX-1
NX-1# show vrrp master
Interface VR IpVersion Pri Time Pre State VR IP addr
---------------------------------------------------------------
Vlan10 10 IPV4 110 1 s Y Master 10.12.1.1
For any VRRP flapping issues, use the command show vrrp statistics to determine
whether the flapping is the result of some kind of error or a packet being wrongly
received. The command displays the number of times the device has become a master,
along with other error statistics such as TTL errors, invalid packet length, and a mismatch
in address list. Example 6-45 displays the output of the show vrrp statistics. Notice that
NX-1 received five authentication failure statistics for group 10.
NX-1
NX-1# show vrrp statistics
Technet24
382 Chapter 6: Troubleshooting IP and IPv6 Services
VRRP version 2 has support only for the IPv4 address family, but VRRP version 3
(VRRP3) has support for both IPv4 and IPv6 address families. On NX-OS, both VRRP
and VRRPv3 cannot be enabled on the same device. If the feature VRRP is already
enabled on the Nexus switch, enabling the feature VRRPv3 displays an error stating
that VRRPv2 is already enabled. Thus, a migration must be performed from VRRP to
VRRPv3, which has minimal impact on the services. Refer to the following steps to per-
form the migration from VRRP version 2 to version 3.
Step 1. Disable the feature VRRP using the command no feature vrrp.
Step 2. Enable the feature VRRPv3 using the command feature vrrpv3.
Step 3. Under the interface, configure the VRRPv3 group using the command vrrpv3
group-number address-family [ipv4 | ipv6].
Step 4. Use the address command to define the VRRPv3 primary and secondary
virtual IP.
Step 5. Use the command vrrpv2 to enable backward compatibility with VRRP version 2.
This helps in exchanging state information with other VRRP version 2 devices.
NX-1
NX-1(config)# feature vrrpv3
Cannot enable VRRPv3: VRRPv2 is already enabled
The command show vrrpv3 [brief | detail] verifies the information of the VRRPv3
groups. The show vrrpv3 brief command option displays the brief information related
to the group, such as group number, address family, priority, preemption, state, master
address, and group address (which is the virtual group IP). The show vrrpv3 detail
command displays additional information, such as advertisements sent and received for
both VRRPv2 and VRRPv3, virtual MAC address, and other statistics related to errors
and transition states. Example 6-47 displays both the brief and detailed command
output of show vrrpv3.
NX-1
NX-1# show vrrpv3 brief
Interface Grp A-F Pri Time Own Pre State Master addr/Group addr
Vlan10 10 IPv4 100 0 N Y MASTER 10.12.1.2(local) 10.12.1.1
Technet24
384 Chapter 6: Troubleshooting IP and IPv6 Services
You can also use the show vrrpv3 statistics command output to view the error statistics.
This command displays the counters for dropped packets and packets dropped for vari-
ous reasons, such as invalid TTL, invalid checksum, or invalid message type. The second
half of the output is similar to the show vrrpv3 detail output. Example 6-48 displays the
output of the command show vrrpv3 statistics.
NX-1
NX-1# show vrrpv3 statistics
Note For any failure or issues with VRRP, collect the output from the commands show
tech vrrp [brief] or show tech vrrpv3 [detail] during problematic state for further investi-
gation by Cisco TAC.
GLBP
As the name suggests, Gateway Load-Balancing Protocol (GLBP) provides gateway redun-
dancy and load balancing to the network segment. It provides redundancy with an active/
standby gateway and supplies load balancing by ensuring that each member of the GLBP
group takes care of forwarding the traffic to the appropriate gateway. GLBP is enabled
on NX-OS using the command feature glbp. When defining a GLBP group, the following
parameters can be configured:
■ Initial weighting value, upper and lower threshold values for a secondary gateway to
become AVG
■ Gateway load-balancing method
■ Interface tracking
Technet24
386 Chapter 6: Troubleshooting IP and IPv6 Services
■ Round-robin: Each virtual forwarder MAC address is used to sequentially reply for
the virtual IP address.
■ Weighted: Weights are determined for each device in the GLBP group, to define the
ratio of load balancing between the devices.
Example 6-49 displays the GLBP configuration between NX-1 and NX-2.
NX-1
NX-1(config)# interface vlan 10
NX-1(config-if)# glbp 10
NX-1(config-if-glbp)# timers 1 4
NX-1(config-if-glbp)# priority 110
NX-1(config-if-glbp)# preempt
NX-1(config-if-glbp)# load-balancing ?
host-dependent Load balance equally, source MAC determines forwarder choice
round-robin Load balance equally using each forwarder in turn
weighted Load balance in proportion to forwarder weighting
Similar to HSRP version 2, GLBP communicates its hello packets over the multicast address
224.0.0.102. However, it uses the UDP source and destination port number of 3222.
To view the details of the GLBP group, use the command show glbp [brief]. The com-
mand displays the configured virtual IP, the group state, and all the other information
related to the group. The command output also displays information regarding the for-
warders, their MAC address, and their IP addresses. Example 6-50 examines the output
of both the command show glbp and the command show glbp brief, displaying the infor-
mation of the GLBP group 10 along with its forwarder information and their states.
First-Hop Redundancy Protocol 387
Example 6-50 show glbp and show glbp brief Command Output
NX-1
NX-1# show glbp
Vlan10 - Group 10
State is Active
4 state change(s), last state change(s) 00:01:54
Virtual IP address is 10.12.1.1
Hello time 1 sec, hold time 4 sec
Next hello sent in 990 msec
Redirect time 600 sec, forwarder time-out 14400 sec
Preemption enabled, min delay 0 sec
Active is local
Standby is 10.12.1.3, priority 100 (expires in 3.905 sec)
Priority 110 (configured)
Weighting 100 (default 100), thresholds: lower 1, upper 100
Load balancing: host-dependent
Group members:
5087.8940.2042 (10.12.1.2) local
E4C7.221E.9642 (10.12.1.3)
There are 2 forwarders (1 active)
Forwarder 1
State is Active
2 state change(s), last state change 00:01:50
MAC address is 0007.B400.0A01 (default)
Owner ID is 5087.8940.2042
Preemption enabled, min delay 30 sec
Active is local, weighting 100
Forwarder 2
State is Listen
1 state change(s), last state change 00:00:40
MAC address is 0007.B400.0A02 (learnt)
Owner ID is E4C7.221E.9642
Redirection enabled, 599.905 sec remaining (maximum 600 sec)
Time to live: 14399.905 sec (maximum 14400 sec)
Preemption enabled, min delay 30 sec
Active is 10.12.1.3 (primary), weighting 100 (expires in 3.905 sec)
NX-1# show glbp brief
Technet24
388 Chapter 6: Troubleshooting IP and IPv6 Services
Interface Grp Fwd Pri State Address Active rtr Standby rtr
Vlan10 10 - 110 Active 10.12.1.1 local 10.12.1.3
For troubleshooting GLBP issues, use tools such as Ethanalyzer to capture GLBP control
packeks. The detailed command output of Ethanalyzer supplies what information is being
received or sent as part of the GLBP control packet. Example 6-51 displays the output of
Ethanalyzer for GLBP packets.
NX-2
NX-2# ethanalyzer local interface inband display-filter glbp limit-captured-frames 0
Capturing on inband
2017-10-22 20:33:43.857524 10.12.1.2 -> 224.0.0.102 GLBP 102 G: 10, Hello, I
Pv4, Request/Response?
2017-10-22 20:33:43.857934 10.12.1.3 -> 224.0.0.102 GLBP 102 G: 10, Hello, I
Pv4, Request/Response?
2 2017-10-22 20:33:44.858861 10.12.1.2 -> 224.0.0.102 GLBP 102 G: 10, Hello, I
Pv4, Request/Response?
4 2017-10-22 20:33:44.859474 10.12.1.3 -> 224.0.0.102 GLBP 102 G: 10, Hello, I
Pv4, Request/Response?
NX-2# ethanalyzer local interface inband display-filter glbp limit-captured-frames 1
detail
Capturing on inband
1
Frame 1: 102 bytes on wire (816 bits), 102 bytes captured (816 bits) on interfac
e 0
Interface id: 0
Encapsulation type: Ethernet (1)
Arrival Time: Oct 22, 2017 20:33:54.873326000 UTC
[Time shift for this packet: 0.000000000 seconds]
Epoch Time: 1508704434.873326000 seconds
[Time delta from previous captured frame: 0.000000000 seconds]
[Time delta from previous displayed frame: 0.000000000 seconds]
[Time since reference or first frame: 0.000000000 seconds]
Frame Number: 1
Frame Length: 102 bytes (816 bits)
Capture Length: 102 bytes (816 bits)
[Frame is marked: False]
[Frame is ignored: False]
[Protocols in frame: eth:ip:udp:glbp]
First-Hop Redundancy Protocol 389
Technet24
390 Chapter 6: Troubleshooting IP and IPv6 Services
Note In case of an issue with GLBP, collect the show tech glbp command output for
further investigation by Cisco TAC.
Summary 391
Summary
NX-OS supports multiple IP and IPv6 services that complement the Nexus platforms,
along with their routing and switching capabilities within the data center and position the
Nexus switches at different layers. This chapter detailed how IP SLA is leveraged to main-
tain track reachability, limit jitter between a specified source and destination, and support
both UDP- and TCP-based probes. Along with IP SLA, the object tracking feature is lev-
eraged to perform conditional actions in the system. The object tracking feature supports
tracking an interface, an IP or IPv6 route, and a track list, as well as using them with static
routes.
A part of the IPv4 services, NX-OS provides support for DHCP relay, snooping, and
other IPv4 security–related features. This chapter covered in detail how DHCP Relay and
DHCP Snooping can be used in data center environments to extend the capability of
DHCP server and, at the same time, protect the network from attacks. The DHCP Relay
feature can be used when the DHCP server and the host are extended across different
VLANs or subnets. This chapter also showed how to use security features such as DAI, IP
Source Guard, and URPF. When enabling all these services, NX-OS configures ACLs in
the hardware to permit relevant traffic.
For IPv6 services, this chapter covered the IPv6 neighbor discovery process and IPv6
first-hop security features such as RA Guard, IPv6 snooping, and DHCPv6 Guard.
Additionally, the chapter looked at FHRP protocols such as HSRP for both IPv4 and
IPv6, VRRP, and GLBP. The FHRP protocols provide hosts with gateway redundancy.
Finally, the chapter looked at how different FHRP protocols work and how to configure
and troubleshoot them.
Technet24
This page intentionally left blank
Chapter 7
Troubleshooting Enhanced
Interior Gateway Routing
Protocol (EIGRP)
This chapter focuses on identifying and troubleshooting issues that are caused with forming
EIGRP neighbor adjacency, path selection, missing routes, and problems with convergence.
EIGRP Fundamentals
A Nexus switch can run multiple EIGRP processes. Each process forms adjacencies with
other routers or NX-OS switches under the same common routing domain—otherwise
known as an autonomous system (AS). EIGRP devices within the same AS exchange
routes only with members of the same AS and use the same metric calculation formula.
EIGRP uses factors outside of hop-count and adds logic to the route-selection algorithm.
EIGRP uses a Diffusing Update Algorithm (DUAL) to identify network paths and pro-
vides for fast convergence using pre-calculated loop-free backup paths.
Figure 7-1 is used as a reference topology for NX-1 calculating the best path to the
10.4.4.0/24 network.
Technet24
394 Chapter 7: Troubleshooting Enhanced Interior Gateway Routing Protocol (EIGRP)
NX-1 (256
10.1 0)
10 (256) 4.1.
0/24
.13
.1.
0/2
4
10.12.1.0/24
(7680)
10.34.1.0/24 10.4.4.0/24
(256) (2816)
NX-3 NX-4
0/24
.1. 4.1.0
/24
.23 6) 10.2 0)
10 (25
(256
NX-2
Table 7-1 contains key terms, definitions, and their correlation to Figure 7-1.
Topology Table
EIGRP contains a topology table that is a vital component to DUAL and contains infor-
mation to identify loop-free backup routes. The topology table contains all the network
prefixes advertised within an EIGRP AS. Each entry in the table contains the following:
■ Network prefix
■ Nearby EIGRP neighbors that have advertised that prefix
■ Metrics from each neighbor (reported distance, hop-count)
■ Values used for calculating the metric (load, reliability, total delay, minimum bandwidth)
Feasible Successor
Path Metric Reported Distance
Passes Feasibility Condition
2816<3328
Technet24
396 Chapter 7: Troubleshooting Enhanced Interior Gateway Routing Protocol (EIGRP)
Upon examining the network 10.4.4.0/24, notice that NX-1 calculates a FD of 3,328 for
the successor route. The successor (upstream router) advertises the successor route with
a reported distance (RD) of 3,072. The second path entry has a metric of 5,376 and has
an RD of 2,816. Because 2,816 is less than 3,072, the second entry passes the feasibility
condition and classifies the second entry as the feasible successor for prefix.
The 10.4.4.0/24 route is Passive (P), which means that the topology is stable. During a
topology change, routes go into an Active (A) state when computing a new path.
The path metric formula shown in Figure 7-3 is described in RFC 7868, which explains
EIGRP.
K2* BW K5
Metric = [( K1* BW + + K3* Delay) * ]
256 - Load K4 + Reliability
EIGRP uses K values to define which coefficients the formula uses and the associated
impact with that coefficient when calculating the metric. A common misconception
is that the K values directly apply to bandwidth, load, delay, or reliability; this is not
accurate. For example, K1 and K2 both reference bandwidth (BW).
BW represents the slowest link in the path scaled to a 10 Gigabit per second link (107).
Link speed is collected from the configured interface bandwidth on an interface. Delay
is the total measure of delay in the path measured in tenths of microseconds (μs).
The EIGRP formula is based off the IGRP metric formula, except the output is multi-
plied by 256 to change the metric from 24 bits to 32 bits. Taking these definitions into
consideration, the formula for EIGRP is shown in Figure 7-4.
107
By default, K1 and K3 have a value of 1, and K2, K4, and K5 are set to 0. Figure 7-5 places
default K values into the formula and then shows a streamlined version of the formula.
EIGRP Fundamentals 397
107
Equals
107 Total Delay
Metric = 256 * ( + )
Min. Bandwidth 10
Note EIGRP includes a second formula to address high-speed interfaces called EIGRP
wide metrics, which add a sixth K value. EIGRP wide metrics is explained later in the
chapter.
The EIGRP update packet includes path attributes associated with each prefix. The
EIGRP path attributes can include hop count, cumulative delay, minimum bandwidth
link speed, and reported distance. The attributes are updated at each hop along the way,
allowing each router to independently identify the shortest path.
Table 7-2 shows some of the common network types, link speeds, delays, and EIGRP
metrics using the streamlined formula from Figure 7-5.
Note Notice how the delay is the same between Serial and T1 interfaces, so the only
granularity is the link speed. In addition, there is not a differentiation of delay between
the Gigabit Ethernet and 10 Gigabit Ethernet interfaces.
Technet24
398 Chapter 7: Troubleshooting Enhanced Interior Gateway Routing Protocol (EIGRP)
Using the topology from Figure 7-1, the metric from NX-1 for the 10.4.4.0/24 network
is calculated using the formula in Figure 7-5. The link speed for both Nexus switches
is 1 Gbps, and the total delay is 30 μs (10 μs for the 10.4.4.0/24 link, 10 μs for the
10.34.1.0/24 link, and 10 μs for the 10.13.1.0/24 link to NX-3).
The EIGRP metric for a specific prefix is queried directly from EIGRP’s topology table
with the command show ip eigrp topology network/prefix-length. Example 7-1 shows
NX-1’s topology table output for the 10.4.4.0/24 network. Notice that the output
includes the successor route, any feasible successor paths, and the EIGRP state for the
prefix. Each path contains the EIGRP attributes minimum bandwidth, total delay, inter-
face reliability, load, and hop count.
Note The EIGRP topology table maintains other paths besides the successor and fea-
sible successor. The command show ip eigrp topology all-links displays the other ones.
EIGRP Fundamentals 399
EIGRP Communication
EIGRP uses five packet types to communicate with other routers, as shown in
Table 7-3. EIGRP uses its own IP protocol number (88), and uses multicast packets
where possible and unicast packets when necessary. Communication between EIGRP
devices is accomplished using the multicast group address of 224.0.0.10 or MAC
address of 01:00:5e:00:00:0a when possible.
EIGRP uses the Reliable Transport Protocol (RTP) to ensure that packets are delivered
in order and that routers receive specific packets. A sequence number is included in
all of the EIGRP packets. A sequence value of zero does not require a response from
the receiving EIGRP router; all other values require an Acknowledgement packet that
includes the original sequence number.
Ensuring that packets are received makes the transport method reliable. All Updates,
Queries, and Reply packets are deemed reliable, whereas Hello and Acknowledgement
packets do not require acknowledgement and could be unreliable.
If the originating router does not receive an acknowledgement packet from the neighbor
before the retransmit timeout expires, it notifies the nonresponsive router to stop pro-
cessing its multicast packets. The originating router sends all traffic via unicast, until the
neighbor is fully synchronized. Upon complete synchronization, the originating router
notifies the destination router to start processing multicast packets again. All unicast
packets require acknowledgement. EIGRP will retry up to 16 times for each packet that
requires confirmation and will reset the neighbor relationship when the neighbor reaches
the retry limit of 16.
Technet24
400 Chapter 7: Troubleshooting Enhanced Interior Gateway Routing Protocol (EIGRP)
Step 1. Enable the EIGRP feature. The EIGRP feature must be enabled with the
global configuration command feature eigrp.
Step 2. Define an EIGRP process tag. The EIGRP process must be defined with the
global configuration command router eigrp process-tag. The process-tag can
be up to 20 alphanumeric characters in length.
Step 4. Define the address family. EIGRP supports IPv4 and IPv6 address-
families under the same EIGRP process. Therefore, the address-family
should be defined with the command address-family [ipv4 | ipv6]
unicast.
Step 5. Define the Autonomous System Number (ASN) for the EIGRP process. The
autonomous system must be defined for the EIGRP process with the com-
mand autonomous-system as-number.
This step is optional if the EIGRP process tag is only numeric and matches
the ASN used by the EIGRP process.
Note Unlike IOS devices, enabling EIGRP on an interface advertises any secondary
connected network into the topology table.
Figure 7-6 provides a simple topology with two Nexus switches that are used to explain
how to troubleshoot EIGRP adjacency problems.
The first step is to verify devices that have successfully established an EIGRP adjacency
with the command show ip eigrp neighbors [detail] [interface-id | neighbor-ip-address |
vrf {vrf-name | all}]. Example 7-3 demonstrates the command being run on NX-1.
Technet24
402 Chapter 7: Troubleshooting Enhanced Interior Gateway Routing Protocol (EIGRP)
Table 7-4 provides a brief explanation to the key fields shown in Example 7-3.
Besides enabling EIGRP on the network interfaces of an NX-OS device, the following
parameters must match for the two routers to become neighbors:
■ Authentication parameters
Table 7-5 provides a brief explanation of the key fields shown with the EIGRP
interfaces.
Passive Interface
Some network topologies require advertising a network segment into EIGRP, but need
to prevent neighbors from forming adjacencies on that segment. Example scenarios
involve advertising access layer networks in a campus topology.
To illustrate how this can cause problems, NX-1 and NX-2 cannot establish an EIGRP
adjacency. Viewing the EIGRP interfaces on both switches and the peering link E1/1 is
not displayed as expected in Example 7-5.
Technet24
404 Chapter 7: Troubleshooting Enhanced Interior Gateway Routing Protocol (EIGRP)
A passive interface is not displayed when displaying the EIGRP interfaces as explained
in the previous section. Examining the EIGRP process with the command show ip eigrp
[process-tag] provides a count of active and passive interfaces as seen in Example 7-6.
Example 7-7 displays the configuration on NX-1 and NX-2 that prevents the two Nexus
switches from forming an EIGRP adjacency. The Ethernet1/1 interfaces must be active
on both switches for an adjacency to form. The command no ip passive-interface eigrp
NXOS should be moved to Interface E1/1 on NX-1, and the command ip passive-
interface eigrp NXOS should be moved from E1/1 to Vlan20 on NX-2.
interface Vlan10
ip router eigrp NXOS
no ip passive-interface eigrp NXOS
interface loopback0
ip router eigrp NXOS
interface Ethernet1/1
ip router eigrp NXOS
interface Vlan20
ip router eigrp NXOS
interface loopback0
ip router eigrp NXOS
interface Ethernet1/1
ip router eigrp NXOS
ip passive-interface eigrp NXOS
Note In addition to placing an interface into a passive state, an interface can have EIGRP
temporarily shut down with the command ip eigrp process-tag shutdown. This disables
EIGRP on that interface while leaving EIGRP configuration on that interface.
Technet24
406 Chapter 7: Troubleshooting Enhanced Interior Gateway Routing Protocol (EIGRP)
request | update | verbose] enables debug functionality for the type of packet that is
selected. Example 7-9 displays the use of the EIGRP packet debugs.
Performing EIGRP debugs shows only the packets that have reached the supervisor
CPU. If packets are not displayed in the debugs, further troubleshooting must be
taken by examining quality of service (QoS) policies, access control list (ACL), con-
trol plane policing (CoPP), or just verification of the packet leaving or entering an
interface.
QoS policies may or may not be deployed on an interface. If they are deployed, the
policy-map must be examined for any dropped packets, which must then be referenced
to a class-map that matches the EIGRP routing protocol. The same logic applies to CoPP
policies because they are based on QoS settings.
Example 7-10 displays the process for checking the CoPP policy with the following
logic:
■ Examine the CoPP policy with the command show running-config copp all. This
displays the relevant policy-map name, classes defined, and the police rate for
each class.
■ Investigate the class-maps to identify the conditional matches for that class-map.
■ After the class-map has been verified, examine the policy-map drops for that class
with the command show policy-map interface control-plane.
Troubleshooting EIGRP Neighbor Adjacency 407
Technet24
408 Chapter 7: Troubleshooting Enhanced Interior Gateway Routing Protocol (EIGRP)
Note This CoPP policy was taken from a Nexus 7000 switch; the policy-name and class-
maps may vary depending on the platform.
Example 7-11 demonstrates the configuration of an ACL to detect EIGRP traffic on the
Ethernet1/1 interface. Notice that the ACL includes a permit ip any any command to
allow all traffic to pass through this interface. Failing to do so could result in the loss of
traffic.
Example 7-11 uses an Ethernet interface, which generally indicates a one-to-one relation-
ship, but on multi-access interfaces like Switched Virtual Interfaces (SVI) (a.k.a. Interface
VLANs) the neighbor may need to be specified in a specific ACE.
Example 7-12 displays the configuration for an ACL that is placed on a SVI with an ACE
entry for the neighbor 10.12.100.200. EIGRP packets from other neighbors are collected
with the second entry, line 20.
Troubleshooting EIGRP Neighbor Adjacency 409
ip access-list EIGRP
statistics per-entry
permit eigrp 10.12.100.200/32 any any
permit eigrp any any
permit icmp any any
permit ip any any
interface vlan 10
ip access-group EIGRP in
An alternative to using an ACL is to use the built-in NX-OS Ethanalyzer to capture the
EIGRP packets. Example 7-13 demonstrates the command syntax. The optional detail
keyword is used to view the contents of the packets.
Example 7-14 demonstrates that NX-1 detects NX-2 and registers it as a neighbor,
whereas NX-2 does not detect NX-1.
Technet24
410 Chapter 7: Troubleshooting Enhanced Interior Gateway Routing Protocol (EIGRP)
In addition, NX-1 keeps changing the neighbor state for NX-2 (10.12.1.200) after a retry
limit was exceeded, as shown in Example 7-15.
NX-1
13:28:06 NX-1 %$ VDC-1 %$ %EIGRP-5-NBRCHANGE_DUAL: eigrp-NXOS [26809] (default-base)
IP-EIGRP(0) 12: Neighbor 10.12.1.200 (Ethernet1/1) is down: retry limit exceeded
13:28:09 NX-1 %$ VDC-1 %$ %EIGRP-5-NBRCHANGE_DUAL: eigrp-NXOS [26809] (default-base)
IP-EIGRP(0) 12: Neighbor 10.12.1.200 (Ethernet1/1) is up: new adjacency
21:19:00 NX-1 %$ VDC-1 %$ %EIGRP-5-NBRCHANGE_DUAL: eigrp-NXOS [26809] (default-base)
IP-EIGRP(0) 123: Neighbor 10.12.1.200 (Ethernet1/1) is down: retry limit exceeded
21:19:00 NX-1 %$ VDC-1 %$ %EIGRP-5-NBRCHANGE_DUAL: eigrp-NXOS [26809] (default-base)
IP-EIGRP(0) 123: Neighbor 10.12.1.200 (Ethernet1/1) is up: new adjacency
Note NX-OS does not provide the syslog message “is blocked: not on common subnet”
that is included with IOS routers.
Remember that EIGRP will retry up to 16 times for each packet that requires confirma-
tion, and it will reset the neighbor relationship when the neighbor reaches the retry limit
of 16. The actual retry values are examined on NX-OS by using the command show ip
eigrp neighbor detail, as demonstrated in Example 7-16.
The next step is to try to ping the primary IP address between nodes, as shown in
Example 7-17.
NX-1 cannot ping NX-2, and NX-2 cannot ping NX-1 because it does not have a route
to the host. This also means that NX-1 might have been able to send the packets to
NX-2, but NX-2 did not have a route to send the ICMP response.
Example 7-18 displays the routing table on NX-1 and NX-2 to help locate the reason.
Technet24
412 Chapter 7: Troubleshooting Enhanced Interior Gateway Routing Protocol (EIGRP)
At this point, check the IP address configuration on both devices. This should result in
the mismatch of prefix-length for the subnet mask. Correcting this allows for the EIGRP
devices to communicate properly.
Example 7-19 displays a configuration that might be confusing to junior network engi-
neers. Is the EIGRP ASN 12 or 1234?
router eigrp 12
autonomous-system 1234
interface Ethernet1/1
ip router eigrp 12
Troubleshooting EIGRP Neighbor Adjacency 413
Unfortunately, no debugs or log messages are provided if the EIGRP ASNs are
mismatched. Check the EIGRP ASN on both sides to verify that it is the same.
The ASN for an EIGRP instance is found by examining the EIGRP protocol with
the command show ip eigrp, which is listed beside the router-id. The ASN is also
displayed when viewing the EIGRP interfaces with the command show ip eigrp
interfaces brief.
Note Specifying the AS in the EIGRP configuration removes any potential for confusion
by network engineers of all skill level. This is considered a best practice.
Mismatch K Values
EIGRP uses K values to define which factors that the best path formula uses. To
ensure a consistent routing logic and prevent routing loops from forming, all EIGRP
neighbors must use the same K values. The K values are included as part of the EIGRP
Hello packets.
Example 7-21 displays the syslog message that indicates a mismatch of K values. The
K values are identified on the local router by looking at the EIGRP process with the
command show ip eigrp.
Technet24
414 Chapter 7: Troubleshooting Enhanced Interior Gateway Routing Protocol (EIGRP)
The K values on Nexus switches are configured with the command metric weights TOS
K1 K2 K3 K4 K5[K6] under the EIGRP process. The K6 value is optional unless EIGRP
wide metrics are configured. TOS is not used and should be set to zero. Example 7-22
displays an EIGRP configuration with custom K values.
EIGRP uses a second timer called the hold time, which is the amount of time EIGRP
deems the router reachable and functioning. The hold time value defaults to three times
the Hello interval. The default value is 15 seconds, and 180 seconds for slow-speed
interfaces. The hold time decrements, and upon receipt of a Hello packet, the hold time
resets and restarts the countdown. If the hold time reaches zero, EIGRP declares the
neighbor unreachable and notifies the DUAL algorithm of a topology change.
If the EIGRP Hello timer is greater than the Hold timer on the other EIGRP neighbor,
the session will continuously flap. Example 7-23 demonstrates NX-1 periodically reset-
ting the adjacency with NX-2 because of the holding time expiring on NX-1.
Troubleshooting EIGRP Neighbor Adjacency 415
NX-1
03:11:35 NX-1 %$ VDC-1 %$ %EIGRP-5-NBRCHANGE_DUAL: eigrp-NXOS [30489] (default-
base) IP-EIGRP(0) 12: Neighbor 10.12.1.200 (Ethernet1/1) is down: holding time
expired
03:11:39 NX-1 %$ VDC-1 %$ %EIGRP-5-NBRCHANGE_DUAL: eigrp-NXOS [30489] (default-
base) IP-EIGRP(0) 12: Neighbor 10.12.1.200 (Ethernet1/1) is up: new adjacency
03:11:54 NX-1 %$ VDC-1 %$ %EIGRP-5-NBRCHANGE_DUAL: eigrp-NXOS [30489] (default-
base) IP-EIGRP(0) 12: Neighbor 10.12.1.200 (Ethernet1/1) is down: holding time
expired
03:11:59 NX-1 %$ VDC-1 %$ %EIGRP-5-NBRCHANGE_DUAL: eigrp-NXOS [30489] (default-
base) IP-EIGRP(0) 12: Neighbor 10.12.1.200 (Ethernet1/1) is up: new adjacency
NX-2
03:11:35 NX-2 %$ VDC-1 %$ %EIGRP-5-NBRCHANGE_DUAL: eigrp-NXOS [26807] (default-
base) IP-EIGRP(0) 12: Neighbor 10.12.1.100 (Ethernet1/1) is down: Interface
Goodbye received
03:11:39 NX-2 %$ VDC-1 %$ %EIGRP-5-NBRCHANGE_DUAL: eigrp-NXOS [26807] (default-
base) IP-EIGRP(0) 12: Neighbor 10.12.1.100 (Ethernet1/1) is up: new adjacency
03:11:54 NX-2 %$ VDC-1 %$ %EIGRP-5-NBRCHANGE_DUAL: eigrp-NXOS [26807] (default-
base) IP-EIGRP(0) 12: Neighbor 10.12.1.100 (Ethernet1/1) is down: Interface
Goodbye received
03:11:59 NX-2 %$ VDC-1 %$ %EIGRP-5-NBRCHANGE_DUAL: eigrp-NXOS [26807] (default-
base) IP-EIGRP(0) 12: Neighbor 10.12.1.100 (Ethernet1/1) is up: new adjacency
The EIGRP Hello and Hold timers for an interface are seen with the command show ip
eigrp interface [interface-id] [vrf {vrf-name | all}]. The optional brief keyword cannot be
used to view the timers. Example 7-24 displays sample output for NX-1 and NX-2.
Technet24
416 Chapter 7: Troubleshooting Enhanced Interior Gateway Routing Protocol (EIGRP)
NX-2 is displaying a Hello timer of 120 seconds which exceeds NX-1’s Hold timer of
15 seconds which is the reason that NX-1 keeps tearing down the EIGRP adjacency.
Example 7-25 verifies that the Hello interval was modified with the interface command
ip hello-interval eigrp process-tag hello-time. Changing the Hello time back to the
default value or a value less than 15 seconds (NX-1’s Hold timer) allows the switches to
form an adjacency.
interface Ethernet1/1
ip router eigrp NXOS
ip hello-interval eigrp NXOS 120
Note The EIGRP interface Hold timer is modified with the command ip hold-time eigrp
process-tag hold-time.
EIGRP encrypts the password using an MD5 using the keychain function. Keychains
allow the configuration of multiple passwords and sequences that can have the validity
period set so that passwords could be rotated. When using time-based keychains, it is
important that the Nexus switches time is synchronized with NTP and that some overlap
of time is provided between key iterations.
The hash is composed of the key number and a password. EIGRP authentication does
not encrypt the entire EIGRP packet, just the password. The password is seen with the
command show key chain [mode decrypt]. The optional keywords mode decrypt dis-
play the password in plain text between a pair of quotation marks, which is helpful to
detect unwanted characters such as spaces. Example 7-27 displays how the keychain
password is verified.
Technet24
418 Chapter 7: Troubleshooting Enhanced Interior Gateway Routing Protocol (EIGRP)
Note The hash does not match between EIGRP devices if the key number is different,
even if the password is identical. So the key number and password must match.
Step 1. Create the keychain. The command key chain key-chain-name creates the
local keychain.
Step 2. Identify the key sequence. The key sequence is specified with the command key
key-number, where the key number can be anything from 0 to 2147483647.
Step 3. Specify the password. The pre-shared password is entered with the command
key-string text. Steps 2 and 3 could be repeated as needed to accommodate
multiple key strings.
Step 4. Identify the keychain for an interface. The keychain used by the interface
must be specified with the command ip authentication key-chain eigrp
process-tag key-chain-name.
interface Ethernet1/1
ip router eigrp NXOS
ip authentication key-chain eigrp NXOS EIGRP
ip authentication key-chain eigrp mode eigrp NXOS md5
10.23.1.0/24
4 NX-2 NX-3 10
2 E1/1 E1/1 E1 .3
0/ 6.
1. /2 /2 1.
1 2. E1 0/
24
10. E1
1
10.1.1.0/24
E 1/ /1
10.11.11.0/24 10.6.6.0/24
E1/3 NX-1 E1 NX-6 E1/3
2
/2
E1/
10 24
.1 /2 0/
4.
1.
E1 E1 . 1.
0/ /2 .56
24
E1/1 E1/1 10
NX-4 10.45.1.0/24 NX-5
Technet24
420 Chapter 7: Troubleshooting Enhanced Interior Gateway Routing Protocol (EIGRP)
Example 7-30 displays a portion of NX-1 and NX-6’s routing table. Notice that two
paths exist between NX-1 and NX-6 in both directions for the corresponding advertised
network prefixes.
EIGRP routes that are installed into the RIB are seen with the command show ip route
[eigrp]. The optional eigrp keyword only shows EIGRP learned routes. EIGRP routes are
indicated by the eigrp-process-tag.
EIGRP routes originating within the autonomous system have an administrative distance
(AD) of 90 and have the internal flag listed after the process-tag. Routes that originate
from outside of the AS are external EIGRP routes. External EIGRP routes have an AD of
170, and have the external flag listed after the process-tag. Placing external EIGRP routes
into the RIB with a higher AD acts as a loop prevention mechanism.
Example 7-31 displays the EIGRP routes from the sample topology in Figure 7-7. The
metric for the selected route is the second number in brackets.
Load Balancing
EIGRP allows multiple successor routes (same metric) to be installed into the RIB.
Installing multiple paths into the RIB for the same prefix is called equal- cost multipath
(ECMP) routing. At the time of this writing, the default maximum ECMP paths value for
Nexus nodes is eight.
The default ECMP setting are changed with the command maximum-paths maximum-
paths under the EIGRP process to increase the default value to 16.
NXOS does not support EIGRP unequal-cost load balancing, which allows installation
of both successor routes and feasible successors into the EIGRP RIB. Unequal-cost load
balancing is supported in other Cisco operating systems with the variance command.
Stub
EIGRP stub functionality allows an EIGRP router to conserve router resources.
An EIGRP stub router announces itself as a stub within the EIGRP Hello packet.
Neighboring routers detect the stub field and update the EIGRP neighbor table to
reflect the router’s stub status.
If a route goes active, EIGRP does not send EIGRP Queries to an EIGRP stub router.
This provides faster convergence within an EIGRP AS because it decreases the size of the
Query domain for that prefix.
EIGRP stubs do not advertise routes that they learn from other EIGRP peers. By default,
EIGRP stubs advertise only connected and summary routes, but can be configured so
that they only receive routes or advertise any combination of redistributed routes, con-
nected routes, or summary routes.
The routing tables in Example 7-32 look different on NX-1 and NX-6 from the baseline
routing table that was displayed in Example 7-30.
Technet24
422 Chapter 7: Troubleshooting Enhanced Interior Gateway Routing Protocol (EIGRP)
The routes from NX-1 and NX-2 seem to be available only on the lower path (NX-1 →
NX-4 → NX-5 → NX-6). Has a problem occurred on the upper path (NX-1 → NX2 →
NX-3 → NX-6)? The first step is to check the EIGRP adjacency, which is shown in
Example 7-33.
All the routers have established adjacency. Using the optional detail keyword may pro-
vide more insight to the problem. Example 7-34 displays the command show ip eigrp
neighbors detail.
NX-1 was able to detect that the 10.12.1.2 peer (NX-2) has the EIGRP stub feature con-
figured. The stub feature prevented NX-2 from advertising routes learned on the E1/2
interface toward the E1/1 interface and vice versa.
The next step is to verify and remove the EIGRP configuration. The EIGRP com-
mand eigrp stub {direct | leak-map leak-map-name |receive-only | redistributed
| static | summary} configures stub functionality on a switch and is displayed in
Example 7-35. Removing the stub configuration allows for the routes to transit
across NX-2.
Technet24
424 Chapter 7: Troubleshooting Enhanced Interior Gateway Routing Protocol (EIGRP)
Note The receive-only option cannot be combined with other EIGRP stub options.
Give the network design special consideration to ensure bidirectional connectivity for any
networks connected to an EIGRP router with the receive-only stub option to ensure that
routers know how to send return traffic.
interface Ethernet1/1
ip router eigrp NXOS
interface Ethernet1/2
ip router eigrp NXOS
Note At the time of this writing, full EIGRP support is available only in Enterprise
Services, whereas only EIGRP Stub functionality is included in LAN Base licensing for
specific platforms. Please check current licensing options, because this could cause issues.
Maximum-Hops
EIGRP is a hybrid distance vector routing protocol and does keep track of hop counts.
Just as before, a change is notated in the routing table of NX-1 where paths appear to
have disappeared. The routing table for NX-1 and NX-6 is provided in Example 7-36.
NX-1 is missing the upper (NX-1 → NX-2 → NX-3 → NX-6) path for the 10.6.6.0/24
network, whereas NX-6 maintains full paths to the 10.1.1.0/24 and 10.11.11.0/24 net-
work. This means that there is connectivity in both directions and that EIGRP stub func-
tionality has not been deployed. It also states that there is EIGRP adjacency along all
paths, so some form of filtering or path manipulation was performed.
Examining the EIGRP configuration on NX-1, NX-2, NX-3, and NX-6 identifies the
cause of the problem. NX-2 has configured the maximum-hops feature and set it to 1, as
shown in Example 7-37. This allows for the relevant routes (from NX-6’s perspective) to
be seen equally. Removing the metric maximum-hops command or changing the value
to a normal value returns the routing table to normal.
interface Ethernet1/1
ip router eigrp NXOS
interface Ethernet1/2
ip router eigrp NXOS
Technet24
426 Chapter 7: Troubleshooting Enhanced Interior Gateway Routing Protocol (EIGRP)
Distribute List
EIGRP supports filtering of routes with a distribute list that is placed on an individual
interface. The distribute list uses the command ip distribute-list eigrp process-tag
{route-map route-map-name | prefix-list prefix-list-name {in | out}. The following
rules apply:
■ If the direction is set to in, inbound filtering drops routes prior to the DUAL
processing; therefore, the routes are not installed into the RIB.
■ If the direction is set to out, the filtering occurs during outbound route adver-
tisement; the routes are processed by DUAL and install into the local RIB of the
receiving router.
■ Any routes that pass the prefix-list are advertised or received. Routes that do not
pass the prefix-list are filtered.
A network engineer has identified that a path for the 10.1.1.0/24 route has disappeared
on NX-6 while the 10.11.11.0/24 route has both paths in it. Example 7-38 displays the
current routing table of NX-6, which is different from the original routing table dis-
played in Example 7-30.
Because the 10.11.11.0/24 network has two paths and it is connected to the same
Nexus switch (NX-1), some form of path manipulation is enabled. Checking the
routing table along the missing path should identify the router causing this
behavior.
Example 7-39 displays NX-2’s routing table that shows the path for 10.1.1.0/24 coming
from NX-3 when the path from NX-1 appears to be more optimal.
Troubleshooting Path Selection and Missing Routes 427
This means that the filtering is happening either on NX-1 (outbound) or on NX-2
(inbound). Example 7-40 displays the configuration on NX-2 that filters the path for
the 10.1.1.0/24 inbound. Notice that sequence 5 blocks the 10.1.1.0/24 route, while
sequence 10 allows all other routes to pass.
NX-2
interface Ethernet1/2
description To NX-1
ip router eigrp NXOS
ip distribute-list eigrp NXOS prefix-list DISTRIBUTE out
Offset Lists
Modifying the EIGRP path metric provides traffic engineering in EIGRP. Modifying the
delay setting for an interface modifies all routes that are received and advertised from
that router’s interface. Offset lists allow for the modification of route attributes based
upon direction of the update, specific prefix, or combination of direction and prefix.
The offset list is applied under the interface with the command ip offset-list eigrp
process-tag {route-map route-map-name | prefix-list prefix-list-name {in | out } off-set
value. The following rules apply:
■ If the direction is set to in, the offset value is added as routes are added to the
EIGRP topology table.
■ If the direction is set to out, the path metric increases by the offset value specified
in the offset list as advertised to the EIGRP neighbor.
■ Any routes that pass the route-map or the prefix-list will have the metric added to
the path attributes.
Technet24
428 Chapter 7: Troubleshooting Enhanced Interior Gateway Routing Protocol (EIGRP)
The offset-value is calculated from an additional delay value that is added to the existing
delay in the EIGRP path attribute. Figure 7-8 shows the modified path metric formula
when an offset delay is included.
107 Total Delay
Metric + Offset Value = 256 * (( + ) + Offset Delay )
Min. Bandwidth 10
Equals
Offset Value = 256 * Offset Delay
Example 7-41 displays an offset list configuration on NX-2 that adds 256 to the path
metric to only the 10.1.1.0/24 prefix received from NX-1.
NX-2
interface Ethernet1/2
description To NX-1
ip router eigrp NXOS
ip offset-list eigrp NXOS prefix-list OFFSET in 256
Example 7-42 displays the topology for the 10.1.1.0/24 prefix that is advertised from
NX-1 toward NX-2 from Figure 7-8. Notice that the path metric has increased from
768 to 1,024 and that the delay increased by 10 microseconds.
The metric value added in Example 7-41 was explicitly calculated using the EIGRP path
metric formula so that a delay value of 10 was added. Adding a metric value at one point
in the path may not be the same metric increase later on, depending on whether the
bandwidth changes further downstream on that path.
Example 7-43 displays how the increase of the metric (256) has impacted only the path
from 10.1.1.0/24 and not the path from 10.11.11.0/24.
Technet24
430 Chapter 7: Troubleshooting Enhanced Interior Gateway Routing Protocol (EIGRP)
Interface-Based Settings
EIGRP assigns the delay and bandwidth to an interface automatically based on the inter-
face’s negotiated connection speed. In some instances these values are modified for traf-
fic engineering. If the traffic flow is not as expected, check the EIGRP configuration for
the following commands:
■ ip bandwidth eigrp process-tag bandwidth changes the value used by the EIGRP
process when calculating the minimum bandwidth path attribute.
The usage of these commands affects all prefixes that are received or advertised from the
associated interface, whereas with an offset list, the prefixes can be selectively chosen.
Note As stated earlier, the path metric can be manipulated with an EIGRP offset list
or the use of a distribute-list when a route-map is used. In both scenarios, EIGRP modi-
fies the metric through the total delay path attribute. When small values are scaled for
EIGRP, the potential to lose precision can occur on IOS-based routers because they use
integer math. These devices may not be able to register a difference between the value of
4007 and 4008, whereas a Nexus switch can.
In general, use larger values where the rounding does not have an effect on the path
decision. Be sure to accommodate decisions that could be impacted further away from
where the change is being made.
Redistribution
Every routing protocol has a different methodology for calculating the best path for a
route. For example, EIGRP can use bandwidth, delay, load, and reliability for calculating
its best path, whereas OSPF primarily uses the path metric for calculating the shortest
Troubleshooting Path Selection and Missing Routes 431
path first (SPF) tree (SPT). OSPF cannot calculate the SPF tree using EIGRP path attri-
butes, and EIGRP cannot run Diffusing Update Algorithm (DUAL) using only the total
path metric. The destination protocol must provide relevant metrics to the destination
protocols so that the destination protocol can calculate the best path for the redistrib-
uted routes.
Redistributing into EIGRP uses the command redistribute [bgp asn | direct | eigrp
process-tag | isis process-tag | ospf process-tag | rip process-tag | static] route-map
route-map-name. A route-map is required as part of the redistribution process on Nexus
switches.
Every protocol provides a seed metric at the time of redistribution that allows the desti-
nation protocol to calculate a best path. EIGRP uses the following logic when setting the
seed metric:
■ The default seed metric on Nexus switches is 100,000 Kbps for minimum band-
width, 1000 μs of delay, reliability of 255, load of 1, and MTU of 1492.
■ The default seed metric is not needed, and path attributes are preserved when
redistributing between EIGRP processes.
Note The default seed metric behavior on Nexus switches is different from IOS and IOS
XR routers that use a default seed value of infinity. Setting the seed metric to infinity pre-
vents routes from being installed into the topology table.
The default seed metrics can be changed to different values for bandwidth, load, delay,
reliability, and maximum transmission unit (MTU) if desired. The EIGRP process com-
mand metric weights tos bandwidth delay reliability load mtu changes the value for
all routes that are redistributed into that process, or the command set metric weights
bandwidth delay reliability load mtu can be used for selective manipulation within a
route-map.
Example 7-44 provides the necessary configuration to demonstrate the process of redis-
tribution. NX-1 redistributes the connected routes for 10.1.1.0/24 and 10.11.11.0/24 in
lieu of them being advertised with the EIGRP routing protocol. Notice that the route-
map can be a simple permit statement without any conditional matches.
Technet24
432 Chapter 7: Troubleshooting Enhanced Interior Gateway Routing Protocol (EIGRP)
Example 7-45 displays the routing table on NX-2. The 10.1.1.0/24 and 10.11.11.0/24
routes are tagged as external, and the AD is set to 170. The topology table is shown to
display the EIGRP path metrics. Notice that EIGRP contains an attribute for the source
protocol (Connected) as part of the route advertisement from NX-1.
Note EIGRP router-ids are used as a loop prevention mechanism for external routes. An
EIGRP router does not install an external route that contains the router-id that matches
itself. Ensuring unique router-ids on all devices in an EIGRP AS prevents problems with
external EIGRP routes.
Troubleshooting Path Selection and Missing Routes 433
Example 7-46 provides some metric calculations for common LAN interface speeds.
Notice how there is no a differentiation between an 11 Gbps interface and a 20 Gbps
interface. The composite metric stays at 256 despite having different bandwidth rates.
GigabitEthernet:
Scaled Bandwidth = 10,000,000 / 1000000
Scaled Delay = 10 / 10
Composite Metric = 10 + 1 * 256 = 2816
10 GigabitEthernet:
Scaled Bandwidth = 10,000,000 / 10000000
Scaled Delay = 10 / 10
Composite Metric = 1 + 1 * 256 = 512
11 GigabitEthernet:
Scaled Bandwidth = 10,000,000 / 11000000
Scaled Delay = 10 / 10
Composite Metric = 0 + 1 * 256 = 256
20 GigabitEthernet:
Scaled Bandwidth = 10,000,000 / 20000000
Scaled Delay = 10 / 10
Composite Metric = 0 + 1 * 256 = 256
EIGRP includes support for a second set of metrics known as wide metrics that address-
es the issue of scalability with higher-capacity interfaces. EIGRP wide metric support is
supported and must be configured to be enabled on NX-OS.
Note IOS routers support EIGRP wide metrics only in named configuration mode, and
IOS-XR routers use wide metrics by default.
Figure 7-9 shows the explicit EIGRP wide metric formula. Notice that an additional
K value (K6) is included that adds an extended attribute to measure jitter, energy, or
other future attributes.
Technet24
434 Chapter 7: Troubleshooting Enhanced Interior Gateway Routing Protocol (EIGRP)
Wide K2 * BW K5
= [(K1 * BW + + K3 * Latency + K6 * Extended ) * ]
Metric 256 - Load K4 + Reliability
Just as EIGRP scaled by 256 to accommodate IGRP, EIGRP wide metrics scale by
65,535 to accommodate higher-speed links. This provides support for interface speeds
up to 655 terabits per second (65,535 * 107) without encountering any scalability issues.
Latency is the total interface delay measured in picoseconds (10−12) instead of measuring
in microseconds (10−6), which scales as well with higher speed interfaces. Figure 7-10
displays the updated formula that takes into account the conversions in latency and
scalability.
K2 * 107
EIGRP wide metrics were designed with backward compatibility in mind. EIGRP wide
metrics set K1 and K3 to a value of 1, and K2, K4, K5, and K6 are set to 0, which allows
backward compatibility because the K value metrics match with Classic metrics. As
long as K1–K5 are the same and K6 is not set, the two metric styles allow an adjacency
between routers.
Note The metric style used by a Nexus switch is identified with the command show ip
eigrp. If a K6 metric is present, the router is using wide style metrics.
EIGRP can detect when peering with a router is using classic metrics, and unscales the
metric to the formula in Figure 7-11.
This conversion results in loss of clarity if routes pass through a mixture of classic metric
and wide metric devices. An end result of this intended behavior is that paths learned via
wide metric peers always look better than paths learned via classic paths. This could lead
to suboptimal routing.
Revisiting the topology from Figure 7-7, let’s revisit the effects of changing the Nexus
switches to EIGRP wide metrics. Example 7-47 displays how the path metrics have
changed for the 10.1.1.0/24 network that is advertised (Ethernet1/3). Notice that mini-
mum bandwidth has not changed, but the delay is now measured in picoseconds.
Troubleshooting Path Selection and Missing Routes 435
Example 7-48 displays the EIGRP topology table for the 10.1.1.0/24 network on NX-6
while classic metric values are configured on the entire network. Notice that both paths
have the same FD of 1,280.
Technet24
436 Chapter 7: Troubleshooting Enhanced Interior Gateway Routing Protocol (EIGRP)
Example 7-49 displays the EIGRP topology table for the 10.1.1.0/24 network on NX-6,
and wide metrics have been enabled on NX-1 and NX-2. EIGRP classic metric values are
configured on the remaining switches network. Notice that the total delay has changed
on the path from NX-1 → NX-2 → NX-3 → NX-6 to 30 μs. This is because the first two
hops of this path were calculated using picoseconds instead of microseconds, resulting
in a 10 μs reduction. NX-6 uses this path only for forwarding traffic.
Vector metric:
Minimum bandwidth is 10000000 Kbit
Total delay is 40 microseconds
..
Hop count is 3
Example 7-50 displays the EIGRP topology table for the 10.1.1.0/24 network on NX-5
and NX-6, whereas wide metrics have been enabled on NX-1, NX-2, and NX-3. The
delay is now reduced to 20 μs along the NX-1 → NX-2 → NX-3 → NX-6 path. The path
NX-1 → NX-4 → NX-5 → NX-6 no longer passes the feasible successor condition on
NX-6 and does not show up in the topology table.
Notice that NX-5 has now calculated the path NX-1 → NX-2 → NX-3 → NX-6 → NX-5
the same amount of delay as NX-1 → NX-4 → NX-5. When load balanced, a portion of
the traffic is forwarded suboptimally along the longer path.
Technet24
438 Chapter 7: Troubleshooting Enhanced Interior Gateway Routing Protocol (EIGRP)
The number of classic or wide metric EIGRP neighbors is identified by looking at the
EIGRP interfaces in nonbrief format. Example 7-51 displays the command and relevant
output on NX-6.
Example 7-52 displays the EIGRP topology table for the 10.1.1.0/24 network on NX-6
and NX-5, whereas wide metrics were enabled on NX-1, NX-2, NX-3 and NX-6. NX-6
contains only the wide metric path, and the delay is shown only in picoseconds.
NX-5 has now calculated the path NX-1 → NX-2 → NX-3 → NX-6 → NX-5 as the best path
due to the unscaling formula. All traffic to the 10.1.1.0/24 network takes the longer path.
Careful planning is needed when enabling EIGRP wide metrics. When enabling wide
metrics, it is best to enable all the devices in an area or along the same path to a destina-
tion to ensure optimal routing.
When EIGRP detects that it has lost its successor for a path, the feasible successor
instantly becomes the successor route providing a backup route. The Nexus switch sends
out an Update packet for that path because of the new EIGRP path metrics. Downstream
switches run their own DUAL algorithm for any impacted prefixes to account for the
new EIGRP metrics. It is possible that a change of the successor route or feasible succes-
sor to occur upon receipt of new EIGRP metrics from a successor switch for a prefix.
Figure 7-12 demonstrates such a scenario when the link between NX-1 and NX-3 fails.
Technet24
440 Chapter 7: Troubleshooting Enhanced Interior Gateway Routing Protocol (EIGRP)
RD
9
NX-2
(9) (10
)
10.1.1.0/24
NX-1 (10) NX-3
RD 19
(10)
(2
0)
(5)
NX-4 NX-5
RD 20
■ NX-3 installs the feasible successor path advertised from NX-2 as the successor route.
■ NX-3 sends an Update packet with a new RD of 19 for the 10.1.1.0/24 prefix to NX-5.
■ NX-5 receives the Update packet from NX-3 and calculates a FD of 29 for the
NX-3 →NX-2 →NX-1 path to 10.1.1.0/24.
■ NX-5 compares that path to the one received from NX-4, which has a path metric of 25.
Example 7-53 provides simulated output of the NX-5’s EIGRP topology for the
10.1.1.0/24 prefix after the NX-1-NX-3 link fails.
Vector metric:
..
Hop count is 3
Originating router is 192.168.1.1
If a feasible successor is not available for the prefix, DUAL must compute a new
route calculation. The route state changes from Passive (P) to Active (A) in the EIGRP
topology table.
Active Query
The router detecting the topology change sends out Query packets to EIGRP neigh-
bors for the route. The Query packet includes the network prefix with the delay set to
infinity so that other routers are aware that it has gone Active. When the router sends
the EIGRP Query packets, it sets the Reply status flag set for each neighbor on a pre-
fix basis.
Upon receipt of a Query packet, an EIGRP router does one of the following:
■ Reply to the Query that the router does not have a route to the prefix.
■ If the Query did not come from the successor for that route, it detects the delay set
for infinity but ignores it because it did not come from the successor. The receiving
router replies with the EIGRP attributes for that route.
■ If the Query came from the successor for the route, the receiving router detects the
delay set for infinity, sets the prefix as Active in the EIGRP topology, and sends out
a Query packet to all downstream EIGRP neighbors for that route.
The Query process continues from router to router until a router establishes the Query
boundary. A Query boundary is established when a router does not mark the prefix as
Active, meaning that it responds to a query with the following:
■ Replying with EIGRP attributes because the query did not come from the
successor
When a router receives a Reply for all its downstream queries, it completes the
DUAL algorithm, changes the route to Passive, and sends a Reply packet to any
upstream routers that sent a Query packet to it. Upon receiving the Reply packet
for a prefix, the reply packet is notated for that neighbor and prefix. The reply
process continues upstream for the Queries until the first router’s Queries are
received.
Figure 7-13 represents a topology where the link between NX-1 and NX-2 has failed.
Technet24
442 Chapter 7: Troubleshooting Enhanced Interior Gateway Routing Protocol (EIGRP)
NX-3
Query
(10)
10.1.1.0/24 (10) (10)
NX-1 NX-2 NX-4
Query
(10)
(35
)
NX-5
The following steps are processed in order from the perspective of NX-2 calculating a
new route to the 10.1.1.0/24 network.
Step 1. NX-2 detects the link failure. NX-2 did not have a feasible successor for
the route, set the 10.1.1.0/24 prefix as active, and sent Queries to NX-3
and NX-4.
Step 2. NX-3 receives the Query from NX-2, and processes the delay field that is
set to infinity. NX-3 does not have any other EIGRP neighbors and sends
a Reply to NX-2 that a route does not exists. NX-4 receives the Query
from NX-2 and processes the delay field that is set to infinity. Because
the Query was received by the successor, and a feasible successor for the
prefix does not exist, NX-4 marks the route as active and sends a Query
to NX-5.
Step 3. NX-5 receives the Query from NX-4 and detects that the delay field is set to
infinity. Because the Query was received by a nonsuccessor, and a successor
exists on a different interface, a REPLY for the 10.4.4.0/24 network is sent
back to NX-4 with the appropriate EIGRP attributes.
Step 4. NX-4 receives NX-5’s Reply, acknowledges the packet, and computes a new
path. Because this is the last outstanding Query packet on NX-4, NX-4 sets
the prefix as passive. With all Queries satisfied, NX-4 responds to NX-2’s
query with the new EIGRP metrics.
Step 5. NX-2 receives NX-4’s Reply, acknowledges the packet, and computes a new
path. Because this is the last outstanding Query packet on NX-4, NX-4 sets
the prefix as passive.
Problems with Convergence 443
Stuck in Active
DUAL is very efficient at finding loop-free paths quickly, and normally finds a backup
path in seconds. Occasionally an EIGRP Query is delayed because of packet loss,
slow neighbors, or a large hop count. EIGRP waits half of the active timer (90 seconds
default) for a Reply. If the router does not receive a response within 90 seconds, the
originating router sends a Stuck In Active (SIA) Query to EIGRP neighbors that have
not responded.
Upon receipt of a SIA-Query, the router should respond within 90 seconds with a
SIA-REPLY. A SIA-Reply contains the route information, or provides information on
the Query process itself. If a router fails to respond to a SIA-Query by the time the
active timer expires, EIGRP deems the router as Stuck In Active (SIA). If the SIA state
is declared for a neighbor, DUAL deletes all routes from that neighbor and treats the
situation as if the neighbor responded with unreachable messages for all routes. Active
Queries are shown with the command show ip eigrp topology active.
Figure 7-14 shows a topology where the link between NX-1 and NX-2 has failed. NX-2
sends out Queries to NX-4 and NX-3 for the 10.1.1.0/24 and 10.12.1.0/24 networks.
NX-4 sends a Reply back to NX-2, and NX-3 sends a Query onto R5, which then sends
a query on to R6.
NX-4
Reply
Query
10.1.1.0/24 10.12.1.0/24
NX-1 NX-2 NX-3 R5 R6
A network engineer sees the syslog message for the down link and immediately runs
the show ip eigrp topology active command on NX-2 and sees output the output from
Example 7-54.
The “r” next to the 10.23.1.3 indicates that NX-2 is still waiting on the reply from NX-3.
NX-1 is registered as down, and the path is set to infinity. The show ip eigrp topology
command can then be executed on NX-3, which indicates it is waiting on a response
from NX-5. Then the command can be run again on R5, which indicates it is waiting on
R6. Executing the command on R6 does not show any active prefixes, inferring that R6
never received a Query from R5. R5’s Query could have been dropped on the wireless
connection.
Technet24
444 Chapter 7: Troubleshooting Enhanced Interior Gateway Routing Protocol (EIGRP)
After the 90-second window has passed, the switch sends out a SIQ Query, which is seen
by examining the EIGRP traffic counters. Example 7-55 displays the traffic counters
before and after the 90-second window.
Example 7-55 EIGRP Traffic Counters with SIA Queries and Replies
Example 7-56 displays the EIGRP topology table after the SIA Replies are received. And
just after that, the SIA message appears in the syslog, and the EIGRP peering is reset.
NX-2
03:57:41 NX-2 %EIGRP-3-SIA_DUAL: eigrp-NXOS [8394] (default-base) Route
10.12.1.0/24 stuck-in-active state in IP-EIGRP(0) 100. Cleaning up
03:57:41 NX-2 %EIGRP-5-NBRCHANGE_DUAL: eigrp-NXOS [8394] (default-base) IP-EIGRP(0)
100: Neighbor 10.23.1.3 (Ethernet1/1) is down: stuck in active
03:57:42 NX-2 %EIGRP-5-NBRCHANGE_DUAL: eigrp-NXOS [8394] (default-base) IP-EIGRP(0)
100: Neighbor 10.23.1.3 (Ethernet1/1) is up: new adjacency
Having an invalid route stuck in the routing table because of a busy router can be
frustrating. There are two possible solutions:
■ Change the active timer to a different value with the command timers active-time
{disabled | 1-65535_minutes} under the EIGRP process.
■ Use network summarization within the network design. EIGRP summarization is use-
ful for creating query boundaries to reduce the realm that a query will be executed in.
The active timer is shown by examining the EIGRP process with the show ip eigrp com-
mand. The SIA timer is displayed in the Active Timer field. Example 7-57 displays the
active timer value of three minutes.
Technet24
446 Chapter 7: Troubleshooting Enhanced Interior Gateway Routing Protocol (EIGRP)
Summary
This chapter provided a logical overview of how the most common issues with EIGRP
can be identified so that any issue can be remediated.
The following parameters must match when troubleshooting EIGRP adjacency with
other devices:
■ Authentication parameters.
EIGRP is a distance vector routing protocol, which creates a topology map based on the
information it has received from downstream neighbors. When troubleshooting subopti-
mal path selection or missing routes, it is best to start at the destination and work toward
the source of the route. Along each hop, the following items should be checked to see if
there is explicit modification of path information:
■ Manipulation of metrics. This can be an offset list to increase the metric for that
path, or the explicit configuration of bandwidth or delay for an interface.
■ A router that is using two different processes to contain the upstream and down-
stream routing interface. In these instances the routes need to be mutually redistrib-
uted between the processes.
■ Poorly planned implementation of EIGRP wide metrics that does not take into
account the scale factor on higher speed interfaces.
References 447
EIGRP’s DUAL algorithm is extremely intelligent and overcomes barriers that apply to
most vector-based routing protocols. DUAL provides fast convergence, but occasionally
has difficulties during convergence when remote routers become unresponsive. The con-
vergence time period can be reduced by implementing lower SIA timers, or through the
deployment of route summarization.
References
RFC 7868, Cisco’s Enhanced Interior Gateway Routing Protocol (EIGRP).
Savage, D., J. Ng, S. Moore, et al. IETF, https://tools.ietf.org/html/rfc7868, May 2016.
Edgeworth, Brad, Aaron Foss, and Ramiro Garza Rios. IP Routing on Cisco IOS, IOS
XE and IOS XR. Indianapolis: Cisco Press, 2014.
Technet24
This page intentionally left blank
Chapter 8
Open Shortest Path First (OSPF) is a link-state routing protocol that provides every
router with a complete map for all destination networks. Every router in the network
calculates the best, shortest, loop-free paths using this complete map of the network.
This chapter focuses on identifying and troubleshooting issues that are caused with
forming OSPF neighbor adjacency, path selection, and missing routes.
OSPF Fundamentals
OSPF advertises link-state advertisements (LSA) that contain the link state and metric
to neighboring routers. Received LSAs are stored in a local database called the link-state
database (LSDB), which are then advertised to neighboring routers exactly as the LSAs
were received. The same LSA is flooded throughout the OSPF area just as the advertising
router advertised it. The LSDB provides the topology of the network, in essence provid-
ing the router a complete map of the network.
All routers run the Dijkstra Shortest Path First (SPF) algorithm to construct a loop-free
topology of shortest paths. Each router sees itself as the top of the tree, and the tree
contains all network destinations within the OSPF domain. The SPF Tree (SPT) is dif-
ferent for each OSPF router, but the LSDB used to calculate the SPT is identical for all
OSPF routers in that area.
Technet24
450 Chapter 8: Troubleshooting Open Shortest Path First (OSPF)
Inter-Router Communication
OSPF runs on its own protocol (89) and multicast where possible to reduce unnecessary
traffic. The two OSPF multicast addresses are as follows:
Within the OSPF protocol are five types of packets. Table 8-1 provides an overview of
the OSPF packet types and a brief description for each type.
Neighbor States
An OSPF neighbor is a router that shares a common OSPF-enabled network link. OSPF
routers discover other neighbors via the OSPF Hello packets. An adjacent OSPF neigh-
bor is an OSPF neighbor that shares a synchronized OSPF database between the two
neighbors.
Each OSPF process maintains a table for adjacent OSPF neighbors and the state of each
router. Table 8-3 provides an overview of the OSPF neighbor states.
Technet24
452 Chapter 8: Troubleshooting Open Shortest Path First (OSPF)
State Description
Init A Hello packet has been received from another a router, but
bidirectional communication has not been established.
2-Way Bidirectional communication has been established. If a Designated
Router or Backup Designated Router is needed, the election occurs
during this state.
ExStart This is the first state of forming an adjacency. Routers identify which
router will be the master or slave for the LSDB synchronization.
Exchange During this state, routers are exchanging link-states and via
DBD packets.
Loading LSR packets are sent to the neighbor asking for the more recent
LSAs that have been discovered (but not received) in the
Exchange state.
Full Neighboring routers are fully adjacent.
Designated Routers
Multi-access networks such as Ethernet (LANs) allow more than two routers to exist
on a network segment. This could cause scalability problems with OSPF as the number
of routers on a segment increases. Additional routers flood more LSAs on the segment,
and OSPF traffic becomes excessive as OSPF neighbor adjacencies increase. If 6 routers
share the same multi-access network, 15 OSPF adjacencies would form along with 15
occurrences of database flooding on that one network.
The DROther is a router on the DR-enabled segment that is not the DR or the BDR; it is
simply the other router.
Note Neighbors are selected as the DR and BDR based on the highest OSPF priority,
followed by higher Router ID (RID) when the priority is a tie. The OSPF priority is set on
an interface with the command ip ospf priority 0-255. Setting the value to zero prevents
that router from becoming a DR for that segment.
OSPF Fundamentals 453
Areas
OSPF provides scalability for the routing table by using multiple OSPF areas with the
routing domain. Each OSPF area provides a collection of connected networks and hosts
that are grouped together. OSPF uses a two-tier hierarchical architecture where Area 0
is a special area known as the backbone, and all other OSPF areas must connect to
Area 0. In other words, Area 0 provides transit connectivity between nonbackbone areas.
Nonbackbone areas advertise routes into the backbone, and the backbone then adver-
tises routes into other nonbackbone areas.
The exact topology of the area is invisible from outside of the area while still providing
connectivity to routers outside of the area. This means that routers outside the area do
not have a complete topological map for that area, which reduces OSPF network traffic
in that area. By segmenting an OSPF routing domain into multiple areas, it is no longer
true that all OSPF routers will have identical LSDBs; however, all routers within the
same area will have identical area LSDBs. The reduction in routing traffic uses less router
memory and resources providing scalability.
Area Border Routers (ABR) are OSPF routers connected to Area 0 and another OSPF
area. ABRs are responsible for advertising routes from one area and injecting them into a
different OSPF area. Every ABR needs to participate in Area 0; otherwise, routes do not
advertise into another area.
When a router redistributes external routes into an OSPF domain, the router is called an
Autonomous System Boundary Router (ASBR). An ASBR can be any OSPF router, and
the ASBR function is independent of the ABR function.
Router links are classified either as either stub or transit. A stub router link
includes a netmask, whereas a transit link does not.
2 Network Link—Type-2 LSAs represent multi-access network segments that
use a DR. The DR always advertises the Type-2 LSA, and connects the Type-1
transit link type LSAs together. Type-2 LSAs also provide the network mask
for the Type-1 transit link types.
Technet24
454 Chapter 8: Troubleshooting Open Shortest Path First (OSPF)
When an ABR receives a Type-1 LSA, it creates a Type-3 LSA referencing the
network in the original Type-1 LSA. A Type-2 LSA is used to determine the
network mask of a multi-access network. The ABR then advertises the Type-3
LSA into other areas.
Only Type-1 or Type-2 LSAs provide a method to locate the RID within an
area. The Type-4 LSA provides a way for routers to locate the ASBR when
the router is in a different area from the ASBR.
5 AS External—When a route is redistributed into OSPF on the ASBR, the
external route is flooded throughout the entire OSPF domain as a Type-5 LSA.
Type-5 LSAs are not associated to a specific area and are flooded across all
ABRs. Only the LSA age is modified during flooding.
7 NSSA External—Not So Stubby Areas (NSSA) areas are a method to reduce
the LSDB within an area by preventing Type-4 and Type-5 LSAs while allow-
ing redistribution of networks into the area. A Type-7 LSA exists only in
NSSA areas where the route redistribution is occurring.
An ASBR injects external routes as Type-7 LSAs into an NSSA area. The
ABR does not advertise Type-7 LSAs outside of the originating NSSA
area, but advertises a Type-5 LSA for the other OSPF areas. If the Type-5
LSA crosses Area 0, then the second ABR creates a Type-4 LSA for the
Type-5 LSA.
OSPF Fundamentals 455
Note Every LSA contains the advertising router’s RID. The router RID represents the
router and is how links are connected to each other.
Figure 8-1 displays a multi-area OSPF topology with an external route redistrib-
uted into Area 56. On the left of the figure is the network prefix for the topology,
and the appropriate LSA type is displayed underneath the segment it is advertised.
This demonstrates where each LSA is located. Notice that Area 1234 is a broadcast
area and contains a DR, which generates a Type-2 LSA. NX-6 is redistributing the
100.65.0.0/16 network into OSPF, whereas NX-5 advertises the first Type-4 LSA for
the ASBR (NX-6).
External
NX-2 P2P P2P P2P
Link Link Link
100.65.0.0/16
NX-1 NX-3 NX-4 NX-5 NX-6
10.1.1.0/24 10.123.1.0/24 10.34.1.0/24 10.45.1.0/24 10.56.1.0/24
LSAs
Network
10.1.1.0/24 Type-1 Type-1 Type-1 Type-3 Type-3
Type-5
100.65.0.0/16
Type-4 Type-4
Note The Cisco Press book IP Routing on Cisco IOS, IOS XE and IOS XR
describes OSPF LSAs and how a router builds the actual topology table using LSAs in
a visual manner.
Technet24
456 Chapter 8: Troubleshooting Open Shortest Path First (OSPF)
■ Intra-area routes: Routes for networks that exist in that OSPF area and contain
intra beside them in the routing table.
■ Inter-area routes: Routes for networks that exist in the OSPF domain from a
different OSPF area and contain inter beside them in the routing table.
■ External routes: Routes that were redistributed into the OSPF domain and contain
a type-1 or type-2 beside them in the routing table.
Example 8-1 displays the routing table from NX-1 from Figure 8-1 that includes intra-
area, inter-area, and external OSPF routes.
Step 1. Enable the OSPF feature. The OSPF feature must be enabled with the global
configuration command feature ospf.
Troubleshooting OSPF Neighbor Adjacency 457
Step 2. Define an OSPF process tag. The OSPF process must be defined with the
global configuration command router ospf process-tag. The process-tag can
be up to 20 alphanumeric characters in length.
Step 3. Enable OSPF neighbor logging (recommended). NX-OS does not log OSPF
neighbor adjacencies forming or dissolving by default. The OSPF configura-
tion command log-adjacency-changes [detail] enables logging and is recom-
mended. The optional detail keyword lists out the OSPF neighbor states
from Table 8-3 as they are entered.
Secondary networks are advertised by default after OSPF is enabled on that interface.
This behavior is disabled with the command ip router ospf process-tag area area-id
secondaries none.
Loopback interfaces are advertised as a /32 regardless of the actual subnet mask. The
command ip ospf advertise-subnet changes the behavior so that the subnet mask is
advertised with the LSA.
Note Typically, an interface can exist in only one area at a time. However, recent chang-
es allow an interface to exist in multiple areas across only point-to-point OSPF links with
the command ip router ospf process-tag multi-area area-id.
Technet24
458 Chapter 8: Troubleshooting Open Shortest Path First (OSPF)
OSPF requires a neighbor relationship to form before routes are processed and added
to the RIB. The neighbor adjacency table is vital for tracking neighbor status and the
updates sent to each neighbor. This section explains the process for troubleshooting
OSPF neighbor adjacencies on NX-OS switches.
Figure 8-2 provides a simple topology with two Nexus switches that are used to explain
how to troubleshoot OSPF adjacency problems.
Example 8-3 demonstrates a couple iterations of the command being run on NX-1.
Notice the additional information like Dead timer and last change that is included with
the detail keyword.
Troubleshooting OSPF Neighbor Adjacency 459
Table 8-5 provides a brief overview of the fields that appear in Example 8-3.
The second field is the DR, BDR, or DROther role if the interface
requires a DR. For non-DR network links, the second field will show ‘-'.
Technet24
460 Chapter 8: Troubleshooting Open Shortest Path First (OSPF)
Besides enabling OSPF on the network interfaces of an NX-OS device, the following
parameters must match for the two routers to become neighbors:
Table 8-6 provides an overview of the fields in the output from Example 8-3.
Example 8-5 displays the output of the show ip ospf interface command in non-
brief format. It is important to note that the primary IP address, interface network
type, DR, BDR, and OSPF interface timers are included as part of the information
provided.
Passive Interface
Some network topologies require advertising a network segment into OSPF, but need
to prevent neighbors from forming adjacencies on that segment. A passive interface is
displayed when displaying the OSPF interfaces, so the quickest method is to check the
OSPF process with the command show ip ospf [process-tag] to see whether any passive
interfaces are configured. Example 8-6 displays the command and where the passive
interface count is provided.
Technet24
462 Chapter 8: Troubleshooting Open Shortest Path First (OSPF)
Now that a passive interface has been identified, the configuration must be examined for
the following:
■ The interface parameter command ip ospf passive-interface, which makes only that
interface passive.
Example 8-7 displays the configuration on NX-1 and NX-2 that prevents the two Nexus
switches from forming an OSPF adjacency. The Ethernet1/1 interfaces must be active on
both switches for an adjacency to form. Move the command ip ospf passive-interface
from Eth1/1 to VLAN10 on NX-1, and the command no ip ospf passive-interface from
VLAN20 to Interface Eth1/1 on NX-2 to allow an adjacency to form.
interface loopback0
ip router ospf NXOS area 0.0.0.0
interface Ethernet1/1
ip ospf passive-interface
ip router ospf NXOS area 0.0.0.0
interface VLAN10
ip router ospf NXOS area 0.0.0.0
interface loopback0
ip router ospf NXOS area 0.0.0.0
Troubleshooting OSPF Neighbor Adjacency 463
interface Ethernet1/1
ip router ospf NXOS area 0.0.0.0
interface VLAN20
no ip ospf passive-interface
ip router ospf NXOS area 0.0.0.0
Example 8-8 displays the use of this command. Notice that there is a separation of
errors and valid packets in the output. Executing the command while specifying an
interface provides more granular visibility to the packets received or transmitted for
an interface.
Technet24
464 Chapter 8: Troubleshooting Open Shortest Path First (OSPF)
Example 8-9 displays the use of the OSPF hello and packet debugs.
Note Debug output can also be redirected to a logfile, as shown earlier in Chapter 2,
“NX-OS Troubleshooting Tools.”
Table 8-7 provides a brief description of the fields that are provided in the debug output
from Example 8-9.
Debug commands are generally the least preferred method for finding root cause
because of the amount of data that could be generated while the debug is enabled.
NX-OS provides event-history that runs in the background without performance hits
Troubleshooting OSPF Neighbor Adjacency 465
that provides another method of troubleshooting. The command show ip ospf event-
history [hello | adjacency | event] provides helpful information when troubleshooting
OSPF adjacency problems. The hello keyword provides the same information as the
debug command in Example 8-9.
Example 8-10 displays the show ip ospf event-history hello command. Examine the dif-
ference in the sample output on NX-1.
Performing OSPF debugs on a switch only shows the packets that have reached the
supervisor. If packets are not displayed in the debugs or event-history, further trouble-
shooting must be taken by examining quality of service (QoS) policies, control plane
policing (CoPP), or just verification of the packet leaving or entering an interface.
QoS policies may or may not be deployed on an interface. If they are deployed, the
policy-map must be examined for any drop packets, which must then be referenced to
a class-map that matches the OSPF routing protocol. The same process applies to CoPP
policies because they are based on QoS settings as well.
Example 8-11 displays the process for checking a switch’s CoPP policy with the
following logic:
1. Examine the CoPP policy with the command show running-config copp all. This
displays the relevant policy-map name, classes defined, and the police rate.
2. Investigate the class-maps to identify the conditional matches for that class-map.
3. After the class-map has been verified, examine the policy-map drops for that class
with the command show policy-map interface control-plane.
Technet24
466 Chapter 8: Troubleshooting Open Shortest Path First (OSPF)
Note This CoPP policy was taken from a Nexus 7000 switch, and the policy-name and
class-maps may vary depending on the platform.
Because CoPP operates at the RP level, it is possible that the packets were received
on an interface and did not forward to the RP. The next phase is to identify whether
packets were transmitted or received on an interface. This technique involves creating a
specific access control entity (ACE) for the OSPF protocol. The ACE for OSPF should
appear before any other ambiguous ACE entries to ensure a proper count. The ACL con-
figuration command statistics per-entry is required to display the specific hits that are
encountered per ACE.
Example 8-12 demonstrates the configuration of an ACL to detect OSPF traffic on the
Ethernet1/1 interface. Notice that the ACL includes a permit ip any any command to allow
all traffic to pass through this interface. Failing to do so could result in the loss of traffic.
Troubleshooting OSPF Neighbor Adjacency 467
Note There are three ACE entries for OSPF. The first two are tied to the multicast
groups for DR and BDR communication. The third ACE applies to the initial Hello
packets.
Note Example 8-12 uses an Ethernet interface, which generally indicates a one-to-one
relationship, but on multi-access interfaces like switched virtual interfaces (SVI), also
known as interface VLANs, the neighbor may need to be specified in a specific ACE.
An alternative to using an ACL is to use the built-in NX-OS Ethanalyzer to capture the
OSPF packets. Example 8-13 demonstrates the command syntax. The optional detail
keyword can be used to view the contents of the packets.
Technet24
468 Chapter 8: Troubleshooting Open Shortest Path First (OSPF)
The subnet mask was changed on NX-2 from 10.12.1.200/24 to 10.12.1.200/25 for
this section. This places NX-2 on the 10.12.1.128/25 network, which is different from
NX-1’s (10.12.1.100) network.
Examining the OSPF neighbor table does not reflect any entries on either switch. Now
examine the OSPF Hello packets with the command show ip ospf event-history, as
shown in Example 8-14. Notice that OSPF was able to detect the wrong subnet mask
between the routers.
In the event that the problem was due to a blatant subnet mismatch, the Hello packets
are not recognized in OSPF debug or event-history. Verifying connectivity by the ping
neighbor-ipaddress or show ip route neighbor-ipaddress will reflect that the networks
are not on matching networks. Ensuring that the OSPF routers' primary interfaces are on
a common subnet ensures proper communication.
Note OSPF RFC 2328 allows neighbors to form an adjacency using disjointed networks
only when using the ip unnumbered command on point-to-point OSPF network types.
NX-OS does not support IP unnumbered addressing, so this use case is not applicable.
Troubleshooting OSPF Neighbor Adjacency 469
MTU Requirements
The OSPF header of the DBD packets includes the interface MTU. OSPF DBDs are
exchanged in the EXSTART and EXCHANGE Neighbor State. Routers check the inter-
face’s MTU that is included in the DBD packets to ensure that they match. If the MTUs
do not match, the OSPF devices do not form an adjacency.
Example 8-15 displays that NX-1 and NX-2 have started to form a neighbor adjacency
over 3 minutes ago and are stuck in the EXSTART state.
Examine the OSPF event-history to identify the reason the switches are stuck in the
EXSTART state. Example 8-16 displays the OSPF adjacency event-history on NX-1, in
which the MTU from NX-2 has been detected as larger than the MTU on NX-1’s interface.
Technet24
470 Chapter 8: Troubleshooting Open Shortest Path First (OSPF)
Note The MTU messages appear only on the device with the smaller MTU.
MTU is examined on both switches by using the command show interface interface-id
and looking for the MTU value as shown in Example 8-17. The MTU on NX-2 is larger
than NX-1.
The OSPF protocol itself does not know how to handle fragmentation. It relies on IP
fragmentation when packets are larger than the interface. It is possible to ignore the
MTU safety check by placing the interface parameter command ip ospf mtu-ignore on
the switch with the smaller MTU. Example 8-18 displays the configuration command on
NX-1 that allows it to ignore the larger MTU from NX-2.
interface Ethernet1/1
ip ospf mtu-ignore
ip router ospf NXOS area 0.0.0.0
interface VLAN10
ip ospf passive-interface
ip router ospf NXOS area 0.0.0.0
This technique allows for adjacencies to form, but may cause problems later. The sim-
plest solution is to change the MTU to match on all devices.
Note If the OSPF interface is a VLAN interface (SVI), make sure that all the Layer 2 (L2)
ports support the MTU configured on the SVI. For example, if VLAN 10 has an MTU of
9000, configure all the trunk ports to support an MTU of 9000 as well.
Troubleshooting OSPF Neighbor Adjacency 471
Unique Router-ID
The RID provides a unique identifier for an OSPF router. A Nexus switch drops
packets that have the same RID as itself as part of a safety mechanism. The syslog
message using our routerid, packet dropped is displayed along with the interface
and RID of the other device. Example 8-19 displays what the syslog message looks
like on NX-1.
The RID is checked by viewing the OSPF process with the command show ip ospf, as
displayed in Example 8-20.
Using the command router-id router-id in the OSPF process sets the RID statically and
is considered a best practice. After changing the RID on one of the Nexus switches, an
adjacency should form.
Note The RID is a key component of the OSPF topology table that is built from the
LSDB. All OSPF devices should maintain a unique RID.
More information on how to interpret the OSPF topology table is found in Chapter 7,
“Advanced OSPF” of the Cisco Press book IP Routing on Cisco IOS, IOS-XE, and
IOS XR.
Technet24
472 Chapter 8: Troubleshooting Open Shortest Path First (OSPF)
Example 8-21 Syslog Message with Neighbors Configured with Different Areas
When this happens, check the OSPF interfaces to detect which area-ids are configured
by using the command show ip ospf interface brief. Example 8-22 shows the output
from NX-1 and NX-2. Notice that the area is different on NX-1 and NX-2 for the
Ethernet1/1 interface.
Changing the interface areas to the same value on NX-1 and NX-2 allows for an adja-
cency to form between them.
Troubleshooting OSPF Neighbor Adjacency 473
Note The area-id is always stored in dot-decimal format on Nexus switches. This may
cause confusion when working with other devices that store the area-id in decimal format.
To convert decimal to dot-decimal, follow these steps:
Step 2. Split the binary value into four octets starting with the furthest right number.
Step 4. Convert each octet to decimal format, which provides dot-decimal format.
■ Not So Stubby Area (NSSA)/Totally NSSA: External LSAs (Type-5 LSAs) are not
allowed in this area. Redistribution is allowed in this area.
The OSPF Hello event-history detects a mismatched OSPF area setting. Example 8-23
displays the concept where NX-1 has detected a different area flag from what is config-
ured on its interface.
Verify the area settings on the two routers that cannot form an adjacency. Example 8-24
displays that NX-1 has Area 1 configured as a stub, whereas NX-2 does not.
Technet24
474 Chapter 8: Troubleshooting Open Shortest Path First (OSPF)
Setting the area to the same stub setting on both routers allows for the area flag check to
pass and the routers to form an adjacency.
DR Requirements
Different media types can provide different characteristics or might limit the number
of nodes allowed on a segment. Table 8-8 defines the five OSPF network types—
which ones are configurable on NX-OS and which network types can peer with other
network types.
Ethernet provides connectivity to more than two OSPF devices on a network segment,
therefore requiring a DR. The default OSPF network type for Nexus switches is the
Broadcast OSPF network type because all its interfaces are Ethernet, and the Broadcast
network type provides a DR.
There are times when a Nexus switch forms only one OSPF adjacency for that inter-
face. An example is two Ethernet ports configured as Layer 3 (L3) with a direct cable.
In scenarios like this, setting the OSPF network type to point-to-point (P2P) provides
advantages of faster adjacency (no DR Election) and not wasting CPU cycles for DR
functionality.
OSPF can form an adjacency only if the DR and BDR Hello options match.
Example 8-25 displays NX-1 stuck in INIT state with NX-2. NX-2 does not con-
sider NX-1 an OSPF neighbor. Scenarios like this indicate incompatibility in OSPF
network types.
The Ethernet1/1 OSPF interface network type is confirmed with the command show
ip ospf interface. NX-1 is configured for Broadcast (DR required), whereas NX-2 is
configured as a point-to-point (DR not required). The mismatch of DR requirements is
the reason that the adjacency failed. Example 8-26 displays the discrepancy in OSPF
network types.
Technet24
476 Chapter 8: Troubleshooting Open Shortest Path First (OSPF)
The OSPF network type needs to be changed on one of the devices, because both Nexus
switches are using L3 Ethernet ports. Configuring both switches to use an OSPF point-
to-point network type is recommended. The command ip ospf network point-to-point
configures NX-1’s Ethernet1/1 interface as an OSPF point-to-point network type. This
allows for both switches to form an adjacency. Example 8-27 displays the configuration
for NX-1 and NX-2 that allows them to form an adjacency.
Timers
A secondary function to the OSPF Hello packets is to ensure that adjacent OSPF neighbors
are still healthy and available. OSPF sends Hello packets at set intervals called the Hello
Timer. OSPF uses a second timer called the OSPF Dead Interval Timer, which defaults
to four times (4x) the Hello Timer. Upon receipt of the Hello packet from a neighboring
router, the OSPF Dead Timer resets to the initial value and starts to decrement again.
Note The default OSPF Hello Timer interval varies upon the OSPF network type.
Changing the Hello Timer interval modifies the default Dead Interval, too.
Troubleshooting OSPF Neighbor Adjacency 477
If a router does not receive a Hello before the OSPF Dead Interval Timer reaches
zero, the neighbor state changes to Down. The OSPF router immediately sends out the
appropriate LSA reflecting the topology change, and the SPF algorithm processes on
all routers within the area.
The OSPF Hello Time and OSPF Dead Interval Time must match when forming an
adjacency. In the event the timers do not match, timers are displayed in the OSPF Hello
packet event history. Example 8-28 shows that NX-1 is receiving a Hello packet with
different OSPF timers.
The OSPF interfaces of both switches need to be examined with the command show
ip ospf interface to view the Hello and Dead Timers. Example 8-29 displays NX-1 and
NX-2 OSPF timers for Ethernet1/1. Notice that the Hello and Dead Timers are different
between the two switches.
Technet24
478 Chapter 8: Troubleshooting Open Shortest Path First (OSPF)
Example 8-30 displays the configuration on both switches for examination to identify
a fix. NX-2 has the command ip ospf hello-interval 15 on the Ethernet1/1 interface to
modify the Hello interval. Removing the ip ospf hello-interval command on NX-2 or
setting the same timers on NX-1 allows the switches to form an adjacency.
Note IOS routers support OSPF fast-packet Hellos for subsecond detection of
neighbors with issues. Nexus and IOS XR do not support OSPF fast-packet Hellos.
The use of bidirectional forwarding detection (BFD) provides fast convergence
across IOS, IOS XR, and Nexus devices and is the preferred method of subsecond
failure detection.
Authentication
OSPF supports two types of authentication: plaintext and a MD5 cryptographic hash.
Plaintext mode provides little security, because anyone with access to the link can see
the password with a network sniffer. MD5 crytographic hash uses a hash instead, so the
password is never sent out the wire, and this technique is widely accepted as being the
more secure mode.
Plaintext authentication is enabled for an OSPF area with the command area area-id
authentication, and the interface parameter command ip ospf authentication sets plain-
text authentication only on that interface. The plaintext password is configured with the
interface parameter command ip ospf authentication-key password.
Example 8-31 displays plaintext authentication on NX-1’s Ethernet1/1 interface and all
Area 0 interfaces on NX-2 using both commands explained previously.
Troubleshooting OSPF Neighbor Adjacency 479
NX-1# conf t
Enter configuration commands, one per line. End with CNTL/Z.
NX-1(config)# int eth1/1
NX-1(config-if)# ip ospf authentication
NX-1(config-if)# ip ospf authentication-key CISCO
NX-1 %OSPF-4-AUTH_ERR: ospf-NXOS [8792] (default) Received packet from 10.12.1.200
on Ethernet1/1 with bad authentication 0
NX-2# conf t
Enter configuration commands, one per line. End with CNTL/Z.
NX-2(config)# router ospf NXOS
NX-2(config-router)# area 0 authentication
NX-2(config-router)# int eth1/1
NX-2(config-if)# ip ospf authentication-key CISCO
Notice the authentication error that NX-1 produced upon enabling authentication.
When there is a mismatch of OSPF authentication parameters, the Nexus switch pro-
duces the syslog message that contains bad authentication, which requires verification
of the authentication settings.
Authentication is verified by looking at the OSPF interface and looking for the authenti-
cation option. Example 8-32 verifies the use of OSPF plaintext passwords on NX-1 and
NX-2 interfaces.
Technet24
480 Chapter 8: Troubleshooting Open Shortest Path First (OSPF)
It is important to note that the password is stored in encrypted format. It may be easier
to reconfigure the password when explicitly configured on an interface. Example 8-33
displays how the password can be viewed.
interface loopback0
ip router ospf NXOS area 0.0.0.0
interface Ethernet1/1
ip ospf authentication-key 3 bdd0c1a345e1c285
ip router ospf NXOS area 0.0.0.0
MD5 authentication is enabled for an OSPF area with the command area area-id
authentication message-digest, and the interface parameter command ip ospf authenti-
cation message-digest sets MD5 authentication for that interface. The MD5 password
is configured with the interface parameter command ip ospf message-digest-key key#
md5 password or set by using a key-chain with the command ip ospf authentication
key-chain key-chain-name. The MD5 authentication is a hash of the key number and
password combined. If the keys do not match, the hash is different between the nodes.
NX-1# conf t
Enter configuration commands, one per line. End with CNTL/Z.
NX-1(config)# int eth1/1
NX-1(config-if)# ip ospf authentication message-digest
NX-1(config-if)# ip ospf message-digest-key 2 md5 CISCO
NX-2# conf t
NX-2(config)# key chain OSPF-AUTH
Troubleshooting OSPF Neighbor Adjacency 481
NX-2(config-keychain)# key 2
NX-2(config-keychain-key)# key-string CISCO
NX-2(config-keychain-key)# router ospf NXOS
NX-2(config-router)# area 0 authentication message-digest
NX-2(config-router)# int eth1/1
NX-2(config-if)# ip ospf authentication key-chain OSPF-AUTH
A benefit to using keychains is that passwords are verified as shown in Example 8-36.
This allows for network engineers to examine a password, versus forcing them to reenter
the password.
Technet24
482 Chapter 8: Troubleshooting Open Shortest Path First (OSPF)
Upon enabling authentication, it is important to check the syslog for error messages that
indicate bad authentication. For those that do, the authentication options and password
need to be verified on all peers for that network link.
Discontiguous Network
Network engineers who do not fully understand OSPF design may create a topology
such as the one illustrated in Figure 8-3. Although NX-2 and NX-3 have OSPF interfaces
in Area 0, traffic from Area 12 must cross Area 23 to reach Area 34. An OSPF network
with this design is discontiguous because interarea traffic is trying to cross a nonback-
bone area.
Loopback0: Loopback0:
192.168.1.1/32 192.168.4.4/32
Loopback0: Loopback0:
192.168.2.2/32 192.168.3.3/32
Example 8-37 shows that NX-2 and NX-3 appear to have full connectivity to all net-
works in the OSPF domain. NX-2 maintains connectivity to the 10.34.1.0/24 network
and 192.168.4.4/32 network, and NX-3 maintains connectivity to the 10.12.1.0/24 net-
work and 192.168.1.1/32 network.
Example 8-38 shows the route tables for NX-1 and NX-4. NX-1 is missing route entries
for Area 34, and NX-4 is missing route entries for Area 12. When Area 12’s Type-1 LSAs
reach NX-2, NX-2 generates a Type-3 LSA into Area 0 and Area 23. NX-3 receives the
Type-3 LSA and inserts it into the LSDB for Area 23. NX-3 does not create a new
Type-3 LSA for Area 0 or Area 34.
OSPF ABRs use the following logic for Type-3 LSAs when entering another OSPF Area:
■ Type-1 LSAs received from a nonbackbone area create Type-3 LSAs into backbone
area and nonbackbone areas.
■ Type-3 LSAs received from Area 0 are created for the nonbackbone area.
Technet24
484 Chapter 8: Troubleshooting Open Shortest Path First (OSPF)
■ Type-3 LSAs received from a nonbackbone area only insert into the LSDB for the
source area. ABRs do not create a Type-3 LSA for the other nonbackbone areas.
The simplest fix for a discontiguous network is to install a virtual link between NX-2 and
NX-3. Virtual links overcome the ABR limitations by extending Area 0 into a nonback-
bone area. It is similar to running a virtual tunnel for OSPF between an ABR and another
multi-area OSPF router. The virtual link extends Area 0 across Area 23, making Area 0 a
contiguous OSPF area.
The virtual link configuration is applied to the OSPF routing process with the command
area area-id virtual-link endpoint-rid. The configuration is applied on both end devices
as shown in Example 8-39.
NX-2
router ospf NXOS
area 0.0.0.23 virtual-link 192.168.3.3
NX-3
router ospf NXOS
area 0.0.0.23 virtual-link 192.168.2.2
Example 8-40 displays the routing table of NX-1 after the virtual link is configured
between NX-2 and NX-3. Notice that the 192.168.4.4 network is present. In addition,
the virtual link appears as an OSPF interface.
Duplicate Router ID
Router IDs (RID) play a critical role for the creation of the topology. If two adjacent
routers have the same RID, an adjacency does not form as shown earlier. However, if
two routers have the same RID and have an intermediary router, it prevents those routes
from being installed in the topology.
The RID act as a unique identifier in the OSPF LSAs. When two different routers adver-
tise LSAs with the same RID, it causes confusion in the OSPF topology, which can result
in routes not populating or packets being forwarded toward the wrong router. It also
prevent LSA propagation because the receiving router may assume that a loop exists.
Figure 8-4 provides a sample topology in which all Nexus switches are advertising their
peering network and their loopback addresses in the 192.168.0.0/16 network space.
NX-2 and NX-4 have been configured with the same RID of 192.168.4.4. NX-3 sits
between NX-2 and NX-4 and has a different RID, therefore allowing NX-2 and NX-4 to
establish full neighbor adjacencies with their peers.
RID RID
192.168.4.4 192.168.4.4
From NX-1’s perspective, the first apparent issue is that NX-4’s loopback interface
(192.168.4.4/32) is missing. Example 8-41 displays NX-1’s routing table.
Example 8-41 NX-1’s Routing Table with Missing NX-4’s Loopback Interface
Technet24
486 Chapter 8: Troubleshooting Open Shortest Path First (OSPF)
On NX-2 and NX-4, there are complaints about LSAs and Possible router-id collision
syslog messages, as shown in Example 8-42.
Example 8-43 displays the routing table of the two Nexus switches with the Possible
router-id collision syslog messages. Notice that NX-2 is missing NX-1’s loopback
interface (192.168.1.1/32) and NX-4’s loopback interface (192.168.4.4/32); whereas
NX-4 is missing the 10.12.1.0/24 and NX-2’s loopback interface (192.168.2.2/32)
network interface.
A quick check of the RIDs is done by examining the OSPF processes on both Nexus
switches that reported the Possible router-id collision using the show ip ospf command.
Notice that in Example 8-44, NX-2 and NX-4 have the same RID.
Troubleshooting Missing Routes 487
Remember that the RID can be dynamically set or statically set. Generally, this problem
is a result of a configuration being copied from one router to another and not changing
the RID. The RID is changed using the command router-id router-id under the OSPF
process. The OSPF process restarts upon changing the RID on a Nexus switch.
Filtering Routes
NX-OS provides multiple methods of filtering networks after they are entered into
the OSPF database. Filtering of routes occurs on ABRs for internal OSPF networks
and ASBRs for external OSPF networks. The following includes some configurations
that should be examined when routes are present in one area but not present in a
different area.
■ Area Filtration: Routes are filtered upon receipt or advertisement to an ABR with
the process level configuration command area area-id filter-list route-map route-
map-name {in|out}.
■ Route Summarization: Internal routes are summarized on ABRs using the command
area area-id range summary-network [not-advertise]. If the not-advertise keyword
is configured, a Type-3 LSA is not generated for any of the component routes;
thereby hiding them to only the area of origination.
Note ABRs for NSSA areas act as an ASBR when the Type 7 LSAs are converted to
Type 5 LSA. External summarization is performed only on ABRs when they match this
scenario.
Redistribution
Redistributing into OSPF uses the command redistribute [bgp asn | direct | eigrp
process-tag | isis process-tag | ospf process-tag | rip process-tag | static] route-map
route-map-name. A route-map is required as part of the redistribution process on Nexus
switches.
Technet24
488 Chapter 8: Troubleshooting Open Shortest Path First (OSPF)
Every protocol provides a seed metric at the time of redistribution that allows the des-
tination protocol to calculate a best path. OSPF uses the following default settings for
seed metrics:
■ The default redistribution metric is set to 20 unless the source protocol is BGP
which provides a default seed metric of 1.
The default seed metrics can be changed to different values for OSPF external network
type (1 versus 2), redistribution metric, and a route-tag if desired.
Example 8-45 provides the necessary configuration to demonstrate the process of redis-
tribution. NX-1 redistributes the connected routes for 10.1.1.0/24 and 10.11.11.0/24 in
lieu of them being advertised with the OSPF routing protocol. Notice that the route-map
can be a simple permit statement without any conditional matches.
■ Users are trying to connect to the proxy server that is located in a DMZ (172.16.1.1)
off of the firewall.
■ NX-1 has a static route for the 172.16.1.0/24 network pointing toward the firewall
(10.120.1.10).
■ NX-1 and NX-2 have direct connectivity using VLAN 120 (10.120.1.0/24) to the
firewall.
Example 8-46 displays NX-1’s configuration for advertising the 172.16.1.0/24 network
into the OSPF domain. In addition, NX-1’s static route is verified for installation into the
OSPF database and is then checked on NX-3.
Troubleshooting Missing Routes 489
Proxy Server
Traffic Flow
Static Route
Redistribution
into OSPF
10.34.1.0/24
VLAN 120
Internet 10.120.1.0/24
.10
10.24.1.0/24
NX-2 NX-4
Area 0
Traffic Flow
NX-1
ip route 172.16.1.0/24 10.120.1.10
!
route-map REDIST permit 10
set metric-type type-1
!
router ospf NXOS
redistribute static route-map REDIST
log-adjacency-changes
!
interface Ethernet1/1
ip router ospf NXOS area 0.0.0.0
Technet24
490 Chapter 8: Troubleshooting Open Shortest Path First (OSPF)
Example 8-47 displays the Type-5 LSA for the external route for the 172.16.1.0/24
network to the proxy server. The ASBR is identified as NX-1 (192.168.1.1), which is the
device that all Nexus switches forward packets to in order to reach the 172.16.1.0/24
network. Notice that the forwarding address is the default value of 0.0.0.0.
LS age: 199
Options: 0x2 (No TOS-capability, No DC)
LS Type: Type-5 AS-External
Link State ID: 172.16.1.0 (Network address)
Advertising Router: 192.168.1.1
LS Seq Number: 0x80000002
Checksum: 0x7c98
Length: 36
Network Mask: /24
Metric Type: 1 (Same units as link state path)
TOS: 0
Metric: 20
Forward Address: 0.0.0.0
External Route Tag: 0
Traffic from NX-2 (and NX-4) takes the non-optimal route (NX-2→NX-4→NX-3→
NX-1→FW), as shown in Example 8-48. The optimal route would allow NX-2 to use the
directly connected 10.120.1.0/24 network toward the firewall.
The forwarding address in OSPF Type-5 LSAs is specified in RFC 2328 for scenarios
such as this. When the forwarding address is 0.0.0.0, all routers forward packets to the
ASBR, introducing the potential for suboptimal routing.
The OSPF forwarding address changes from 0.0.0.0 to the next-hop IP address in the
source routing protocol when the following occurs:
■ OSPF is enabled on the ASBR’s interface that points to the next-hop IP address. In
this scenario, NX-1’s VLAN120 interface has OSPF enabled, which correlates to the
172.16.1.0/24 static route’s next-hop address of 10.120.1.10.
Now OSPF is enabled on the NX-1’s and NX-2’s VLAN120 interface, which has been
associated to area 120. Figure 8-6 illustrates the current topology. VLAN interfaces
default to the broadcast OSPF network type, and all conditions were met to set the FA
to an explicit IP address.
Technet24
492 Chapter 8: Troubleshooting Open Shortest Path First (OSPF)
Proxy Server
Traffic Flow
Static Route
Redistribution
into OSPF
172.16.1.0/24
NX-1 NX-3
10.13.1.0/24
10.34.1.0/24
VLAN 120
Internet 10.120.1.0/24
.10
Area 120
10.24.1.0/24
NX-2 NX-4
Area 0
Traffic Flow
Example 8-49 displays the Type-5 LSA for the 172.16.1.0/24 network. Now that OSPF
is enabled on NX-1’s 10.120.1.1 interface and the interface is a broadcast network type,
the forwarding address changed from 0.0.0.0 to 10.120.1.10.
Example 8-50 verifies that connectivity from NX-2 and NX-4 now takes the optimal
path because the forwarding address changed to 10.120.1.10.
A junior network engineer identified that the 10.120.1.0/24 network is no longer needed.
The engineer implemented filtering on Area 120 LSAs from being advertised into Area 0,
as shown in Example 8-51.
NX-1
router ospf NXOS
redistribute static route-map REDIST
area 0.0.0.120 range 10.0.0.0/8 not-advertise
log-adjacency-changes
NX-2
router ospf NXOS
area 0.0.0.120 range 10.0.0.0/8 not-advertise
log-adjacency-changes
After the junior network engineer made the change, the 172.16.1.0/24 network disap-
peared on all the routers in Area 0. Only the other peering network is present, as shown
in Example 8-52.
Technet24
494 Chapter 8: Troubleshooting Open Shortest Path First (OSPF)
If the Type-5 LSA forwarding address is not a default value, the address must be
an intra-area or inter-area OSPF route. If the FA is not resolved, the LSA is ignored
and does not install into the RIB. The FA provides a mechanism to introduce
multiple paths to the external next-hop address. Otherwise, there is not a reason
to include the FA in the LSA. Removing the filtering on NX-1 and NX-2 restores
connectivity.
Note In the scenario provided, there was not any redundancy to provide connectivity
in the event that NX-1 failed. Typically, the configuration is repeated on other routers,
which provides resiliency. Be considerate of the external networks when applying filtering
of routes on ABRs.
■ Intra-Area
■ Inter-Area
■ External Type-1
■ External Type-2
Intra-Area Routes
Routes advertised via a Type-1 LSA for an Area are always preferred over Type-3 and
Type-5 LSAs. If multiple intra-area routes exist, the path with the lowest total path met-
ric is installed in the RIB. If there is a tie in metric, both routes install into the RIB.
Note Even if the path metric from an intra-area route is higher than an inter-area path
metric; the intra-area path is selected.
Troubleshooting OSPF Path Selection 495
Inter-Area Routes
Inter-area routes take the lowest total path metric to the destination. If there is a tie in
metric, both routes install into the RIB. All inter-area paths for a route must go through
Area 0.
In Figure 8-7, NX-1 is computing the path to NX-6. NX-1 uses the path NX-1→
NX-3→NX-5→NX-6 because its total path metric is 35 versus the NX-1→NX-2→
NX-4→NX-6 path with a metric of 40.
■ The Type-1 path metric equals the following: redistribution metric + total path
metric to reach the ASBR.
Another critical factor to identify is whether the devices are operating in RFC 1583
or RFC 2328 mode. Cisco NX-OS switches operate in 2328 mode by default, whereas
Cisco IOS, IOS XE, and IOS XR operate only in 1583 mode. The following subsection
explains the path selection logic depending on whether the device is operating in RFC
1583 or RFC 2328 mode.
Technet24
496 Chapter 8: Troubleshooting Open Shortest Path First (OSPF)
■ RFC 1583 Mode: External OSPF Type-1 route calculation uses the redistribution
metric + the lowest path metric to reach the ASBR that advertised the network.
Type-1 path metrics are lower for routers closer to the originating ASBR, whereas
generally the path metric is higher for a router 10 hops away from the ASBR.
If there is a tie in the path metric, both routes install into the RIB. If the ASBR is in
a different area, the path of the traffic must go through Area 0. An ABR router does
not install an O E1 and O N1 route into the RIB at the same time. O N1 is given pref-
erence for a typical NSSA area and prevents the O E1 from installing on the ABR.
■ RFC 2328 Mode: Preference first goes to the ASBR in the same area as the calcu-
lating router. In the event that the ASBR is not in the same area as the calculating
router, the rules for calculating the best path follow those as RFC 1583 Mode.
Note There is an option with NSSA areas that prevents the redistributed routes from
being advertised outside of the NSSA area (setting the P-bit to zero), which may change
the behavior. This concept is outside of the scope of this book; it is explained in depth in
RFC 2328 and 3101.
Figure 8-8 shows the topology for NX-1 and NX-3 computing a path to the external
network (100.65.0.0/16) that is being redistributed on NX-6 and NX7.
Area 246
Area 0
0
t1
Cost
s
20
Co
External Routes
NX-1 100.65.0.0/16
Cost
20
Co
st
10
The path NX-1→NX-2→NX-4→NX-6 has a metric of 50, which is less than the
path NX-1→NX-3→NX-5→NX-7, which has a path metric of 90. NX-1 selects the
NX-1→NX-2→NX-4→NX-6 path to reach the 100.65.0.0/16 network, whereas NX-3
selects the NX-3→NX-5→NX-7 path that has a higher metric.
The decisions were made based upon RFC 2328 logic because NX-1 is not in an area
with an ASBR, whereas NX-3 is in the same area as NX-7. Example 8-53 displays the
routing tables and path metrics from NX-1 and NX-3’s perspective.
■ RFC 1583 Mode: External OSPF Type-2 routes do not increment in metric respec-
tive of the path metric to the ASBR. If there is a tie in the redistribution metric, the
router compares the forwarding cost. The forwarding cost is the metric to the ASBR
that advertised the network, and the lower forwarding cost is preferred. If there is a
tie in forwarding cost, both routes install into the routing table. An ABR router does
not install an O E2 and O N2 route into the RIB at the same time. O N2 is given pref-
erence for a typical NSSA area and prevents the O E2 from installing on the ABR.
■ RFC 2328 Mode: Preference first goes to the ASBR in the same area as the calcu-
lating router. In the event that the ASBR is not in the same area as the calculating
router, the rules for calculating the best path follow those as RFC 1583 Mode.
Reusing the topology from Figure 8-6, all paths reflect a metric of 20. The first deciding
step is to check to see if the ASBR for the 100.65.0.0/16 network is in the same area as
the ASBR. NX-1 is not, so it selects the path based on forwarding cost. Forwarding cost
is calculated on NX-OS.
Example 8-54 walks through the steps for calculating the forwarding cost.
Step 1. The ASBRs must be identified by looking at the OSPF LSDB with the com-
mand show ip ospf database external network.
Step 2. The metric reported by the ABR for the ASBR address (Type-4 LSA) is exam-
ined with the command show ip ospf database asbr-summary detail. (This
provides the path metric from the ASBR to the area’s ABR.)
Technet24
498 Chapter 8: Troubleshooting Open Shortest Path First (OSPF)
Step 3. Find the metric to the ABR of the Type-4 LSA with the command show ip
ospf database router abr-ip-address detail.
Step 4. Combine the two metrics to calculate a forwarding cost of 30 from NX-1 to
NX-6, and a forwarding cost of 70 from NX-1 to NX-7. The path to NX-6 is
the lowest and is selected by NX-1.
Example 8-54 NX-1 External OSPF Path Selection for Type-2 Network
NX-3’s path is selected based on RFC 2328’s guidelines because NX-3 is in the same
area as NX-7. Example 8-55 confirms the path from NX-3 → NX-5 → NX-7.
Example 8-55 NX-3 External OSPF Path Selection for Type-2 Network
Area 246
Area 0
R2 R4 R6
Cost 10 Cost 10
Cost
10
20
st
Co
External Routes
100.65.0.0/16
R1
Cost
20
Co
st
10
NX-3 R5 R7
Cost 10 Cost 50
Area 357
Figure 8-9 External Type-2 Route Selection with Nexus and IOS Devices
Technet24
500 Chapter 8: Troubleshooting Open Shortest Path First (OSPF)
NX-3 selects R7 as the ASBR for the 100.65.0.0/16 network using RFC 2328 standards
and forwards packets toward R5. R5 uses RFC 1583 standards and forwards packets
back to NX-3, causing a loop. Example 8-56 verifies that the loop exists using a simple
traceroute from NX-3 toward the 100.65.0.0/16 network.
The solution involves placing the Nexus switches into RFC 1583 mode with the OSPF
command rfc1583compatibility. Example 8-57 displays the configuration to remove
the routing loop.
Note Another significant change between RFC 1583 and RFC 2328 is the summarization
metric. With RFC 1583, an ABR uses the lowest metric from any of the component routes
for the metric of the summarized network. RFC 2328 uses the highest metric from any of
the component routes for the metric of the summarized route. Deploying rfc1583compa-
tibility on the ABR changes the behavior.
The default reference bandwidth for NX-OS is 40 Gbps, whereas for other Cisco OSs
(IOS and IOS XR) it is 100 Mbps. Table 8-9 provides the OSPF cost for common net-
work interface types using the default reference bandwidth.
Troubleshooting OSPF Path Selection 501
Notice in Table 8-9 that there is no differentiation in the link cost associated to a
FastEthernet interface and a 100-Gigabit Ethernet interface on IOS routers. This can
result in suboptimal path selection and is magnified when a NX-OS switch is inserted
into a path.
Figure 8-10 displays a topology that introduces problems because of the reference
bandwidth not being set properly. Connectivity between the two WAN service provid-
ers should take the 10 Gigabit Path (R1→NX-3→NX-4→R2) and use the 1 Gigabit link
between R1 and R2 only as a backup path, because traffic is likely be dropped by the
QoS policy to support only business-critical traffic.
WAN WAN
Service Service
Provider 1 Provider 2
172.16.1.0/24 172.32.2.0/24
Gigabit Link
R1 R2
10.12.1.0/24
10-Gigabit Link
10-Gigabit Link
10.24.1.0/24
10.13.1.0/24
10-Gigabit Link
NX-3 NX-4
10.34.1.0/24
Example 8-58 displays the routing table of R1 with the default reference bandwidth.
Traffic between 172.16.1.0/24 and 172.32.2.0/24 flows across the backup 1 Gigabit
link (10.12.1.0/24), which does not follow the intended traffic patterns. Notice that the
OSPF path metric is 2 to the 172.32.2.0/24 network using the 1 Gigabit link.
Technet24
502 Chapter 8: Troubleshooting Open Shortest Path First (OSPF)
Example 8-58 R1’s Routing Table with Default OSPF Auto-Cost Bandwidth
Now let’s shut down the 1 Gigabit link and examine the OSPF metrics using the 10
Gigabit path. Example 8-59 displays the process. Notice that R1’s path metric is 10 to
the 172.32.2.0/24 network using the 10 Gigabit link path.
R1# conf t
Enter configuration commands, one per line. End with CNTL/Z.
R1(config)# int gi0/1
R1(config-if)# shut
16:04:43.107: %OSPF-5-ADJCHG: Process 1, Nbr 192.168.2.2 on GigabitEthernet0/1 from
FULL to DOWN, Neighbor Down: Interface down or detached
16:04:45.077: %LINK-5-CHANGED: Interface GigabitEthernet0/1, changed state to
administratively down
16:04:46.077: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/1,
changed state to down
R1(config-if)# do show ip route ospf | b Gatewa
Gateway of last resort is not set
R1 and R2 are taking the suboptimal path because of the differences in reference
bandwidth. Change the reference bandwidth to match the NX-OS’s default setting of
40 Gbps. The reference bandwidth on IOS and NX-OS devices is set with the command
auto-cost reference-bandwidth speed-in-megabits. Example 8-60 displays the reference
bandwidth being changed on R1 and R2.
Troubleshooting OSPF Path Selection 503
R2# conf t
Enter configuration commands, one per line. End with CNTL/Z.
R2(config)# router ospf 1
R2(config-router)# auto-cost reference-bandwidth 40000
% OSPF: Reference bandwidth is changed.
Please ensure reference bandwidth is consistent across all routers.
Now let’s examine the new OSPF metric cost using the 10 Gigabit path, and then reacti-
vate the 1 Gigabit link on R1. Example 8-61 demonstrates this change and then verifies
which path is now used to connect the 172.16.1.0/24 and 172.32.2.0/24 networks.
Example 8-61 Verification of New Path After New Reference OSPF Bandwidth Is
Configured on R1 and R2
R1# conf t
Enter configuration commands, one per line. End with CNTL/Z.
R1(config)# int gi0/1
R1(config-if)# no shut
16:09:10.887: %LINK-3-UPDOWN: Interface GigabitEthernet0/1, changed state to up
16:09:11.887: %LINEPROTO-5-UPDOWN: Line protocol on Interface GigabitEthernet0/1,
changed state to up
16:09:16.623: %OSPF-5-ADJCHG: Process 1, Nbr 192.168.2.2 on GigabitEthernet0/1 from
LOADING to FULL, Loading Done
R1(config-if)# do show ip route ospf | b Gate
Gateway of last resort is not set
Technet24
504 Chapter 8: Troubleshooting Open Shortest Path First (OSPF)
The path between 172.16.1.0/24 and 172.32.2.0/24 continues to use the 10 Gigabit path
because the path metric cost using the 1 Gigabit path would be 41 ((1,000/40,000) + 1
(for loopback).
Note Another solution involves statically setting the OSPF cost on an interface with the
command ip ospf cost 1-65535 for NX-OS and IOS devices.
Summary
This chapter provided a brief review of the OSPF routing protocols, and then explored
the methods for troubleshooting adjacency issues between devices, missing routes, and
path selection.
The following parameters must be compatible for the two routers to become neighbors:
OSPF is a link state routing protocol that builds a complete map based on LSAs. Routes
are missing from the OSPF routing domain typically because of bad network design or
through filtering of routes as they are advertised across area boundaries. This chapter
provided some common bad OSPF designs that cause loss of path information.
OSPF builds a loop-free topology from the computing router to all destination net-
works. All routers use the same logic to calculate the shortest-path for each network.
Path selection prioritizes paths by using the following logic:
■ Intra-Area
■ Inter-Area
■ External Type-1
■ External Type-2
References 505
When the redistribution metric is the same, Nexus switches select external paths using
RFC 2328 by default, which states to prefer intra-area connectivity over inter-area con-
nectivity when multiple ABSRs are present. Cisco IOS and IOS XR routers use RFC 1583
external path selection, which selects an ABSR by the lowest forwarding cost. This can
cause routing loops when Nexus switches are intermixed with IOS or IOS XR routers,
but the Nexus switches can be placed in RFC 1583 compatibility mode.
References
RFC 1583, OSPF Version 2. IETF, http://www.ietf.org/rfc/rfc1583.txt, March 1997.
Edgeworth, Brad, Aaron Foss, Ramiro Garza Rios. IP Routing on Cisco IOS, IOS XE
and IOS XR. Indianapolis: Cisco Press, 2014.
Technet24
This page intentionally left blank
Chapter 9
Troubleshooting Intermediate
System-Intermediate
System (IS-IS)
This chapter focuses on identifying and troubleshooting issues that are caused with forming
IS-IS neighbor adjacency, path selection, missing routes, and problems with convergence.
IS-IS Fundamentals
IS-IS uses a two-level hierarchy consisting of Level 1 (L1) and Level 2 (L2) connections.
IS-IS communication occurs at L1, L2, or both (L1-L2). L2 routers communicate only
with other L2 routers, and L1 routers communicate only with other L1 routers. L1-L2
routers provide connectivity between the L1 and L2 levels. An L2 router can communi-
cate with L2 routers in the same or a different area, whereas an L1 router communicates
only with other L1 routers within the same area. The following list indicates the type of
adjacencies that are formed between IS-IS routers:
■ L1 ← → L1
■ L2 ← → L2
Technet24
508 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
■ L1-L2 ← → L1
■ L1-L2 ← → L2
■ L1-L2 ← → L1-L2
Note The terms L1 and L2 are used frequently in this chapter, and refer only to the IS-IS
levels. They should not be confused with the OSI model.
IS-IS uses the link-state packets (LSP) for building a link-state packet database (LSPDB)
similar to OSPF’s link-state database (LSDB). IS-IS then runs the Dijkstra Shortest Path
First (SPF) algorithm to construct a loop-free topology of shortest paths.
Areas
OSPF and IS-IS use a two-level hierarchy but work differently between the protocols.
OSPF provides connectivity between areas by allowing a router to participate in mul-
tiple areas, whereas IS-IS places the entire router and all its interfaces in a specific area.
OSPF’s hierarchy is based on areas advertising prefixes into the backbone, which then
are advertised into nonbackbone areas. Level 2 is the IS-IS backbone and can cross mul-
tiple areas, unlike OSPF, as long as the L2 adjacencies are contiguous.
Figure 9-1 demonstrates these basic differences between OSPF and IS-IS. Notice that the IS-IS
backbone extends across four areas, unlike OSPF’s backbone, which is limited to Area 0.
OSPF
Backbone
IS-IS
Backbone
In Figure 9-2, NX-1 and NX-2 form an L1 adjacency with each other, and NX-4 and
NX-5 form an L1 adjacency with each other. Although NX-2 and NX-4 are L1-L2 rout-
ers, NX-1 and NX-5 support only an L1 connection. The area address must be the same
to establish an L1 adjacency. NX-2 establishes an L2 adjacency with NX-3, and NX-3
establishes an L2 adjacency with NX-4. NX-2 and NX-4 are L1-L2 routers and can form
an L1 and L2 adjacency on them.
All L1 IS-IS routers in the same level maintain an identical copy of the LSPDB, and all
L1 routers do not know about any routers or networks outside of their level (area). In a
similar fashion, L2 routers maintain a separate LSPDB that is identical with other L2 rout-
ers. L2 routers are aware only of other L2 routers and networks in the L2 LSPDB.
L1-L2 routers inject L1 prefixes into the L2 topology. L1-L2 routers do not advertise L2
routes into the L1 area, but they set the attached bit in their L1 LSP, indicating that the
router has connectivity to the IS-IS backbone network. If an L1 router does not have a
route for a network, it searches the LSPDB for the closest router with the attached bit,
which acts as a route of last resort.
NET Addressing
IS-IS routers share an area topology through link-state packets (LSP) that allows them
to build the LSPDB. IS-IS uses NET addresses to build the LSPDB topology. The NET
address is included in the IS header for all the LSPs. Ensuring that a router is unique in
an IS-IS routing domain is essential for properly building the LSPDB. NET addressing is
based off the OSI model’s Network Service Access Point (NSAP) address structure that
is between 8 to 20 bytes in length. NSAP addressing is variable based on the logic for
addressing domains.
The dynamic length in the Inter-Domain Part (IDP) portion of the NET address
causes unnecessary confusion. Instead of reading the NET address left to right, most
network engineers read the NET address from right to left. In the most simplistic
form, the first byte is always the selector (SEL) (with a value of 00), with the next
6 bytes as the system ID, and the remaining 1 to 13 bytes are the Area Address, as
shown in Figure 9-3.
Technet24
510 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
Selector
Area Address ID
(SEL)
■ A simple 8-byte NET address structure. The Authority and Format Identifier (AFI) is
not needed because the length does not enter into the Inter-Domain Part (IDP) por-
tion of the NSAP address. Notice that the Area Address is 1-byte, which provides
up to 256 unique areas.
■ A common 10-byte NET address structure. The private AFI (49) is used, and
the area uses 2 bytes, providing up to 65,535 unique areas. Notice that the Area
Address is 49.1234.
■ Typical Open System Interconnection (OSI) NSAP address that includes the domain
address. Notice that the Area Address is 49.0456.1234 and that the private AFI (49)
is used.
Area Address
Note In essence, the router’s System ID is equivalent to EIGRP, or OSPF’s router-id. The
NET address is used to construct the network topology and must be unique.
IS-IS Fundamentals 511
Inter-Router Communication
Unlike other routing protocols, intermediate system (IS) communication is protocol
independent because inter-router communication is not encapsulated in the third layer
(network) of the OSI model. IS communication uses the second layer of the OSI model.
IP, IPv6, and other protocols all use the third layer addressing in the OSI model.
IS protocol data units (PDU) (packets) follow a common header structure that identifies the
type of the PDU. Data specific to each PDU type follows the header, and the last fields use
optional variable-length fields that contain information specific to the IS PDU type.
IS packets are categorized into three PDU types, with each type differentiating between
L1 and L2 routing information:
■ IS-IS Hello (IIH) Packets: IIH packets are responsible for discovering and maintain-
ing neighbors.
■ Link State Packets (LSP): LSPs provide information about a router and associated
networks. Similar to an OSPF LSA, except OSPF uses multiple LSAs.
■ Sequence Number Packets (SNP): Sequence number packets (SNP) control the
synchronization process of LSPs between routers.
■ Complete sequence number packets (CSNP) provide the LSP headers for the
LSPDB of the advertising router to ensure the LSPDB is synchronized.
IS Protocol Header
Every IS packet includes a common header that describes the PDU. All eight fields are
1-byte long and are in all packets.
Table 9-1 provides an explanation for the fields listed in the IS Protocol Header.
Technet24
512 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
Field Description
PDU Type 1-byte representation of the PDU type.
ISO 10589 states that a value of 0 in the IS packet header is treated in a special way
in for the IS Type, LSPF Database Overload Bit, and Maximum Area Addresses field.
A value of zero infers the default setting indicated in Table 9-1.
TLVs
A portion of IS PDUs uses variable modules that contain routing information. Each
module specifies the type of information, length of data, and the value itself, and are
commonly referred to as type, length, and value (TLV) tuples. Every TLV maintains a
1-byte numeric label to identify the type (function) and length of the data. TLVs support
the capability of nesting, so a sub-TLV can exist inside another TLV.
TLVs provide functionality and scalability to the IS protocol. Developing new features
for the IS protocol involves the addition of TLVs to the existing structure. For example,
IPv6 support was added to the IS protocol by adding TLV #232 (IPv6 Interface Address)
and #236 (IPv6 Reachability).
IS PDU Addressing
Communication between IS devices uses Layer 2 addresses. The source address is always
the network interface’s Layer 2 address, and the destination address varies depending
upon the network type. Nexus switches are Ethernet based and therefore use Layer 2
MAC addresses for IS-IS communication.
ISO standards classify network media into two categories: broadcast and general
topology.
Table 9-2 provides a list of destination MAC addresses used for IS communication.
IS-IS Fundamentals 513
General topology networks are based off network media that allows communication
only with another device if a single packet is sent out. General topology networks are
often referred to in IS-IS documentation as point-to-point networks. Point-to-point net-
works communicate with a directed destination address that matches the Layer 2 address
for the remote device. NBMA technologies such as Frame Relay may not guarantee
communication to all devices with a single packet. A common best practice is to use
point-to-point subinterfaces on NMBA technologies to ensure proper communication
between IS-IS nodes.
Routers that form an L1-L2 adjacency with another IS-IS router send both L1 and L2
IIHs on broadcast links. To save bandwidth on WAN links, point-to-point links use the
Point-to-Point Hello, which services both L1 and L2 adjacencies.
Table 9-3 provides a brief overview of the five IS Hello packet types.
Technet24
514 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
Table 9-4 provides a brief list of information included in the IIH Hello Packet.
Link-State Packets
Link-state packets (LSP) are similar to OSPF LSAs where they advertise neighbors and
attached networks, except that IS-IS uses only two types of LSPs. IS-IS defines a LSP
type for each level. L1 LSPs are flooded throughout the area they originate, and L2 LSPs
are flooded throughout the Level 2 network.
LSP ID
The LSP ID is a fixed 8-byte field that provides a unique identification of the LSP
originator. The LSP ID is composed of the following:
■ System ID (6 bytes): The system ID is extracted from the NET address configured
on the router.
■ Pseudonode ID (1 byte): The pseudonode ID identifies the LSP for a specific
pseudonode (virtual router) or for the physical router. LSPs with a pseudonode ID
of zero describe the links from the system and can be called non-pseudonode LSPs.
LSPs with a nonzero number indicate that the LSP is a pseudonode LSP. The
pseudonode ID correlates to the router’s circuit ID for the interface performing
the designated intermediate system (DIS) function. The pseudonode ID is unique
among any other broadcast segments for which the same router is the DIS on that
level. Pseudonodes and DIS are explained later in this chapter.
■ Fragment ID (1 byte): If an LSP is larger than the max MTU value of the interface it
needs to be sent out of, that LSP must be fragmented. IS-IS fragments the LSP as it is
created, and the fragment-ID allows the receiving router to process fragmented LSPs.
Figure 9-5 shows two LSP IDs. The LSP ID on the left indicates that it is for a spe-
cific IS router, and the LSP ID on the right indicates that it is for the DIS because the
pseudonode ID is not zero.
1921.6800.1001.00-00 1921.6800.1001.01-00
Regular LSP ID Pseudonode LSP ID
Attribute Fields
The last portion of the LSP header is an 8-bit section that references four components of
the IS-IS specification:
■ Partition Bit: The partition bit identifies whether a router supports the capability for
partition repair. Partition repair allows a broken L1 area to be repaired by L2 routers
that belong to the same area as the L1 routers. Cisco and most other network ven-
dors do not support partition repair.
Technet24
516 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
■ Attached Bit: The next four bits reflect the attached bit set by a L1-L2 router con-
nected to other areas via the L2 backbone. The attached bit is in L1 LSPs.
■ Overload Bit: The overload bit indicates when a router is in an overloaded condi-
tion. During SPF calculation, routers should avoid sending traffic through this router.
Upon recovery, the router advertises a new LSP without the overload bit, and the
SPF calculation occurs normally without avoiding routes through the previously
overloaded node.
■ Router Type: The last two bits indicate whether the LSP is from a L1 or L2 router.
IS-IS overcomes this inefficiency by creating a pseudonode (virtual router) to manage the
synchronization issues that arise on the broadcast network segment. A router on the broad-
cast segment, known as the Designated Intermediate System (DIS), assumes the role of
the pseudonode. If the acting DIS router fails, another router becomes the new DIS and
assumes the responsibilities. A pseudonode and DIS exist for each IS-IS level (L1 and L2)
which means that a broadcast segment can have two pseudonodes and two DISs.
By inserting the logical pseudonode into a broadcast segment, the multi-access network
segment is converted into multiple point-to-point networks in the LSPDB.
Note There is a natural tendency to associate IS-IS DIS behavior with OSPF’s designated
router (DR) behavior, but they operate in a different nature. All IS-IS routers form a full
neighbor adjacency with each other. Any router can advertise non-pseudonode LSPs to all
other IS-IS routers on that segment, whereas OSPF specifies that LSAs are sent to the DR
to be advertised to the network segment.
The DIS advertises a pseudonode LSP that indicates the routers that attach to the
pseudonode. The pseudonode LSP acts like an OSPF Type-2 LSA because it indicates
the attached neighbors and informs the nodes which router is acting as the DIS. The
system IDs of the routers connected to the pseudonode are listed in the IS Reachability
TLV with an interface metric set to zero because SPF uses the metric for the non-
pseudonode LSPs for calculating the SPF tree.
The pseudonode advertises the complete sequence number packets (CSNP) every 10
seconds. IS-IS routers check their LSPDBs to verify that all LSPs listed in the CSNP exist,
and that the sequence number matches the version in the CSNP.
■ If an LSP is missing or the router has an outdated (lower sequence number) LSP than
what is contained in the CSNP, the router advertises a partial sequence number
packet (PSNP) requesting the correct or missing LSP. All IS-IS routers receive the
PSNP, but only the DIS sends out the correct LSP, thereby reducing traffic on that
network segment.
■ If a router detects that the sequence number in the CSNP is lower than the sequence
number for any LSP that is stored locally in its LSPDB, it advertises the local LSP
with the higher sequence number. All IS-IS routers receive the LSP and process it
accordingly. The DIS should send out an updated CSNP with the updated sequence
number for the advertised LSP.
Path Selection
Note that the IS-IS path selection is quite straightforward after reviewing the following
key definitions:
■ Intra-area routes are routes that are learned from another router within the same
level and area address.
Technet24
518 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
■ Inter-area routes are routes that are learned from another L2 router that came from
a L1 router or from a L2 router from a different area address.
■ External routes are routes that are redistributed into the IS-IS domain. External
routes can choose between two metric types:
■ Internal metrics are directly comparable with IS-IS path metrics and are selected
by default by Nexus switches. IS-IS treats these routes with the same preferences
as those advertised normally via TLV #128.
IS-IS best-path selection follows the processing order shown in the following steps to
identify the route with the lowest path metric for each stage.
Note Under normal IS-IS configuration, only the first three steps are used. External
routes with external metrics require the external metric-type to be explicitly specified in
the route-map at the time of redistribution.
Step 1. Enable the IS-IS feature. The IS-IS feature must be enabled with the global
configuration command feature isis.
Step 2. Define an IS-IS process tag. The IS-IS process must be defined with the glob-
al configuration command router isis instance-tag. The instance-tag can be
up to 20 alphanumeric characters in length.
Step 3. Define the IS-IS NET address. The NET address must be configured with the
command net net-address.
Step 4. Define the IS-IS type (optional). By default, Nexus switches operate at L1-L2
IS-IS types. This means that an L1 adjacency is formed with L1 neighbors, a
L2 adjacency is formed with L2 neighbors, and two sessions (L1 and L2) are
formed with another L1-L2 IS-IS peer.
The IS-IS router type is changed with the command is-type {level-1 | level-
1-2 | level-2}.
Step 6. Enable IS-IS on interfaces. The interface that IS-IS is enabled on is selected
with the command interface interface-id. The IS-IS process is then enabled
on that interface with the command ip router isis instance-tag.
NX-2# conf t
Enter configuration commands, one per line. End with CNTL/Z.
NX-2(config)# feature isis
NX-2(config)# router isis NXOS
Technet24
520 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
IS-IS requires that neighboring routers form an adjacency before LSPs are processed. The
IS-IS neighbor adjacency process consists of three states: down, initializing, and up. This
section explains the process for troubleshooting IS-IS neighbor adjacencies on Nexus
switches.
Figure 9-6 provides a simple topology with two Nexus switches that are used to explain
how to troubleshoot IS-IS adjacency problems.
VLAN 10 NX-1
NX-1 E1/1 E1/1 NX-2 VLAN 20
L1-L2
L1-L2 L1-L2
(.100) (.200)
Example 9-2 displays the output of the nondetailed command on NX-1. Notice that
there is an entry for the L1 adjacency and a separate entry for the L2 adjacency. This is
expected behavior for L1-L2 adjacencies with other routers.
Troubleshooting IS-IS Neighbor Adjacency 521
Table 9-7 provides a brief overview of the fields used in Example 9-2. Notice that the
Holdtime for NX-2 is relatively low because NX-2 is the DIS for the 10.12.1.0/24 network.
Note Notice that the system ID actually references the router’s hostname instead of the
6-byte system ID. IS-IS provides a name to system ID mapping under the optional TLV
#137 that is found as part of the LSP. This feature is disabled under the IS-IS router con-
figuration with the command no hostname dynamic.
Example 9-3 displays the show isis adjacency command using the summary and detail
keywords. Notice that the optional detail keyword provides accurate timers for transi-
tion states for a particular neighbor.
Example 9-3 Display of IS-IS Neighbors with summary and detail Keywords
Technet24
522 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
L1-2 0 0 0 0
SubTotal 0 0 0 0
Total 2 0 0 2
Besides enabling IS-IS on the network interfaces on Nexus switches, the following
parameters must match for the two switches to become neighbors:
■ MTU matches.
Troubleshooting IS-IS Neighbor Adjacency 523
■ L1 adjacencies require the area address to match the peering L1 router, and the
system ID must be unique between neighbors.
■ L1 routers can form adjacencies with L1 or L1-L2 routers, but not L2.
■ L2 routers can form adjacencies with L2 or L1-L2 routers, but not L1.
■ DIS requirements match.
■ IIH Authentication Type & Credentials (if any).
Some of the output in Example 9-4 has been omitted for brevity, but the following rel-
evant information is shown in the output:
Technet24
524 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
L1 L2 L1 L2 L1 L2
--------------------------------------------------------------------------------
Topology: TopoID: 0
Vlan10 Bcast 3 Down/Ready 0x02/L1-2 1500 4 4 64 64 0/0 0/0
Topology: TopoID: 0
loopback0 Loop 1 Up/Ready 0x01/L1-2 1500 1 1 64 64 0/0 0/0
Topology: TopoID: 0
VLAN10 Bcast 2 Up/Ready 0x01/L1-2 1500 4 4 64 64 1/0 1/0
Topology: TopoID: 0
VLAN10 Bcast 4 Up/Ready 0x03/L1-2 1500 4 4 64 64 0/0 0/0
The command show isis lists the IS-IS interfaces and provides an overview of the IS-IS
configuration for the router that might seem more efficient. Example 9-5 displays the
command. Notice that the System ID, MTU, metric styles, area address, and topology
mode are provided.
Technet24
526 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
Passive Interface
Some network topologies require advertising a network segment into IS-IS, but need to
prevent routers in that segment from forming neighbor adjacencies on that segment. A
passive interface is displayed as Inactive when displaying the IS-IS interfaces. The com-
mand show isis interface displays all IS-IS interfaces and the current status. Example 9-6
displays the use of this command. Notice that the Ethernet1/1 interface is passive for L1
only, whereas it is active for L2.
Level1
No auth type and keychain
Auth check set
Level2
No auth type and keychain
Auth check set
Index: 0x0002, Local Circuit ID: 0x01, Circuit Type: L1-2
BFD IPv4 is locally disabled for Interface Ethernet1/1
BFD IPv6 is locally disabled for Interface Ethernet1/1
MTR is disabled
Passive level: level-1
LSP interval: 33 ms, MTU: 1500
Level-2 Designated IS: NX-2
Level Metric-0 Metric-2 CSNP Next CSNP Hello Multi Next IIH
1 4 0 10 Inactive 10 3 Inactive
2 4 0 10 00:00:06 10 3 00:00:03
Level Adjs AdjsUp Pri Circuit ID Since
1 0 0 64 0000.0000.0000.00 00:01:55
2 1 1 64 NX-2.01 00:01:57
Topologies enabled:
L MT Metric MetricCfg Fwdng IPV4-MT IPV4Cfg IPV6-MT IPV6Cfg
1 0 4 no UP DN yes DN no
2 0 4 no UP UP yes DN no
Now that a passive interface has been identified, the configuration must be examined for
the following:
Example 9-7 displays the configuration on NX-1 and NX-2 that prevents the two Nexus
switches from forming an IS-IS adjacency on L1 or L2. The Ethernet1/1 interfaces must
be active on both switches per IS-IS level for an adjacency to form. The interfaces can be
made active by removing the command isis passive-interface level-1 from Ethernet1/1
on NX-1 and setting the command no isis passive-interface level-1-2 to Interface
Ethernet1/1 on NX-2 to allow an adjacency to form on L1 and L2.
Technet24
528 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
interface loopback0
ip router isis NXOS
interface Ethernet1/1
ip router isis NXOS
isis passive-interface level-1
interface VLAN10
ip router isis NXOS
interface loopback0
ip router isis NXOS
interface Ethernet1/1
ip router isis NXOS
no isis passive-interface level-1
interface VLAN20
ip router isis NXOS
Example 9-8 displays the use of this command. Notice that there is a separation of
authentication errors from other errors. Executing the command while specifying an
interface provides more granular visibility to the packets received or transmitted for an
interface.
Troubleshooting IS-IS Neighbor Adjacency 529
Example 9-9 displays the transmission and receipt of L1 and L2 IIH packets.
Technet24
530 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
03:25:37.583037 isis: NXOS Receive L1 LAN IIH over Ethernet1/1 from NX-2 (0021.21ae.
c123) len 1497 prio 0
03:25:37.583102 isis: NXOS Failed to find IPv6 address TLV MT-0
03:25:37 NX-1 %ISIS-5-ADJCHANGE: isis-NXOS LAN adj L1 NX-2 over Ethernet1/1 - INIT
(New) on MT--1
03:25:37.583158 isis: NXOS isis_iih_find_bfd_enable: MT 0 : isis_topo_bfd_required =
FALSE
03:25:37.583176 isis: NXOS isis_iih_find_bfd_enable: MT 0 : isis_topo_usable = TRUE
03:25:37.583193 isis: NXOS isis_receive_lan_iih: isis_bfd_required = 0, isis_
neighbor_useable 1
03:25:37.583229 isis: NXOS Set adjacency NX-2 over Ethernet1/1 IPv4 address to
10.12.1.200
03:25:37.583271 isis: NXOS isis_receive_lan_iih BFD TLV: Bring UP adjacency
03:25:37.583295 isis: NXOS 2Way Advt pseudo-lsp : LAN adj L1 NX-2 over Ethernet1/1
03:25:37 NX-1 %ISIS-5-ADJCHANGE: isis-NXOS LAN adj L1 NX-2 over Ethernet1/1 - UP
on MT-0
03:25:37.583365 isis: NXOS Obtained Restart TLV RR=0, RA=0, SA=0
03:25:37.583383 isis: NXOS Process restart tlv for adjacency NX-2 over Ethernet1/1
address 10.12.1.200
03:25:37.583397 isis: NXOS Process restart info for NX-2 on Ethernet1/1: RR=no,
RA=no SA=no
03:25:37.583410 isis: NXOS Restart TLV present SA did not change SA state unsuppress
adj changed
03:25:37.583467 isis: NXOS Timer started with holding time 30 sec
03:25:37.583484 isis: NXOS Sending triggered LAN IIH on Ethernet1/1
03:25:37.583501 isis: NXOS Sending triggered LAN IIH on Ethernet1/1
03:25:37.583516 isis: NXOS isis_receive_lan_iih: Triggering DIS election
03:25:37.583571 isis: NXOS LAN IIH parse complete
03:25:37.604100 isis: NXOS Receive L2 LAN IIH over Ethernet1/1 from NX-2 (0021.21ae.
c123) len 1497 prio 0
Debug commands are generally the least preferred method for finding root cause because
of the amount of data that could be generated while the debug is enabled. NX-OS pro-
vides event-history that runs in the background without performance hits that provides
another method of troubleshooting. The command show isis event-history [adjacency |
dis | iih | lsp-flood | lsp-gen] provides helpful information when troubleshooting IS-IS. The
iih keyword provides the same information as the debug command in Example 9-9.
Example 9-10 displays the show isis even-history iih command. Examine the difference
in the sample output on NX-1 with the previous debug output. There is not much differ-
ence of information.
03:33:27.593010 isis NXOS [11140]: [11145]: 2Way Advt pseudo-lsp : LAN adj L1 NX-2
over Ethernet1/1
03:33:27.592977 isis NXOS [11140]: [11145]: Set adjacency NX-2 over Ethernet1/1 IPv4
address to 10.12.1.200
03:33:27.592957 isis NXOS [11140]: [11145]: isis_receive_lan_iih: isis_bfd_required
= 0, isis_neighbor_useable 1
03:33:27.592904 isis NXOS [11140]: [11145]: Failed to find IPv6 address TLV MT-0
03:33:27.592869 isis NXOS [11140]: [11145]: Receive L1 LAN IIH over Ethernet1/1 from
NX-2 (0021.21ae.c123) len 1497 prio 0
03:33:27.590316 isis NXOS [11140]: [11141]: isis_elect_dis(): Sending triggered LAN
IIH on Ethernet1/1
03:33:27.590253 isis NXOS [11140]: [11141]: Advertising MT-0 adj 0000.0000.0000.00
for if Ethernet1/1
03:33:27.590241 isis NXOS [11140]: [11141]: Advertising MT-0 adj NX-2.01 for if
Ethernet1/1
03:33:27.590181 isis NXOS [11140]: [11141]: Send L1 LAN IIH over Ethernet1/1 len
1497 prio 6,dmac 0180.c200.0014
03:33:27.582343 isis NXOS [11140]: [11145]: Sending triggered LAN IIH on Ethernet1/1
03:33:27.582339 isis NXOS [11140]: [11145]: Sending triggered LAN IIH on Ethernet1/1
03:33:27.582307 isis NXOS [11140]: [11145]: Process restart tlv for adjacency NX-2
over Ethernet1/1 address 10.12.1.200
03:33:27.582242 isis NXOS [11140]: [11145]: 2Way Advt pseudo-lsp : LAN adj L2 NX-2
over Ethernet1/1
03:33:27.582207 isis NXOS [11140]: [11145]: Set adjacency NX-2 over Ethernet1/1 IPv4
address to 10.12.1.200
03:33:27.582154 isis NXOS [11140]: [11145]: isis_receive_lan_iih: isis_bfd_required
= 0, isis_neighbor_useable 1
03:33:27.582101 isis NXOS [11140]: [11145]: Failed to find IPv6 addr
ess TLV MT-0
03:33:27.582066 isis NXOS [11140]: [11145]: Receive L2 LAN IIH over Ethernet1/1 from
NX-2 (0021.21ae.c123) len 1497 prio 0
03:33:27.579283 isis NXOS [11140]: [11141]: Send L2 LAN IIH over Ethernet1/1 len
1497
prio 6,dmac 0180.c200.0015
Performing IS-IS debugs shows only the packets that have reached the supervisor CPU.
If packets are not displayed in the debugs or event-history, further troubleshooting must
be taken by examining quality of service (QoS) policies, control plane policing (CoPP),
or just verification of the packet leaving or entering an interface.
QoS policies may or may not be deployed on an interface. If they are deployed, the
policy-map must be examined for any drop packets, which must then be referenced to
a class-map that matches the IS-IS routing protocol. The same process applies to CoPP
policies because they are based on QoS settings as well.
Example 9-11 displays the process for checking a switch’s CoPP policy with the follow-
ing logic:
1. Examine the CoPP policy with the command show running-config copp all. This
displays the relevant policy-map name, classes defined, and the police rate.
Technet24
532 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
2. Investigate the class-maps to identify the conditional matches for that class-map.
3. After the class-map has been verified, examine the policy-map drops for that class
with the command show policy-map interface control-plane. If drops are found,
the CoPP policy needs to be modified to accommodate a higher IS-IS packet flow.
Note This CoPP policy was taken from a Nexus 7000 switch, and the policy-name and
class-maps may vary depending on the platform.
Another technique to see if the packets are reaching the Nexus switch is to use the built
in Ethanalyzer. The Ethanalyzer is used because IS-IS uses Layer 2 addressing, which
restricts packet captures on Layer 3 ports. The command ethanalyzer local interface
inband [capture-filter “ether host isis-mac-address”] [detail] is used. The capture-filter
restricts traffic to specific types of traffic, and the filter ether host isis-mac-address
restricts traffic to IS-IS based on the values from Table 9-2. The optional detail pro-
vides a packet-level view of any matching traffic. The use of Ethanalyzer is shown in
Example 9-12 to identify L2 IIH packets.
Capturing on inband
09:08:42.979127 88:5a:92:de:61:7c -> 01:80:c2:00:00:15 ISIS L2 HELLO,
System-ID: 0000.0000.0001
09:08:46.055807 88:5a:92:de:61:7c -> 01:80:c2:00:00:15 ISIS L2 HELLO,
System-ID: 0000.0000.0001
09:08:47.489024 88:5a:92:de:61:7c -> 01:80:c2:00:00:15 ISIS L2 CSNP,
Source-ID: 0000.0000.0001.00, Start LSP-ID: 0000.0000.0000.00-00, End LSP-ID: ff
ff.ffff.ffff.ff-ff
09:08:48.570401 00:2a:10:03:f2:80 -> 01:80:c2:00:00:15 ISIS L2 HELLO,
System-ID: 0000.0000.0002
09:08:49.215861 88:5a:92:de:61:7c -> 01:80:c2:00:00:15 ISIS L2 HELLO,
System-ID: 0000.0000.0001
09:08:52.219001 88:5a:92:de:61:7c -> 01:80:c2:00:00:15 ISIS L2 HELLO,
System-ID: 0000.0000.0001
Capturing on inband
Frame 1 (1014 bytes on wire, 1014 bytes captured)
Arrival Time: May 22, 2017 09:07:16.082561000
[Time delta from previous captured frame: 0.000000000 seconds]
[Time delta from previous displayed frame: 0.000000000 seconds]
[Time since reference or first frame: 0.000000000 seconds]
Frame Number: 1
Frame Length: 1014 bytes
Capture Length: 1014 bytes
[Frame is marked: False]
Technet24
534 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
IS Neighbor: 00:2a:10:03:f2:80
Restart Signaling (1)
Restart Signaling Flags : 0x00
.... .0.. = Suppress Adjacency: False
.... ..0. = Restart Acknowledgment: False
.... ...0 = Restart Request: False
Padding (255)
Padding (255)
Padding (255)
Padding (171)
The subnet mask was changed on NX-2 from 10.12.1.200/24 to 10.12.1.200/25 for this
section. This places NX-2 in the 10.12.1.128/25 network, which is different from NX-1
(10.12.1.100).
When examining the IS-IS neighbor table, note that NX-1 is in INIT state with NX-2,
but NX-2 does not detect NX-1. This is shown in Example 9-13.
The next plan of action is to check the IS-IS event-history for adjacency and IIH on
NX-1 and NX-2. NX-1 has adjacency entries for NX-2, whereas NX-2 does not have any
Technet24
536 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
adjacency entries. After checking the IIH event-history, NX-2 displays that it cannot find
a usable IP address, as shown in Example 9-14.
The next step is to check and correct IP addressing/subnet masks on the two IS-IS
router’s interfaces so that connectivity is established.
MTU Requirements
IS-IS hellos (IIH) are padded with TLV #8 to reach the maximum transmission unit
(MTU) size of the network interface. Padding IIHs provides the benefit of detecting
errors with large frames or mismatched MTU on remote interfaces. Broadcast
interfaces transmit L1 and L2 IIHs wasting bandwidth if both interfaces use the
same MTU.
To demonstrate the troubleshooting process for mismatch MTU, the MTU on NX-1 is
set to 1000, whereas the MTU remains at 1500 for NX-2.
The first step is to check the IS-IS adjacency state as shown in Example 9-15. NX-1 does
not detect NX-2, whereas NX-2 detects NX-1.
Technet24
538 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
The next step is to examine the IS-IS IIH event-history to identify the problem. In
Example 9-16, NX-1 is sending IIH packets with a length of 997, and they are received
on NX-2. NX-2 is sending IIH packets with a length of 1497 to NX-1, which are
received. The length of the IIH packets indicates an MTU problem.
MTU is examined on both switches by examining the MTU values with the command
show interface interface-id and looking for the MTU value, as shown in Example 9-17.
The MTU on NX-2 is larger than NX-1.
Cisco introduced a feature that disables the MTU padding after the router sends
the first five IIHs out of an interface. This eliminates wasted bandwidth while still
providing a mechanism for checking the MTU between routers. Nexus switches
Troubleshooting IS-IS Neighbor Adjacency 539
disable the IIH padding with the interface parameter command no isis hello pad-
ding [always]. The always keyword does not pad any IIH packets, which allows
NX-1 to form an adjacency but could result in problems later. The best solution is
to modify the interface MTU to the highest MTU that is acceptable between the
two device’s interfaces.
Note If the IS-IS interface is a VLAN interface (SVI), make sure that all the L2 ports sup-
port the MTU configured on the SVI. For example, if VLAN 10 has an MTU of 9000, all
the trunk ports should be configured to support an MTU of 9000 as well.
Unique System-ID
The System-ID provides a unique identifier for an IS-IS router in the same area. A
Nexus switch drops packets that have the same System-ID as itself as part of a safety
mechanism. The syslog message Duplicate system ID is displayed along with the inter-
face and System-ID of the other device. Example 9-18 displays what the syslog message
looks like on NX-2.
Typically, a duplicate System-ID occurs when the IS-IS configuration from another
switch is copied. The System-ID portion of the NET address needs to be changed for an
adjacency to form.
Example 9-19 displays NX-1 and NX-2’s IS-IS adjacency tables. Notice that both Nexus
switches have established an L2 adjacency, but there is not an L1 adjacency like those
shown previously in this chapter.
Technet24
540 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
Through logical deduction, NX-1 and NX-2 can establish and maintain bidirectional
transmission of IS-IS packets because the L2 adjacency is established. This indicates
incorrect authentication parameters, invalid timers, or that the area numbers do
not match.
Example 9-20 displays the IS-IS event-history for NX-1 and NX-2. Notice that the error
message No common area is displayed before the message indicating that the L1 IIH
is received.
The final step is to verify the configuration and check the NET Addressing.
Example 9-21 displays the NET entries for NX-1 and NX-2. NX-1 has an area of
49.0012 and NX-2 has an area of 49.0002.
Changing the area portion of the NET address to match on either Nexus switch allows
for the L1 adjacency to form.
The IS-IS level that a Nexus switch operates at is set with the IS-IS configuration com-
mand is-type {level-1 | level-1-2 | level-2-only).
The setting is verified by looking at the IS-IS process as shown in Example 9-22.
Technet24
542 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
Other topology designs may specify that a specific interface should establish only a
specific IS-IS level adjacency. This is accomplished with the interface parameter com-
mand isis circuit-type {level-1 | level-1-2 | level-2-only}.
This setting is verified by looking at the IS-IS process as shown in Example 9-23. Notice
that Ethernet1/1 is set to allow only L1 connections.
interface loopback0
ip router isis NXOS
interface Ethernet1/1
isis circuit-type level-1
ip router isis NXOS
interface EthernetVlan10
ip router isis NXOS
It is possible to set the Nexus switch to a specific IS-IS level functionality with a different
setting for a circuit from the global IS-IS setting. When the settings are combined, the Nexus
switch uses the most restrictive level when forming an adjacency. Table 9-8 displays the capa-
ble adjacencies for a router based solely on the IS-IS router type, and IS-IS circuit-type.
If IIH packets are missing from the event-history, the IS-IS Router and Interface-level
settings need to be verified on both routers.
DIS Requirements
The default IS-IS interface on Nexus switches is a broadcast interface and requires a
DIS. Broadcast interface IS-IS interfaces that are directly connected with only two IS-IS
routers do not benefit from the use of a pseudonode. Resources are wasted on electing
a DIS. CSNPs are continuously flooded into a segment, and an unnecessary pseudonode
LSP is included in the LSPDB of all routers in that level. IS-IS allows general topology
interfaces to behave like a point-to-point interface with the interface command isis
network point-to-point.
An adjacency will not form between IS-IS Nexus switches that have one broadcast inter-
face and an IS-IS point-to-point interface. Neither device shows an IS-IS adjacency, but
the general topology switch reports the message Fail: Receiving P2P IIH over LAN
interface xx in the IS-IS IIH event-history. IS-IS event-history indicates which neighbor
has advertised the P2P interface. When those messages are detected, the interface type
needs to be changed on one node to ensure that they are consistent.
Example 9-24 displays NX-2’s IS-IS event-history and the relevant configurations for
NX-1 and NX-2.
interface loopback0
ip router isis NXOS
Technet24
544 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
interface Ethernet1/1
isis network point-to-point
ip router isis NXOS
interface EthernetVlan10
ip router isis NXOS
interface loopback0
ip router isis NXOS
interface Ethernet1/1
ip router isis NXOS
interface EthernetVlan20
ip router isis NXOS
Adding the command isis network point-to-point to NX-2’s Ethernet1/1 interface sets
both interfaces to the same type, and then an adjacency forms.
IIH Authentication
IS-IS allows for the authentication of IIH packets that are required to form an adjacency.
IIH authentication is configured on an interface by interface perspective. IIH authentica-
tion uses different settings for each IS-IS level. Authenticating on one PDU type is
sufficient for most designs.
IS-IS provides two types of authentication: plaintext and a MD5 cryptographic hash.
Plaintext mode provides little security, because anyone with access to the link can see
the password with a network sniffer. MD5 cryptographic hash uses a hash instead, so
the password is never included in the PDUs, and this technique is widely accepted as
being the more secure mode. All IS-IS authentication is stored in TLV#10 that is part of
the IIH.
Nexus switches enable IIH authentication with the interface parameter command
isis authentication key-chain key-chain-name {level-1 | level-2}. The authentication
type is identified with the command isis authentication-type {md5 | cleartext}
{level-1 | level-2}.
NX-1# conf t
Enter configuration commands, one per line. End with CNTL/Z.
NX-1(config)# key chain IIH-AUTH
NX-1(config-keychain)# key 2
NX-1(config-keychain-key)# key-string CISCO
NX-1(config-keychain-key)# interface Ethernet1/1
NX-1(config-if)# isis authentication key-chain CISCO level-1
NX-1(config-if)# isis authentication-type md5 level-1
Technet24
546 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
The password in the keychain is viewed with the command show key chain key-chain-
name [mode decrypt]. The optional mode decrypt keywords display the password in
plaintext as displayed in Example 9-28.
Upon enabling authentication, it is important to check the syslog for error messages that
indicate bad authentication. For those that do, verify the authentication options and
password on all peers for that network link.
Duplicate System ID
The IS-IS system ID plays a critical role for the creation of the topology. If two adjacent
routers have the same system ID in the same L1 area, an adjacency does not form as shown
earlier. However, if two routers have the same system ID in the same L1 area and have an
intermediary router, it prevents those routes from being installed in the topology.
Figure 9-7 provides a sample topology in which all Nexus switches are in the same
area with only L1 adjacencies. NX-2 and NX-4 have been configured with the same system
Troubleshooting Missing Routes 547
ID of 0000.0000.0002. NX-3 sits between NX-2 and NX-4 and has a different system ID,
therefore allowing NX-2 and NX-4 to establish full neighbor adjacencies.
Area 49.1234
System ID System ID
0000.0000.0002 0000.0000.0002
From NX-1’s perspective, the first apparent issue is that NX-4’s 10.4.4.0/24 network is
missing, as shown in Example 9-29.
Example 9-29 NX-1’s Routing Table with Missing NX-4’s 10.4.4.0/24 Network
On NX-2 and NX-4, there are complaints about LSPs with duplicate system IDs: L1
LSP—Possible duplicate system ID, as shown in Example 9-30.
Example 9-30 Syslog Messages with LSPs with Duplicate System IDs
Example 9-31 displays the routing table of the two Nexus switches with the Possible
duplicate system ID syslog messages. Notice that NX-2 is missing only NX-4’s interface
(10.4.4.0/24), whereas NX-4 is missing the 10.12.1.0/24 and NX-1’s Ethernet interface
(10.1.1.0/24). Examining the IS-IS database displays a flag (*) that indicates a problem
with NX-2.
Technet24
548 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
A quick check of the router’s system ID is done by examining the IS-IS processes
on both Nexus switches that reported the Possible duplicate system ID using the
show isis | i system command. Notice that in Example 9-32, NX-2 and NX-4 have the
same system ID.
Troubleshooting Missing Routes 549
The default reference bandwidth for NX-OS is 40 Gbps, whereas other Cisco OSs (IOS
and IOS XR) statically set the interface link metric to 10 regardless of interface speed.
Table 9-9 provides the default IS-IS metric for common network interface types using
the default reference bandwidth.
Notice in Table 9-9 that there is no differentiation in the link cost associated to
a Fast Ethernet Interface and a 40-Gigabit Ethernet interface on IOS routers. In
essence, suboptimal routing can exist when Nexus switches interact with IOS-
based devices in an IS-IS topology. For example, Figure 9-8 displays a topology in
which connectivity between R1 and R2 should take the 10 Gigabit Path (R1→
NX-3→NX-4→R2) because the 1 Gigabit link between R1 and R2 should be used
only as a backup path.
Technet24
550 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
WAN WAN
Service Service
Provider 1 Provider 2
172.16.1.0/24 172.32.2.0/24
Gigabit Link
R1 R2
10.12.1.0/24
L1 L1
10-Gigabit Link
10-Gigabit Link
10.13.1.0/24
10.24.1.0/24
10-Gigabit Link
NX-3 10.34.1.0/24 NX-4
L1 L1
Example 9-33 displays the routing table of R1 with the default interface metrics on
all the devices. Traffic between 172.16.1.0/24 and 172.32.2.0/24 flows across the
backup 1 Gigabit link (10.12.1.0/24), which does not follow the intended traffic pat-
terns. Notice that the IS-IS path metric is 20 to the 172.32.2.0/24 network using the
1 Gigabit link.
Example 9-33 R1’s Routing Table with Default Interface Metrics Bandwidth
Now one of the beautiful things about IS-IS is how it structures networks as
objects that exist on top of the routers themselves. Instead of viewing the routing
table, the IS-IS topology table is viewed with the command show isis topology.
The IS-IS topology table lists the total path metric to reach the destination router,
next-hop node, and outbound interface. Example 9-34 displays the topology table
from R1 and NX-3’s perspective. R1 is selecting the path to R2 via the direct link
on Gi0/1.
Troubleshooting Missing Routes 551
Example 9-34 R1’s and NX-3’s IS-IS Topology Table with Default Metric
Notice how R1 and NX-3 have conflicting metric values when they point to each other.
To ensure that routing takes the optimal path, three options ensure optimal routing:
■ Statically set the IS-IS metric on IS-IS devices that are not Nexus switches. IOS-
based devices use the interface parameter command isis metric metric-value.
■ Statically set the IS-IS metric on a Nexus interface to reflect network links that are
more preferred with the interface parameter command isis metric metric-value
{level-1 | level-2}.
■ Change the reference bandwidth on Nexus switches to a higher value to make those
links more preferred. The reference bandwidth is set with the IS-IS process configu-
ration command reference-bandwidth reference-bw {gbps | mbps}.
There are not any intermediary routers between R1 and R2, so the only option that
makes sense is to modify the IS-IS metrics on R1 and R2. Example 9-35 displays the met-
ric for the 10.12.1.0/24 link being statically set to 40, and the metric being set to 4 for
the 10 Gbps interface. The value correlates to a reference bandwidth of 40 Gbps.
Technet24
552 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
R1# conf t
Enter configuration commands, one per line. End with CNTL/Z.
R1(config)# interface GigabitEthernet0/1
R1(config-if)# isis metric ?
<1-16777214> Default metric
maximum Maximum metric. All routers will exclude this link from their
SPF
Now that the change has been made, let’s examine the IS-IS routing table and topology
table on R1 and NX-3, as shown in Example 9-36. Now the interface metrics match
for the 10.13.1.0/24 and 10.24.1.0/24 networks. In addition, R1 is now selecting the
10 Gbps path as the preferred path to reach R2.
Example 9-36 IS-IS Routing and Topology Table After Static Metric Configuration
VRF: default
IS-IS Level-1 IS routing table
R1.00, Instance 0x00000023
*via R1, Ethernet1/2, metric 4
R2.00, Instance 0x00000023
*via NX-4, Ethernet1/1, metric 8
R2.02, Instance 0x00000023
*via NX-4, Ethernet1/1, metric 8
NX-4.00, Instance 0x00000023
*via NX-4, Ethernet1/1, metric 4
RFC 3784 provided a method for the interface metric to use a 24-bit number that allows
for the metric to be set between 1 and 16,777,214. The 24-bit metrics are available in the
Extended IS Reachability TLV (22) and the Extended IP Reachability TLV (135), and are
commonly referred to as wide metrics.
Nexus switches accept narrow or wide metrics and advertise only wide metrics by
default. IOS and IOS XR accept and advertise only narrow metrics by default, which
causes problems when integrating non-Nexus switches in a topology. Figure 9-9 displays
a simple L1 topology with multiple device types. All devices and interfaces have IS-IS
enabled on them.
Area 49.1234
Example 9-37 displays R1’s and NX-2’s IS-IS routing entries. R1 does not have any
IS-IS routes in the routing table, whereas NX-2 has routes to all the networks in the
topology.
Technet24
554 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
The first step to identify missing routes is to verify neighbor adjacencies and then check
the IS-IS topology table. Example 9-38 displays the topology table on R1 and NX-2. R1
displays double asterisks (**) for all the metrics to the other routers, whereas NX-2 has
populated metrics. This is because R1 is configured only for narrow metrics, which use
different TLVs than the wide metric TLVs that are advertised from NX-2.
To confirm the theory, the metric types are checked on R1 and NX-2 by looking at the IS-IS
protocol, as shown in Example 9-39. R1 is set to accept and generate only narrow metrics,
whereas NX-OS accepts both narrow and wide metrics while advertising only wide metrics.
The Nexus switches are placed in metric transition mode using the command metric-
style transition, which makes the Nexus switch populate the LSP with narrow and wide
metric TLVs. This allows other routers that operate in narrow metric mode to compute a
total path metric for a topology.
Example 9-40 displays the configuration and verification on NX-2 for IS-IS metric transi-
tion mode.
Technet24
556 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
Example 9-41 displays the IS-IS topology table and routing table for the IOS routers
now that the Nexus switches are placed in IS-IS metric transition mode.
Example 9-41 Verification of IOS Devices After NX-OS Metric Transition Mode
L1 to L2 Route Propagations
IS-IS operates on a two-level hierarchy. A primary function of the L1-L2 routers is
to act as a gateway for L1 routers to the L2 IS-IS backbone. Figure 9-10 displays
a simple topology with NX-1 and NX-2 in Area 49.0012 while NX-3 and NX-4
are in Area 49.0034. NX-1’s 10.1.1.0/24 network should be advertised to Area
49.0034 by NX-2, and NX-4’s 10.4.4.0/24 network is advertised to Area 49.0012
by NX-3.
Troubleshooting Missing Routes 557
Example 9-42 displays all four Nexus switches' routing tables. Notice that NX-3 is miss-
ing the 10.1.1.0/24 network. This network exists in NX-2’s routing table as an IS-IS L1
route. The same behavior exists for NX-4’s 10.4.4.0/24 network, which appears on NX-3.
Technet24
558 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
The next step is to examine the IS-IS database with the command show isis database [level-1
| level-2] [detail] [lsp-id] to make sure that the appropriate LSPs are in the LSPDB. LSPs are
restricted by specifying an IS-IS level or the specific LSPID for an advertising router.
Example 9-43 displays all the LSPs for L1 and L2 in NX-2’s LSPDB. From the output,
NX-2 has received NX-1’s L1 LSP and has received NX-3’s L2 LSP.
Table 9-10 explains some of the key fields in the output from Example 9-43.
Field Description
Partition Bit (P) Indicates whether the partition repair bit is set on this LSP.
Overload Bit (O)f Indicates whether the overload bit is set on the advertising router. The
overload bit indicates that system maintenance is being performed or the
router has just started up and is waiting to fully converge. The overload
bit acts as a form of traffic engineering and directs traffic via other paths
where possible, and in essence provides the same effect as costing out
(placing high interface costs) on all links.
Nexus switches set the overload bit with the command set-overload-bit.
Topology Bit (T) Indicates the function of the router. A value of 1 indicates that the router
is an IS-IS L1 router. The value of 3 indicates that the router could be an
L1 or L1-L2, depending on whether the LSPID exists in both IS-IS levels.
Using the optional detail keyword provides a list of all the networks, metrics, and TLV
types when viewing the LSPDB. Example 9-44 displays all of NX-2’s L2 IS-IS LSP infor-
mation in detail. The output includes every network that NX-2 advertises to other L2
neighbors. Notice that the 10.1.1.0/24 network entry is not present on NX-2’s LSP, nor
is the 10.4.4.0/24 network entry on NX-3’s LSP.
Technet24
560 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
Note Remember that the pseudonode portion of the LSP ID is zero for the actual router
and contains all its links. If the pseudonode portion of the LSP ID is nonzero, it reflects
the DIS for the segment and lists the LSP IDs for the routers connected to it. The LSP ID
NX-3.02-00 is the DIS for the NX-2 to NX-3 network link.
The IS-IS LSPDB indicates that NX-1’s L1 routes are not propagating to NX-2’s L2
database, and the same behavior is occurring between NX-4 and NX-3. This is caused
by a difference in operational behavior between NX-OS and other Cisco operating
systems (IOS, IOS XR, etc.). Nexus switches require explicit configuration with the
command distribute level-1 into level-2 {all | route-map route-map-name} on L1-L2
routers to insert L1 routes into the L2 topology.
Example 9-45 displays the relevant IS-IS configuration on NX-2 and NX-3 to enable L1
route propagation into the L2 LSPDB.
Example 9-46 displays NX-3’s LSP that was advertised to NX-2, now that L1 route
propagation has been configured on NX-2 and NX-3. Notice that it now includes the L1
route 10.4.4.0/24.
Troubleshooting Missing Routes 561
Example 9-47 displays NX-2’s and NX-4’s routing table after the L1 route propagation
was configured on NX-2 and NX-3. Now the 10.1.1.0/24 and 10.4.4.0/24 network are
reachable on both the L1-L2 switches.
Example 9-47 NX-2 and NX-4’s Routing Table After L1 Route Propagation
Technet24
562 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
Suboptimal Routing
As mentioned in the previous section, L1-L2 routers act as a gateway for L1 routers to
the L2 IS-IS backbone. L1-L2 routers do not advertise L2 routes into the L1 area, but they
set the attached bit in their L1 LSP indicating that the router has connectivity to the IS-IS
backbone network. If an L1 router does not have a route for a network, it searches the
LSPDB for the closest router with the attached bit, which acts as a route of last resort.
In Figure 9-11, Area 49.1234 connects to Area 49.0005 and Area 49.0006. NX-1 and
NX-3 are L1 routers, and NX-2 and NX-4 are L1-L2 routers.
10.24.1.0/24
(Cost 40)
(Cost 4)
Area 49.0006
(Cost 4)
NX-3 10.34.1.0/24 NX-4 172.16.46.0/24 NX-6 10.6.6.0/24
L1 L1-L2 L2
The problem comes from the suboptimal routing that occurs when NX-1 tries to connect
with 10.6.6.0/24 network, as it crosses the higher cost 10.24.1.0/24 network link. The
same problem occurs for NX-3 connecting with the 10.5.5.0/24 network. Example 9-48
displays the suboptimal path taken by both NX-1 and NX-3.
Example 9-49 displays the IS-IS database on NX-1 and NX-3. The attached bit ‘A’ is
detected for NX-2 and NX-4. In essence, the attached bit provides a L1 default route
toward the advertising L1-L2 router.
Now NX-1 and NX-2 must identify the closest router with the attached bit. Normally
this is a manual process of cross-referencing the IS-IS database with the IS-IS topology
table, but NX-OS does this for you automatically. The IS-IS topology table for NX-1
and NX-3 is displayed in Example 9-50.
Technet24
564 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
Example 9-51 displays the routing table of NX-1 and NX-3. Notice that an entry does not
exist for the 10.5.5.0/24 or 10.6.6.0/24 networks, so the default network is used instead.
Notice that the default route correlates with the IS-IS topology table entry from Example 9-50.
Note Route leaking normally uses a restrictive route map to control which routes are
leaked; otherwise, running all the area routers in L2 mode makes more sense.
Let’s verify the change by checking the IS-IS database to see if the 10.5.5.0/24 and
10.6.6.0/24 networks are being advertised by NX-2 and NX-4 into IS-IS L1 for
Area 49.1234. After that is verified, check the routing table to verify that those
entries are added to the RIB. Example 9-53 displays the IS-IS Database with L2
Route Leaking.
Technet24
566 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
Example 9-54 verifies that NX-1 and NX-3 are forwarding traffic using the optimal path.
Redistribution
Redistributing into IS-IS uses the command redistribute [bgp asn | direct | eigrp
process-tag | isis process-tag | ospf process-tag | rip process-tag | static] route-map
route-map-name. A route-map is required as part of the redistribution process on Nexus
switches. Every protocol provides a seed metric at the time of redistribution that allows
the destination protocol to calculate a best path. IS-IS provides a default redistribution
metric of 10.
Summary 567
Example 9-55 provides the necessary configuration to demonstrate the process of redis-
tribution. NX-1 redistributes the connected routes for 10.1.1.0/24 and 10.11.11.0/24 in
lieu of them being advertised with the IS-IS routing protocol. Notice that the route-map
is a simple permit statement without any conditional matches.
The route is redistributed on NX-1 and is injected into the IS-IS database with the
10.1.1.0/24 and 10.11.11.0/24 prefix. The redistribution of prefixes is verified by look-
ing at the LSPDB on other devices, such as NX-2, as shown in Example 9-56.
Summary
This chapter provided a brief review of the IS-IS routing protocols and then explored
the methods for troubleshooting adjacency issues between devices, missing routes, and
path selection.
Technet24
568 Chapter 9: Troubleshooting Intermediate System-Intermediate System (IS-IS)
The following parameters must match for the two routers to become neighbors:
■ MTU matches.
■ L1 adjacencies require the area address to match, and the system ID must be unique
between neighbors.
■ L1 routers can form adjacencies with L1 or L1-L2 routers, but not L2.
■ L2 routers can form adjacencies with L2 or L1-L2 routers, but not L1.
IS-IS is a link-state routing protocol that creates a complete map based on LSPs. Routes
are missing from the routing database typically because of bad network design, mis-
match of metric types, or through configurations that do not support L1-to-L2 route
propagation. This chapter provided some common bad IS-IS designs and their solutions
to prevent the loss of path information.
References
RFC 1195, Use of OSI IS-IS for Routing in TCP/IP and Dual Environments.
RFC 3784, Intermediate System to Intermediate System (IS-IS) Extensions for Traffic
Engineering (TE). Tony Li, Henk Smit. IETF, https://tools.ietf.org/html/rfc3784,
June 2004.
Edgeworth, Brad, Aaron Foss, Ramiro Garza Rios. IP Routing on Cisco IOS, IOS XE
and IOS XR. Indianapolis: Cisco Press, 2014.
Troubleshooting Nexus
Route-Maps
■ Route-Maps
■ Troubleshooting RPM
■ Redistribution
■ Policy-Based Routing
Nexus Operating System (NX-OS) route-maps provide the capability to filter routes and
modify route attributes and routing behavior. These technologies use conditional match
criteria to allow actions to occur based upon route characteristics.
Before route-maps are explained, the concepts involved with conditional matching using
access control lists (ACL), prefix lists, and conditional matching of BGP communities
must be explained.
Conditional Matching
Route-maps typically use some form of conditional matching so that only certain pre-
fixes are blocked, accepted, or modified. Network prefixes are conditionally matched
by a variety of routing protocol attributes, but the following sections explain the most
common techniques for conditionally matching a prefix.
Technet24
570 Chapter 10: Troubleshooting Nexus Route-Maps
Today, ACLs provide a method of identifying networks within routing protocols. ACLs
are also useful to isolate the direction of the problem or identify where the packet is
getting dropped while troubleshooting a complex network environment.
ACLs in NX-OS are generic expressions for filtering traffic based on Layer 2, Layer 3,
or Layer 4 information. ACLs are composed of access control entries (ACE), which
are entries in the ACL that identify the action to be taken (permit or deny) and the rel-
evant packet classification. Packet classification starts at the top (lowest sequence) and
proceeds down (higher sequence) until a matching pattern is identified. After a match
is found, the appropriate action (permit or deny) is taken and processing stops. At the
end of every ACL is an implicit deny ACE, which denies all packets that did not match
earlier in the ACL.
■ Standard ACLs: Define the packets based solely on the source network.
■ Extended ACLs: Define the packet based upon source, destination, protocol,
port or combination of other packet attributes. Standard ACLs use the numbered
entry 1–99, 1300–1999, or a named ACL. Extended ACLs use the numbered
entry 100–199, 2000–2699, or a named ACL. Named ACLs provide relevance to
the functionality of the ACL, are used with standard or extended ACLs, and are
generally preferred.
The behavior for selecting a network prefix with an extended ACL varies depending on
whether the protocol is an IGP such as Enhanced Interior Gateway Protocol (EIGRP),
Open Shortest Path First (OSPF), Intermediate System-to-Intermediate System (IS-IS) or
Border Gateway Protocol (BGP).
■ Time ranges
■ Per-entry statistics
Conditional Matching 571
Along with applying ACLs on the interface or using ACLs along with route-maps, which
is then used by routing protocols for route filtering purposes, ACLs have the following
applications:
■ IPv4/IPv6 ACLs
■ Media access control (MAC) ACL
In NX-OS, when an ACL is applied to a target, a policy is created. NX-OS supports the
following types of ACL policies:
Note PACL can be applied only on ingress packets for L2/L3 physical Ethernet inter-
faces (including L2 port-channel interfaces).
Example 10-1 illustrates the various ACL configurations supported on NX-OS. In the
following example, the command statistics per-entry is configured to enable the sta-
tistics for the ACEs configured under the ACLs. If the command statistics per-entry is
not configured, the command show ip access-list does not display any statistics for the
packets hitting a particular ACE.
IP ACL
NX-1(config)# ip access-list TEST
NX-1(config-acl)# permit ip host 192.168.33.33 host 192.168.3.3
NX-1(config-acl)# permit ip any any
NX-1(config-acl)# statistics per-entry
Technet24
572 Chapter 10: Troubleshooting Nexus Route-Maps
IPv6 ACL
NX-1(config)# ipv6 access-list TESTv6
NX-1(config-ipv6-acl)# permit icmp host 2001::33 host 2001::3
NX-1(config-ipv6-acl)# permit ipv6 any any
NX-1(config-ipv6-acl)# statistics per-entry
MAC ACL
NX-1(config)# mac access-list TEST-MAC
NX-1(config-mac-acl)# permit 00c0.cf00.0000 0000.00ff.ffff any
NX-1(config-mac-acl)# permit any any
NX-1(config-mac-acl)# statistics per-entry
ARP ACL
NX-1(config)# arp access-list TEST-ARP
NX-1(config-arp-acl)# deny ip host 192.168.10.11 mac 00c0.cf00.0000 ffff.ff00.0000
NX-1(config-arp-acl)# permit ip any mac any
VLAN Access-map
NX-1(config)# vlan access-map TEST-VLAN-MAP
NX-1(config-access-map)# match ip address TEST
NX-1(config-access-map)# action drop
NX-1(config-access-map)# statistics per-entry
Note Validate the ACL-related configuration using the command show run aclmgr. This
command displays both the ACL configuration and the ACL attach points.
Example 10-2 illustrates the difference between the output of the show ip access-list
command when the statistics per-entry command is configured, compared to when it is
not configured. In the following example, the ACL configuration that has the statistics
per-entry command configured displays the statistics for the confirmed hits.
When an ACL is attached to an interface or any other component, the ACL gets pro-
grammed in the ternary content addressable memory (TCAM). The TCAM program-
ming for the access-list is verified using the command show system internal access-list
interface interface-id input statistics [module slot]. This command displays under
which bank the ACL is programmed and what kind of policy is created when the ACL is
attached to an attach point. Along with this, the command displays the statistics of each
ACE entry in the ACL.
If the statistics per-entry command is not configured, the counters in the TCAM
increment only for the entry for all traffic; that is, permit ip 0.0.0.0/0 0.0.0.0/0.
Example 10-3 demonstrates the ACL entry on TCAM and the TCAM statistics when
statistics per-entry command is not configured.
Technet24
574 Chapter 10: Troubleshooting Nexus Route-Maps
Label_b = 0x3
Bank 0
------
IPv4 Class
Policies: RACL(TEST) [Merged]
Netflow profile: 0
Netflow deny profile: 0
Entries:
[Index] Entry [Stats]
---------------------
[001b:15262:0007] prec 1 permit-routed ip 0.0.0.0/0 0.0.0.0/0 [33]
! Output after 5 packets are sent between the host 192.168.3.3 and 192.168.33.33
NX-1# show system internal access-list interface e4/2 input statistics module 4
INSTANCE 0x0
---------------
Tcam 1 resource usage:
----------------------
Label_b = 0x3
Bank 0
------
IPv4 Class
Policies: RACL(TEST) [Merged]
Netflow profile: 0
Netflow deny profile: 0
Entries:
[Index] Entry [Stats]
---------------------
[001b:15262:0007] prec 1 permit-routed ip 0.0.0.0/0 0.0.0.0/0 [38]
As stated before, the ACLMGR takes care of creating the policies when an ACL is
attached to an attach point. The policies created by the ACLMGR are verified using the
command show system internal aclmgr access-lists policies interface interface-id. This
command displays the policy type and interface index, which points to the interface
where the ACL is attached, as shown in Example 10-4.
2 policies: {
ACLMGR_POLICY_INBOUND_IPV4_GHOST_RACL: 0x4400282
ACLMGR_POLICY_INBOUND_IPV4_RACL: 0x4400283
}
no links
}
NX-OS has a packet processing filter (PPF) API, which is used to filter the security rules
received and processed by the ACLMGR to the relevant clients. The clients can be an
interface, a port-channel, a VLAN manager, VSH, and so on. It is important to remember
that the ACLMGR stores all the data in the form of a PPF database, where each element
is a node. Based on the node ID received from the previous command, more details about
the policy can be verified by performing a lookup in the PPF database on that node.
Example 10-5 illustrates the use of the command show system internal aclmgr ppf node
node-id to perform a lookup on the PPF database of the ACLMGR for the policy node
created when the policy is attached to an attach point. This command is useful when
troubleshooting ACL/filtering-related issues, such as ACL not filtering the traffic prop-
erly or not matching the ACL entry at all on NX-OS platform.
Technet24
576 Chapter 10: Troubleshooting Nexus Route-Maps
.nrefs = 0
.id = 0x4400283
.group = 0x0
.flags = 0x0
.priv_data_size = 0
.type = Policy Instance
.dest.vdc = 1
.dest.vrf = 0
.dest.vlan = 0
.dest.ifindex = 0x1a181000
.dir = IN
.u.pinst.type = 0x400010 (racl_ipv4)
.u.pinst.policy.head = 0x4400265
.u.pinst.policy.tail = 0x0
.u.pinst.policy.size = 0x0
.u.pinst.policy.el_field = 0
.u.pinst.policy.el_field = 0
Note When troubleshooting any ACL related issues, it is recommended that you collect
the command show tech aclmgr [detail] or show tech aclqos [detail] during a problem.
The ACLQOS component on the line card provides statistics for ACLs on a per-line card
basis and are important when you are troubleshooting ACL-related issues.
Note Extended ACLs that are used for distribute-list use the source fields to identify
the source of the network advertisement, and the destination fields identify the network
prefix.
Table 10-2 demonstrates the concept of the wildcard for the network and subnet mask.
Technet24
578 Chapter 10: Troubleshooting Nexus Route-Maps
Prefix Matching
The structure for a prefix match specification contains two parts: high-order bit pattern
and high-order bit count, which determine the high-order bits in the bit pattern that are
to be matched. Some documentation refers to the high-order bit pattern as the address
or network, and the high-order bit count as length or mask length.
In Figure 10-2, the prefix match specification has the high-order bit pattern of
192.168.0.0 and a high-order bit count of 16. The high-order bit pattern has been
converted to binary to demonstrate where the high-order bit count lays. Because no
additional matching length parameters are included, the high-order bit count is an
exact match.
High-Order Bit
High-Order Bits 16 Count Boundary
The prefix match specification logic might look identical to the functionality of an
access-list. The true power and flexibility comes by using matching length parameters to
identify multiple networks with specific prefix lengths with one statement. The matching
length parameter options are as follows:
Figure 10-3 demonstrates the prefix match specification with a high-order bit pattern of
10.168.0.0, high-order bit count of 13, and the matching length of the prefix must be
greater than or equal to 24.
The 10.168.0.0/13 prefix does not qualify because the prefix length is less than the
minimum of 24 bits, whereas the 10.168.0.0/24 prefix does meet the matching length
parameter. The 10.173.1.0/28 prefix qualifies because the first 13 bits match the high-
order bit pattern, and the prefix length is within the matching length parameter. The
10.104.0.0/24 prefix does not qualify because the high-order bit-pattern does not match
within the high-order bit count.
Conditional Matching 579
High-Order Bit
High-Order Bit Count (Length)
Pattern (Network) Matching Length
Parameters
High-Order Bit
Count Boundary
High-Order Bits 13
Figure 10-4 demonstrates a prefix match specification with a high-order bit pattern
of 10.0.0.0, high-order bit count of 8, and the matching length must be between
22 and 26.
High-Order Bit
Count (Length)
High-Order Bit
Pattern (Network) Matching Length
Parameters
High-Order Bit
High Count Boundary
8
Order Bits
The 10.0.0.0/8 prefix does not match because the prefix length is too short. The
10.0.0.0/24 qualifies because the bit pattern matches and the prefix length is between 22
and 26. The 10.0.0.0/30 prefix does not match because the bit pattern is too long. Any
prefix that starts with 10 in the first octet and has a prefix length between 22 and 26 will
match the prefix match specification.
Technet24
580 Chapter 10: Troubleshooting Nexus Route-Maps
Prefix Lists
Prefix lists contain multiple prefix matching specification entries that contain a permit
or deny action. Prefix lists process in sequential order in a top-down fashion, and the
first prefix match processes with the appropriate permit or deny action.
NX-OS prefix lists are configured with the global configuration command ip prefix-list
prefix-list-name [seq sequence-number] {permit | deny} high-order-bit-pattern/high-
order-bit-count [{eq match-length-value | le le-value | ge ge-value [le le-value]}].
Example 10-6 provides a sample prefix list named RFC 1918 for all the networks in
the RFC 1918 address range. The prefix list only allows /32 prefixes to exist in the
192.168.0.0 network range and not exist in any other network range in the prefix list.
Notice that sequence 5 permits all /32 prefixes in the 192.168.0.0/13 bit pattern, then
sequence 10 denies all /32 prefixes in any bit pattern, and then sequence 15, 20,
25 permit routes in the appropriate network ranges. The sequence order is important
for the first two entries to ensure that only /32 prefixes exist in the 192.168.0.0 in the
prefix list.
The command show ip prefix-list prefix-list-name high-order-bit-pattern/high-order-
bit-count first-match provides the capability for a specific network prefix to be checked
against a prefix-list to identify the matching sequence, if any.
Example 10-7 displays the command being executed against three network prefix pat-
terns based upon the RFC1918 prefix-list created earlier. The first command uses a high-
order bit count of 32, which matches against sequence 5, whereas the second command
uses a high-order bit count of 16, which matches against sequence 25. The last command
matches against sequence 10, which has a deny action.
Note This command demonstrated in Example 10-7 is useful for verifying that the
network prefix matches the intended sequence in a prefix-list.
Route-Maps
Route-maps provide many features to a variety of routing protocols. At the simplest
level, route-maps filter networks similar to an ACL, but also provide additional capabil-
ity by adding or modifying a network attribute. Route-maps must be referenced within a
routing-protocol to influence it. Route-maps are a critical to BGP because it is the main
component of modifying a unique routing policy on a neighbor-by-neighbor basis.
■ Processing within a route-map stops after all optional actions have processed
(if configured) after matching a conditional matching criteria.
■ If a route is not conditionally matched, there is an implicit deny for that route.
Technet24
582 Chapter 10: Troubleshooting Nexus Route-Maps
Note When deleting a specific route-map statement, include the sequence number to
prevent deleting the entire route-map.
Conditional Matching
Now that the components and processing order of a route-map were explained, this sec-
tion expands upon the aspect of how to match a route. Example 10-9 shows the various
options available within NX-OS.
As you can see, a number of conditional matching options are available. Some of the
options, like vlan and mac-list, are applicable only for policy-based routing. Table 10-3
provides the command syntax for the most common methods for matching prefixes and
describes their usage.
Technet24
584 Chapter 10: Troubleshooting Nexus Route-Maps
Note Sequence 20 is redundant because of the implicit deny for any prefixes that are
not matched in sequence 10. However it provides clarity for junior network engineers.
If multiple match options are configured for a specific route-map sequence, both match
options must be met for the prefix to qualify for that sequence. The Boolean logic uses
an and operator for this configuration.
Route-Maps 585
In Example 10-11, sequence 10 requires that the prefix match ACL ACL-ONE and that
the metric be a value between 500 and 600. If the prefix does not qualify for both match
options, the prefix does not qualify for sequence 10 and is denied because another
sequence does not exist with a permit action.
Complex Matching
Some network engineers find route-maps too complex if the conditional matching
criteria uses an ACL, AS-Path ACL, or prefix list that contains a deny statement in it.
Example 10-12 demonstrates a configuration where the ACL uses a deny statement for
the 172.16.1.0/24 network range.
Reading configurations like this must follow the sequence order first, conditional match-
ing criteria second, and only after a match occurs should the processing action and
optional action be used. Matching a deny statement in the conditional match criteria
excludes the route from that sequence in the route-map.
The prefix 172.16.1.0/24 is denied by ACL-ONE, so that infers that there is not a match
in sequence 10 and 20; therefore, the processing action (permit or deny) is not needed.
Sequence 30 does not contain a match clause, so any remaining routes are permitted,
The prefix 172.16.1.0/24 passes on sequence 30 with the metric set to 20. The prefix
172.16.2.0/24 matches ACL-ONE and passes in sequence 10.
Technet24
586 Chapter 10: Troubleshooting Nexus Route-Maps
Note Route-maps process in the order of evaluation of the sequence, conditional match
criteria, processing action, and optional action, in that order. Any deny statements in the
match component are isolated from the route-map sequence action.
Optional Actions
In addition to permitting the prefix to pass, route-maps modify route attributes.
Table 10-4 provides a brief overview of the most popular attribute modifications.
Example 10-13 displays the relevant EIGRP configuration that redistributes OSPF and
directly connected routes into EIGRP. The route-map is selective with the routes that are
being redistributed into EIGRP.
NX-2
ip prefix-list PRE1 seq 5 permit 100.1.1.0/24
ip prefix-list PRE2 seq 5 permit 100.64.1.0/24
!
route-map REDIST-CONNECTED-2-EIGRP permit 10
match interface Vlan10
route-map REDIST-OSPF-2-EIGRP permit 10
match ip address prefix-list PRE1 PRE2
set metric 10000 1 255 1 1500
!
router eigrp NXOS
router-id 192.168.2.2
address-family ipv4 unicast
autonomous-system 100
redistribute direct route-map REDIST-CONNECTED-2-EIGRP
redistribute ospf NXOS route-map REDIST-OSPF-2-EIGRP
Technet24
588 Chapter 10: Troubleshooting Nexus Route-Maps
When routes are not installed in EIGRP as anticipated, the first step is to check to make
sure that any relevant policies were bound to the destination routing protocol. The com-
mand show system internal rpm event-history rsw displays low-level events that are
handled by RPM.
Example 10-14 displays the command. Notice that two different route-maps were
applied to EIGRP: one was for OSPF and the other for directly connected interfaces as
the source redistribution protocols.
It might be simpler to get a count of the number of RPM processes that have attached
to a protocol using the command show system internal rpm clients, as demonstrated in
Example 10-15. If the Bind-count did not match the anticipated number of route-maps
used by that protocol, viewing the RPM event-history will indicate the error.
RF/RD 0 ospf-NXOS
RF/RD 0 icmpv6
RF/RD 0 igmp
RF/RD 0 u6rib
RF/RD 0 tcp
RF/RD 0 urib
In addition to viewing the accuracy of a prefix-list as shown in Example 10-2, the capa-
bility to view the internal programming for a prefix-list is beneficial. The command show
system internal rpm ip-prefix-list displays all the prefix-lists configured on the Nexus
switch for the route-map from Example 10-13. Notice that the clients show you the
route-map referencing the prefix list. In addition, each prefix-list entry displays the num-
ber of sequences and version history for that prefix-list. Example 10-16 displays the use
of the show system internal rpm ip-prefix-list command for NX-2.
The last method is to view relevant changes from a debug perspective. Traditionally, the
route-map option needs to be enabled from the destination protocol. Example 10-17
displays the debugs for the route-maps associated to the redistribution into EIGRP.
Technet24
590 Chapter 10: Troubleshooting Nexus Route-Maps
Policy-Based Routing
A router makes forwarding decisions based upon the destination address of the
IP packet. Some scenarios accommodate other factors, such as packet length or
source address, when deciding where the router should forward a packet.
Policy-based routing (PBR) allows for conditional forwarding of packets based on the
following packet characteristics:
■ Manually assigning different network paths to the same destination based upon tol-
erance for latency, link-speed, or utilization for specific transient traffic
Packets are examined for PBR processing as they are received on the router interface.
PBR verifies the existence of the next-hop IP address and then forwards packets using
the specified next-hop address. Additional next-hop addresses are configured so that
in the event that the first next-hop address is not in the RIB, the secondary next-hop
addresses are used. If none of the specified next-hop addresses exist in the routing table,
the packets are not conditionally forwarded.
Note PBR policies do not modify the RIB because the policies are not universal for all
packets. This often complicates troubleshooting because the routing table displays the
next-hop address learned from the routing protocol, but does not accommodate for a dif-
ferent next-hop address for the conditional traffic.
NX-OS PBR configuration use a route-map with match and set statements that are then
attached to the inbound interface. The following steps are used:
Step 1. Enable the PBR feature. The PBR feature is enabled with the global configu-
ration command feature pbr.
Step 2. Define a route-map. The route-map is configured with the global configura-
tion command route-map route-map-name [permit | deny] [sequence-
number].
Step 3. Identify the conditional match criteria. The conditional match criteria
is based upon packet length with the command match length minimum-
length maximum-length, or by using the packet ip address fields
with an ACL using the command match ip address {access-list-number |
acl-name}.
Technet24
592 Chapter 10: Troubleshooting Nexus Route-Maps
Step 5. Apply the route-map to the inbound interface. The route-map is applied with
the interface parameter command ip policy route-map route-map-name.
Step 6. Enable PBR statistics (optional). Statistics of PBR forwarding are enabled with
the command route-map route-map-name pbr-statistics.
Figure 10-6 displays a topology to demonstrate how PBR operates. The default path
between NX-1 and NX-6 is NX-1 → NX-2 → NX-3 → NX-5 → NX-6 because the link
cost of 10.23.1.0/24 is lower than 10.24.1.0/24 link. However, specific traffic sourced
from NX-1’s Loopback 0 (192.168.1.1) to NX-6’s Loopback 0 (192.168.6.6) must not
forward through NX-3. These packets must forward through NX-4 even though it has a
higher path cost.
Lo0: Lo0:
192.168.1.1 192.168.6.6
24 10
.0/ NX-3 .35
1 .
.23
.
1 Co 1.0/2
10 ost st
40 4
10.12.1.0/24 C 10.56.1.0/24
Cost 40 10 4 NX-5 Cost 40
NX-1 NX-2 .24 NX-6
0/2
Co .1.0/ 5 .1. 0
st 2 . 4 4
40 4 10 ost
C
NX-4
NX-2
feature pbr
!
ip access-list R1-TO-R6
10 permit ip 192.168.1.1/32 192.168.6.6/32
!
route-map PBR pbr-statistics
route-map PBR permit 10
match ip address R1-TO-R6
Policy-Based Routing 593
Example 10-19 displays a traceroute from NX-1, which displays traffic flowing through
NX-3. A source interface was not specificed, so traffic is sourced from the 10.12.1.1 IP
address. This is confirmed based upon the routing table on NX-2 for the 192.168.6.6
network prefix.
PBR statistics were enabled on NX-2 that allows for network engineers to see how much
traffic was forwarded by PBR. Example 10-21 displays output for PBR statistics before
and after traffic was conditionally forwarded.
Technet24
594 Chapter 10: Troubleshooting Nexus Route-Maps
Note The PBR configuration shown is for transient traffic. For PBR on locally generated
traffic, use the command ip local policy route-map route-map-name.
Summary
This chapter covered several important building block features that are necessary for
understanding the conditional matching process used within NX-OS route-maps:
■ Access control lists provide a method of identifying networks. Extended ACLS pro-
vide the capability to select the network and advertising router for IGP protocols
and provide the capability to use wildcards for the network and subnet mask for
BGP routes.
■ Prefix lists identify networks based upon the high-order bit pattern, high-order bit
count, and required prefix length requirements.
Route-maps filter routes similar to an ACL and provide the capability to modify route
attributes. Route-maps are composed of sequence numbers, matching criteria, processing
action, and optional modifying actions. They use the following logic:
■ If matching criteria is not specified, all routes qualify for that route-map
sequence.
References 595
■ Multiple conditional matching requirements of the same type are a Boolean or, and
multiple conditional matching requirements of different type are a Boolean and.
■ NX-OS uses RPM that operate as a separate process and memory space from the
actual protocols. This provides an additional method for diagnosing unintentional
behaviors in a protocol.
The default packet-forwarding decisions bypass routing protocols altogether through the
use of policy-based routing to place specific network traffic onto a different path that
was selected by the routing protocol.
References
Edgeworth, Brad, Aaron Foss, and Ramiro Garza Rios. IP Routing on Cisco IOS, IOS
XE and IOS XR. Indianapolis: Cisco Press, 2014.
Technet24
This page intentionally left blank
Chapter 11
Troubleshooting BGP
BGP Fundamentals
Defined in RFC 1654, Border Gateway Protocol (BGP) is a path-vector routing protocol
that provides scalability, flexibility, and network stability. When BGP was first developed,
the primary design consideration was for IPv4 inter-organizational routing information
exchange across the public networks, such as the Internet, or for private dedicated net-
works. BGP is often referred to as the protocol for the Internet, because it is the only
protocol capable of holding the Internet routing table, which has more than 600,000 IPv4
routes and over 42,000 IPv6 routes, both of which continue to grow.
Technet24
598 Chapter 11: Troubleshooting BGP
Two blocks of private ASNs are available for any organization to use as long as they are
never exchanged publicly on the Internet. ASNs 64,512 to 65,535 are private ASNs within
the 16-bit ASN range, and 4,200,000,000 to 4,294,967,294 are private ASNs within the
extended 32-bit range.
Note It is imperative that you use only the ASN assigned by IANA, the ASN assigned by
your service provider, or private ASNs. Not only that, the public prefixes are mapped with
the relevant ASN numbers of the organizations. Thus, mistakenly or maliciously advertis-
ing a prefix using the wrong ASN could result in traffic loss and causing havoc on the
Internet.
Address Families
Originally, BGP was intended for routing of IPv4 prefixes between organizations, but
RFC 2858 added Multi-Protocol BGP (MP-BGP) capability by adding extensions called
address-family identifier (AFI). An address-family correlates to a specific network
protocol, such as IPv4, IPv6, and so on, and additional granularity through subsequent
address-family identifier (SAFI), such as unicast and multicast. MBGP achieves this sep-
aration by using the BGP path attributes (PA) MP_REACH_NLRI and MP_UNREACH_
NLRI. These attributes are carried inside BGP update messages and are used to carry
network reachability information for different address families.
Note Some network engineers refer to Multi-Protocol BGP as MP-BGP and other net-
work engineers use the term MBGP. Both terms are the same thing.
Network engineers and vendors continue to add functionality and feature enhance-
ments to BGP. BGP now provides a scalable control plane for signaling for overlay tech-
nologies like Multiprotocol Label Switching (MPLS) Virtual Private Networks (VPN),
IPsec Security Associations, and Virtual Extensible Lan (VXLAN). These overlays
provide Layer 3 connectivity via MPLS L3VPNs, or Layer 2 connectivity via Ethernet
VPNs (eVPN).
Every address-family maintains a separate database and configuration for each protocol
(address-family + subaddress-family) in BGP. This allows for a routing policy in one
address-family to be different from a routing policy in a different address-family, even
though the router uses the same BGP session to the other router. BGP includes an AFI
and SAFI with every route advertisement to differentiate between the AFI and SAFI
databases. Table 11-1 provides a small list of common AFI and SAFIs used with BGP.
BGP Fundamentals 599
Path Attributes
BGP attaches path attributes (PA) associated with each network path. The PAs provide
BGP with granularity and control of routing policies within BGP. The BGP prefix PAs are
classified as follows:
■ Well-known mandatory
■ Well-known discretionary
■ Optional transitive
■ Optional nontransitive
Per RFC 4271, well-known attributes must be recognized by all BGP implementations.
Well-known mandatory attributes must be included with every prefix advertisement,
whereas well-known discretionary attributes may or may not be included with the prefix
advertisement.
Loop Prevention
BGP is a path vector routing protocol and does not contain a complete topology of the
network like link state routing protocols. BGP behaves similar to distance vector proto-
cols to ensure a path is a loop-free path.
Technet24
600 Chapter 11: Troubleshooting BGP
The BGP attribute AS_PATH is a well-known mandatory attribute and includes a com-
plete listing of all the ASNs that the prefix advertisement has traversed from its source
AS. The AS_PATH is used as a loop-prevention mechanism in the BGP protocol. If a BGP
router receives a prefix advertisement with its AS listed in the AS_PATH, it discards the
prefix because the router thinks the advertisement forms a loop.
Note The other IBGP-related loop-prevention mechanism are discussed later in this chapter.
BGP Sessions
A BGP session refers to the established adjacency between two BGP routers. BGP ses-
sions are always point-to-point and are categorized into two types:
■ Internal BGP (iBGP): Sessions established with an iBGP router that are in the same
AS or participate in the same BGP confederation. iBGP sessions are considered more
secure, and some of BGP’s security measures are lowered in comparison to EBGP
sessions. iBGP prefixes are assigned an administrative distance (AD) of 200 upon
installing into the router’s Routing Information Base (RIB).
■ External BPG (EBGP): Sessions established with a BGP router that are in a different
AS. EBGP prefixes are assigned an AD of 20 upon installing into the router’s RIB.
BGP uses TCP port 179 to communicate with other routers. Transmission Control
Protocol (TCP) allows for handling of fragmentation, sequencing, and reliability (acknowl-
edgement and retransmission) of communication (control plane) packets. Although BGP
can form neighbor adjacencies that are directly connected, it can also form adjacencies
that are multiple hops away. Multihop sessions require that the router use an underlying
route installed in the RIB (static or from any routing protocol) to establish the TCP ses-
sion with the remote endpoint.
Note BGP neighbors connected via the same network use the ARP table to locate the IP
address of the peer. Multihop BGP sessions require route table information for finding the
IP address of the peer. It is common to have a static route or Interior Gateway Protocol (IGP)
running between iBGP peers for providing the topology path information for establishing the
BGP TCP session. A default route is not sufficient to establish a multihop BGP session.
BGP Sessions 601
BGP Identifier
The BGP Router-ID (RID) is a 32-bit unique number that identifies the BGP router in the
advertised prefixes as the BGP Identifier. The RID is also used as a loop prevention mech-
anism for routers advertised within an autonomous system. The RID can be set manually
or dynamically for BGP. A nonzero value must be set for routers to become neighbors.
NX-OS nodes use the IP address of the lowest up loopback interface. If there are no up
loopback interfaces, then the IP address of the lowest active up interface becomes the
RID when the BGP process initializes.
Router-IDs typically represent an IPv4 address that resides on the router, such as a loop-
back address. Any IPv4 address can be used, including IP addresses not configured on
the router. NX-OS uses the command router-id router-id under the BGP router configu-
ration to statically assign the BGP RID. Upon changing the router-id, all BGP sessions
reset and need to reestablish.
BGP Messages
BGP communication uses four message types as shown in Table 11-2.
OPEN
The OPEN message is used to establish a BGP adjacency. Both sides negotiate session
capabilities before a BGP peering establishes. The OPEN message contains the BGP
version number, ASN of the originating router, Hold Time, BGP Identifier, and other
optional parameters that establish the session capabilities.
The Hold Time attribute sets the Hold Timer in seconds for each BGP neighbor. Upon
receipt of an UPDATE or KEEPALIVE, the Hold Timer resets to the initial value. If the
Hold Timer reaches zero, the BGP session is torn down, routes from that neighbor are
Technet24
602 Chapter 11: Troubleshooting BGP
removed, and an appropriate update route withdraw message is sent to other BGP neigh-
bors for the impacted prefixes. The Hold Time is a heartbeat mechanism for BGP neigh-
bors to ensure that the neighbor is healthy and alive.
When establishing a BGP session, the routers use the smaller Hold Time value contained
in the two router’s OPEN messages. The Hold Time value must be set to at least
3 seconds, or zero. For Cisco routers the default hold timer is 180 seconds.
UPDATE
The UPDATE message advertises any feasible routes, withdraws previously advertised
routes, or can do both. The UPDATE message includes the Network Layer Reachability
Information (NLRI) that includes the prefix and associated BGP PAs when advertising
prefixes. Withdrawn NLRIs include only the prefix. An UPDATE message can act as a
KEEPALIVE message to reduce unnecessary traffic.
NOTIFICATION
A NOTIFICATION message is sent when an error is detected with the BGP session,
such as a Hold Timer expiring, a neighbor capabilities change, or a BGP session reset is
requested. This causes the BGP connection to close.
Note More details on the BGP messages are discussed during troubleshooting sections.
KEEPALIVE
BGP does not rely upon the TCP connection state to ensure that the neighbors are still
alive. KEEPALIVE messages are exchanged every 1/3 of the Hold Timer agreed upon
between the two BGP routers. Cisco devices have a default Hold Time of 180 seconds,
so the default KEEPALIVE interval is 60 seconds. If the Hold Time is set for zero, no
KEEPALIVE messages are sent between the BGP neighbors.
■ Idle
■ Connect
■ Active
■ OpenSent
■ OpenConfirm
■ Established
BGP Sessions 603
Figure 11-1 displays the BGP FSM and the states in order of establishing a BGP session.
Idle 1
Active 3 Connect 2
Open
Established 6 4
Sent
Open
Confirm 5
Idle
This is the first stage of the BGP FSM. BGP detects a start event and tries to initiate
a TCP connection to the BGP peer and also listens for a new connect from a peer router.
If an error causes BGP to go back to the Idle state for a second time, the
ConnectRetryTimer is set to 60 seconds and must decrement to zero before
the connection is initiated again. Further failures to leave the Idle state result in
the ConnectRetryTimer doubling in length from the previous time.
Connect
In this state, BGP initiates the TCP connection. If the 3-way TCP handshake completes,
the established BGP Session BGP process resets the ConnectRetryTimer and sends the
Open message to the neighbor, and changes to the OpenSent State.
If the ConnectRetry timer depletes before this stage is complete, a new TCP connection is
attempted, the ConnectRetry timer is reset, and the state is moved to Active. If any other
input is received, the state is changed to Idle.
Technet24
604 Chapter 11: Troubleshooting BGP
During this stage, the neighbor with the higher IP address manages the connection.
The router initiating the request uses a dynamic source port, but the destination port is
always 179.
Note Service providers consistently assign their customers the higher or lower IP address
for their networks. This helps the service provider create proper instructions for ACLs or
firewall rules, or for troubleshooting them.
Active
In this state, BGP starts a new 3-way TCP handshake. If a connection is established,
an Open message is sent, the Hold Timer is set to 4 minutes, and the state moves to
OpenSent. If this attempt for TCP connection fails, the state moves back to the Connect
state and resets the ConnectRetryTimer.
OpenSent
In this state, an Open message has been sent from the originating router and is await-
ing an Open message from the other router. After the originating router receives the
OPEN message from the other router, both OPEN messages are checked for errors.
The following items are being compared:
■ The source IP Address of the OPEN message must match the IP address that is con-
figured for the neighbor.
■ The AS number in the OPEN message must match what is configured for the neighbor.
■ BGP Identifiers (RID) must be unique. If a RID does not exist, this condition is not met.
If the Open messages do not have any errors, the Hold Time is negotiated (using the
lower value), and a KEEPALIVE message is sent (assuming the value is not set to zero).
The connection state is then moved to OpenConfirm. If an error is found in the OPEN
message, a Notification message is sent, and the state is moved back to Idle.
If TCP receives a disconnect message, BGP closes the connection, resets the
ConnectRetryTimer, and sets the state to Active. Any other input in this process results in
the state moving to Idle.
OpenConfirm
In this state, BGP waits for a Keepalive or Notification message. Upon receipt of a
neighbor’s Keepalive, the state is moved to Established. If the Hold Timer expires, a stop
event occurs, or a Notification message is received, the state is moved to Idle.
BGP Sessions 605
Established
In this state, the BGP session is established. BGP neighbors exchange routes via Update
messages. As Update and Keepalive messages are received, the Hold Timer is reset. If
the Hold Timer expires, an error is detected, and BGP moves the neighbor back to the
Idle state.
Step 1. Create the BGP routing process. Initialize the BGP process with the global
configuration command router bgp as-number.
Step 2. Assign a BGP router-id. Assign a unique BGP router-id under the BGP router
process. The router-id can be an IP address assigned to a physical interface or
a Loopback interface.
Step 3. Initialize the address-family. Initialize the address-family with the BGP router
configuration command address-family afi safi so it can be associated to a
BGP neighbor.
Step 4. Identify the BGP neighbor’s IP address and autonomous system number.
Identify the BGP neighbor’s IP address and autonomous system number with
the BGP router configuration command neighbor ip-address remote-as
as-number.
Step 5. Activate the address-family for the BGP neighbor. Activate the address-
family for the BGP neighbor with the BGP neighbor configuration command
address-family afi safi.
Examine the topology shown in Figure 11-2. This topology is used as reference for the
next section as well. In this topology, Nexus devices NX-1, NX-2, and NX-4 are part of
AS 65000, whereas router NX-6 belongs to AS 65001.
Example 1-4 displays the BGP configuration for router NX-4 demonstrating both IBGP
and EBGP peering. For this example, NX-4 is trying to establish an IBGP peering with
Technet24
606 Chapter 11: Troubleshooting BGP
NX-1 and an EBGP peering with NX-6. While configuring a BGP peering, it is important
to ensure the following information is correct:
■ Source peering IP
■ Remote peering IP
In Example 11-1, NX-4 is forming an IBGP peering with NX-1 and an EBGP peering with
NX-6 router. The NX-4 device is also advertising its loopback address under the IPv4
address family using the network command.
NX-4
feature bgp
router bgp 65000
router-id 192.168.4.4
address-family ipv4 unicast
network 192.168.4.4/32
redistribute direct route-map conn
neighbor 10.46.1.6
remote-as 65001
address-family ipv4 unicast
neighbor 192.168.1.1
remote-as 65000
update-source loopback0
address-family ipv4 unicast
next-hop-self
!
ip prefix-list connected-routes seq 5 permit 10.46.1.0/24
!
route-map conn permit 10
match ip address prefix-list connected-routes
After the BGP peering is established, the BGP prefixes are verified using the command show
bgp afi safi. This command lists all the BGP prefixes in the respective address families.
Example 11-3 displays the output of the BGP prefixes on NX-4. In the output, the BGP table
holds locally advertised prefixes with the next-hop value of 0.0.0.0, the next-hop IP address,
and a flag to indicate whether the prefix was learned from an IBGP (i) or EBGP (e) peer.
On NX-OS, the BGP process is instantiated the moment the router bgp asn command
is configured. The details of the BGP process and the summarized configuration are
viewed using the command show bgp process. This command displays the BGP process
ID, state, number of configured and active BGP peers, BGP attributes, VRF information,
redistribution and relevant route-maps used with various redistribute statements, and so
on. If there is a problem with the BGP process, this command can be viewed to verify the
state of BGP along with the memory information of the BGP process. Example 11-4 dis-
plays the output of the command show bgp process, highlighting some of the important
fields in the output in Example 11-3.
Technet24
608 Chapter 11: Troubleshooting BGP
Redistribution
direct, route-map conn
Troubleshooting BGP Peering Issues 609
Nexthop trigger-delay
critical 3000 ms
non-critical 10000 ms
Redistribution
None
Nexthop trigger-delay
critical 3000 ms
non-critical 10000 ms
BGP peering issues are one of the most common issues that are experienced by network
operators in the production environment. Though one of the common issues, the impact
of down peer or a flapping BGP peer can be from very minimal (if there is redundancy
in the network) to huge (where the peering to the Internet provider is completely down).
This section focuses on troubleshooting both issues.
Technet24
610 Chapter 11: Troubleshooting BGP
A down BGP peer state is in either an Idle or Active state. From the peer state standpoint,
these states would mean the following possible problems:
■ Idle State
■ Active State
■ Idle/Active State
The following subsections list the various steps involved in troubleshooting BGP peering
down issues.
Verifying Configuration
The very first step in troubleshooting BGP peering issues is verifying the configuration
and understanding the design. Many times, a basic configuration mistake causes a BGP
peering not to establish. The following items should be checked when a new BGP session
is configured:
■ Local AS number
■ Remote AS number
It is important to understand the traffic flow of BGP packets between peers. The source
IP address of the BGP packets still reflects the IP address of the outbound interface.
When a BGP packet is received, the router correlates the source IP address of the packet
to the BGP neighbor table. If the BGP packet source does not match an entry in the
neighbor table, the packet cannot be associated to a neighbor and is discarded.
In most of the deployments, the iBGP peering is established over loopback interface,
and if the update-source interface is not specified, the session does not come up. The
explicit sourcing of BGP packets from an interface is verified by ensuring that the
update-source interface-id command under the neighbor ip-address configuration
section is correctly configured for the peer.
If there are multiple hops between the EBGP peers, then proper hop count is required. Ensure
the ebgp-multihop [hop-count] is configured with the correct hop count. If the hop-count
is not specified, the default value is set to 255. Note that the default TTL value for IBGP
sessions is 255 whereas the default value of EBGP session is 1. If an EBGP peering is estab-
lished between two directly connected devices but over the loopback address, users can also
use the disable-connected-check command instead of using the ebgp-multihop 2 command.
Troubleshooting BGP Peering Issues 611
This command disables the connection verification mechanism, which by default, prevents
the session from getting established when the EBGP peer is not in the directly connected
segment.
Note At times, users may experience packet loss when performing a ping test. If there is
a pattern seen in the ping test, it is most likely be due to CoPP policy, which is dropping
those packets.
Using the preceding ping methods, reachability is verified for both the IBGP and EBGP
peers. But if there is a problem with the reachability, use the following procedure to
isolate the problem or direction of the problem.
Identify the direction of packet loss. The show ip traffic command on NX-OS is used to
identify the packet loss or direction of the packet loss. If there is a complete or random
packet loss of the ping (ICMP) packets from source to destination, use this method. The
command output has the section of ICMP Software Processed Traffic Statistics, which
consists of two subsections: Transmission and Reception. Both the sections consist of
statistics for echo request and echo reply packets. To perform this test, first ensure that
Technet24
612 Chapter 11: Troubleshooting BGP
the sent and receive counters are stable (not incrementing) on both the source and the
destination devices. Then initiate the ping test toward the destination by specifying the
source interface or IP address. After the ping is completed, verify the show ip traffic
command to validate the increase in counters on both sides to understand the direction
of the packet loss. Example 11-6 demonstrates the method for isolating the direction of
packet loss. In this example, the ping is initiated from NX-1 to NX-4 loopback. The first
output displays that the echo request packets received at 10 and the echo reply sent are
10 as well. After the ping test from NX-1 to NX-4 loopback, the counters increase to 15
for both echo request and echo reply.
NX-4
NX-1
NX-1# ping 192.168.4.4 source 192.168.1.1
PING 192.168.4.4 (192.168.4.4) from 192.168.1.1: 56 data bytes
64 bytes from 192.168.4.4: icmp_seq=0 ttl=253 time=3.901 ms
64 bytes from 192.168.4.4: icmp_seq=1 ttl=253 time=2.913 ms
64 bytes from 192.168.4.4: icmp_seq=2 ttl=253 time=2.561 ms
64 bytes from 192.168.4.4: icmp_seq=3 ttl=253 time=2.502 ms
64 bytes from 192.168.4.4: icmp_seq=4 ttl=253 time=2.571 ms
NX-4
NX-4# show ip traffic | in Transmission:|Reception:|echo
Transmission:
Redirect: 0, unreachable: 0, echo request: 33, echo reply: 15,
Reception:
Redirect: 0, unreachable: 0, echo request: 15, echo reply: 29,
Similarly, the outputs are verified on NX-1 as well for echo reply received counters. In the
previous example, the ping test is successful, and thus both the echo request received and
echo reply sent counters incremented, but in situations when the ping test is failing, it is
worth checking these counters closely and with multiple iterations of test. If the ping to
the destination device is failing but still both the counters increment on the destination
Troubleshooting BGP Peering Issues 613
device, the problem could be with the return path, and the users may have to check the
path for the return traffic.
ACLs prove to be really useful when troubleshooting packet loss or reachability issues.
Configuring an ACL matching the source and the destination IP helps to confirm whether
the packet has actually reached the destination router. The only caution that needs to be
taken is that while configuring ACL, permit ip any any should be configured at the end,
or else it could cause the other packets to get dropped and thus cause a service impact.
Example 11-7 shows how the ACL configuration should look if BGP is passing through
that link. The example shows the configuration for both IPv4 as well as ipv6 access-
list in case of IPv6 BGP sessions. For applying IPv4 ACL on interface, ip access-group
access-list-name {in|out} command is used on all platforms. For IPv6 ACL, ipv6 traffic-
filter access-list-name {in|out} interface command is used on NX-OS.
Other than having ACLs configured on the edge devices, lot of deployments have fire-
walls to protect the network from unwanted and malicious traffic. It is a better option to
have a firewall installed than to have a huge ACL configured on the routers and switches.
Firewalls can be configured in two modes:
■ Routed mode
■ Transparent mode
In routed mode, the firewall has routing capabilities and is considered to be a routed hop
in the network. In transparent mode, the firewall is not considered as a router hop to the
Technet24
614 Chapter 11: Troubleshooting BGP
connected device but merely acts like a bump in the wire. Thus, if an EBGP session is
being established across a transparent firewall, ebgp-multihop might not be required, and
even if it is required to configure ebgp-multihop due to multiple devices in the path, the
firewall is not counted as another routed hop.
Firewalls implement various security levels for the interfaces. For example, the ASA
Inside interface is assigned a security level of 100 and the Outside interface is assigned
security level 0. An ACL needs to be configured to permit the relevant traffic from the
least secure interface going toward the higher security interface. This rule applies for both
routed as well as transparent mode firewalls, and ACL is required in both cases.
Bridge groups are configured in transparent mode firewall for each network to help
minimize the overhead on security contexts. The interfaces are made part of a bridge
group and a Bridge Virtual Interface (BVI) interface is configured with a management
IP address.
Example 11-8 displays an ASA ACL configuration that allows ICMP as well as BGP
packets to traverse across the firewall and shows how to assign the ACL to the interface.
Any traffic that is not part of the ACL is dropped.
interface GigabitEthernet0/0
nameif Inside
bridge-group 200
security-level 100
!
interface GigabitEthernet0/1
nameif Outside
bridge-group 200
security-level 0
!
! Creating BVI with Management IP and should be the same subnet
! as the connected interface subnet
interface BVI200
ip address 10.1.13.10 255.255.255.0
!
access-list Out extended permit icmp any any
access-list Out extended permit tcp any eq bgp any
access-list Out extended permit tcp any any eq bgp
!
access-group Out in interface Outside
In the access-list named Out, though, both the statements permitting the BGP packets are
not required, but it is good practice to have both.
Troubleshooting BGP Peering Issues 615
Another problem users might run into with a firewall in middle is with a couple of fea-
tures on an ASA firewall:
ASA firewalls by default perform sequence number randomization and thus can cause
BGP sessions to flap. Also, if the BGP peering is secured using MD5 authentication,
enable TCP option 19 on the firewall’s policy.
If BGP peering is not getting established, it may be possible that there is a stale entry
in the TCP table. The stale entry may show the TCP session to be in established state
and thus prevent the router from initiating another TCP connection, thus preventing the
router from establishing a BGP peering.
A good troubleshooting technique for down BGP peers is using Telnet on TCP port 179
toward the destination peer IP and using local peering IP as the source. This technique
Technet24
616 Chapter 11: Troubleshooting BGP
helps ensure that the TCP is not getting blocked or dropped between the two BGP peer-
ing devices. This test is useful for verifying any TCP issues on the destination router and
also helps verify any ACL that could possibly block the BGP packets.
Example 11-10 shows the use of Telnet on port 179 from NX-1 (192.168.1.1) to NX-4
(192.168.4.4) to verify BGP session. When this test is performed, the BGP TCP session
gets established but is closed/disconnected immediately.
If the telnet is not sourced from the interface or IP that the remote device is configured
to form a BGP neighborship with, the Telnet request is refused. This is another way to
confirm that the peering device configuration is as per the documentation or not.
When troubleshooting TCP connection issues, it is also important to check the event-
history logs for a netstack process as well. Netstack is an implementation of a Layer-2
to Layer-4 stack on NX-OS. It is one of the critical components involved in the control
plane on NX-OS. If there is a problem with establishing a TCP session on a Nexus device,
it could be a problem with the netstack process. The show sockets internal event-history
events command helps understand what TCP state transitions happened for the BGP
peer IP.
Example 11-11 demonstrates the use of the show sockets internal event-history events
command to see the TCP session getting closed for BGP peer IP 192.168.2.2, but it does
not show any request coming in.
Troubleshooting BGP Peering Issues 617
Note For any problems encountered with TCP-related protocol such as BGP, capture
show tech netstack [detail] and share the information with Cisco TAC.
Out of the reasons listed, wrong peer AS or bad BGP Identifier are the most common
OPEN message errors and are usually caused due to documentation or human error.
The notification messages are also self-explanatory for the two errors and clearly indi-
cate the wrong value and the expected value in the notification message, as shown in
Example 11-12. In this example, the router is expecting the peer AS to be in AS 65001
but it's receiving the AS 65002.
04:51:33 NX-4 %BGP-3-BADPEERAS: bgp-100 [9544] VRF default, Peer 10.46.1.6 - bad
remote-as, expecting 65001 received 65002.
During the initial BGP negotiation between the BGP speakers, certain capabilities are
exchanged. If any of the BGP speakers are receiving a capability that they do not support,
Technet24
618 Chapter 11: Troubleshooting BGP
BGP detects an OPEN message error for unsupported capability (or unsupported option-
al parameter). For instance, one of the BGP speakers is having the capability of enhanced
route refresh, but the BGP speaker on the receiving end is running an old software that
does not have the capability, then it detects this as an OPEN message error. The following
optional capabilities are negotiated between the BGP speakers:
■ 4-byte AS capability
■ Multiprotocol capability
■ Single/Multisession capability
BGP Debugs
Running debugs should always be the last resort for troubleshooting any network prob-
lem because debugs can sometimes cause an impact in the network if not used carefully.
But sometimes they are the only option when other troubleshooting techniques don’t
help understand the problem. Using the NX-OS debug logfile, users can mitigate any
kind of impact due to chatty debug outputs. Along with using debug logfile, network
operators can put a filter on the debugs using the debug-filter and filtering the output for
specific neighbor, prefix, and even the address-family, thus removing any possibility of an
impact on the Nexus switch.
When a BGP peer is down, and all the other troubleshooting steps are not helping figure
out where the problem is, enable debugs enabled to see if the router is generating and
sending the necessary BGP packets, and if it's receiving the relevant packets or not.
However, debug is not required on NX-OS because the traces in BGP have sufficient
information to debug the problem. There are several debugs that are available for BGP.
Depending on the state in which BGP is stuck, certain debug commands are helpful.
For a BGP peering down situation, one of the key debugs used is for BGP keepalives. The
BGP keepalive debug is enabled using the command debug bgp keepalives. In the debug
output, the two important factors to consider for ensuring a successful BGP peering are
as follows:
If the BGP keepalive is being generated at regular intervals but the BGP peering still
remains down, it may be possible that the BGP keepalive couldn’t make it to the other
end, or it reached the peering router but was not processed or dropped. In such cases,
BGP keepalive debugs are useful. Enable the debug command debug bgp keepalives to
verify whether the BGP keepalives are being sent and received. Example 11-13 illustrates
Troubleshooting BGP Peering Issues 619
the use of BGP keepalive debug. The first output helps the user verify that the BGP
keepalive is being generated every 60 seconds. The second output shows the keepalive
being received from the remote peer 192.168.1.1.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + –+ –+ –+ –+ –+ –+ –+ –+ –+
Marker
+ – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + –+ –+ –+ –+ –+ –+ –+ –+ –+
Length Type
+ – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + –+
Technet24
620 Chapter 11: Troubleshooting BGP
In addition to the fixed-size BGP message header, a notification contains the following, as
shown in Figure 11-4.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + –+ –+ –+ –+ –+ –+ –+ –+ –+
Error Code Error Subcode Data (Variable)
+ – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + – + –+ –+ –+ –+ –+ –+ –+ –+ –+
The Error code and Error-Subcode values are defined in RFC 4271. Table 11-3 shows all
the Error codes, Error-Subcode and their interpretation.
Whenever a notification is generated, the error code and the subcode are always printed
in the message. These notification messages are really helpful when troubleshooting
down peering issues or flapping peer issues.
The methodology for troubleshooting IPv6 BGP peers is same as that of IPv4 BGP peers.
Here are a few steps you can use to troubleshoot down peering issues for IPv6 BGP
neighbors:
Step 1. Verify the configuration for correct peering IPv6 addresses, AS numbers, update-
source interface, authentication passwords, EBGP multihop configuration.
Step 3. Verify the TCP connections using the command show socket connection tcp
on NX-OS. In case of IPv6, check for TCP connections for source and desti-
nation IPv6 addresses and one of the ports as port 179.
Technet24
622 Chapter 11: Troubleshooting BGP
Step 4. Verify any IPv6 ACL’s in path. Like IPv4, the IPv6 ACLs in the path should
permit for TCP connections on port 179 and ICMPv6 packets that can help in
verifying reachability.
Step 5. Debugs. On NX-OS switches, use the debug bgp ipv6 unicast neighbors
ipv6-neighbor-address debug command to capture IPv6 BGP packets. Before
enabling the debugs, enable the debug logfile for BGP debug. For filtering the
debugs for a particular IPv6 neighbor, use the IPv6 ACL to filtering the debug
output for that particular neighbor.
■ MTU mismatch
■ High CPU
Whenever a BGP update is corrupted, a BGP notification is generated with the error code
of 3, as shown in Table 11-3. When an error is noticed in the BGP update, BGP generates
a hex-dump of the bad update message, which can be further decoded to understand
which section of the update was corrupted. Along with the hex-dump, BGP also gener-
ates a log message that explains what kind of update error has occurred, as shown in
Example 11-14.
Troubleshooting BGP Peering Issues 623
Use the command debug bgp packets to view the BGP messages in hexdump, which can
be further decoded. If too many BGP updates and messages are being exchanged on the
NX-OS devices, a better option is to perform an Ethanalyzer or SPAN to capture a mal-
formed BGP update packet to further analyze it.
Note The hexdump in the BGP message can be further analyzed using some online tools,
such as http://bgpaste.convergence.cx.
■ Interface/platform drops
■ MTS queue stuck
■ Control-plane policy drops
■ BGP Keepalive generation
■ MTU issues
One reason may be Interface/platform drops. Various Interface issues like a physical layer
issue or drops on the interface can lead to the BGP session getting flapped due to Hold
Timer expiry. If the interface is carrying excessive traffic or even the line card itself is
overloaded or busy, the packets may get dropped on the interface level or on the line card
ASIC. If the BGP keepalive or update packets are dropped in such instances, BGP may
notify the peer of Hold Timer expiry.
Technet24
624 Chapter 11: Troubleshooting BGP
Another possibility is that the MTS queue is stuck. Sometimes, BGP Keepalives have
arrived at the TCP receiving queue but are not being processed and moved to the BGP
InQ. This is noticed when the BGP InQ queues are empty and a BGP neighbor goes
down due to Hold Timer expiry. The most common reason for such a scenario on Nexus
switches is because the MTS queue is stuck on either the BGP or TCP process. MTS is
the main component that takes care of carrying information from one component to
another component within NX-OS. In such scenarios, it may be possible that multiple
BGP peers may get impacted on the system. To recover, a supervisor switchover or a
reload may be required.
In addition, CoPP policy drops can also be a cause. The CoPP policy is designed to pre-
vent the CPU from excessive and unwanted traffic. But a poorly designed CoPP policy
causes control-plane protocol flaps. If the CoPP policy has not been accommodated to
take care of all the BGP control-plane packets and the number of BGP peers on the rout-
er, there might be instances where those packets get dropped. In such situations, users
might experience random BGP flaps due to CoPP policy dropping certain packets.
Note MTS, CoPP, and other platform troubleshooting is covered in detail in Chapter 3,
“Troubleshooting Nexus Platform Issues.”
■ How is the traffic load on the interface/system when the flap occurs?
■ Is the CPU high during the time of the flap? If yes, is it due to traffic or a particular
process?
These questions help lay out a pattern for the BGP flaps, and relevant troubleshooting
can be performed around the same time. To further troubleshoot the problem, understand
that the BGP flap is due to two reasons:
■ Either keepalives getting generated at regular intervals but not leaving the router or
not making it to the other end.
If the keepalives are getting generated at regular intervals but not leaving the router, then
notice that the OutQ for the BGP peer keeps piling up. The OutQ keeps incrementing
due to keepalive generation, but the MsgSent does not increase, which may be an indica-
tion that the messages are stuck in the OutQ. Example 11-15 illustrates such a scenario
where the BGP keepalives are generated at regular intervals but do not leave the router,
leading to a BGP flap due to hold timer expiry. Notice that in this example, the OutQ
value increases from 10 to 12, but the MsgSent counter is stagnant at 3938. In this
scenario, the peering may flap every BGP hold timer.
But if the device experiences random BGP flaps and at irregular intervals, it is possible
that the BGP keepalives are getting generated at regular intervals, although the flaps may
still happen frequently. For instance, a BGP peering flaps between 4 to 10 minutes. These
issues are hard to troubleshoot and may require a different technique than just running
show commands. The reason is that it is not easy to isolate which device is not generating
the keepalive in a timely manner, or if the keepalive is generated in a timely manner but
there is a delay that occurs when the keepalive makes it to the remote peer. To trouble-
shoot, follow the two-step process between the two ends of the BGP connection.
Step 1. Enable BGP keepalive debug on both routers along with the debug logfile.
The purpose of enabling Ethanalyzer or any other packet capture tool (based on the
underlying platform) is that it is possible that the BGP keepalives reach the other end in a
Technet24
626 Chapter 11: Troubleshooting BGP
timely manner, but those keepalives may be delayed before reaching BGP process itself.
Based on the outputs of the BGP keepalive debug and the Ethanalyzer from the far end
device, the timelines could be matched to conclude where exactly the delay might be
happening that is causing the BGP to flap. It may be the BGP process that is delaying the
keepalive generation, or it may be the other components that interact with BGP to delay
the keepalive processing.
BGP sends updates based on the Maximum Segment Size (MSS) value calculated by TCP.
If Path-MTU-Discovery (PMTUD) is not enabled, the BGP MSS value defaults to 536
bytes as defined in RFC 879. The problem with that is, if a huge number of updates are
getting exchanged between the two routers at the MSS value of 536 bytes, convergence
issues will be noticed and thus an inefficient use of the network. The reason is that the
interface with an MTU size of 1500 is capable of sending nearly three times the MSS
value and can be much higher if the interface supports jumbo MTU, but it has to break
down the updates in chunks of 536 bytes.
Defined in RFC 1191, PMTUD is introduced to reduce the chances of IP packets getting
fragmented along the path and thus helping with faster convergence. Using PMTUD, the
source identifies the lowest MTU along the path to the destination and thus decides what
packet size should be sent.
How does PMTUD work? When the source generates a packet, it sets the MTU size
equal to the outgoing interface with a DF (Do-Not-Fragment) bit set. For any intermedi-
ate device that receives the packet and has an MTU value of its egress interface lower
than the packet it received, the device drops the packet and sends an ICMP error message
with Type 3 (Destination Unreachable) and Code 4 (Fragmentation needed and DF bit
set) along with the MTU information of the outgoing interface in the Next-Hop MTU
field back toward the source. When the source receives the ICMP unreachable error mes-
sage, it modifies the MTU size of the outgoing packet to the value specified in the Next-
Hop MTU field above. This process continues until the packet successfully reaches the
final destination.
BGP also supports PMTUD. PMTUD allows a BGP router to discover the best MTU size
along the path to a neighbor to ensure efficient usage of exchanging packets. With Path
Troubleshooting BGP Peering Issues 627
MTU discovery enabled, the initial TCP negotiation between two neighbors has MSS
value equal to (IP MTU − 20 byte IP Header − 20 byte TCP Header) and DF bit set. Thus,
if the IP MTU value is 1500 (equal to the interface MTU) then the MSS value is 1460.
If the device in the path has a lower MTU or even if the destination router has a lower
MTU—for example, 1400, then the MSS value is negotiated based on 1400−40 bytes =
1360 bytes. To derive MSS calculation, use the following formulas:
■ MSS without MPLS = MTU − IP Header (20 bytes) − TCP Header (20 bytes)
■ MSS over MPLS = MTU − IP Header − TCP Header − n*4 bytes (where n is the
number of labels in the label stack)
■ MSS across GRE Tunnel = MTU − IP Header (Inner) − TCP Header − [IP Header
(Outer) + GRE Header (4 bytes)]
Note MPLS VPN providers should increase the MPLS MTU to at least 1508 (assuming a
minimum of 2 labels) or MPLS MTU of 1516 (to accommodate up to 4 labels)
Now the question is why the MTU mismatch causes BGP sessions to flap? When the
BGP connection is established, the MSS value is negotiated over the TCP session. When
the BGP update is generated, BGP updates are packaged in the BGP update message,
which can hold prefixes and header information to the maximum capacity of the MSS
bytes. These BGP update messages are then sent to the remote peer with the do-not-
fragment (df-bit) set. If a device in path or even the destination is not able to accept the
packets with a higher MTU, it sends an ICMP error message back to BGP speaker. The
destination router either waits for the BGP Keepalive or BGP Update packet to update its
hold down timer. After 180 seconds, the destination router sends a Notification back to
Source with a Hold Time expired error message.
Note When a BGP router sends an update to a BGP neighbor, it does not send a BGP
Keepalive separately. But rather it updates the Keepalive timer for that neighbor. During the
BGP update process, the update message is treated as a keepalive by the BGP speakers.
Example 11-16 illustrates a BGP peer flapping problem when there is a MTU mismatch in
the path. Consider the same set of devices NX-1, NX-2, NX-4, and NX-6 from the topol-
ogy shown in Figure 11-2. In this topology, assume the devices have ICMP unreachable
disabled on its interfaces. The NX-6 device is advertising 10,000 prefixes to NX-4, which
is being further advertised toward NX-1. The interface MTU on NX-1 and NX-4 is set to
9100, whereas the MTU on the interface on NX-2 facing NX-1 is still set to the default;
that is, 1500. Because the path MTU discovery (PMTUD) is enabled, the MSS is negoti-
ated to value 9060. The ICMP unreachable message is denied because the lower MTU
setting on the NX-2 interface is not received by NX-1.
Technet24
628 Chapter 11: Troubleshooting BGP
NX-4
NX-1
NX-1# show bgp ipv4 unicast summary
BGP summary information for VRF default, address family IPv4 Unicast
BGP router identifier 192.168.1.1, local AS number 65000
BGP table version is 37, IPv4 Unicast config peers 1, capable peers 1
4 network entries and 4 paths using 576 bytes of memory
BGP attribute entries [4/576], BGP AS path entries [1/6]
BGP community entries [0/0], BGP clusterlist entries [0/0]
NX-1
! Logs showing BGP flap after hold timer expiry
00:56:27.873 NX-1 %BGP-5-ADJCHANGE: bgp-65000 [6884] (default) neighbor 192.168.4.4
Down - holdtimer expired error
00:57:26.627 NX-1 %BGP-5-ADJCHANGE: bgp-65000 [6884] (default) neighbor 192.168.4.4 Up
Troubleshooting BGP Peering Issues 629
The BGP flap does not occur when a small amount of prefixes are exchanged between
the peers because the BGP packet size is under 1460 bytes. One symptom of BGP flaps
due to MSS/MTU issues is a repetitive BGP flap that occurs because the Hold Timer
expires.
The following are the few possible causes of BGP session flapping due to
MTU mismatch:
■ PMTUD didn’t calculate correct MSS for the TCP BGP session.
To verify there are MTU mismatch issues in the path, perform an extended ping test by
setting the size of the packet as the outgoing interface MTU value along with DF bit set.
Also, ensure that ICMP messages are not being blocked in the path to have PMTUD
function properly. Ensure that the MTU values are consistent throughout the network
with a proper review of the configuration.
Perform a ping test to remote peer with the packet size as the MTU of the interface and
do not fragment (df-bit) set as shown in Example 11-17.
Technet24
630 Chapter 11: Troubleshooting BGP
Note Nexus platform adds 28 bytes (20 bytes IP header + 8 bytes ICMP header) when
performing the ping with MTU size. Thus, when the ping test is performed with DF-bit set,
the ping with 1500 size fails. To successfully test the ping with the interface MTU packet
size and df-bit set, subtract 28 bytes from the MTU value on the interface. In this case,
1500 − 28 = 1472.
Inbound
Policy
Outbound
RIB
Policy
Let’s now understand the various fundamentals of route advertisement in the sections that
follow. For this section, examine the topology shown in Figure 11-6.
BGP Route Processing and Route Propagation 631
NX-2 NX-5
Lo0 - 192.168.2.2/32 Lo0 - 192.168.5.5/32
AS 65000 10.25.1.0/24
IGP - OSPF
IBGP EBGP
AS 65001
■ Aggregate Route: Summarizing a route, though the component route must exist in
the BGP table.
Network Statement
A BGP prefix is advertised via BGP using a network statement. For the network statement
to function properly, the route must be present in the routing table. If the route is not
present in the routing table, the network statement neither installs the route in the BGP
table nor advertises it to the BGP peers. Example 11-18 illustrates the use of network
statements to advertise two prefixes. One of the prefixes has the loopback configured
locally on the router, and the other prefix does not have the route present in the rout-
ing table. It is clear from the output of the command show bgp ipv4 unicast neighbors
Technet24
632 Chapter 11: Troubleshooting BGP
ip-address advertised-routes that the prefix 192.168.4.4/32 gets advertised to the BGP
peer 192.168.1.1 but not the prefix 192.168.44.44/32. When looking at the BGP table for
any address-family, it is important to verify the status flags, which would indicate how
the prefix is learned on the router. These status flags and their meaning are highlighted
before the prefixes in the BGP table are listed. In Example 11-18, the prefix is a local
prefix and thus has the status flag as L along with the flag *>, which indicates the route is
selected as the best route.
NX-4
router bgp 65000
router-id 192.168.4.4
log-neighbor-changes
address-family ipv4 unicast
network 192.168.4.4/32
network 192.168.44.44/32
neighbor 192.168.1.1
remote-as 65000
update-source loopback0
address-family ipv4 unicast
next-hop-self
Redistribution
Redistributing routes into BGP is a common method of populating the BGP table.
Examine the same topology shown in Figure 11-6. On router NX-1, OSPF is being
redistributed into BGP. While redistributing the routes from OSPF to BGP, the route-map
permits for prefixes 192.168.4.4/32 and 192.168.44.44/32, although the routing table
only learns 192.168.4.4/32 from NX-4. Example 11-19 demonstrates the redistribution
process into BGP. Notice in the output, the prefix 192.168.4.4/32 has an r flag, which
indicates redistributed prefix. Also, the redistributed prefix has a question mark (?) in the
AS path list.
NX-1
router bgp 65000
address-family ipv4 unicast
redistribute ospf 100 route-map OSPF-BGP
!
ip prefix-list OSPF-BGP seq 5 permit 192.168.4.4/32
ip prefix-list OSPF-BGP seq 10 permit 192.168.44.44/32
!
route-map OSPF-BGP permit 10
match ip address prefix-list OSPF-BGP
redistribute ospf 100 route-map OSPF-BGP
Technet24
634 Chapter 11: Troubleshooting BGP
Note The redistribution process is the same for other routing protocols, static routes, and
directly connected links, as shown in Example 11-19.
There are a few caveats when performing redistribution for OSPF and IS-IS as listed:
■ OSPF: When redistributing OSPF into BGP, the default behavior includes only routes
that are internal to OSPF. The redistribution of external OSPF routes requires a con-
ditional match on route-type under route-map.
■ IS-IS: IS-IS does not include directly connected subnets for any destination routing pro-
tocol. This behavior is overcome by redistributing the connected networks into BGP.
Example 11-20 displays the various match route-type options available under the
route-map. The route-type options are available for both OSPF and IS-IS route types.
Route Aggregation
Not all devices in the network are powerful enough to hold all the routes learned via
BGP or other routing protocols. Also, having multiple paths in the network leads to
consumption of more CPU and memory resources. To overcome this challenge, route
aggregation or summarization can be performed. Route aggregation in BGP is performed
using the command aggregate-address aggregate-prefix/length [advertise-map | as-set
| attribute-map | summary-only | suppress-map]. Table 11-4 describes all the optional
command options available with the aggregate-address command.
BGP Route Processing and Route Propagation 635
Example 11-21 demonstrates the use of the summary-only attribute with the aggregate-
address command. Notice that NX-2 has 3 prefixes but only a single aggregate prefix
gets advertised to NX-5. Notice that on NX-2, when the summary-only command is con-
figured, the more specific routes are suppressed.
NX-2
router bgp 65000
address-family ipv4 unicast
network 192.168.2.2/32
aggregate-address 192.168.0.0/16 summary-only
Technet24
636 Chapter 11: Troubleshooting BGP
Default-Information Originate
Not every external route can be redistributed and advertised within the network. In
such instances, the gateway or edge device advertises a default route to other parts of
the network using a routing protocol. To advertise a default route using BGP, use the
command default-information originate under the neighbor configuration mode. It is
important to note that the command only advertises the default route if the default route
is present in the routing table. If there is no default route present, create a default route
pointing to null0 interface.
BGP uses three tables for maintaining the network prefix and path attributes (PA)s for a
route. The following BGP tables are briefly explained:
■ Adj-RIB-in: Contains the NLRIs in original form before inbound route policies are
processed. The table is purged after all route-policies are processed to save memory.
■ Loc-RIB: Contains all the NLRIs that originated locally or were received from other
BGP peers. After NLRIs pass the validity and next-hop reachability check, the BGP
best path algorithm selects the best NLRI for a specific prefix. The Loc-RIB table
is the table used for presenting routes to the ip routing table.
Inside the BGP Loc-RIB table, all the routes and their path attributes are maintained with
the best path calculated. The best path is then installed in the RIB of the router. In the
event the best path is no longer available, the router can use the existing paths to quickly
identify a new best path. BGP recalculates the best path for a prefix upon four possible
events:
■ Redistribution change
The BGP best path selection algorithm influences how traffic enters or leaves an autono-
mous system (AS). BGP does not use metrics to identify the best path in a network.
BGP uses path attributes to identify its best path. But even before BGP influences the
BGP Route Processing and Route Propagation 637
best path selection using PAs, the router looks for the longest prefix match for the routes
present in the RIB and prefers that route to be installed in the forwarding information
base (FIB).
BGP path attributes are modified upon receipt or advertisement to influence routing
in the local AS or neighboring AS. A basic rule for traffic engineering with BGP is that
modifications in outbound routing policies influence inbound traffic, and modifications
to inbound routing policies influence outbound traffic.
BGP installs the first received path as the best path automatically. When additional paths
are received, the newer paths are compared against the current best path. If there is a tie,
processing continues onto the next step, until a best path winner is identified.
The following list provides the attributes that the BGP best path algorithm uses for
the best route selection process. These attributes are processed in the order listed in
Table 11-5.
Technet24
638 Chapter 11: Troubleshooting BGP
The best path algorithm is used to manipulate network traffic patterns for a specific route
by modifying various path attributes on BGP routers. Changing of BGP PA influences traffic
flow into, out of, and around an autonomous system (AS). The BGP routing policy varies
from organization to organization based upon the manipulation of the BGP PAs. Because
some PAs are transitive and carry from one AS to another AS, those changes could impact
downstream routing for other SPs, too. Other PAs are nontransitive and influence only the
routing policy within the organization. Network prefixes are conditionally matched on a vari-
ety of factors, such as AS-Path length, specific ASN, BGP communities, or other attributes.
Examining the topology shown in Figure 11-6, NX-5 and NX-6 advertise their loopback
toward AS 65000. When NX-1 receives the loopbacks, it receives it via NX-2 and NX-3
but only one of them is chosen as the best. The command show bgp afi safi ip-address/
length displays both the received paths but also displays one of the paths that was not
chosen as the best path, as shown in Example 11-22. In this example, initially the path for
192.168.5.5/32 is chosen via NX-2 due to the lowest RID, but when an inbound policy on
NX-3 is defined to set a higher local preference, the path via NX-3 is chosen as the best.
Advertised path-id 1
Path type: internal, path is valid, is best path
AS-Path: 65001 , path sourced external to AS
192.168.2.2 (metric 41) from 192.168.2.2 (192.168.2.2)
Origin IGP, MED not set, localpref 100, weight 0
NX-3(config-router-neighbor-af)# route-map LP in
NX-3(config-router-neighbor-af)# end
Path type: internal, path is invalid, not best reason: Local Preference, is de
leted, no labeled nexthop
AS-Path: 65001 , path sourced external to AS
192.168.2.2 (metric 41) from 192.168.2.2 (192.168.2.2)
Origin IGP, MED not set, localpref 100, weight 0
Advertised path-id 1
Path type: internal, path is valid, is best path
AS-Path: 65001 , path sourced external to AS
192.168.3.3 (metric 41) from 192.168.3.3 (192.168.3.3)
Origin IGP, MED not set, localpref 200, weight 0
Advertised path-id 1
Path type: internal, path is valid, is best path
AS-Path: 65001 , path sourced external to AS
192.168.3.3 (metric 41) from 192.168.3.3 (192.168.3.3)
Origin IGP, MED not set, localpref 200, weight 0
Note While a prefix is being removed from the BGP RIB (BRIB), the prefix is marked as
deleted and the path is never used for forwarding. After the update is complete, the BRIB
does not show the path/prefix that was removed.
Technet24
640 Chapter 11: Troubleshooting BGP
BGP Multipath
BGP’s default behavior is to advertise only the best path to the RIB, which means that
only one path for a network prefix is used when forwarding network traffic to a destina-
tion. BGP multipath allows for multiple paths to be presented to the RIB, so that both
paths can forward traffic to a network prefix at the same time. BGP multipath is an
enhanced form of BGP multihoming.
Note It is vital to understand that the primary difference between BGP multihoming and
BGP multipath is how load balancing works. BGP multipath attempts to distribute the
load of the traffic dynamically. BGP multihoming is distributed somewhat by the nature of
the BGP best path algorithm, but manipulation to the inbound/outbound routing policies is
required to reach a more equally distributed load among the links.
BGP supports three types of equal cost multipath (ECMP): EBGP multipath, IBGP mul-
tipath, or eiBGP multipath. In all three types of BGP multipath, the following BGP path
attributes (PA) must match for multipath to be eligible:
■ Weight
■ Local Preference
■ Origin
■ MED
■ Advertisement method must match (IBGP or EBGP); if the prefix is learned via an
IBGP advertisement, the IGP cost must match to be considered equal
Note NX-OS does not support the eiBGP multipath feature at the time of writing.
Examine the topology shown in Figure 11-6. In this topology, NX-1 learns same prefixes
from both NX-2 and NX-3. Because there is an IBGP peering between NX-1, NX-2, and
NX-3, the paths learned via NX-1 are internal. To have multiple BGP paths installed in the
RIB and BRIB, multipath IBGP is configured on NX-1. Example 11-23 demonstrates the
IBGP multipath functionality as explained.
BGP Route Processing and Route Propagation 641
Advertised path-id 1
Path type: internal, path is valid, is best path
AS-Path: 65001 , path sourced external to AS
192.168.2.2 (metric 41) from 192.168.2.2 (192.168.2.2)
Origin IGP, MED not set, localpref 100, weight 0
Technet24
642 Chapter 11: Troubleshooting BGP
Advertised path-id 1
Path type: internal, path is valid, is best path
AS-Path: 65001 , path sourced external to AS
192.168.2.2 (metric 41) from 192.168.2.2 (192.168.2.2)
Origin IGP, MED not set, localpref 100, weight 0
Path type: internal, path is valid, not best reason: Router Id, multipath
AS-Path: 65001 , path sourced external to AS
192.168.3.3 (metric 41) from 192.168.3.3 (192.168.3.3)
Origin IGP, MED not set, localpref 100, weight 0
The BGP event-history logs are used to verify the second-best path being added to the
Unicast Routing Information Base (URIB). Use the command show bgp event-history
detail to view the details for both the best path and the second-best path of a prefix
being added to URIB, as shown in Example 11-24. In Example 11-24, first the best path
is selected, which is via 192.168.2.2, and then another path is added to the URIB, which is
learned via nexthop 192.168.3.3.
BGP Route Processing and Route Propagation 643
Technet24
644 Chapter 11: Troubleshooting BGP
Example 11-25 Debugs for BGP Update and Route Installation in BRIB
22:40:31.707254 bgp: 65000 [10739] (default) UPD: Received UPDATE message from
192.168.4.4
22:40:31.707422 bgp: 65000 [10739] (default) UPD: 192.168.4.4 parsed UPDATE
message from peer, len 55 , withdraw len 0, attr len 32, nlri len 0
22:40:31.707499 bgp: 65000 [10739] (default) UPD: Attr code 1, length 1,
Origin: IGP
22:40:31.707544 bgp: 65000 [10739] (default) UPD: Attr code 5, length 4,
Local-pref: 100
22:40:31.707601 bgp: 65000 [10739] (default) UPD: Peer 192.168.4.4 nexthop
length in MP reach: 4
22:40:31.707672 bgp: 65000 [10739] (default) UPD: Recvd NEXTHOP 192.168.4.4
22:40:31.707716 bgp: 65000 [10739] (default) UPD: Attr code 14, length 14,
Mp-reach
22:40:31.707787 bgp: 65000 [10739] (default) UPD: [IPv4 Unicast] Received prefix
192.168.44.44/32 from peer 192.168.4.4, origin 0, next hop 192.168.4.4,
localpref 100, med 0
22:40:31.707859 bgp: 65000 [10739] (default) BRIB: [IPv4 Unicast] Installing
prefix 192.168.44.44/32 (192.168.4.4) via 192.168.4.4 into BRIB with extcomm
22:40:31.707915 bgp: 65000 [10739] (default) BRIB: [IPv4 Unicast] Created new
path to 192.168.44.44/32 via 0.0.0.0 (pflags=0x0)
22:40:31.707962 bgp: 65000 [10739] (default) BRIB: [IPv4 Unicast]
(192.168.44.44/32 (192.168.4.4)): bgp_brib_add: handling nexthop
22:40:31.708054 bgp: 65000 [10739] (default) BRIB: [IPv4 Unicast]
(192.168.44.44/32 (192.168.4.4)): returning from bgp_brib_add, new_path: 1,
change : 1, undelete: 0, history: 0, force: 0, (pflags=0x2010), reeval=0
22:40:31.708292 bgp: 65000 [10739] (default) BRIB: [IPv4 Unicast]
192.168.44.44/32, no Label AF
22:40:31.709891 bgp: 65000 [10739] (default) UPD: [IPv4 Unicast] Starting update
run for peer 192.168.3.3 (#65)
22:40:31.709917 bgp: 65000 [10739] (default) UPD: [IPv4 Unicast] consider
sending 192.168.44.44/32 to peer 192.168.3.3, path-id 1, best-ext is off
22:40:31.709948 bgp: 65000 [10739] (default) UPD: 192.168.3.3 Sending attr
code 1, length 1, Origin: IGP
22:40:31.709974 bgp: 65000 [10739] (default) UPD: 192.168.3.3 Sending attr
code 5, length 4, Local-pref: 100
22:40:31.709998 bgp: 65000 [10739] (default) UPD: 192.168.3.3 Sending attr
code 9, length 4, Originator: 192.168.4.4
22:40:31.710149 bgp: 65000 [10739] (default) UPD: 192.168.3.3 Sending attr
code 10, length 4, Cluster-list
22:40:31.710180 bgp: 65000 [10739] (default) UPD: 192.168.3.3 Sending attr
code 14, length 14, Mp-reach
22:40:31.710204 bgp: 65000 [10739] (default) UPD: 192.168.3.3 Sending nexthop
address 192.168.4.4 length 4
22:40:31.710231 bgp: 65000 [10739] (default) UPD: [IPv4 Unicast] 192.168.3.3
Created UPD msg (len 69) with prefix 192.168.44.44/32 ( Installed in HW)
path-id 1 for peer
22:40:31.710261 bgp: 65000 [10739] (default) UPD: [IPv4 Unicast] 192.168.3.3:
walked 0 nodes and packed 0/0 prefixes
22:40:31.710286 bgp: 65000 [10739] (default) UPD: [IPv4 Unicast] (#66)
Finished update run for peer 192.168.3.3 (#66)
Technet24
646 Chapter 11: Troubleshooting BGP
On NX-OS, debugs are not necessarily required to understand the update generation
process. Use the command show bgp event-history detail to view the detailed event
logs. The detail option is not available by default and thus is required to be configured
under the router bgp configuration using the command event-history detail [size
large | medium | small]. Example 11-26 displays the detailed output of the BGP event-
history logs showing the same update process. In this example, the update is being
generated for NX-3. If the event-history logs are rolled over and the issue still keeps
occurring again and again, in such situations debugs can be enabled, as demonstrated
in Example 11-25.
BGP Convergence
BGP convergence depends on various factors. BGP convergence is all about the speed of
the following:
■ Locally generate all the BGP paths (either via network statement, redistribution of
static/connected/IGP routes), and/or from other component for other address-family
(for example, Multicast VPN (MVPN) from multicast, L2VPN from l2vpn manager,
and so on).
■ Send and receive multiple BGP tables; that is, different BGP address-families to/from
each peer.
■ Upon receiving all the paths from peers, perform best path calculation to find the
best path and/or multipath, additional-path, backup path.
■ Installing the best path into multiple routing tables like default or VRF routing table.
■ Import and export mechanism.
■ For other address-family like L2VPN or multicast, pass the path calculation result to
different lower layer components.
BGP uses a lot of CPU cycles when processing BGP updates and requires memory for
maintaining BGP peers and routes in BGP tables. Based on the role of the BGP router in
the network, appropriate hardware should be chosen. The more memory a router has,
the more routes it can support, much like how a router with a faster CPU supports larger
number of peers.
Note BGP updates rely on TCP, optimization of router resources, like memory, and TCP
session parameters, like maximum segment size (MSS), path MTU discovery, interface
input queues, TCP window size, and so on to help improve convergence.
There are various steps that should be followed to verify whether the BGP has converged
and the routes are installed in the BRIB.
If there is a traffic loss, before BGP has completed its convergence for a given address-
family, verify the routing information in the URIB and the forwarding information in
the FIB. Example 11-27 demonstrates a BGP route getting refreshed. The command
show bgp event-history [event | detail] is used to validate that the prefix is installed
in BRIB table and that the command show routing event-history [add-route | modify-
route | delete-route] used to check the route has been installed in the URIB. In the
URIB, verify the timestamp of when the route was downloaded to the URIB. If the pre-
fix was recently downloaded to the URIB, there might have been an event that caused
the route to get refreshed. Also, the difference in the time between when the prefix was
installing in BRIB and when it was further downloaded to URIB will help understand
the convergence time.
Technet24
648 Chapter 11: Troubleshooting BGP
BGP convergence for relevant address-family is checked using the command show bgp
convergence detail vrf all. Example 11-28 shows the output of the show bgp conver-
gence details vrf all command. This command shows when the best-path selection pro-
cess was started and the time to complete it. Not only that, the command also displays
the time taken to converge the prefix to URIB, which can be used to understand how the
device is performing from BGP and URIB convergence perspective.
IPv4 Unicast:
First bestpath signalled 0.068443 after start
First bestpath completed 0.069397 after start
Convergence to URIB sent 0.082041 after start
Scaling BGP 649
IPv6 Unicast:
First bestpath signalled 0.068467 after start
First bestpath completed 0.069574 after start
Note If the BGP best-path has not run yet, the problem is likely not related to BGP on
that node.
If the best-path runs before EOR is received, or if a peer fails to send EOR marker, it can
lead to traffic loss. In such situations, enable debug for BGP updates with relevant debug-
filters for VRF, address-family, and peer, as shown in Example 11-29.
Example 11-29 Debug Commands with Filter
From the debug output, check the event log to look at the timestamp to see when the
most recent EOR was sent to the peer. This also shows how many routes were advertised
to the peer before the sending of the EOR. A premature EOR sent to the peer can also
lead to traffic loss if the peer flushes stale routes early.
If the route in URIB has not been downloaded, it needs to be further investigated
because it may not be a problem with BGP. The following commands can be run to check
the activity in URIB that could explain the loss:
Scaling BGP
BGP is one of the most feature-rich protocols ever developed that provides ease of rout-
ing and control using policies. Although BGP has many inbuilt features that scale the
protocol very well, these enhancements were never utilized properly. This poses various
challenges when BGP is deployed in a scaled environment.
Technet24
650 Chapter 11: Troubleshooting BGP
BGP is a heavy protocol because it uses the most CPU and memory resources on a
router. Many factors explain why it keeps utilizing more and more resources. The three
major factors for BGP memory consumption are as follows:
■ Prefixes
■ Paths
■ Attributes
BGP can hold many prefixes, and each prefix consumes some amount of memory. But
when the same prefix is learned via multiple paths, that information is also maintained
in the BGP table. Each path adds to more memory. Because BGP was designed to give
control to each AS to manage the flow of traffic through various attributes, each prefix
can have various attributes per path. This is put down as a mathematical function, where
N represents the number of prefixes, M represents the number of paths for a given prefix,
and L represents the attributes attached to given prefix:
■ Prefixes: (O(N))
■ Paths: (O(M × N))
■ Attributes: (O(L × M × N))
Prefixes
BGP memory consumption becomes huge when BGP is holding a large number of pre-
fixes or holding the Internet routing table. In most cases, not all the BGP prefixes are
required to be maintained by all the routers running BGP in the network. To reduce the
number of prefixes, take the following actions:
■ Aggregation
■ Filtering
With the use of aggregation, multiple specific routes can be aggregated into one route. But
aggregation is challenging when tried on a fully deployed running network. After the network
is up and running, the complete IP addressing scheme has to be looked at to perform aggrega-
tion. Aggregation is a good option for green field deployments. The green field deployments
give more control on the IP addressing scheme, which makes it easier to apply aggregation.
Filtering provides control over the number of prefixes maintained in the BGP table or adver-
tised to BGP peers. BGP provides filtering based on prefix, BGP attributes, and communities.
One important point to remember is that complex route filtering, or route filtering applied for
a large number of prefixes, helps reduce the memory, but the router takes a hit on the CPU.
Scaling BGP 651
Many deployments do not require all the BGP speakers to maintain a full BGP routing
table. Especially in an enterprise and data center deployments, there is no real need to
having the full Internet routing table. The BGP speakers can maintain even a partial rout-
ing table containing the most relevant and required prefixes or just a default route toward
the Internet gateway. Such designs greatly reduce the resources being used throughout
the network and increase scalability.
Paths
Sometimes the BGP table carries fewer prefixes but still holds more memory because of
multiple paths. A prefix can be learned via multiple paths, but only the best or multiple
best paths are installed in the routing table. To reduce the memory consumption by BGP
due to multiple paths, the following solutions should be adopted:
Multiple BGP paths are a direct effect of the multiple BGP peerings. Especially in an
IBGP full-mesh environment, the number of BGP sessions increases exponentially and
thus the number of paths. A lot of customers increase the number of IBGP neighbors
to have more redundant paths, but two paths are sufficient to maintain redundancy.
Increasing the number of peerings can cause scaling issues both from the perspective of
the number of sessions and from the perspective of BGP memory utilization.
It is a well-known fact that IBGP needs to be in full mesh. Figure 11-7 illustrates an IBGP
full-mesh topology. In an IBGP full-mesh deployment of n nodes, there are a total of
n*(n−1)/2 IBGP sessions and (n−1) sessions per BGP speaker.
R1 R2
Rm-1 Rm
Rn-1 Rn
This not only affects the scalability of an individual node or router but the whole net-
work. To increase the scalability of IBGP network, two design approaches can be used:
■ Confederations
■ Route Reflectors
Technet24
652 Chapter 11: Troubleshooting BGP
Note BGP Confederations and Route Reflectors are discussed in another section later in
this chapter.
Attributes
A BGP route is a “bag” of attributes. Every BGP prefix has certain default or mandatory
attributes that are assigned automatically, such as next-hop or AS-PATH, or attributes
that are configured manually, such as Multi-Exit Discriminator (MED) and the like,
assigned by customers. Each attribute attached to the prefix adds up some memory
utilization. Along with attributes, communities—both standard and extended—add to
increased memory consumption. To reduce the BGP memory consumption due to various
attributes and communities, the following solutions can be adopted:
On NX-OS, use the command show bgp private attr detail to view the various attributes
attached to the BGP prefixes. Example 11-30 displays the various global BGP attributes
on NX-1. These attributes were learned across various prefixes, including the community
attached to the prefix learned from NX-4.
localpref : 100
weight : 0
Extcommunity presence mask: (nil)
There is no method to get rid of the default BGP attributes, but the use of other attri-
butes can be controlled. Using attributes that make things more complex is of no use. For
example, using MED and various MED-related commands, such as the command bgp
always-compare-med or bgp deterministic-med, can have an adverse impact on the net-
work and can lead to route instability or routing loop conditions. At the same time, the
user-assigned attributes will consume more BGP memory, which can easily be avoided.
A peer-policy defines the address-family dependent policy aspects for a peer, including
inbound and outbound policy, filter-list and prefix-lists, soft-reconfiguration, and so on.
A peer-session template defines session attributes, such as transport details and session
timers. Both the peer-policy and peer-session templates are inheritable; that is, a peer-
policy or peer-session can inherit attributes from another peer-policy or peer-session,
Technet24
654 Chapter 11: Troubleshooting BGP
respectively. A peer template pulls the peer-session and peer-policy sections together to
allow cookie-cutter neighbor definitions. Example 11-31 illustrates the configuration of
BGP templates on NX-1.
■ Hard Reset: Dropping and reestablishing BGP session. Performed by the command
clear bgp afi safi [* | ip-address].
Scaling BGP 655
■ Soft Reset: A soft reset uses filtered prefixes stored in the memory to reconfigure
and activate BGP routing tables without tearing down the BGP session. Performed
using the command clear bgp afi safi [* | ip address] soft [in | out].
To manually perform a soft reset, use the command clear bgp ipv4 unicast
[* | ip-address] soft [in | out]. The soft-reconfiguration feature is useful when the
operator wants to know which prefixes have been sent to a router prior to the
application of any inbound policy.
0 7 15 23 31
+-----------------+-----------------+----------------+-----------------+
AFI Res. SAFI
+-----------------+-----------------+----------------+-----------------+
The AFI and SAFI in the ROUTE-REFRESH message points to the address-family where
the configured peer is negotiating the route refresh capability. The Reserved bits are
unused and are set to 0 by the sender and ignored by the receiver.
A BGP speaker sends a ROUTE-REFRESH message only if it has negotiated the route
refresh capability with its peer. This implies that all the participating routes should sup-
port the route refresh capability. The router sends a route refresh request (REFRESH_
REQ) to the peer. After the speaker receives a route refresh request, the BGP speaker
readvertises to the peer the Adj-RIB-Out of the AFI and SAFI carried in the message, to
its peer. If the BGP speaker has an outbound route filtering policy, the updates are fil-
tered accordingly. The route refresh requesting peer receives the filtered routes.
Technet24
656 Chapter 11: Troubleshooting BGP
The clear ip bgp ip-address in or clear bgp afi safi ip-address in command tells the
peer to resend the full BGP announcement by sending a route-refresh request. Whereas
the clear bgp afi safi ip-address out command resends the full BGP announcement to
the peer, it does not initiates a route refresh request. The route refresh capability is veri-
fied using the show bgp afi safi neighbor ip-address command. Example 11-32 displays
the route refresh capability negotiated between the two BGP peers.
NX-1
NX-1# show bgp ipv4 unicast neighbors 192.168.2.2
BGP neighbor is 192.168.2.2, remote AS 65000, ibgp link, Peer index 1
Inherits peer configuration from peer-template IBGP-RRC
BGP version 4, remote router ID 192.168.2.2
BGP state = Established, up for 01:10:46
Using loopback0 as update source for this peer
Last read 00:00:43, hold time = 180, keepalive interval is 60 seconds
Last written 00:00:31, keepalive timer expiry due 00:00:28
Received 77 messages, 0 notifications, 0 bytes in queue
Sent 80 messages, 0 notifications, 0 bytes in queue
Connections established 1, dropped 0
Last reset by us never, due to No error
Last reset by peer never, due to No error
Neighbor capabilities:
Dynamic capability: advertised (mp, refresh, gr) received (mp, refresh, gr)
Dynamic capability (old): advertised received
Route refresh capability (new): advertised received
Route refresh capability (old): advertised received
4-Byte AS capability: advertised received
Address family IPv4 Unicast: advertised received
Graceful Restart capability: advertised received
! Output omitted for brevity
Note When the soft-reconfiguration feature is configured, BGP route refresh capability
is not used, even though the capability is negotiated. The soft-reconfiguration configura-
tion controls the processing or initiating route refresh.
The BGP refresh request (REFRESH_REQ) is sent in one of the following cases:
■ Adding a route-target import to a VRF in MPLS VPN (for AFI/SAFI value 1/128
or 2/128).
RFC 1966 introduces the concept that an iBGP peering can be configured so that it
reflects routes to another iBGP peer. The router reflecting routes is known as a route
reflector (RR), and the router receiving reflected routes is a route reflector client. The RR
design turns an IBGP mesh into a hub-and-spoke design where the RR is the hub router.
The RR clients are either regular IBGP peers—that is, they are not directly connected
to each other—or the other design could have RR clients that are interconnected. Three
basic rules involve route reflectors and route reflection:
■ Rule #1: If an RR receives an NLRI from a non-RR client, the RR advertises the
NLRI to a RR client. It will not advertise the NLRI to a non-RR client.
■ Rule #3: If an RR receives a route from an EBGP peer, it advertises the route to
RR client(s) and non-RR client(s). Only route-reflectors are aware of this change in
behavior because no additional BGP configuration is performed on route-reflector
clients. BGP route reflection is specific to each address-family. The command
route-reflector-client is used on NX-OS devices under the neighbor address-family
configuration.
Examine the two RR design scenarios shown in Figure 11-9. The topology in (a) has R1
acting as the RR, whereas R2, R3, and R4 are the RR clients. The topology shown in
(b) has a similar setup to that of (a) with a difference that the RR clients are fully meshed
with each other.
Technet24
658 Chapter 11: Troubleshooting BGP
(a) (b)
Route Reflector Route Reflector
NX-1 NX-1
The RR and the client peers form a cluster and are not required to be fully meshed.
Because the topology in (b) has an RR along with fully meshed IBGP client peers, which
actually defies the purpose of having RR, the BGP RR reflection behavior should be dis-
abled. The BGP RR client-to-client reflection is disabled using the command no bgp client-
to-client reflection. This command is required only on the RR and not on the RR clients.
Example 11-33 displays the configuration for disabling BGP client-to-client reflection.
ORIGINATOR_ID
This optional nontransitive BGP attribute is created by the first route-reflector and sets
the value to the RID of the router that injected/advertised the route into the AS. If the
ORIGINATOR_ID is already populated on an NLRI, it should not be overwritten.
If a router receives an NLRI with its RID in the Originator attribute, the NLRI is discarded.
CLUSTER_LIST
This nontransitive BGP attribute is updated by the route-reflector. This attribute is
appended (not overwritten) by the route-reflector with its cluster-id. By default, this is the
BGP identifier. The cluster-id is set with the BGP configuration command cluster-id.
Scaling BGP 659
If a route reflector receives an NLRI with its cluster-id in the Cluster List attribute, the
NLRI is discarded.
Example 11-34 provides a sample prefix output from a route that was reflected by the
route reflector NX-1, as shown in Figure 11-9. Notice that the originator ID is the adver-
tising router and that the cluster list contains the route-reflector ID. The cluster list con-
tains the route-reflectors that the prefix traversed in the order of the last route-reflector
that advertised the route.
Advertised path-id 1
Path type: internal, path is valid, is best path
AS-Path: 65001 , path sourced external to AS
192.168.2.2 (metric 81) from 192.168.1.1 (192.168.1.1)
Origin IGP, MED not set, localpref 100, weight 0
Originator: 192.168.2.2 Cluster list: 192.168.1.1
If a topology contains more than one RR and both the RRs are configured with different
cluster IDs, the second RR holds the path from the first RR and hence consumes more
memory and CPU resources. Having either single cluster-id or multiple cluster-id has its
own disadvantages.
If the RR clients are fully meshed within the cluster, no bgp client-to-client reflection
command can be enabled on the RR.
Maximum Prefixes
By default, a BGP peer holds all the routes advertised by the peering router. The number
of routes are filtered either on the inbound of the local router or on the outbound of the
peering router. But there can still be instances where the number of routes are more than
what a router needs or a router can handle.
Technet24
660 Chapter 11: Troubleshooting BGP
NX-OS supports the BGP maximum-prefix feature that allows you to limit the number
of prefixes on a per-peer basis. Generally, this feature is enabled for EBGP sessions, but it
is also used for IBGP sessions. Although this feature helps scale and prevent the network
from an excess number of routes, it is very important to understand when to use this fea-
ture. The BGP maximum-prefix feature is enabled in the following situations:
■ Know how many BGP routes are anticipated from the peer.
■ What actions need to be taken if the number of routes are exceeded. Should the
BGP connection be reset or should a warning message be logged?
To limit the number of prefixes, use the command maximum-prefix maximum [thresh-
old] [restart restart-interval | warning-only] for each neighbor. Table 11-6 elaborates
each of the fields in the command.
An important point to remember is that when the restart option is configured with the
maximum-prefix command, the only other way apart from waiting for the restart-interval
timer to expire, to reestablish the BGP connection, is to perform a manual reset of the
peer using the clear bgp afi safi ip-address command.
Example 11-35 illustrates the use of the maximum-prefix command. NX-2 is receiving
over 10 prefixes from neighbor 10.25.1.5, but the device has set the maximum-prefix
limit to 10 prefixes. In such an instance, the BGP peering is shut on the device where
maximum-prefix is set, but the remote end peer remains in Idle state. While troubleshoot-
ing BGP peering issues, validate the show bgp afi safi neighbors ip-address command to
verify the reason for last reset.
Scaling BGP 661
Technet24
662 Chapter 11: Troubleshooting BGP
BGP Max AS
Various attributes are, by default, assigned to every BGP prefix. The length of attributes
that can be attached to a single prefix can grow up to size of 64 KB, which can cause
scaling as well as convergence issues for BGP.
A lot of times, the as-path prepend option is used to increase the AS-PATH list to make a
path with lower AS-PATH list preferred. This operation does not have much of an impact.
But from the perspective of the Internet, a longer AS-PATH list cannot only cause conver-
gence issues but can also cause security loopholes. The AS-PATH list actually signifies a
router’s position on the Internet.
To limit the maximum number of AS-PATH length supported in the network, the maxas-
limit command was introduced. Using the command maxas-limit 1-512 in NX-OS, any
route with AS-PATH length higher than the specified number is discarded.
■ Prefix-lists
■ Filter-lists
■ Route-maps
The BGP route-maps provide more dynamic capability as compared to prefix-lists and
filter-lists, because it not only allows you to perform route filtering but also allows the
network operators to define policies and set attributes that can be further used to control
traffic flow within the network. All these route filtering and route policy methods are
discussed in future sections.
Example 11-36 displays the BGP table of Nexus switch NX-2 in the topology shown in
Figure 11-9. The NX-2 switch is used as the base to demonstrate all the filtering tech-
niques shown further in this chapter.
Prefix-List-Based Filtering
As explained in Chapter 10, “Troubleshooting Nexus Route-Maps,” prefix lists provide
another method of identifying networks in a routing protocol. They identify a specific IP
address, network, or network range, and allow for the selection of multiple networks with
a variety of prefix lengths (subnet masks) by using a prefix match specification.
The prefix-list can be applied directly to a BGP peer and also as a match statement
within the route-map. A prefix-list is configured using the command ip prefix-list
name [seq sequence-number] [permit ip-address/length | deny ip-address/length] [le
length | ge length | eq length]. Examine the same topology as shown in Figure 11-6.
Technet24
664 Chapter 11: Troubleshooting BGP
Example 11-37 illustrates the configuration of BGP inbound and outbound route filtering
using prefix-lists on NX-2. The inbound prefix-list permits for 5 networks, whereas the
outbound prefix-list permits for host network entries is /32 prefixes matching in subnet
192.168.0.0/16. When the prefix-lists are configured, use the command show bgp afi safi
neighbor ip-address to ensure that the prefix-lists have been attached to the neighbor.
Example 11-38 displays the output of the BGP table after the prefix-lists have been
configured and attached to BGP neighbor 10.25.1.5. Notice that in this example, on the
NX-2 switch, only 5 prefixes are seen from neighbor 10.25.1.5. On NX-5, all the loopback
addresses of the nodes in AS 65000 are advertised apart from 192.168.44.0/24.
BGP Route Filtering and Route Policies 665
On the inbound direction on NX-2, use the command show bgp event-history detail to
view the details of the prefixes being matched against the prefix-list Inbound. Based on
the match, the prefixes are either permitted or denied. If no entry exists for the prefix in
the prefix-list, it is dropped by BGP and will not be part of the BGP table. Example 11-39
displays the event-history detail output demonstrating how a prefix 100.1.30.0/24 is
rejected or dropped by BGP prefix-list and the prefix 100.1.5.0/24 being permitted at
the same time.
Technet24
666 Chapter 11: Troubleshooting BGP
For the outbound direction, the show bgp event-history detail command output dis-
plays the prefixes in the BGP table being permitted and denied based on the matching
entries in the outbound prefix-list named Outbound. After the filtering is performed, the
prefixes are then advertised to the BGP peer along with relevant attributes, as shown in
Example 11-40.
BGP Route Filtering and Route Policies 667
NX-OS also has CLI to verify policy-based statistics for prefix-lists. The statistics are ver-
ified for the policy implied in both inbound and outbound directions and shows the num-
ber of prefixes permitted and denied in either direction. Use the command show bgp afi
safi policy statistics neighbor ip-address prefix-list [in | out] to view the policy statistics
for prefix-list applied on a BGP neighbor. The counters of the policy statistics command
increment every time a BGP neighbor flaps or a soft clear is performed on the neigh-
bor. Example 11-41 demonstrates the use of a policy statistics command for BGP peer
10.25.1.5 in both inbound and outbound directions to understand how many prefixes are
being permitted and dropped in both inbound and outbound directions. In this example,
a soft clear is performed on the outbound direction, and it is seen that the counters incre-
ment for the outbound prefix-list policy statistics by 4 for permitted prefixes and 1 for a
dropped prefix.
NX-2# show bgp ipv4 unicast policy statistics neighbor 10.25.1.5 prefix-list in
Total count for neighbor rpm handles: 1
Technet24
668 Chapter 11: Troubleshooting BGP
NX-2# show bgp ipv4 unicast policy statistics neighbor 10.25.1.5 prefix-list out
Total count for neighbor rpm handles: 1
After verifying the prefix-list and its clients, use the command show system internal rpm
event-history rsw to ensure the correct prefix-list has been bound to the BGP process.
An incorrect binding or a missing binding event-history log can indicate that the prefix-
list is not properly associated with the BGP process or the BGP neighbor.
Filter-Lists
BGP filter-lists allow for filtering of prefixes based on AS-Path lists. A BGP filter-list
can be applied in both inbound and outbound directions. A BGP filter-list is configured
using the command filter-list as-path-list-name [in | out] under the neighbor address-
family configuration mode. Example 11-43 illustrates a sample configuration of filter-
list on NX-2 switch in the topology referenced in Figure 11-6. In this example, an
inbound filter-list is configured to allow the prefixes that have AS 274 in the AS_PATH
list. The second output of the example shows that the filter-list is applied on the
inbound direction.
Technet24
670 Chapter 11: Troubleshooting BGP
Example 11-44 displays the prefixes in the BGP table received from peer 10.25.1.5 after
being filtered by the filter-list. Notice that all the prefixes shown in the BGP table have
AS 274 in their AS_PATH list.
Note If a BGP peer is configured with the soft-reconfiguration inbound command, you
can also use the command show bgp afi safi neighbor ip-address received-routes to view
the received BGP prefixes.
The easiest way to verify which prefixes are being permitted and denied is to use the
show bgp event-history detail command output, but if the event-history detail com-
mand is not enabled under the router bgp configuration, you can enable debugs to verify
the updates. The debug bgp updates command can be used to verify both the inbound
and the outbound updates. Example 11-45 demonstrates the use of debug bgp updates
to verify which prefixes are being permitted and which are being denied. The action of
permit or deny is always based on the entries present in the AS-path list.
Technet24
672 Chapter 11: Troubleshooting BGP
21:39:01.723538 bgp: 65000 [10743] (default) UPD: [IPv4 Unicast] 10.25.1.5 Inbound
as-path-list ALLOW_274, action permit
21:39:01.723592 bgp: 65000 [10743] (default) UPD: [IPv4 Unicast] Received prefix
100.1.21.0/24 from peer 10.25.1.5, origin 1, next hop 10.25.1.5, localpref 0, med 0
21:39:01.723687 bgp: 65000 [10743] (default) UPD: [IPv4 Unicast] Received prefix
100.1.22.0/24 from peer 10.25.1.5, origin 1, next hop 10.25.1.5, localpref 0, med 0
Similar to policy statistics for prefix-lists, the statistics are also available for filter-list
entries. When executing the command show bgp afi safi policy statistics neighbor
ip-address filter-list [in | out], notice the relevant AS-path access list referenced as part
of the filter-list command and the number of matches per each entry. The output also
displays the number of accepted and rejected prefixes by the filter-list, as displayed in
Example 11-46.
NX-2# show bgp ipv4 unicast policy statistics neighbor 10.25.1.5 filter-list in
Total count for neighbor rpm handles: 1
Because the filter-list uses AS-path access-list, RPM information can be verified for
as-path-access-list using the command show system internal rpm as-path-access-list
as-path-acl-name. This command confirms if the AS-path access-list is associated with
the BGP process. The command show system internal rpm event-history rsw is used to
validate if the AS-path access-list is bound to the BGP process. Example 11-47 displays
both the command outputs.
Clients:
bgp-65000 (Route filtering/redistribution) ACN version: 0
! RPM Event-History
NX-2# show system internal rpm event-history rsw
BGP Route-Maps
BGP uses route-maps to provide route filtering capability and traffic engineering by set-
ting various attributes to the prefixes that help control the inbound and outbound traffic.
Route-maps typically use some form of conditional matching so that only certain pre-
fixes are blocked or accepted. At the simplest level, route-maps can filter networks similar
to an AS-Path filter/prefix-list, but also provide additional capability by adding or modi-
fying a network attribute. Route-maps are referenced to a specific route-advertisement
or BGP neighbor and require specifying the direction of the advertisement (inbound/
outbound). Route-maps are a critical component of BGP because they allow for a unique
routing policy on a neighbor-by-neighbor basis.
Technet24
674 Chapter 11: Troubleshooting BGP
Example 11-49 shows the BGP table after inbound route-map filtering. Notice that the
prefixes 100.1.1.0/24 to 100.1.5.0/24 are set with the local preference of 200, whereas the
prefixes that match AS 274 in the AS-path list are set with the local preference of 300.
Because there is no route-map entry matching sequence 30, all the other prefixes are
denied by the inbound route-map filtering.
The show bgp event-history detail command can be used again to verify which prefixes
are being permitted or denied based on the route-map policy. Based on the underlying
match statements, relevant set actions are taken (if any). Example 11-50 displays the event-
history detail output demonstrating prefixes being permitted and denied by route-map.
You can also validate the policy statistics for the route-map similar to prefix-list and filter-list.
The command show bgp ipv4 unicast policy statistics neighbor ip-address route-map [in |
out] displays the matching prefix-list or AS-path access-list or any other attributes under each
route-map sequence and its matching statistics, as shown in Example 11-51.
NX-2# show bgp ipv4 unicast policy statistics neighbor 10.25.1.5 route-map in
Total count for neighbor rpm handles: 1
Within route-maps, various conditional matching features are used, such as prefix-
lists, regular expressions (regex), AS-Path access-list, BGP communities, and
community-lists. When multiple filtering mechanisms are configured under the same
neighbor, the following order of preference is used for both inbound and outbound
filtering:
■ Inbound Filtering
■ Route-map
■ Filter-list
■ Prefix-list, distribute-list
Technet24
676 Chapter 11: Troubleshooting BGP
■ Outbound Filtering
■ Filter-list
■ Route-map
■ Prefix-list, distribute-list
To parse through the large amount of available ASNs (4,294,967,295), regular expres-
sions (regex) are used. Regular expressions are based upon query modifiers to select the
appropriate content. The BGP table is parsed with regex using the command show bgp
afi safi regexp “regex-pattern” on Nexus switches.
Note NX-OS devices require the regex-pattern to be placed within a pair of double-quotes “”.
Table 11-7 provides a brief list and description of the common regex query modifiers.
Note The .^$*+()[]? characters are special control characters that cannot be used without
using the backslash \ escape character. For example, to match on the * in the output use
the \* syntax.
The following section provides a variety of common tasks to help demonstrate each of
the regex modifiers. Example 11-52 provides a reference BGP table for displaying scenari-
os of each regex query modifier for querying the prefixes learned via Figure 11-10.
Note The AS-Path for the prefix 172.16.129.0/24 has the AS 300 twice nonconsecutively
for a specific purpose. This is not seen in real life, because it indicates a routing loop.
_ Underscore
Query Modifier Function: Matches a space
Scenario: Only display ASs that passed through AS 100. The first assumption is that
the syntax show bgp ipv4 unicast regex “100” as shown in Example 11-53 is ideal.
Technet24
678 Chapter 11: Troubleshooting BGP
The regex query includes the following unwanted ASNs: 1100, 2100, 21003, and
10010.
Example 11-54 uses the underscore (_) to imply a space left of the 100 to remove the
unwanted ASNs. The regex query includes the following unwanted ASNs: 10010.
Example 11-55 provides the final query by using the underscore (_) before and after the
ASN (100) to finalize the query for the route that passes through AS 100.
^ Caret
Query Modifier Function: Indicates the start of the string.
Scenario: Only display routes that were advertised from AS 300. At first glance, the
command show bgp ipv4 unicast regex “_300_” might be acceptable for use, but in
Example 11-56 the route 192.168.129.0/24 is also included.
Because AS 300 is directly connected, it is more efficient to ensure that AS 300 was the
first AS listed. Example 11-57 shows the caret (^) in the regex pattern.
$ Dollar Sign
Query Modifier Function: Indicates the end of the string.
Scenario: Only display routes that originated in AS 40. In Example 11-58 the regex
pattern “_40_” was used. Unfortunately, this also includes routes that originated
in AS 50.
Technet24
680 Chapter 11: Troubleshooting BGP
Example 11-59 provides the solution using the dollar sign ($) for the regex the pattern
“_40$”.
[ ] Brackets
Query Modifier Function: Matches a single character or nesting within a range.
Scenario: Only display routes with an AS that contains 11 or 14 in it. The regex filter
“1[14]” can be used as shown in Example 11-60.
- Hyphen
Query Modifier Function: Indicates a range of numbers in brackets.
Scenario: Only display routes with the last two digits of the AS of 40, 50, 60, 70, or
80. Example 11-61 uses the regex query “[5-8]0_”. See the output in Example 11-60.
BGP Route-Maps 681
Scenario: Only display routes where the second AS from AS 100 or AS 300 does not
start with 3, 4, 5, 6, 7, or 8. The first component of the regex query is to restrict the
AS to the AS 100 or 300 with the regex query “^[13]00_”, and the second component
is to filter out AS starting with 3-8 with the regex filter “_[^3-8]”. The complete regex
query is “^[13]00_[^3-8]” as shown in Example 11-62.
Scenario: Only display routes where the AS_PATH ends with AS 40 or 45 in it. The
regex filter “_4(5|0)$” is shown in Example 11-63.
Technet24
682 Chapter 11: Troubleshooting BGP
. Period
Query Modifier Function: Matches a single character, including a space.
Scenario: Only display routes with an originating AS of 1–99. In Example 11-64, the
regex query “_..$” requires a space, and then any character after that (including other
spaces).
+ Plus Sign
Query Modifier Function: One or more instances of the character or pattern.
Scenario: Only display routes where they contain at least one or more ‘11’ in the AS
path. The regex pattern is “(11)+” as shown in Example 11-65.
? Question Mark
Query Modifier Function: Matches one or no instances of the character or pattern.
Scenario: Only display routes from the neighboring AS or its directly connected AS
(that is, restrict to two ASs away). This query is more complicated and requires you
to define an initial query for identifying the AS, which is “[0-9]+”. The second com-
ponent includes the space and an optional second AS. The “?” limits the AS match to
one or two ASs as shown in Example 11-66.
Note The CTRL+V escape sequence must be used before entering the ?.
* Asterisk
Query Modifier Function: Matches zero or more characters or patterns.
Scenario: Display all routes from any AS. This may seem like a useless task, but may
be a valid requirement when using AS-Path access lists, which are explained later in
this chapter. Example 11-67 shows the regex query.
Technet24
684 Chapter 11: Troubleshooting BGP
Example 11-68 provides two sample AS-Path access lists. AS-Path access-list 1 matches
against any local IBGP prefix, or any prefix that passes through AS 300 where as AS-Path
access-list 2 provides a more complicated AS-Path access list that matches the 16-bit pri-
vate ASN range (64,512 – 65,536).
BGP Communities
BGP communities provide additional capability for tagging routes and are considered
either well-known or private BGP communities. Private BGP communities are used for
conditional matching for a router’s route-policy, which could influence routes during
inbound or outbound route-policy processing. There are four well-known communities
that affect only outbound route-advertisement:
■ Internet: Advertise this route to the Internet community and all the routers that
belong to it.
NX-OS devices do not advertise BGP communities to peers by default. Communities are
enabled on a neighbor-by-neighbor basis with the BGP address-family configuration com-
mand send-community [standard | extended | both] under the neighbor’s address family
configuration. Standard communities are sent by default, unless the optional extended or
both keywords are used.
Conditionally matching on NX-OS devices requires the creation of a community list. A
community list shares a similar structure to an ACL, is standard or expanded, and is ref-
erenced via number or name. Standard community lists match either well-known commu-
nities or a private community number (as-number:16-bit-number), whereas Expanded
community lists use regex patterns.
Examining the same topology as shown in Figure 11-6. In this topology, NX-5 assigns a
community value of 65001:274 for the prefixes that have AS 274 in their AS_Path list.
Example 11-69 illustrates the configuration on NX-5 to a community value attached to
prefixes.
Advertised path-id 1
Path type: external, path is valid, is best path
Technet24
686 Chapter 11: Troubleshooting BGP
AS-Path: 65001 100 228 274 {300 243} , path sourced external to AS
10.25.1.5 (metric 0) from 10.25.1.5 (192.168.5.5)
On NX-2, if an operator wants to set a BGP attribute based on the matching community
value, community-list is used in the matching statement under route-map. Example 11-70
illustrates the configuration for using BGP community values for influencing route policy.
Logs Collection
In event of BGP failure, the following show tech logs can be collected:
If there is some issue seen with BGP route policies, collect the following logs along with
show tech bgp:
In case the routes are not being installed in the routing table but are present in the BGP
table, you can also collect the following show tech output:
Collect and share these logs with Cisco TAC for a root-cause analysis of the problem.
Summary
BGP is a powerful path vector routing protocol that provides scalability and flexibility
that cannot be compared to any other routing protocol. BGP uses TCP port 179 for
establishing neighbors, which allows for BGP to establish sessions with directly attached
routers or with routers that are multiple hops away.
Originally BGP was intended for routing of IPv4 prefixes between organizations, but over
the years has had significant increase in functionality and feature enhancements. BGP
has expanded from being an Internet routing protocol to other aspects of the network,
including the data center.
BGP provides a scalable control plane signaling for overlay topologies, including MPLS
VPNs, IPsec SAs, and VXLAN. These overlays provide Layer 3 services, such as L3VPNs,
or Layer 2 services, such as eVPNs, across a widely used scalable control plane for every-
thing from provider-based services to data center overlays. Every AFI/SAFI combination
maintains an independent BGP table and routing policy, which makes BGP the perfect
control plane application.
Technet24
688 Chapter 11: Troubleshooting BGP
This chapter focused on various techniques for troubleshooting BGP peering issues and
flapping peering issues related to MTU mismatch or due to bad BGP updates. Then the
chapter dives deep into BGP route processing and convergence issues. The route process-
ing concepts such as BGP update generation, route advertisement, best path calculation,
and multipath are covered as part of the BGP route processing. This chapter then covers
various scaling techniques for BGP, including BGP route reflectors.
The chapter then focuses on route filtering concepts using prefix-lists, filter-lists, and
route-maps and goes over various matching criteria available with route-maps, such as
prefix-lists, community-lists, and regular expressions.
Further Reading
Some of the topics involving validity checks and next-hop resolution are explained
further in the following books:
Zhang, Randy, and Micah Bartell. BGP Design and Implementation. Indianapolis: Cisco
Press 2003.
White, Russ, Alvaro Retana, and Don Slice. Optimal Routing Design. Indianapolis: Cisco
Press, 2005.
Jain, Vinit, and Brad Edgeworth. Troubleshooting BGP. Indianapolis: Cisco Press, 2016.
References
Jain, Vinit, and Brad Edgeworth. Troubleshooting BGP. Indianapolis: Cisco Press, 2016.
Edgeworth, Brad, Aaron Foss, and Ramiro Garza Rios. IP Routing on Cisco IOS, IOS
XE and IOS XR. Indianapolis: Cisco Press, 2014.
High Availability
Nexus OS (NX-OS) is a resilient OS that has been designed on the paradigms of high
availability not just at the system level, but at both the network and process levels as well.
Some of the Nexus switches provide high availability by redundancy hardware such as
redundant fabric, supervisor cards, and power supplies. Network-level high availability
is provided by features such as virtual port-channels (vPC) and First Hop Redundancy
Protocol (FHRP), which give users backup paths to failover in case the primary path
fails. NX-OS leverages various system components to provide process restartability
and virtualization capability, thus providing process-level high availability. This chapter
covers some of the important features and components within NX-OS that provide high
availability in the network.
Technet24
690 Chapter 12: High Availability
its neighbor(s) cannot send packets at a smaller interval. The following features of BFD
make it a most desirable protocol for failure detection:
■ Capability to run over User Data Protocol (UDP), data protocol independence
(IPv4, IPv6, Label Switched Path [LSP])
When an application (Border Gateway Protocol [BGP], Open Shortest Path First [OSPF],
and so on) creates or modifies a BFD session, it provides the following information:
■ Local address
■ Desired interval
■ Multiplier
The product of the desired interval and the multiplier indicates the desired failure
detection interval. The operational workflow of BFD for a given protocol P is as
follows:
■ If a link failure occurs, BFD detects the failure in the desired failure detection
interval (desired interval × multiplier) and informs both the peer and the local
BFD client (such as BGP) of the failure.
■ The session for P goes down immediately instead of waiting for the Hold Timer
to expire.
■ Asynchronous mode
■ Demand mode
Bidirectional Forwarding Detection 691
Note Demand mode is not supported on Cisco platforms. In demand mode, no hello
packets are exchanged after the session is established. In this mode, BFD assumes there is
another way to verify connectivity between the two endpoints. Either host may still send
hello packets if needed, but they are not generally exchanged.
Asynchronous Mode
Asynchronous mode is the primary mode of operation and is mandatory for BFD to func-
tion. In this mode, each system periodically sends BFD control packets to one another.
For example, packets sent by router R1 have a source address of NX-1 and a destination
address of router NX-2, as Figure 12-1 shows.
NX-1 Is Alive
NX-1 NX-2
NX-2 Is Alive
Each stream of BFD control packets is independent and does not follow a request-
response cycle. If the other system does not receive the configured number of packets in
a row (based on the BFD timer and multiplier), the session is declared down. An adaptive
failure detection time is used to avoid false failures if a neighbor is sending packets slower
than what it is advertising.
BFD async packets are sent on UDP port 3784. The BFD source port must be in the
range of 49152 through 65535. The BFD control packets contain the fields in Table 12-1.
Technet24
692 Chapter 12: High Availability
Figure 12-2 shows the BFD control packets defined by the IETF.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Vers Diag Sta P F C A D M Detect Mult Length
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
My Discriminator
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Your Discriminator
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Desired Min TX Interval
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Required Min RX Interval
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Required Min Echo RX Interval
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Note BFD supports keyed SHA-1 authentication on NX-OS beginning with Release 5.2.
Bidirectional Forwarding Detection 693
NX-1 NX-2
Echo Packets Destination NX-2
Because echo packets do not require application or host stack processing on the remote
end, this function can be used for aggressive detections timers. Another benefit of using
echo function is that the sender has complete control of the response time. For the echo
function to work, the remote node should also be capable of echo function. The BFD
control packets with echo function enabled are sent as UDP packets with the source and
destination port 3785.
BFD configuration must be enabled under the routing protocol configuration and also
under the interface that will be participating in BFD. To enable configuration under the
routing protocol (for instance, OSPF), use the command bfd under the router ospf con-
figuration. Under the interface, two important BFD commands are defined:
■ BFD interval
■ BFD echo function
Technet24
694 Chapter 12: High Availability
The BFD interval can be defined both under the interface and in global configuration
mode. It is defined using the command bfd interval tx-interval min_rx rx-interval
multiplier number. The BFD echo function is enabled by default. To disable or enable
the BFD echo function, use the command [no] bfd echo. Example 12-2 illustrates the
configuration for enabling BFD for OSPF.
NX-1
NX-1(config)# int e4/1
NX-1(config-if)# no ip redirects
NX-1(config-if)# no ipv6 redirects
NX-2(config)# bfd interval 300 min_rx 300 multiplier 3
NX-1(config-if)# ip ospf bfd
NX-1(config-if)# no bfd echo
NX-1(config-if)# exit
NX-1(config)# router ospf 100
NX-1(config-router)# bfd
Note To enable BFD for other routing protocols, refer to the Cisco documentation for
the configuration on different Nexus devices.
When BFD is enabled, a BFD session gets established. Use the command show bfd neigh-
bors [detail] to verify the status of BFD. The show bfd neighbors command displays the
state of the BFD neighbor, along with the interface, local and remote discriminator, and
Virtual Routing and Forwarding (VRF) details. The output with the detail keyword dis-
plays all the fields that are part of the BFD control packet, which is useful for debugging
purposes to see whether a mismatch could cause the BFD session to flap. Ensure that the
State bit is set to Up instead of AdminDown. The output also shows that the echo func-
tion is enabled or disabled. Example 12-3 displays the output of the command show bfd
neighbors [detail].
Before troubleshooting any BFD-related issue, it is important to verify the state of the
feature. This is done by using the command show system internal feature-mgr feature
feature-name current status. If a problem arises with the process (for instance, a pro-
cess is not running or has crashed), the state of the process does not show as Running.
Example 12-4 displays the state of BFD feature. Here, the BFD is currently in the
Running state.
As with other features, BFD also maintains internal event-history logs that are useful in
debugging any state machine-related or BFD flaps. The event-history for BFD provides
various command-line options. To view the BFD event-history, use the command show
system internal bfd event-history [all | errors | logs | msgs | session [discriminator]]. The
all option shows all the event-history (that is, all the events and error event-history logs).
The errors option shows only the BFD-related errors. The logs options shows all the
events for BFD. The msgs option shows BFD-related messages, and the session option
helps view the logs related to errors, log messages, and app-events for a particular session.
Example 12-5 displays the BFD event-history logs for a BFD session hosted on
an interface on module 4 with the discriminator 0x41000004. This example also helps
you understand the information exchange and steps the system goes through in bringing
up a BFD session. These are the steps, listed in sequence:
Step 2. The BFD client (BFDC) adds a BFD session with the interface and IP addresses
of the devices between which the session will be established.
Technet24
696 Chapter 12: High Availability
Step 3. The BFD component sends an MTS message to the BFDC component on the
line card.
Note The BFD process runs on the supervisor, whereas the BFDC runs on the line card.
Example 12-6 displays the detailed information about the session using the command
show system internal bfd event-history session discriminator. The discriminator value
is calculated from the LD or your discriminator value from the show bfd neighbor
detail output. This value is calculated in hex, as shown in Example 12-6, and is used
with the event-history command output. The event-history session command views the
errors, logs such as parameters exchanged and state changes, and app events related to a
given BFD session.
Technet24
698 Chapter 12: High Availability
The command show system internal bfd transition-history shows the different internal
state machine-related events that the BFD session goes through (see Example 12-7).
Note that the final state a BFD session should be in BFD_SESS_ST_SESSION_UP. If
the BFD session is stuck in one of the other states, this command can identify where the
session is stuck.
1) FSM:<Proto Sess 0x41000004> Transition at 292788 usecs after Sat Oct 28 13:13:15
2017
Previous state: [BFD_SESS_ST_INIT]
Triggered event: [BFD_SESS_EV_INTERFACE]
Next state: [BFD_SESS_ST_INSTALLING_SESSION]
2) FSM:<Proto Sess 0x41000004> Transition at 293898 usecs after Sat Oct 28 13:13:15
2017
Previous state: [BFD_SESS_ST_INSTALLING_SESSION]
Triggered event: [BFD_SESS_EV_SESSION_INSTALL_SUCCESS]
Next state: [BFD_SESS_ST_INSTALLING_ACL]
3) FSM:<Proto Sess 0x41000004> Transition at 347878 usecs after Sat Oct 28 13:13:15
2017
Previous state: [BFD_SESS_ST_INSTALLING_ACL]
Triggered event: [BFD_SESS_EV_ACL_RESPONSE]
Next state: [FSM_ST_NO_CHANGE]
4) FSM:<Proto Sess 0x41000004> Transition at 347948 usecs after Sat Oct 28 13:13:15
2017
Previous state: [BFD_SESS_ST_INSTALLING_ACL]
Triggered event: [BFD_SESS_EV_ACL_INSTALL_SUCCESS]
Next state: [BFD_SESS_ST_SESSION_DOWN]
Technet24
700 Chapter 12: High Availability
5) FSM:<Proto Sess 0x41000004> Transition at 769773 usecs after Sat Oct 28 13:13:18
2017
Previous state: [BFD_SESS_ST_SESSION_DOWN]
Triggered event: [BFD_SESS_EV_SESSION_UP]
Next state: [BFD_SESS_ST_SESSION_UP]
6) FSM:<Proto Sess 0x41000004> Transition at 399361 usecs after Sat Oct 28 13:14:02
2017
Previous state: [BFD_SESS_ST_SESSION_UP]
Triggered event: [BFD_SESS_EV_SESSION_DOWN]
Next state: [BFD_SESS_ST_SESSION_DOWN]
7) FSM:<Proto Sess 0x41000004> Transition at 315593 usecs after Sat Oct 28 13:14:06
2017
Previous state: [BFD_SESS_ST_SESSION_DOWN]
Triggered event: [BFD_SESS_EV_CLIENT_ADD]
Next state: [FSM_ST_NO_CHANGE]
8) FSM:<Proto Sess 0x41000004> Transition at 92563 usecs after Sat Oct 28 13:14:08
2017
Previous state: [BFD_SESS_ST_SESSION_DOWN]
Triggered event: [BFD_SESS_EV_SESSION_UP]
Next state: [BFD_SESS_ST_SESSION_UP]
When a BFD session is configured, an access list is installed in the hardware; it is verified
using the command show system internal access-list interface interface-id module slot.
The relevant statistics for the hardware Access Control List (ACL) can be viewed using
the command show system internal access-list input statistics module slot. Note that
when the BFD is enabled on an interface, the ACL gets installed for both IPv4 and IPv6 in
the hardware. Example 12-8 illustrates ACL programmed in the hardware for BFD on the
Nexus 7000 switch.
INSTANCE 0x0
---------------
INSTANCE 0x0
---------------
Technet24
702 Chapter 12: High Availability
Note The ACL programming on the hardware is dependent on the underlying line
card hardware and the Nexus platform. The behavior might differ among Nexus hardware
platforms.
To enable the BFD echo function, configure the command bfd echo under the interface.
When the session is configured with the echo function, the BFD session starts in asyn-
chronous mode using a slow interval of 2 seconds. When the session is up, and if the
interval specified by the client is less than 2 seconds, the echo function gets activated
(assuming that the echo function is enabled on the remote peer as well).
Example 12-9 illustrates the configuration of the BFD echo function between NX-1 and
NX-2 and the changes in the show bfd neighbors detail command output after the BFD
session is established.
If a failure occurs, NX-OS logs a syslog message for BFD failure along with a reason code
for the failure and the session discriminator value. Example 12-10 displays the syslog
message of a BFD failure on NX-1. Notice that, in this case, the reason is 0x2, which
indicates “Echo Function Failed.”
Table 12-2 lists all the BFD failure reason codes, along with their description.
Technet24
704 Chapter 12: High Availability
Note In case of any BFD failure event, capturing show tech bfd soon after the BFD flap
event is recommended. It is also necessary to capture the show tech feature output for the
relevant feature with which BFD is associated; for instance, in case of OSPF, this is show
tech ospf.
Nexus also supports BFD over L3 port-channels or BFD on SVI interfaces over L2 port-
channel. In both cases, Link Aggregation Control Protocol (LACP) must be enabled for
the port-channel interface. BFD is enabled on L3 port-channel interfaces for two methods:
■ BFD per-link
■ Micro BFD session
To enable BFD per-link, use the command bfd per-link under the port-channel interface
along with the no ip redirects command. That enables the BFD for the client protocol
enabled on that L3 port-channel interface. When BFD per-link mode is used, BFD cre-
ates a session for each link in the port-channel and provides accumulated or aggregated
results to the client protocol. Example 12-11 demonstrates the configuration of per-link
BFD configuration on port-channel interface and its verification using the show bfd
neighbors [detail] command output. Use the command show port-channel summary to
verify the member ports of the port-channel interface.
NX-1
NX-1(config)# interface port-channel1
NX-1(config-if)# no ip redirects
NX-1(config-if)# bfd per-link
NX-1(config-if)# ip router ospf 100 area 0.0.0.0
NX-1(config-if)# ip ospf network point-to-point
NX-1(config-if)# exit
NX-1(config)# router ospf 100
NX-1(config-router)# bfd
Session state is Up
Local Diag: 0
Registered protocols: ospf
Uptime: 0 days 0 hrs 0 mins 9 secs
Hosting LC: 0, Down reason: None, Reason not-hosted: None
Parent session, please check port channel config for member info
Nexus 9000 also supports BFD on every link aggregation group (LAG) member inter-
faces, as defined in RFC 7130. This method is called IETF Micro BFD session. The echo
function is not supported on micro BFD sessions. The benefit of using micro BFD ses-
sions is that if any member port goes down, the port is removed from the forwarding
table and traffic disruption is prevented on that member link.
Micro BFD sessions are configured using the commands port-channel bfd track-
member-link and port-channel bfd destination ip-address on an active L3 port-channel
interface. Example 12-12 demonstrates the configuration of micro BFD session
configuration on Nexus 9000 switches N9k-1 and N9k-2.
Technet24
706 Chapter 12: High Availability
N9k-1
N9k-1(config)# interface port-channel2
N9k-1(config-if)# port-channel bfd track-member-link
N9k-1(config-if)# port-channel bfd destination 172.16.0.1
N9k-2
N9k-2(config)# interface port-channel2
N9k-2(config-if)# port-channel bfd track-member-link
N9k-2(config-if)# port-channel bfd destination 172.16.0.0
During verification, it is noticed that the BFD session is established on each member
port of the port-channel. In this method, the BFD client is the port-channel itself.
Example 12-13 verifies the BFD session on the port-channel interface configured with
the micro BFD session. Notice that the client is Ethernet port-channel.
N9k-1
N9k-1# show bfd neighbors
Session state is Up
Local Diag: 0
Registered protocols: eth_port_channel
Uptime: 0 days 0 hrs 9 mins 56 secs
Hosting LC: 0, Down reason: None, Reason not-hosted: None
Parent session, please check port channel config for member info
Note In case of any issues with a per-link BFD or micro BFD session, collect the show
tech bfd and show tech lacp all output and share the captured logs with Cisco Technical
Assistance Center (TAC) for investigation purposes.
This section discusses in detail these features and shows how they provide HA capability
to Nexus devices.
Stateful Switchover
Various Nexus platforms (including the Nexus 7000, Nexus 7700, and Nexus 9500) have
support for fabric as well as supervisor redundancy. The benefit of the hardware-based
redundancy is that if the active hardware (fabric or supervisor card) fails, the standby
hardware takes over the role of active and prevents any kind of traffic and service
Technet24
708 Chapter 12: High Availability
With redundant hardware, the supervisor cards must stay in active/ha-standby mode. The
supervisor states are verified using the command show module. This command displays
all the supervisor cards, line cards, and fabric cards present in the chassis. Example 12-14
displays the show module output on the Nexus 7000 switch. Notice that, in the output,
the supervisor card in slot 1 is in ha-standby state and the one in slot 2 is in active state.
The HA state is also verified using the command show system redundancy status. When
the standby supervisor is booting up, or after a switchover event when the active supervi-
sor moves to a standby role, the ha-standby state is not achieved immediately. The stand-
by supervisor requires synchronizing the state with that of the active supervisor. This is
achieved with the system manager (sysmgr) component on the active supervisor. The
sysmgr component initiates a global sync (gsync) of active supervisor state to standby
supervisor. During the synchronization process, the state is seen as HA synchronization
in progress. Note that the standby should not be in this state for too long because it can
indicate failure and other issues.
When all the components and states are synchronized between the active and standby
supervisor, the Module-Manager is informed that the standby supervisor is up. The
Module-Manager then informs all the software components on active supervisor about
the availability of the standby supervisor and configures them. This event is known as the
Standby Sup Insertion Sequence. Any error faced during this sequence results in a reboot
of the standby supervisor.
Example 12-15 displays the system redundancy status. An ideal state for redundancy is
active/standby state. In this example, the standby supervisor is currently synchronizing its
states with the active supervisor in slot 2.
Nexus High Availability 709
Note In case of failure during Standby Sup Insertion Sequence, collect the following
commands to help identify where the failure has occurred:
On the Nexus 7000 or Nexus 7700 series platform, where virtual device context (VDC)
is supported, the HA state should also be maintained across all VDCs configured on
the system. This is verified using the command show system redundancy ha status.
Example 12-16 verifies the system redundancy state across all VDCs.
Technet24
710 Chapter 12: High Availability
Synchronization is achieved using the sysmgr component, so the state information can
also be verified using the sysmgr state command show system internal sysmgr state.
In this command, verify that the sysmgr state is set to Active/HotStandby, as shown in
Example 12-17. This command also shows the current state of the active supervisor card,
which is set to Active (SYSMGR_CARDSTATE_ACTIVE) here.
The master System Manager has PID 4967 and UUID 0x1.
Last time System Manager was gracefully shutdown.
The state is SRV_STATE_MASTER_ACTIVE_HOTSTDBY entered at time Thu Oct 26 13:20:5
4 2017.
Debugging info:
HA info:
slotid = 2 supid = 0
cardstate = SYSMGR_CARDSTATE_ACTIVE .
cardstate = SYSMGR_CARDSTATE_ACTIVE (hot switchover is configured enabled).
Configured to use the real platform manager.
Configured to use the real redundancy driver.
Redundancy register: this_sup = RDN_ST_AC, other_sup = RDN_ST_SB.
EOBC device name: veobc.
Remote addresses: MTS - 0x00000101/3 IP - 127.1.1.1
MSYNC done.
Nexus High Availability 711
Statistics:
Message count: 0
Total latency: 0 Max latency: 0
Total exec: 0 Max exec: 0
NX-1 SUP-1
NX-1# system switchover
NX-1#
User Access Verification
NX-1 login:
User Access Verification
NX-1 login:
>>>
>>>
>>>
NX7k SUP BIOS version ( 2.12 ) : Build - 05/29/2013 11:58:20
Technet24
712 Chapter 12: High Availability
Note During manual switchover, while the initial active supervisor is being rebooted
to take over the role as standby, if the newly active supervisor crashes or reloads, it can
lead to a whole system reload and cause major outages. Thus, a manual switchover should
always take place during a planned maintenance window.
Nexus High Availability 713
ISSU
Performing upgrades in any network deployment, especially in a huge data center and
enterprise, is unpleasant. In most cases, when a device needs to be upgraded, services and
traffic are shifted to the backup or redundant devices, boot variables are set, and then the
device is brought down using reload command to perform the upgrade. This becomes
more challenging on devices such as the Nexus 7000, with multiple VDCs running on a
single box, acting as individual devices and playing different roles. To overcome the chal-
lenges of upgrades in the network, leverage the ISSU feature.
ISSU is not a new concept. It is available on multiple Cisco catalyst platforms, including
4500 and 6500 switches. ISSU follows the same concept on Nexus 7000 series devices.
The whole ISSU process takes place in a few simple steps:
Step 1. Upgrade the Basic Input and Output System (BIOS) on supervisors and line
card modules.
Step 3. Switch over from the active to the standby supervisor, which is running on the
new image.
Step 4. Bring up old active supervisor card with the new image.
Note Starting with NX-OS Release 5.2(1), simultaneous multiple line card upgrades
happen on Nexus switches, thus reducing the upgrade time using ISSU.
Before ISSU is performed, especially when the software is being downgraded, perform a
sanity check for the configuration compatibility between the existing software version
running on the system and the old image to which the system is being downgraded. This
check helps inform the network administrators about the features and configurations that
are available in the new release but not in the old release, and those configurations then
are removed. The incompatibilities are verified using the command show incompatibility-
all system nx-os-file-name, as in Example 12-19.
Technet24
714 Chapter 12: High Availability
An ISSU upgrade is performed using the command install all kickstart kickstart-image
system system-image [parallel]. The parallel keyword is used to perform parallel upgrade
with I/O modules. ISSU is supposed to perform a nondisruptive software upgrade, which
upgrades the software on the Nexus switch without affecting the data plane. For a non-
disruptive upgrade, the software must be compatible across releases. If the image is not
compatible, the upgrade can be disruptive. Example 12-20 illustrates an example of a
disruptive software upgrade from the 6.2(16) image to the 7.3(2)D1(1) image. The output
shows that the image is incompatible, so the impact of the upgrade is thus disruptive.
Technet24
716 Chapter 12: High Availability
NX-1#
>>>
>>>
>>>
NX7k SUP BIOS version ( 2.12 ) : Build - 05/29/2013 11:58:20
PM FPGA Version : 0x00000025
Power sequence microcode revision - 0x00000009 : card type - 10156EEA0
Booting Spi Flash : Primary
CPU Signature - 0x000106e4: Version - 0x000106e0
CPU - 2 : Cores - 4 : HTEn - 1 : HT - 2 : Features - 0xbfebfbff
FSB Clk - 532 Mhz : Freq - 2144 Mhz - 2128 Mhz
MicroCode Version : 0x00000002
Memory - 32768 MB : Frequency - 1067 MHZ
Loading Bootloader: Done
IO FPGA Version : 0x1000d
PLX Version : 861910b5
Bios digital signature verification - Passed
USB bootflash status : [1-1:1-1]
Technet24
718 Chapter 12: High Availability
When an ISSU upgrade fails, it is important to determine which component caused the
failure. At this point, the first step is to collect the following logs:
■ Installer log
After capturing the relevant logs, it is important to restore the services from ISSU failure.
This is done using the command install all. This command ensures that the system nor-
malizes with running image and that all the modules are running the same image.
It is important to remember that an ISSU upgrade might not be compatible with all
scenarios, such as OTV (in certain releases), LACP Fast rate, and continuous TCNs in the
network. Reviewing the ISSU caveats on CCO is thus recommended before performing
an upgrade.
Note In case of ISSU failure, it is also important to collect show tech-support issu and
show tech-support ha outputs before the services are recovered.
■ Maintenance mode
■ Normal mode
In maintenance mode (also known as the Graceful Removal phase), all data traffic bypass-
es the node. A parallel path should be available for the GIR to function properly. If no
available parallel path exists, service disruptions to the network can arise. Maintenance
mode is used to perform maintenance-related activities such as software/hardware
upgrades, swaps for bad hardware, or other disruptive activities on the node. The node
then can go back to normal mode (also known as Graceful Insertion phase).
Technet24
720 Chapter 12: High Availability
To understand the functioning of GIR, examine the topology in Figure 12-4. This
topology is a typical spine-leaf topology with two spine nodes and six leaf nodes. The
connectivity between spine and leaf is via OSPF.
Spine1 Spine2
Maintenance
Mode
OSPF
Max-Metric
In this topology, suppose that the spine node Spine1 is set to maintenance mode for per-
forming a software upgrade. The first step in GIR is to advertise costly metrics within
the routing protocols. Thus, Spine1 advertises the OSPF max-metric to all its OSPF
neighbors. When the leaf nodes receive the max-metric, they alter their forwarding path
to push all the traffic through Spine2. At this point, the OSPF neighborship is still up
between Spine1 and all six leaf nodes (assuming the default Isolate mode, to be dis-
cussed), but no data forwarding is happening via Spine1.
Maintenance mode is supported on Nexus 7000 and 7700 series platforms starting with
Release 7.2.0 and on Nexus 5500/5600 platforms starting with Release 7.1.0. Maintenance
mode is configured using the command system mode maintenance [shutdown]. When
the command system mode maintenance is configured, GIR is enabled in default mode,
also known as Isolate mode. In this mode, the protocol neighborship is maintained and
traffic is diverted to the backup or parallel path. When the command system mode main-
tenance shutdown is configured, the GIR is enabled in shutdown mode; the protocols go
into shutdown state, links are shut down, and traffic loss can occur. Isolate mode for GIR
is recommended over shutdown mode.
Technet24
722 Chapter 12: High Availability
shutdown
address-family ipv6 unicast
shutdown
router ospf 100
shutdown
router isis IS-IS
shutdown
system interface shutdown
NOTE: 'system interface shutdown' will shutdown all interfaces excluding mgmt 0
Do you want to continue (yes/no)? [no] yes
When the system goes into maintenance mode, the processes that were influenced by
maintenance mode change their running state to Isolate or Shutdown. Example 12-22
displays the different routing protocol processes and their current state on the system.
Technet24
724 Chapter 12: High Availability
Server up : L3VM|IFMGR|RPM|AM|CLIS|URIB|U6RIB|IP|IPv6|SNMP|BGP|MMODE
Server required : L3VM|IFMGR|RPM|AM|CLIS|URIB|IP|SNMP
Server registered: L3VM|IFMGR|RPM|AM|CLIS|URIB|IP|SNMP|BGP|MMODE
Server optional : BGP|MMODE
Early hello : OFF
Force write PSS: FALSE
OSPF mts pkt sap 324
OSPF mts base sap 320
After the maintenance activity is performed, the no system mode maintenance configu-
ration command brings the system out of maintenance mode. When this command is
configured, the system is rolled back to normal mode and all the configuration changes
made during the isolate or shutdown maintenance mode are rolled back. Example 12-23
illustrates moving the system from maintenance mode to normal mode. Another snapshot
then is taken, with the name after_maintenance.
When the system is back to normal mode, verify that the services are normalized, with
routes in the Routing Information Base (RIB), VLANs, and so on. The snapshots taken
before and after maintenance help verify the same with just a single command. The cur-
rent available snapshots are verified using the command show snapshots. When both the
before and after maintenance snapshots are available, use the command show snapshots
compare before_maintenance after_maintenance [summary] to compare the system for
any differences. Example 12-24 demonstrates the comparison of before and after mainte-
nance snapshots.
================================================================================
Feature before_maintenance after_maintenance changed
================================================================================
basic summary
# of interfaces 63 63
# of vlans 1 1
# of ipv4 routes vrf default 43 43
# of ipv4 paths vrf default 46 46
# of ipv4 routes vrf management 9 9
# of ipv4 paths vrf management 9 9
# of ipv6 routes vrf default 3 3
# of ipv6 paths vrf default 3 3
interfaces
# of eth interfaces 60 60
# of eth interfaces up 7 7
# of eth interfaces down 53 53
# of eth interfaces other 0 0
Technet24
726 Chapter 12: High Availability
# of vlan interfaces 1 1
# of vlan interfaces up 0 0
# of vlan interfaces down 1 1
# of vlan interfaces other 0 0
Most production environments have a limit on the duration of the maintenance window.
To set the time limit of the system for the maintenance window, configure the timeout
value for the maintenance mode using the command system mode maintenance timeout
time-in-minutes. When the timeout value is reached, the system automatically rolls back
to normal mode from maintenance mode. Example 12-25 examines configuring the main-
tenance timeout to 30 minutes and verifying the timeout value using the command show
maintenance timeout.
Timer will be started for 30 minutes when the system switches to maintenance mode.
N7k-1# show maintenance timeout
Maintenance mode timeout value: 30 minutes
bitmap = 0xc0
Note If any issues arise with maintenance mode, collect the command show tech-
support mmode output during or just after the problem is seen.
■ Maintenance-mode
■ Normal-mode
Configuration for these profiles first gets generated after the system has been put in main-
tenance mode and switched back to normal mode. While creating custom profiles, the
profile names remain the same, but the configuration inside the profiles can be modified.
When you create custom profiles, it appends the commands to the existing maintenance
profile. Hence, the first step is to check whether a maintenance profile has been defined.
This is verified using the command show maintenance profile, as in Example 12-27.
Technet24
728 Chapter 12: High Availability
[Maintenance Mode]
router bgp 100
isolate
router eigrp 100
isolate
router ospf 100
isolate
router isis IS-IS
isolate
If the maintenance-mode and normal-mode profiles are not empty, it is better to remove
the existing maintenance profiles content and then create the custom profile from scratch.
To remove the maintenance profiles, use the command no configure maintenance profile
[maintenance-mode | normal-mode]. These commands are executed from the exec mode.
After removing the existing profile configuration, the command configure maintenance
profile [maintenance-mode | normal-mode] configures custom profiles from configura-
tion mode. When both the customer maintenance and normal profiles are configured,
it is important to also configure the command system maintenance mode always-use-
custom-profile so that the system-generated custom profile configuration is not gener-
ated and used. Example 12-28 demonstrates all the steps to configure the custom profiles
for both maintenance and normal modes. In this example, the maintenance-mode profile
is configured to isolate BGP and Intermediate System-to-Intermediate System (ISIS) pro-
tocols but shut down OSPF, Enhanced Interior Gateway Routing Protocol (EIGRP), and
interface Ethernet 3/1. Along with configuring custom maintenance profiles, it is impor-
tant to save the configuration so that the customer profiles are retained even after the
reloads.
N7k-1(config-mm-profile)#
N7k-1(config-mm-profile)# router bgp 100
N7k-1(config-mm-profile-router)# isolate
N7k-1(config-mm-profile-router)# router ospf 100
N7k-1(config-mm-profile-router)# shutdown
N7k-1(config-mm-profile-router)# router eigrp 100
N7k-1(config-mm-profile-router)# shutdown
N7k-1(config-mm-profile-router)# router isis IS-IS
N7k-1(config-mm-profile-router)# isolate
N7k-1(config-mm-profile-router)# interface e3/1
N7k-1(config-mm-profile-if-verify)# shutdown
N7k-1(config-mm-profile-if-verify)# end
Technet24
730 Chapter 12: High Availability
[Maintenance Mode]
router bgp 100
isolate
router ospf 100
shutdown
router eigrp 100
shutdown
router isis IS-IS
isolate
interface Ethernet3/1
shutdown
Note Use the command show running-config mmode to validate all the configuration
settings related to maintenance mode.
To activate maintenance mode with custom profiles, configure the command system
mode maintenance dont-generate-profile. This command uses the configuration
from the custom profile created on the Nexus switch to get into maintenance
mode. Example 12-29 illustrates activating maintenance mode using custom profile
configurations.
Note To debug maintenance mode, use the command debug mmode logfile. Enabling
this debug also enables logging of the debug logs into a logfile that is viewed using the
command show system internal mmode logfile. Collecting show tech-support mmode
command output is also recommended, in case of any failures with GIR.
Summary
NX-OS being the OS for data center switches was built on paradigms of high availability
(HA). This chapter focused on some of the high availability features that are commonly
used on Nexus switches, including achieving high availability using BFD, which is used
with various routing protocols and features. This chapter detailed verifying the hardware
programming and using event-history logs to troubleshoot any BFD issues. The following
areas should be verified while troubleshooting BFD session issues:
■ Verify the Error code, explains the reason for the BFD failure:
■ No Diag
■ 5: Path down
Technet24
732 Chapter 12: High Availability
■ 7: Administratively down
In addition, this chapter covered the system high availability features, such as SSO and
ISSU, which are critical in a production environment. Performing incremental ISSU
upgrades that are nondisruptive is better than performing upgrades using the reload
command.
The chapter also examined Graceful Insertion and Removal (GIR) and looked at how GIR
is used to perform maintenance activities in the network without requiring too many
changes. With GIR, maintenance mode is enabled in two modes:
■ Isolate mode
■ Shutdown mode
Isolate mode is recommended for use with GIR. Finally, this chapter elaborated on how
to create and use custom profiles for maintenance windows instead of using system-
generated profiles.
References
RFC 5880, Bidirectional Forwarding Detection. D. Katz and D. Ward. IETF,
http://tools.ietf.org/html/rfc5880, June 2010.
RFC 5881, Bidirectional Forwarding Detection for IPv4 and IPv6 (Single Hop). D. Katz
and D. Ward. IETF, http://tools.ietf.org/html/rfc5881, June 2010.
RFC 5883, Bidirectional Forwarding Detection for Multihop Paths. D. Katz and
D. Ward. IETF, http://tools.ietf.org/html/rfc5883, June 2010.
RFC 5884, Bidirectional Forwarding Detection for MPLS Label Switched Paths.
R. Aggarwal, K. Kompella, T. Nadeau, and G. Swallow. IETF, http://tools.ietf.org/
html/rfc5884, June 2010.
Troubleshooting Multicast
Multicast traffic is found in nearly every network deployed today. The concept of
multicast communication is easy to understand. A host transmits a message that is
intended for multiple recipients. Those recipients are enabled to listen specifically for
the multicast traffic of interest and ignore the rest, which supports the efficient use of
system resources. However, bringing this simple concept to life in a modern network
can be confusing and misunderstood. This chapter introduces multicast communication
using Cisco NX-OS. After discussing the fundamental concepts, it presents examples
to demonstrate how to verify that the control plane and data plane are functioning as
intended. Multicast is a broad topic, and including an example for every feature is not
possible. The chapter primarily focuses on the most common deployment options for
IPv4; it does not cover multicast communication with IPv6.
Technet24
734 Chapter 13: Troubleshooting Multicast
Multicast Fundamentals
Network communication is often described as being one of the following types:
■ Unicast (one-to-one)
■ Broadcast (one-to-all)
■ Anycast (one-to-nearest-one)
■ Multicast (one-to-many)
The concept of unicast traffic is simply a single source host sending packets to a single
destination host. Anycast is another type of unicast traffic, with multiple destination
devices sharing the same network layer address. The traffic originates from a single host
with a destination anycast address. Packets follow unicast routing to reach the nearest
anycast host, where routing metrics determine the nearest device.
10.12.1.0/24 10.12.2.0/24
.253 .254 .254
NX-1 NX-2
.1 .4
.2 H1
.3 .6 .5 H4
H3 H2 H6 H5
NX-2 is configured to route between the two L3 subnets in Figure 13-1. Host 3 sent a
broadcast packet with a destination IP address 255.255.255.255 and destination MAC
address of ff:ff:ff:ff:ff:ff. The broadcast traffic is represented by the black arrows. The
broadcast packet is flooded from all ports in the L2 switch and received by each device
Multicast Fundamentals 735
in the 10.12.1.0/24 subnet. Host 1 is the only device running an application that needs
to receive this broadcast. Receiving the packets on every other device results in wasted
bandwidth and packet processing. NX-2 receives the broadcast but does not forward
the packet to the 10.12.2.0/24 subnet. This behavior limits the scope of communication
to only devices that are within the same broadcast domain or L3 subnet. Figure 13-1
demonstrates the potential ineffieciency of using broadcasts when certain hosts do not
need to receive those packets.
Host 4 is sending multicast traffic represented by the white arrows to a group address
of 239.1.1.1. These multicast packets are handled differently by the L2 switch and
flooded only to Host 6 and NX-2, which is acting as an L3 multicast router (mrouter).
NX-2 performs multicast routing and forwards the traffic to the L2 switch, which finally
forwards the packets to Host 2. Because NX-1 is not receiving multicast traffic, the L2
switch does not consider it to be an mrouter. If NX-1 is reconfigured to be a multicast
router with interested receivers attached, the packet is received and again multicast
routed by NX-1 toward its receivers on other subnets. This theoretical behavior of NX-1
is mentioned to demonstrate that the scope of a multicast packet is limited by the time
to live (TTL) value set in the IP header by the multicast source, not by an L3 subnet
boundary as with broadcasts. Scope is also limited by administrative boundaries, access
lists (ACL), or protocol-specific filtering techniques.
Multicast Terminology
The terminology used to describe the state and behaviors of multicast must be defined
before diving further into concepts. Table 13-1 lists the multicast terms with their
corresponding definition used throughout this chapter.
Technet24
736 Chapter 13: Troubleshooting Multicast
Term Definition
L2 replication The act of duplicating a multicast packet at the branch points along
a multicast distribution tree. Replication for multicast traffic at L2 is
done without rewriting the source MAC address or decrementing the
TTL, and the packets stay inside the same broadcast domain.
L3 replication The act of duplicating a multicast packet at the branch points
along a multicast distribution tree. Replication for multicast traffic
at L3 requires PIM state and multicast routing. The source MAC
address is updated and the TTL is decremented by the multicast
router.
Reverse Path Compares the IIF for multicast group traffic to the routing table
Forwarding (RPF) entry for the source IP address or the RP address. Ensures that
check multicast traffic flows only away from the source.
Multicast distribution Multicast traffic flows from the source to all receivers over the MDT.
tree (MDT) This tree can be shared by all sources (a shared tree), or a separate
distribution tree can be built for each source (a source tree). The
shared tree can be one-way or bidirectional.
Protocol Independent Multicast routing protocol that is used to create MDTs.
Multicast (PIM)
RP Tree (RPT) The MDT between the last-hop router (LHR) and the PIM RP. Also
referred to as the shared tree.
Shortest-path tree The MDT between the LHR and the first-hop router (FHR) to the
(SPT) source. Typically follows the shortest path as determined by unicast
routing metrics. Also known as the source tree.
Divergence point The point where the RPT and the SPT diverge toward different
upstream devices.
Upstream A device that is relatively closer to the source along the MDT.
Downstream A device that is relatively closer to the receiver along the MDT.
Sparse mode Protocol Independent Multicast Sparse mode (PIM SM) relies on
explicit joins from a PIM neighbor before sending traffic toward the
receiver.
Dense mode PIM dense mode (PIM DM) relies on flood-and-prune forwarding
behavior. All possible receivers are sent the traffic until a prune is
received from uninterested downstream PIM neighbors. NX-OS does
not support PIM DM.
rendezvous point (RP) The multicast router that is the root of the PIM SM shared multicast
distribution tree.
Multicast Fundamentals 737
Term Definition
Join A type of PIM message, but more generically, the act of a
downstream device requesting traffic for a particular group or source.
This can result in an interface being added to the OIL.
Prune A type of PIM message, but more generically, the act of a downstream
device indicating that traffic for the group or source is no longer
requested by a receiver. This can result in the interface being removed
from the OIL if no other downstream PIM neighbors are present.
First-hop router The L3 router that is directly adjacent to the multicast source.
(FHR) The FHR performs registration of the source with the PIM RP.
Last-hop router The L3 router that is directly adjacent to the multicast receiver. The
(LHR) LHR initiates a join to the PIM RP and initiates switchover from the
RPT to the SPT.
Intermediate router An L3 multicast-enabled router that forwards packets for the MDT.
The example multicast topology in Figure 13-2 illustrates the terminology in Table 13-1.
PIM RP
RP Tree
NX-1
SPT
Divergence Point L3 Replication
of SPT and RPT Point
Intermediate
Routers
NX-2 NX-5
LHR
FHR
L2 Replication
Point
Figure 13-2 illustrates a typical deployment of PIM Sparse mode any-source multicast
(ASM). The end-to-end traffic flow from the source to the receiver is made possible
through several intermediate steps to build the MDT:
Technet24
738 Chapter 13: Troubleshooting Multicast
Note Figure 13-2 shows both the RP tree and the source tree in the diagram, for
demonstration purposes. This state does not persist in reality because NX-3 prunes itself
from the RP tree and receives the group traffic from the source tree.
The MAC address used by a host is typically assigned by the manufacturer and is
referred to as the Burned-In-Address (BIA). When two hosts in the same IP subnet
communicate, the destination address of the L2 frame is set to the target device’s MAC
address. As frames are received, if the target MAC address matches the BIA of the host,
the frame is accepted and handed to higher layers for further processing.
Broadcast messages between hosts are sent to the reserved address of FF:FF:FF:FF:FF:FF.
A host receiving a broadcast message must process the frame and pass its contents to a
higher layer for additional processing where the frame is either discarded or acted upon
by an application. As mentioned previously, for applications that do not need to be
received by each host on the network the inefficiencies of broadcast communication can
be improved upon by utilizing multicast.
The multicast MAC address differentiates multicast from unicast or broadcast frames at
Layer 2. The reserved range of multicast MAC addresses designated in RFC 1112 are
from 01:00:5E:00:00:00 to 01:00:5E:7F:FF:FF. The first 24 bits are always 01:00:5E. The
first byte contains the individual/group (I/G) bit, which is set to 1 to indicate a multicast
MAC address. The 25th bit is always 0, which leaves 23 bits of the address remaining. The
Layer 3 group address is mapped to the remaining 23 bits to form the complete multicast
MAC address (see Figure 13-3).
Multicast Fundamentals 739
When expanded in binary format, it is clear that multiple L3 group addresses must map
to the same multicast MAC address. In fact, 32 L3 multicast group addresses map to
each multicast MAC address. This is because 9 bits from the L3 group address do not get
mapped to the multicast MAC address. The 4 high-order bits of the first octet are always
1110, and the remaining 4 bits of the first octet are variable. Remember that the multicast
group IP address has the first octet in the range of 224 to 239. The first high-order bit of
the third octet is ignored when the L3 group address is mapped to the multicast MAC
address. This is the 25th bit of the multicast MAC address that is always set to zero.
Combined, the potential variability of those 5 bits is 32 (25), which explains why 32 mul-
ticast groups map to each multicast MAC address.
For a host, this overlap means that if its NIC is programmed to listen to a particular multicast
MAC address, it could receive frames for multiple multicast groups. For example, imagine that
a source is active on a LAN segment and is generating multicast group traffic to 233.65.1.1,
239.65.1.1 and 239.193.1.1. All these groups are mapped to the same multicast MAC address.
If the host is interested only in packets for 239.65.1.1, it cannot differentiate the different
groups at L2. All the frames are passed to a higher layer where the uninteresting frames get
discarded, while the interesting frames are sent to the application for processing. The 32:1
overlap must be considered when deciding on a multicast group addressing scheme. It is also
advisable to avoid using groups X.0.0.Y and X.128.0.Y because the multicast MAC overlaps
with 224.0.0.X. These frames are flooded by switches on all ports in the same VLAN.
Technet24
740 Chapter 13: Troubleshooting Multicast
does not exist with multicast because each address identifies an individual multicast
group address. However, various address blocks within the 224.0.0.0/4 multicast range sig-
nify a specific purpose based on their address. The Internet Assigned Numbers Authority
(IANA) lists the multicast address ranges provided in Table 13-2.
The Local Network Control Block is used for protocol communication traffic. Examples
are the All routers in this subnet address of 224.0.0.2 and the All OSPF routers address
of 224.0.0.5. Addresses in this range should not be forwarded by any multicast router,
regardless of the TTL value carried in the packet header. In practice, protocol packets
that utilize the Local Network Control Block are almost always sent with a TTL of 1.
The Internetwork Control Block is used for protocol communication traffic that is
forwarded by a multicast router between subnets or to the Internet. Examples include
Cisco-RP-Announce 224.0.1.39, Cisco-RP-Discovery 224.0.1.40, and NTP 224.0.1.1.
Table 13-3 provides the well-known multicast addresses used by control plane protocols
from the Local Network Control Block and from the Internetwork Control Block. It is
important to become familiar with these specific reserved addresses so they are easily
identifiable while troubleshooting a control plane problem.
NX-OS Multicast Architecture 741
The Source-Specific Multicast Block is used by SSM, an extension of PIM Sparse mode
that is described later in this chapter. It is optimized for one-to-many applications when
the host application is aware of the specific source IP address of a multicast group.
Knowing the source address eliminates the need for a PIM RP and does not require any
multicast routers to maintain state on the shared tree.
The NX-OS HA architecture allows for stateful process restart and in-service software
upgrades (ISSU) with minimal disruption to the data plane. As Figure 13-4 shows, the
architecture is distributed with platform-independent (PI) components running on the
supervisor module and hardware-specific components that forward traffic running on
the I/O modules or system application-specific integrated circuits (ASIC).
Technet24
742 Chapter 13: Troubleshooting Multicast
Supervisor
PI State
Database
mRIB
MFDM
This common architecture is used across all NX-OS platforms. However, each platform
can implement the forwarding components differently, depending on the capabilities of
the specific hardware ASICs.
The MRIB interacts with the Unicast Routing Information Base (URIB) to obtain routing
protocol metrics and next-hop information used during Reverse Path Forwarding
(RPF) lookups. Any multicast packets that are routed by the supervisor in the software
forwarding path are also handled by the MRIB.
platform components understand. The data structures are then pushed from MFDM to
each I/O module, in the case of a distributed platform such as the Nexus 7000 series. In
a nonmodular platform, MFDM distributes its information to the platform-forwarding
components.
The Multicast Forwarding Information Base (MFIB) programs the (*, G) and (S, G) and
RPF entries it receives from MFDM into hardware forwarding tables known as FIB
(ternary content-addressable memory) TCAM. The TCAM is a high-speed memory space
that is used to store a pointer to the adjacency. The adjacency is then used to obtain the
Multicast Expansion Table (MET) index. The MET index contains information about the
OIFs and how to replicate and forward the packet to each downstream interface. Many
platforms and I/O modules have dedicated replication ASICs. The steps described here
vary based on the type of hardware a platform uses, and troubleshooting at this depth
typically involves working with Cisco TAC Support. Table 13-4 provides a mapping of
multicast components to show commands used to verify the state of each component
process.
show ip mroute
MFDM show forwarding distribution ip multicast route
When Virtual Device Contexts (VDC) are used with the Nexus 7000 series, all of the
previously mentioned PI components are unique to the VDC. Each VDC has its own
PIM, IGMP, MRIB, and MFDM processes. However, in each I/O module, the system
resources are shared among the different VDCs.
Technet24
744 Chapter 13: Troubleshooting Multicast
Replication
Multicast communication is efficient because a single packet from the source can be
replicated many times as it traverses the MDT toward receivers located along different
branches of the tree. Replication can occur at L2 when multiple receivers are in the same
VLAN on different interfaces, or at L3 when multiple downstream PIM neighbors have
joined the MDT from different OIFs.
Module 1
Replication
Engine Local Copy
MET
Fabric ASIC
Fabric Copy
Fabric Module
Fabric ASIC
Module 2 Module 3
Fabric ASIC Fabric ASIC
Replication Replication
Engine Engine
MET MET
The benefit of egress replication is that it allows all modules of the system to share the
load of packet replication, which increases the forwarding capacity and scalability of the
platform. As traffic arrives from the IIF, the following happens:
■ The fabric module replicates additional copies of the packet, one for each module
that has an OIF.
■ At each egress module, additional packet copies are made for each local receiver
based on the contents of the MET table.
NX-OS Multicast Architecture 745
The MET tables on each module contain a list of local OIFs. For improved scalability,
each module maintains its own MET tables. In addition, multicast forwarding entries that
share the same OIFs can share the same MET entries, which further improves scalability.
■ The initial packet from a new source used to create a PIM register message
■ IGMP membership reports used to create entries in the snooping table
NX-OS uses control plane policing (CoPP) policies to protect the supervisor CPU from
excessive traffic. The individual CoPP classes used for multicast traffic vary from plat-
form to platform, but they all serve an important role: to protect the device. Leaving
CoPP enabled is always recommended, although exceptional cases require modifying
some of the classes or policer rates. The currently applied CoPP policy is viewed with the
show policy-map interface control-plane command. Table 13-5 provides additional detail
about the default CoPP classes related to multicast traffic.
In addition to CoPP, which polices traffic arriving at the supervisor, the Nexus 7000
series uses a set of hardware rate limiters (HWRL). The hardware rate limiters exist
on each I/O module and control the amount of traffic that can be directed toward the
supervisor. The status of the HWRL is viewed with the show hardware rate-limiter (see
Example 13-1).
Technet24
746 Chapter 13: Troubleshooting Multicast
Module: 3
As with the CoPP policy, disabling any of the HWRLs that are enabled by default is
not advised. In most deployments, no modification to the default CoPP or HWRL
configuration is necessary.
■ Punted multicast data packets are not replicated by default (this is enabled by
configuring ip routing multicast software-replicate only if needed).
Technet24
748 Chapter 13: Troubleshooting Multicast
Static Joins
In general, static joins should not be required when multicast has been correctly
configured. However, this is a useful option for troubleshooting in certain situations. For
example, if a receiver is not available, a static join is used to build multicast state in the
network.
NX-OS offers the ip igmp join-group [group] [source] interface command, which
configures the NX-OS device as a multicast receiver for the group. Providing the source
address is not required unless the join is for IGMPv3. This command forces NX-OS to
issue an IGMP membership report and join the group as a host. All packets received for
the group address are processed in the control plane of the device. This command can
prevent packets from being replicated to other OIFs and should be used with caution.
The second option is the ip igmp static-oif [group] [source] interface command, which
statically adds an OIF to an existing mroute entry and forwards packets to the OIF in
hardware. The source option is used only with IGMPv3. It is important to note that if this
command is being added to a VLAN interface, you must also configure a static IGMP
snooping table entry with the ip igmp snooping static-group [group] [source] interface
[interface name] VLAN configuration command to actually forward packets.
■ clear ip pim route * clears PIM entries created by PIM join messages.
■ clear ip igmp route * clears IGMP entries created by IGMP membership reports.
In addition, the ip pim border command can be configured on an interface to prevent the
forwarding of any Auto-RP, bootstrap, or candidate-RP messages.
Each feature or service related to forwarding multicast traffic in NX-OS has its own show
tech-support [feature] output. These commands are typically used to collect the major-
ity of data for a problem in a single output that can be analyzed offline or after the fact.
The tech support file contains configurations, data structures, and event-history output
for each specific feature. If a problem is encountered and the time to collect information
is limited, the following list of NX-OS tech support commands can be captured and redi-
rected to individual files in bootflash for later review:
Knowing what time the problem might have occurred is critical so that the various
system messages and protocol events can be correlated in the event-history output.
If the problem occurred in the past, some or all of the event-history buffers might
have wrapped and the events related to the problem condition could be gone. In such
situations, increasing the size of certain event-history buffers might be useful for when
the problem occurs again.
After collecting all the data, the files can be combined into a single archive and com-
pressed for Cisco support to investigate the problem.
Technet24
750 Chapter 13: Troubleshooting Multicast
IGMP
Hosts use the IGMP protocol to dynamically join and leave a multicast group through
the LHR. With IGMP, a host can join or leave a group at any time. Without IGMP, a
multicast router has no way of knowing when interested receivers reside on one of its
interfaces or when those receivers are no longer interested in the traffic. It should be
obvious that, without IGMP, the efficiencies in bandwidth and resource utilization in
a multicast network would be severely diminished. Imagine if every multicast router
sent traffic for each group on every interface! For that reason, hosts and routers must
support IGMP if they are configured to support multicast communication. In the NX-OS
implementation of IGMP, a single IGMP process serves all virtual routing and forwarding
(VRF) instances. If Virtual Device Contexts (VDC) are being used, an IGMP process runs
on each VDC.
IGMPv1 was defined in RFC 1112 and provided a state machine and the messaging
required for hosts to join and leave multicast groups by sending membership reports to
the local router. Finding a device using IGMPv1 in a modern network is uncommon, but
an overview of its operation is provided for historical purposes so that the differences
and evolution in IGMPv2 and IGMPv3 are easier to understand.
A multicast router configured for IGMPv1 periodically sends query messages to the
All-Hosts address of 224.0.0.1. The host then waits for a random time interval, within the
bounds of a report delay timer, to send a membership report using the group address
as the destination address for the membership report. The multicast router receives the
message indicating that traffic for a specific group should be sent. When the router
receives the membership report, it knows that a host on the segment is a current
member of the multicast group and starts forwarding the group traffic onto the segment.
A functional reason for using the group address as the destination of the membership
report is so that hosts are aware of the presence of other receivers for the group on the
same network. This allows a host to suppress its own report message, to reduce the
volume of IGMP traffic on a segment. A multicast router needs to receive only a single
membership report to begin sending traffic onto the segment.
When a host wants to join a new multicast group, it can immediately send a member-
ship report for the group; it does not have to wait for a query message from a multicast
router. However, when a host wants to leave a group, IGMPv1 does not provide a way to
indicate this to the local multicast router. The host simply stops responding to queries.
If the router receives no further membership reports, it sends three queries before prun-
ing off the interface from the OIL and determining that interested receivers are no longer
present.
IGMP 751
IGMPv2
Defined in RFC 2236, IGMPv2 provides additional functionality over IGMPv1. It
required an additional message to be defined to implement the new functionality.
Figure 13-6 shows the IGMP message format.
Group Address
■ Type:
■ Max Response Time: Used only in membership query messages and is set to zero in
all other message types. This is used to tune the response time of hosts and the leave
latency observed when the last member decides to leave the group.
Note IP packets carrying IGMP messages have the TTL set to 1 and the router alert
option set in the IP header, to force routers to examine the packet contents.
In IGMPv2, an election to determine the IGMP querier is specified whenever more than
one multicast router is present on the network segment. Upon startup, a multicast router
sends an IGMP general query message to the All-Hosts group 224.0.0.1. When a router
receives a general query message from another multicast router, a check is performed and
the router with the lowest IP address assumes the role of the querier. The querier is then
responsible for sending query messages on the network segment.
Technet24
752 Chapter 13: Troubleshooting Multicast
an unsolicited membership report when a new group is joined to initiate the flow of
multicast traffic on the segment.
The leave group message was defined to address the IGMPv1 problem in which a host could
not explicitly inform the network after deciding to leave a group. This message type is used
to inform a router when the multicast group is no longer needed on the segment and all
members have left the group. If a host is the last member to send a membership report on the
segment, it should send a leave group message when the host no longer wants to receive the
group traffic. This leave group message is sent to the All-Routers multicast address 224.0.0.2.
When the querier receives this message, it sends a group-specific query in response, which is
also a new functionality enhancement over IGMPv1. The group-specific query message uses
the multicast group’s destination IP address, to ensure that any host listening on the group
receives the query. These messages are sent based on the last member query interval. If a
membership report is not received, the router prunes the interface from the OIL.
IGMPv3
IGMPv3 was specified in RFC 3376. It allows a host to support the functionality required
for Source Specific Multicast (SSM). SSM multicast allows a receiver to specifically join
not only the multicast group address, but also the source address for a particular group.
Applications running on a multicast receiver host can now request specific sources.
In IGMPv3, the interface state of the host includes a filter mode and source list. The filter
mode can be include or exclude. When the filter mode is include, traffic is requested only
from the sources in the source list. If the filter mode is exclude, traffic is requested for
any source except the ones present in the source list. The source list is an unordered list
of IP unicast source addresses, which can be combined with the filter mode to implement
source-specific logic. This allows IGMPv3 to signal only the sources of interest to the
receiver in the protocol messages.
Figure 13-7 provides the IGMPv3 membership query message format, which includes sev-
eral new fields when compared to the IGMPv2 membership query message, although the
message type remains the same (0x11).
Group Address
■ Type 0x11: Membership query (general query, group specific query, or group and
source specific query). These messages are differentiated by the contents of the
group address and source address fields.
■ Max Resp Code: The maximum time allowed for a host to send a responding
report. It enables the operator to tune the burstiness of IGMP traffic and the leave
latency.
■ Checksum: Ensures the integrity of the IGMP message. It is calculated over the
entire IGMP message.
■ Group Address: Set to zero for general query and is equal to the group address for
group specific or source and group specific queries.
■ QQIC: Querier’s query interval code. Provides the querier’s query interval (QQI).
■ Number of Sources: Specifies how many sources are present in the query.
■ Source Address: Specific source unicast IP addresses.
Several differences appear when compared to IGMPv2. The most significant is the capa-
bility to have group and source specific queries, enabling query messages to be sent for
specific sources of a multicast group.
The membership report message type for IGMPv3 is identified by the message type
0x22 and involves several changes when compared to the membership report message
used in IGMPv2. Receiver hosts use this message type to report the current member-
ship state of their interfaces, as well as any change in the membership state to the local
multicast router. Hosts send this message to multicast routers using the group IP destina-
tion address of 224.0.0.22. Figure 13-8 shows the format of the membership report for
IGMPv3.
Technet24
754 Chapter 13: Troubleshooting Multicast
.
.
.
Each group record in the membership report uses the format shown in Figure 13-9.
Multicast Address
.
.
.
Auxiliary Data
The IGMPv3 membership report message fields are defined in the following list:
■ Number of Group Records: Provides the number of group records present in this
membership report
■ Group Record: A block of fields that provides the sender’s membership in a single
multicast group on the interface from which the report was sent
■ Number of Sources: How many sources are present in this group record.
■ Multicast Address: The multicast group this record pertains to.
■ Source Address: The unicast IP address of a source for the group.
■ Auxiliary Data: Indication that auxiliary data is not defined for IGMPv3. The Aux
Data Len should be set to zero and the auxiliary data should be ignored.
■ Additional Data: Accounted for in the IGMP checksum, but any data beyond the
last group record is ignored.
The most significant difference in the IGMPv3 membership report when compared to the
IGMPv2 membership report is the inclusion of the group record block data. This is where
the IGMPv3-specific functionality for the filter mode and source list is implemented.
IGMPv3 is backward compatible with previous versions of IGMP and still follows the
same general state machine mechanics. When a host or router running an older version of
IGMP is detected, the queries and report messages are translated from IGMPv2 into their
IGMPv3 equivalent. For example, an IGMPv3-compatible representation of an IGMPv2
membership report for 239.1.1.1 includes all sources in IGMPv3.
Technet24
756 Chapter 13: Troubleshooting Multicast
As in IGMPv2, general queries are still sent to the All-Hosts group 224.0.0.1 from the
querier. Hosts respond with a membership report message, which now includes spe-
cific sources in a source list and includes or excludes logic in the record type field.
Hosts that want to join a new multicast group or source use unsolicited membership
reports. When leaving a group or specific source, a host sends an updated current state
group record message to indicate the change in state. The leave group message found
in IGMPv2 is not used in IGMPv3. If no other members are in the group or source,
the querier sends a group or group and source-specific query message before pruning
off the source tree. The multicast router keeps an interface state table for each group
and source and updates it as needed when an include or exclude update is received in a
group record.
IGMP Snooping
Without IGMP snooping, a switch must flood multicast packets to each port in a VLAN
to ensure that every potential group member receives the traffic. Obviously, bandwidth
and processing efficiency are reduced if ports on the switch do not have an interested
receiver attached. IGMP snooping inspects (or “snoops on”) the higher-layer protocol
communication traversing the switch. Looking into the contents of IGMP messages
allows the switch to learn where multicast routers and interested receivers for a group
are attached. IGMP snooping operates in the control plane by optimizing and suppress-
ing IGMP messages from hosts, and operates in the data plane by installing multicast
MAC address and port-mapping entries into the local multicast MAC address table of
the switch. The entries created by IGMP snooping are installed in the same MAC address
table as unicast entries. Despite the fact that different commands are used for viewing
the entries installed by normal unicast learning and IGMP snooping, they share the same
hardware resources provided by the MAC address table.
An IGMP snooping switch listens for IGMP query messages and PIM hello messages
to determine which ports are connected to mrouters. When a port is determined to be
an mrouter port, it receives all multicast traffic in the VLAN so that appropriate control
plane state on the mrouter is created and sources are registered with the PIM RP, if appli-
cable. The snooping switch also forwards IGMP membership reports to the mrouter to
initiate the flow of multicast traffic to group members.
Host ports are discovered by listening for IGMP membership report messages. The
membership reports are evaluated to determine which groups and sources are being
requested, and the appropriate forwarding entries are added to the multicast MAC
address table or IP-based forwarding table. An IGMP snooping switch should not forward
membership reports to hosts because it results in hosts suppressing their own member-
ship reports for IGMPv1 and IGMPv2.
If a multicast packet for the Network Control Block 224.0.0.0/24 arrives, it might need to
be flooded on all ports. This is because devices can listen for groups in this range without
sending a membership report for the group, and suppressing those packets could inter-
rupt control plane protocols.
IGMP 757
IGMP snooping is a separate process from the IGMP control plane process and is enabled
by default in NX-OS. No user configuration is required to have the basic functionality
running on the device. NX-OS builds its IGMP snooping table based on the group IP
address instead of the multicast MAC address for the group. This behavior allows for
optimal forwarding even if the L3 group addresses of multiple groups overlap to the
same multicast group MAC address. The output in Example 13-2 demonstrates how to
verify the IGMP snooping state and lookup mode for a VLAN.
Technet24
758 Chapter 13: Troubleshooting Multicast
If multicast traffic arrives for a group that a host has not requested via a membership
report message, those packets are forwarded to the mrouter ports only, by default. This
is called optimized multicast flooding in NX-OS and is shown as enabled by default in
Example 13-2. If this feature is disabled, traffic for an unknown group is flooded to all
ports in the VLAN.
Note Optimized multicast flooding should be disabled in IPv6 networks to avoid prob-
lems related to neighbor discovery (ND) that rely specifically on multicast communication.
This feature is disabled with the no ip igmp snooping optimised-multicast-flood com-
mand in VLAN configuration mode.
IGMP membership reports are suppressed by default to reduce the number of messages
the mrouter receives. Recall that the mrouter needs to receive a membership report from
only one host for the interface to be added to the OIL for a group.
NX-OS has several options available when configuring IGMP snooping. Most of the
configuration is applied per VLAN, but certain parameters can be configured only
globally. Global values apply to all VLANs. Table 13-7 provides the default configuration
parameters for IGMP snooping that apply globally on the switch.
Table 13-8 provides the IGMP snooping configuration parameters, which are configured
per VLAN. The per-VLAN configuration is applied in the vlan configuration [vlan-id]
submode.
Technet24
760 Chapter 13: Troubleshooting Multicast
Note When vPC is configured with IGMP snooping, configuring the same IGMP param-
eters on both vPC peers is recommended. IGMP state is synchronized between vPC peers
with Cisco Fabric Services (CFS).
IGMP Verification
IGMP is enabled by default when PIM is enabled on an interface. Troubleshooting
IGMP problems typically involves scenarios in which the LHR does not have an
mroute entry populated by IGMP and the problem needs to be isolated to the LHR,
the L2 infrastructure, or the host itself. Often IGMP snooping must be verified during
this process because it is enabled by default and therefore plays an important role in
delivering the queries to hosts and delivering the membership report messages to the
mrouter.
Technet24
762 Chapter 13: Troubleshooting Multicast
In the topology in Figure 13-10, NX-1 is acting as the LHR for receivers in VLAN 115
and VLAN 116. NX-1 is also the IGMP querier for both VLANs. NX-2 is an IGMP
snooping switch that is not performing any multicast routing. All L3 devices are
configured for PIM ASM, with an anycast RP address shared between NX-3 and NX-4.
.1 .2 .1 .2
IGMP Snooping
NX-6
.254 .253
PIM RP PIM RP
10.99.99.99 Peer-Link Port-Ch 1 10.99.99.99
NX-3 NX-4
OSPF Area 0
L3 PIM
NX-5
.1 .2 .1 .2
If a receiver is not getting multicast traffic for a group, verify IGMP for correct state and
operation. To begin the investigation, the following information is required:
■ Scope of the problem: The groups, sources, and receivers that are not functioning
The purpose of IGMP is to inform the LHR that a receiver is interested in group traffic. At
the most basic level, this is communicated through a membership report message from the
receiver and should create a (*, G) state at the LHR. In most circumstances, checking the
mroute at the LHR for the presence of the (*, G) is enough to verify that at least one mem-
bership report was received. The OIL for the mroute should contain the interface on which
the membership report was received. If this check passes, typically the troubleshooting
follows the MDT to the PIM RP or source to determine why traffic is not arriving at the
receiver.
In the following examples, no actual IGMP problem condition is present because the
(*, G) state exists on NX-1. Instead of troubleshooting a specific problem, this section
reviews the IGMP protocol state and demonstrates the command output, process events,
and methodology used to verify functionality.
Verification begins from NX-2, which is the IGMP snooping switch connected to
the receiver 10.115.1.4, and works across the L2 network toward the mrouter NX-1.
Example 13-4 contains the output of show ip igmp snooping vlan 115, which is where
the receiver is connected to NX-2. This output is used to verify that IGMP snooping is
enabled and that the mrouter port is detected.
Technet24
764 Chapter 13: Troubleshooting Multicast
The Number of Groups field indicates that one group is present. The show ip igmp
snooping groups vlan 115 command is used to obtain additional detail about the group,
as in Example 13-5.
The last reporter is seen using the detail keyword, shown in Example 13-6.
Note If MAC-based multicast forwarding was configured for VLAN 115, the multicast
MAC table entry can be confirmed with the show hardware mac address-table [module]
[VLAN identifier] command. There is no software MAC table entry in the output of show
mac address-table multicast [VLAN identifier], which is expected.
IGMP 765
NX-2 is configured to use IP-based lookup for IGMP snooping. The show forwarding
distribution ip igmp snooping vlan [VLAN identifier] command in Example 13-7 is
used to find the platform index, which is used to direct the frames to the correct output
interfaces. The platform index is also known as the Local Target Logic (LTL) index. This
command provides the Multicast Forwarding Distribution Manager (MFDM) entry, which
was discussed in the NX-OS “NX-OS Multicast Architecture” section of this chapter.
NX-2# show forwarding distribution ip igmp snooping vlan 115 group 239.215.215.1
detail
Vlan: 115, Group: 239.215.215.1, Source: 0.0.0.0
Route Flags: 0
Outgoing Interface List Index: 13
Reference Count: 2
Platform Index: 0x7fe8
Vpc peer link exclude flag clear
Number of Outgoing Interfaces: 2
port-channel1
Ethernet3/19
The Ethernet3/19 interface is populated by the membership report from the receiver. The
Port-channel 1 interface is included as an outgoing interface because it is the mrouter
port. Verify the platform index as shown in Example 13-8 to ensure that the correct inter-
faces are present and match the previous MFDM output. The show system internal pixm
info ltl [index] command obtains the output from the Port Index Manager (PIXM). The
IFIDX/RID is 0xd, which matches the Outgoing Interface List Index of 13.
Member info
------------------
IFIDX LTL
---------------------------------
Eth3/19 0x0012
Po1 0x0404
Technet24
766 Chapter 13: Troubleshooting Multicast
Note If the IFIDX of interest is a port-channel, the physical interface is found by examin-
ing the LTL index of the port-channel. Chapter 5, “Port-Channels, Virtual Port-Channels,
and FabricPath,” demonstrates the port-channel load balance hash and how to find the
port-channel member link that will be used to transmit the packet.
At this point, the IGMP snooping control plane was verified in addition to the forwarding
plane state for the group with the available show commands. NX-OS also provides several
useful event-history records for IGMP, as well as other multicast protocols. The event-
history output collects significant events from the process and stores them in a circular
buffer. In most situations, for multicast protocols, the event-history records provide the
same level of detail that is available with process debugs.
The show ip igmp snooping internal event-history vlan command provides a sequence
of IGMP snooping events for VLAN 115 and the group of interest, 239.215.215.1.
Example 13-9 shows the reception of a general query message from Port-channel 1, as
well as the membership report message received from 10.115.1.4 on Eth3/19.
The Ethanalyzer tool provides a way to capture packets at the netstack component
level in NX-OS. This is an extremely useful tool for troubleshooting any control plane
protocol exchange. In Example 13-10, an Ethanalyzer capture filtered for IGMP packets
clearly shows the receipt of the general query messages, as well as the membership
report from 10.115.1.4. Ethanalyzer output is directed to local storage with the write
option. The file can then be copied off the device for a detailed protocol examination, if
needed.
IGMP 767
NX-OS maintains statistics for IGMP snooping at both the global and interface level.
These statistics are viewed with either the show ip igmp snooping statistics global
command or the show ip igmp snooping statistics vlan [VLAN identifier] command.
Example 13-11 shows the statistics for VLAN 115 on NX-2. The VLAN statistics also
include global statistics, which are useful for confirming how many and what type of
IGMP and PIM messages are being received on a VLAN. If additional packet-level details
are needed, using Ethanalyzer with an appropriate filter is recommended.
Technet24
768 Chapter 13: Troubleshooting Multicast
With NX-2 verified, the examination moves to the LHR, NX-1. NX-1 is the mrouter for
VLAN 115 and the IGMP querier. The IGMP state on NX-1 is verified with the show ip
igmp interface vlan 115 command, as in Example 13-12.
The membership report NX-2 forwarded from the host is received on Port-channel 1.
The query messages and membership reports are viewed in the show ip igmp internal
event-history debugs output in Example 13-13. When the membership report message is
received, NX-1 determines that state needs to be created.
IGMP creates a route entry based on the received membership report in VLAN 115. The
IGMP route entry is shown in the output of Example 13-14.
IGMP must also inform the MRIB so that an appropriate mroute entry is created. This is
seen in the show ip igmp internal event-history igmp-internal output in Example 13-15.
An IGMP update is sent to the MRIB process buffer through Message and Transactional
Services (MTS). Note that IGMP receives notification from MRIB that the message was
processed and the message buffer gets reclaimed.
Technet24
770 Chapter 13: Troubleshooting Multicast
The message identifier 0xffff000c is used to track this message in the MRIB process
events. Example 13-16 shows the MRIB processing of this message from the show
routing ip multicast event-history rib output.
When the MRIB process receives the MTS message from IGMP, an mroute is created
for (*, 239.215.215.1/32) and the MFDM is informed. The RPF toward the PIM RP
(10.99.99.99) is then confirmed and added to the entry.
The output of show ip mroute in Example 13-17 confirms that a (*, G) entry has been
created by IGMP and the OIF was also populated by IGMP.
PIM Multicast 771
Note Additional events occur after this point when traffic arrives from the source,
10.215.1.1. The arrival of data traffic from the RP triggers a PIM join toward the source and
creation of the (S, G) mroute. This is explained in the “PIM Any Source Multicast” section
later in this chapter.
PIM Multicast
PIM is the multicast routing protocol used to build shared trees and shortest-path
trees that facilitates the distribution of multicast traffic in an L3 network. As the name
suggests, PIM was designed to be protocol independent. PIM essentially creates a
multicast overlay network built upon the information available from the underlying
unicast routing topology. The term protocol independent is based on the fact that PIM
can use the unicast routing information in the Routing Information Base (RIB) from any
source protocol, such as EIGRP, OSPF, or BGP. The unicast routing table provides PIM
with the relative location of sources, rendezvous points, and receivers, which is essential
to building a loop-free MDT.
PIM is designed to operate in one of two modes, dense mode or sparse mode. Dense
mode (DM) operates under the assumption that receivers are densely dispersed through
the network. In dense mode, the assumption is that all PIM neighbors should receive
the traffic. In this mode of operation, multicast traffic is flooded to all downstream
neighbors. If the group traffic is not required, the neighbor prunes itself from the tree.
This is referred to as a push model because traffic is pushed from the root of the tree
toward the leaves, with the assumption that there are many leaves and they are all
Technet24
772 Chapter 13: Troubleshooting Multicast
interested in receiving the traffic. NX-OS does not support PIM dense mode because
PIM sparse mode offers several advantages and is the most popular mode deployed in
modern data centers.
PIM sparse mode (SM) is based on a pull model. The pull model assumes that receivers
are sparsely dispersed through the network and that it is therefore more efficient to have
traffic forward to only the PIM neighbors that are explicitly requesting the traffic. PIM
sparse mode works well for the distribution of multicast when receivers are sparsely or
densely populated in the topology. Because of its explicit join behavior, it has become
the preferred mode of deploying multicast.
The role of PIM in the process of distributing multicast traffic from a source to a receiver
is described by the following responsibilities:
■ If multiple PIM routers exist on the same L3 network, determining which PIM router
will forward traffic
This section of the chapter introduces the PIM protocol and messages PIM uses to build
MDTs and create forwarding state. The different operating models of PIM SM are exam-
ined, including ASM, SSM, and Bi-Directional PIM (Bidir).
Note RFC 2362 initially defined PIM as an experimental protocol that was later made
obsolete by RFC 4601. Recently, RFC 4601 was updated by RFC 7761. The NX-OS
implementation of PIM is based on RFC 4601.
The mroute state is often referred to when discussing multicast forwarding. With PIM
multicast, the (*, G) state is created by the receiver at the LHR and represents the RPT’s
relationship to the receiver. The (S, G) state is created by the receipt of multicast data
traffic and represents the SPT’s relationship to the source.
As packets arrive on a multicast router, they are checked against the unicast route to the
root of the tree. This is known as the Reverse Path Forwarding (RPF) check. The RPF
PIM Multicast 773
check ensures that the MDT remains loop-free. When a router sends a PIM join-prune
message to create state, it is sent toward the root of the tree from the RPF interface that
is determined by the best unicast route to the root of the tree. Figure 13-11 illustrates the
concepts of mroute state and PIM MDTs.
Receiver Receiver
(Leaf) (Leaf) Source
Root of SPT
The PIM control message header format fields are defined in the following list:
Technet24
774 Chapter 13: Troubleshooting Multicast
■ Reserved: This field is set to zero on transmit and is ignored upon receipt.
■ Checksum: The checksum is calculated on the entire PIM message, except for the
multicast data packet portion of a register message.
The type field of the control message header identifies the type of PIM message being
sent. Table 13-9 describes the various PIM message types listed in RFC 6166.
Note This chapter does not cover the PIM messages specific to PIM DM because
NX-OS does not support PIM DM. Interested readers should review RFC 3973 to learn
about the various PIM DM messages.
PIM Multicast 775
The value of the DR priority option is used in the Designated Router (DR) election
process. The default value is one, and the neighbor with the numerically higher priority
is elected as the PIM DR. If the DR priority is equal, then the higher IP address wins the
election. The PIM DR is responsible for registering multicast sources with the PIM RP
and for joining the MDT on behalf of the multicast receivers on the interface.
The hello message carries different option types in a Type, Length, Value (TLV) format.
The various hello message option types follow:
■ Option Type 1: Holdtime is the amount of time to keep the neighbor reachable.
A value of 0xffff indicates that the neighbor should never be timed out, and a value
of zero indicates that the neighbor is about to go down or has changed its IP address.
■ Option Type 2: LAN prune delay is used to tune prune propagation delay on
multiaccess LAN networks. It is used only if all routers on the LAN support this
option, and it is used by upstream routers to figure out how long they should wait
for a join override message before pruning an interface.
■ Option 24: Address list is used to inform neighbors about secondary IP addresses
on an interface.
1. The multicast data packet arrives from the source and is sent to the supervisor.
2. The supervisor creates hardware forwarding state for the group, builds the register
message, and then sends the register message to the PIM RP.
3. Subsequent packets that the router receives from the source after the hardware
forwarding state is built are not sent to the supervisor to create register messages.
This is done to limit the amount of traffic sent to the supervisor control plane.
Technet24
776 Chapter 13: Troubleshooting Multicast
In contrast, a Cisco IOS PIM DR continues to send register messages until it receives a
register-stop message from the PIM RP. NX-OS provides the ip pim register-until-stop
global configuration command that modifies the default NX-OS behavior to behave like
Cisco IOS. In most cases, the default behavior of NX-OS does not need to be modified.
■ The Border Bit (B - Bit): This is set to zero on transmit and ignored on receipt (RFC
7761). RFC 4601 described PIM Multicast Border Router (PMBR) functionality that
used this bit to designate a local source when set to 0, or set to 1 for a source in a
directly connected cloud on a PMBR.
■ The Null-Register Bit: This is set to 1 if the packet is a null register message. The
null register message encapsulates a dummy IP header from the source, not the full
encapsulated packet that is present in a register message.
■ Multicast Data Packet: In a register message, this is the original packet sent by the
source. The TTL of the original packet is decremented before encapsulation into the
register message. If the packet is a null register, this portion of the register message
contains a dummy IP header containing the source and group address.
■ Group Address: This is the group address of the multicast packet encapsulated in
the register message.
■ Source Address: This is the IP address of the source in the encapsulated multicast
data packet from the register message.
Two types of group sets exist, and both types have a join source list and a prune source
list. The wildcard group set represents the entire multicast group range (224.0.0.0/4), and
PIM Multicast 777
the group-specific set represents a valid multicast group address. A single join-prune mes-
sage can contain multiple group-specific sets but may contain only a single instance of
the wildcard group set. A combination of a single wildcard group set and one or more
group-specific sets is also valid in the same join-prune message. The join-prune message
contains the following fields:
■ Unicast Neighbor Upstream Address: The address of the upstream neighbor that is
the target of the message.
■ Number of Groups: The number of multicast group sets contained in the message.
■ Multicast Group Address: The multicast group address identifies the group set. This
can be wildcard or group specific.
■ Number of Joined Sources: The number of joined sources for the group.
■ Joined Source Address 1 .. n: The source list that provides the sources being joined
for the group. Three flags are encoded in this field:
■ S: Sparse bit. This is set to a value of 1 for PIM SM.
■ W: Wildcard bit. This is set to 1 to indicate that the encoded source address
represents the wildcard in a (*, G) entry. When set to 0, it indicates that the
encoded source address represents the source address of an (S, G) entry.
■ R: RP Bit. When set to 1, the join is sent to the PIM RP. When set to 0, the join is
sent toward the source.
■ Number of Pruned Sources: The number of pruned sources for the group.
■ Pruned Source Address 1 .. n: The source list that provides the sources being
pruned for the group. The same three flags are found here as in the joined source
address field (S, W, R).
Note In theory, it is possible that the number of group sets exceeds the maximum IP
packet size of 65535. In this case, multiple join-prune messages are used. It is important to
ensure that PIM neighbors have a matching L3 MTU size because a neighbor could sent
a join-prune message that is too large for the receiving interface to accommodate. This
results in missing multicast state on the receiving PIM neighbor and a broken MDT.
Technet24
778 Chapter 13: Troubleshooting Multicast
contents and builds a new packet to forward the bootstrap message to all PIM neigh-
bors per interface. It is possible for a bootstrap message to be fragmented into multiple
Bootstrap Message Fragments (BSMF). Each fragment uses the same format as the boot-
strap message. The PIM bootstrap message contains the following fields:
■ No-Forward Bit: Instruction that the bootstrap message should not be forwarded.
■ Fragment Tag: Randomly generated number used to distinguish BSMFs that belong
to the same bootstrap message. Each fragment carries the same value.
■ Hash Mask Length: The length, in bits, of the mask to use in the hash function.
■ BSR Priority: The priority value of the originating BSR. The value can be 0 to 255
(higher is preferred).
■ BSR Address: The address of the bootstrap router for the domain.
■ RP Address 1 .. m: The address of the candidate-RP for the corresponding group range.
■ RP1 .. m Priority: The priority of the corresponding RP and group address. This
field is copied from the candidate-RP advertisement message. The highest priority is
zero and is per RP and per group address.
■ Group Address: The group address for which the forwarder conflict needs to be
resolved.
PIM Multicast 779
■ Source Address: The source address for which the forwarder conflict needs to be
resolved. A value of zero indicates a (*, G) assert.
■ RPT-Bit: This value is set to 1 for (*, G) assert messages and 0 for (S, G) assert
messages.
■ Metric Preference: The preference value assigned to the unicast routing pro-
tocol that provided the route to the source or PIM RP. This value refers to the
administrative distance of the unicast routing protocol.
■ Metric: The unicast routing table metric for the route to the source or PIM RP.
■ Holdtime: The amount of time, in seconds, for which the advertisement is valid.
■ Type: The value is 10 for the PIM DF election message and has four subtypes.
■ Offer: Subtype 1. Sent by routers that believe they have a better metric to the
RPA than the metric that has been seen in offers so far.
■ Winner: Subtype 2. Sent by a router when assuming the role of the DF or when
reasserting in response to worse offers.
Technet24
780 Chapter 13: Troubleshooting Multicast
■ Sender Metric: The unicast routing table metric that the message sender used to
reach the RPA.
The Backoff message adds the following fields to the common election message
format:
■ Offering Address: The address of the router that made the last (best) offer.
■ Offering Metric Preference: The preference value assigned to the unicast routing
protocol that the offering router used for the route to the RPA.
■ Offering Metric: The unicast routing table metric that the offering router used to
reach the RPA.
The Pass message adds the following fields to the common election message format:
■ New Winner Address: The address of the router that made the last (best) offer.
■ New Winner Metric Preference: The preference value assigned to the unicast
routing protocol that the offering router used for the route to the RPA.
■ New Winner Metric: The unicast routing table metric that the offering router used
to reach the RPA.
version 7.2(2)D1(2)
feature pim
interface Vlan115
ip pim sparse-mode
interface Vlan116
ip pim sparse-mode
interface Ethernet3/17
ip pim sparse-mode
interface Ethernet3/18
ip pim sparse-mode
After PIM is enabled on an interface, hello packets are sent and PIM neighbors form if
there is another router on the link that is also PIM enabled.
Note The hello interval for PIM is configured in milliseconds. The minimum accepted
value is 1000 ms, which is equal to 1 second. If an interval lower than the default is needed
to detect a failed PIM neighbor, use BFD for PIM instead of a reduced hello interval.
In the output of Example 13-19, NX-1 has formed PIM neighbors with NX-3 and NX-4.
The output shows whether the neighbor is BiDIR capable and also provides the priority
value of each neighbor which is used for DR election.
Technet24
782 Chapter 13: Troubleshooting Multicast
PIM has several interface-specific parameters that determine how the protocol oper-
ates. The specific details are viewed for each PIM enabled interface with the show ip
pim interface [interface identifier] command (see Example 13-20). The most interesting
aspects of this output for troubleshooting purposes are the per-interface statistics, which
provide useful counters for the different PIM message types and the fields related to the
hello packets. The DR election state is also useful for determining which device registers
sources on the segment for PIM sparse mode and which device forwards traffic to receiv-
ers known through IGMP membership reports.
In addition to the per-interface statistics, NX-OS provides statistics aggregated for the
entire PIM router process (global statistics). This output is viewed with the show ip pim
statistics command (see Example 13-21). These statistics are useful when troubleshooting
PIM RP-related message activity.
If a specific PIM neighbor is not forming on an interface, investigate the problem using
the event-history or Ethanalyzer facilities available in NX-OS. The show ip pim internal
event-history hello output in Example 13-22 confirms that PIM hello messages are being
sent from NX-1 and that hello messages are being received on Ethernet 3/18 from NX-3.
Technet24
784 Chapter 13: Troubleshooting Multicast
If additional detail about the PIM message contents is desired, the packets can be
captured using the Ethanalyzer tool (see Example 13-23). The packet detail is examined
locally using the detail option, or the capture may be saved for offline analysis with the
write option.
Capturing on inband
Frame 1: 64 bytes on wire (512 bits), 64 bytes captured (512 bits)
Encapsulation type: Ethernet (1)
Arrival Time: Oct 29, 2017 00:48:35.186687000 UTC
[Time shift for this packet: 0.000000000 seconds]
Epoch Time: 1509238115.186687000 seconds
[Time delta from previous captured frame: 0.029364000 seconds]
[Time delta from previous displayed frame: 0.029364000 seconds]
[Time since reference or first frame: 3.751505000 seconds]
Frame Number: 5
Frame Length: 64 bytes (512 bits)
Capture Length: 64 bytes (512 bits)
[Frame is marked: False]
[Frame is ignored: False]
[Protocols in frame: eth:ip:pim]
PIM Multicast 785
<>
Internet Protocol Version 4, Src: 10.1.13.3 (10.1.13.3), Dst: 224.0.0.13
(224.0.0.13)
<>
Protocol Independent Multicast
0010 .... = Version: 2
.... 0000 = Type: Hello (0)
Reserved byte(s): 00
Checksum: 0x3954 [correct]
PIM options: 4
Option 1: Hold Time: 105s
Type: 1
Length: 2
Holdtime: 105s
Option 19: DR Priority: 1
Type: 19
Length: 4
DR Priority: 1
Option 22: Bidir Capable
Type: 22
Length: 0
Option 20: Generation ID: 765622359
Type: 20
Length: 4
Generation ID: 765622359
Note NX-OS supports PIM neighbor authentication, as well as BFD for PIM neighbors.
Refer to the NX-OS configuration guides for information on these features.
With PIM ASM, all sources are registered to the PIM RP by their local FHR. This makes
the PIM RP the device in the topology with knowledge of all sources. When a receiver
joins a group, its local router (LHR) joins the RPT. When multicast traffic arrives at the
LHR from the RPT, the source address for the group is known and a PIM join message
Technet24
786 Chapter 13: Troubleshooting Multicast
is sent toward the source to join the SPT. This is referred to as the SPT switchover. After
receiving traffic on the SPT, the RPT is pruned from the LHR so that traffic is arriving
only from the SPT. Each of these events has corresponding state in the mroute table,
which is used to determine the current state of the MDT for the receiver. Figure 13-13
shows an example topology configured with PIM ASM, to better visualize the events
that have occurred.
.1 .2 .1 .2
LHR
.254 .253
Peer Link Port-Ch 1 LHR
SPT Root
NX-5
Source (1)
.4 .2 .1 .1 .2
Step 1. Source 10.115.1.4 starts sending traffic to group 239.115.115.1. NX-2 receives
the traffic and creates an (S,G) mroute entry for (10.115.1.4, 239.115.115.1).
Step 2. NX-2 registers the source with PIM RP NX-1 (10.99.99.1). The PIM RP creates
an (S, G) mroute and sends a register-stop message in response. NX-2 contin-
ues to periodically send null register messages to the PIM RP as long as data
traffic is arriving from the source.
Step 4. NX-4 sends a PIM join to the PIM RP NX-1 and traffic arrives on the RPT.
Step 5. NX-4 receives traffic from the RPT and then switches to the SPT by sending
a PIM join to NX-2. When NX-2 receives this PIM join message, an OIF for
Eth3/17 is added to the (S,G) mroute entry.
Step 6. Although Figure 13-13 does not explicitly show it, NX-4 prunes itself from
the RPT and traffic continues to flow from NX-2 on the SPT.
The order of these steps can vary if the receiver joins the RPT before the source is active,
but the mentioned steps are required and still occur. Knowledge of these mandatory
events can be combined with the mroute state on the FHR, LHR, PIM RP, and intermedi-
ate routers to determine exactly where the MDT is broken when a receiver is not getting
traffic. It is important to remember that multicast state is created by control plane events
in IGMP and PIM, as well as the receipt of multicast traffic in the data plane.
Note The SPT switchover is optional in PIM ASM. The ip pim spt-threshold infinity
command is used to force a device to remain on the RPT.
Technet24
788 Chapter 13: Troubleshooting Multicast
feature pim
interface Vlan1101
ip pim sparse-mode
interface loopback99
ip pim sparse-mode
interface Ethernet3/17
ip pim sparse-mode
interface Ethernet3/18
ip pim sparse-mode
The presence of a (*, G) state at the LHR indicates that a receiver sent a valid member-
ship report and the LHR sent an RPT join toward the PIM RP using the unicast route for
the PIM RP to choose the interface. Note that the presence of a (*, G) indicates only a
receiver sent a membership report, which might mean that the problematic receiver did
not. Verify IGMP snooping forwarding tables for each switch that carries the VLAN to
be sure that the receivers port is programmed for receiving the traffic. A receiver host or
L2 forwarding problem can be confirmed if other receivers in the same VLAN can get
the group traffic.
PIM Multicast 789
If the LHR has only a (*, G), it typically indicates that traffic is not arriving from the
RPT. In that case, verify the mroute state between the LHR and the PIM RP and on any
intermediate PIM routers along the tree. If the PIM RP has a valid OIF toward the LHR
and packet counts are incrementing, a data plane problem might be keeping traffic from
arriving at the LHR on the RPT, or the TTL of the packets might be expiring in transit.
Tools such as Switch Port Analyzer (SPAN) capture, the ACL hit counter, or even the
Embedded Logic Analyzer Module (ELAM) can isolate the problem to a specific device
along the RPT.
After traffic arrives at the LHR on the RPT, it attempts to switch to the SPT. This step
involves a routing table lookup for the source address to determine which PIM interface
to send the SPT join message on. The LHR has (S, G) state for the SPT at this point with
an OIL that contains the interface toward the receiver. The IIF for the SPT can be differ-
ent than the IIF for the RPT, but it does not have to be.
The LHR sends a PIM SPT join toward the source. Each intermediate router along the
path also has an (S, G) state with an OIF toward the LHR and an IIF toward the source
for the SPT. At the FHR, the IIF is the interface where the source is attached and the OIF
contains the interface on which the PIM SPT join was received, pointing in the direction
of the LHR.
The same methodology can be used to troubleshoot multicast forwarding along the
SPT. Determine whether any receivers, perhaps on another branch of the SPT, can
receive traffic. Determine which device in the SPT is the merge point where the prob-
lem branch and working branch converge. The mroute state on that device should
indicate that the interfaces for both branches are in the OIL. If they are not, verify
PIM to determine why the SPT join was not received. If the OIL does contain both
OIFs, the problem could be related to a data plane packet drop issue. In that case,
SPAN, ACL, or ELAM is the best option to isolate the problem further. When the
problem is isolated to a specific device along the tree, verify the control plane and
platform-specific hardware forwarding entries to determine the root cause of the
problem.
Technet24
790 Chapter 13: Troubleshooting Multicast
NX-2 then registers this source with the PIM RP NX-1 (10.99.99.99) by sending a PIM
register message with an encapsulated data packet from the source. NX-1 receives this
register message, as the output of show ip pim internal event-history null-register in
Example 13-26 shows. The first register message has pktlen 84, which creates the mroute
state at the PIM RP. Subsequent null-register messages that do not have the encapsulated
source packet are only 20 bytes. NX-1 responds to each register message with a
register-stop.
Note NX-OS can have a separate event-history for receiving encapsulated data register
messages, depending on the version. The command is show ip pim internal event-history
data-register-receive. In older NX-OS releases, debug ip pim data-register send and
debug ip pim data-register receive are used to debug the PIM registration process.
Because no receivers currently exist in the PIM domain, NX-1 adds an (S, G) mroute with
an empty OIL (see Example 13-27). The IIF is the L3 interface between NX-1 and NX-2
Vlan1101, which is carried over Port-channel 1. The mroute has the PIM flag to indicate
that PIM created this mroute state.
PIM Multicast 791
After adding the mroute entry, NX-1 sends a register-stop message back to NX-2
(see Example 13-28). NX-2 suppresses its first null register message because it has
just received a register-stop for a recent encapsulated data register message. After the
register-stop, NX-2 starts its Register-Suppression timer. Just before expiring the timer,
another null-register is sent. If the timer expires without a register stop from the RP, the
DR resumes sending full encapsulated packets.
The source has been successfully registered with the PIM RP. This state persists until
a receiver joins the group, with NX-2 periodically informing NX-1 via null register
messages that the source is still actively sending to the group address.
A receiver in VLAN 215 connected to NX-4 sends a membership report to initiate the
flow of multicast for the 239.115.115.1 group. When this message arrives at NX-4, it
triggers the creation of a (*, G) mroute entry by IGMP with an OIL containing VLAN
215 (see Example 13-29). The IIF Ethernet 3/29 is the interface used to reach the PIM RP
address on NX-1.
Technet24
792 Chapter 13: Troubleshooting Multicast
The mroute entry corresponds to a PIM RPT join being sent from NX-4 toward NX-1
(see Example 13-30).
When NX-1 receives this RPT Join from NX-4, the OIF Ethernet 3/17 is added to the OIL
of the mroute (see Example 13-31).
The receipt of the join triggers the creation of a (*, G) mroute state on NX-1 and also trig-
gers a join from NX-1 to NX-2 over VLAN 1101 for the source (see Example 13-32).
PIM Multicast 793
The result of this join from NX-1 to NX-2 is that NX-2 adds an OIF of VLAN 1101 (see
Example 13-33).
Traffic now flows from the source, through NX-2 toward NX-1. NX-1 receives the
traffic and forwards it through the RPT to NX-4. At NX-4, traffic is now received on
the RPT and the SPT switchover occurs, as seen in the PIM event-history output in
Example 13-34. NX-4 first sends the SPT join to NX-2 (10.2.23.2) and then prunes itself
from the RPT to NX-1 (10.2.13.1).
Technet24
794 Chapter 13: Troubleshooting Multicast
The resulting mroute state on NX-4 is that the (S, G) was created and the OIL contains
VLAN215. The IIF for the (S, G) points toward NX-2, while the IIF for the (*, G) points to
the PIM RP at NX-1. Example 13-35 shows the show ip mroute output from NX-4.
NX-2 has an (S, G) mroute with the IIF of VLAN 115 and the OIF of Ethernet 3/17 that
is connected to NX-4. Example 13-36 shows the mroute state of NX-2.
NX-1 has (*, G) state from NX-4 but no OIF for the (S, G) state. Example 13-37 contains
the mroute table of NX-1 after the SPT switchover. The IIF of the (*, G) is the RP
interface of Loopback99, which is the root of the RPT.
As the previous section demonstrates, the mroute state and the event-history in NX-OS
make it possible to determine whether the problem involves the RPT or the SPT and to
determine which device along the tree is causing trouble.
An example verification is provided here for reference using NX-2, which is a Nexus 7700
with an F3 module. The verification steps provided here are similar on other NX-OS plat-
forms until the Input/Output (I/O) module is reached. When troubleshooting reaches that
level, the verification commands vary significantly, depending on the platform.
The platform-independent (PI) components, such as the mroute table, the mroute table
clients (PIM, IGMP, and MSDP), and the Multicast Forwarding Distribution Manager
(MFDM), are similar across NX-OS platforms. The way that those entries get programmed
into the forwarding and replication ASICs varies. Troubleshooting to the ASIC program-
ming level is best left to Cisco TAC because it is easy to misinterpret the information pre-
sented in the output without a firm grasp on the platform-dependent (PD) architecture.
The mroute provides the IIF and OIF, dictating which modules need to be verified.
Knowing which modules are involved is important because the Nexus 7000 series per-
forms egress replication for multicast traffic. With egress replication, packets arrive on
Technet24
796 Chapter 13: Troubleshooting Multicast
the ingress module and a copy of the packet is sent to any local receivers on the same I/O
module. Another copy of the packet is directed to the fabric toward the I/O module of
the interfaces in the OIL of the mroute. When the packet arrives at the egress module,
another lookup is done to replicate the packet to the egress interfaces.
The OIL contains L3 interface Ethernet 3/17, and the IIF is VLAN 115. To confirm which
physical interface the traffic is arriving on in VLAN 115, the ARP cache and MAC
address table entries are checked for the multicast source. The show ip arp command
provides the MAC address of the source (see Example 13-39).
IP ARP Table
Total number of entries: 1
Address Age MAC Address Interface
10.115.1.4 00:10:53 64a0.e73e.12c2 Vlan115
Now check the MAC address table to confirm which interface packets should be arriving
on from 10.115.1.4. Example 13-40 shows the output of the MAC address table.
Example 13-40 MAC Address Table Entry for the Multicast Source
Note: MAC table entries displayed are getting read from software.
Use the 'hardware-age' keyword to get information related to 'Age'
Legend:
* - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC
age - seconds since last seen,+ - primary entry using vPC Peer-Link, E -
EVPN entry
(T) - True, (F) - False , ~~~ - use 'hardware-age' keyword to retrieve
age info
VLAN/BD MAC Address Type age Secure NTFY Ports/SWID.SSID.LID
---------+-----------------+--------+---------+------+----+------------------
* 115 64a0.e73e.12c2 dynamic ~~~ F F Eth3/19
PIM Multicast 797
It has now been confirmed that packets are coming into NX-2 on Ethernet 3/19 and
egressing on Ethernet 3/17 toward NX-4. The next step in the verification is to check the
MFDM entry for the group to ensure that it is present with the correct IIF and OIL (see
Example 13-41).
The MFDM entry looks correct. The remaining steps are performed from the LC console,
which is accessed with the attach module [module number] command. If the verification
is being done in a nondefault VDC, it is important to use the vdc [vdc number] command
to enter the correct context after logging into the module. After logging into the correct
ingress module, confirm the correct L3LKP ASIC.
Note Verification can be completed without logging into the I/O module by using the
slot [module number] quoted [LC CLI command] to obtain output from the module.
The F3 module uses a switch-on-chip (SOC) architecture, where groups of front panel
ports are serviced by a single SOC. Example 13-42 demonstrates this mapping with the
show hardware internal dev-port-map command.
Technet24
798 Chapter 13: Troubleshooting Multicast
In this particular scenario, the ingress port and egress port are using the same SOC
instance (2), and are on the same module. If the module or SOC instance were different,
each SOC on each module would need to be verified to ensure that the correct
information is present.
With the SOC numbers confirmed for the ingress and egress interfaces, now check the
forwarding entry on the I/O module. This entry has the correct incoming interface of
Vlan115 and the correct OIL, which contains Ethernet 3/17 (see Example 13-43). Verify
the outgoing packets counter to ensure that it is incrementing periodically.
All information so far has the correct IIF and OIF, so the final step is to check the
programming from the SOC (see Example 13-44).
PIM Multicast 799
Cisco TAC should interpret the various fields present. These fields represent the pointers to
the various table lookups required to replicate the multicast packet locally, or to the fabric if
the egress interface is on a different module or SOC. Verification of these indexes requires
multiple ELAM captures at the various stages of forwarding lookup and replication.
PIM Bidirectional
PIM BiDIR is another version of PIM SM in which several modifications to traditional
ASM behavior have been made. The differences between PIM ASM and PIM BiDIR follow:
■ BiDIR uses bidirectional shared trees, whereas ASM relies on unidirectional shared
and source trees.
■ BiDIR does not use any (S, G) state. ASM must maintain (S, G) state for every source
sending traffic to a group address.
■ BiDIR does not need any source registration process, which reduces processing
overhead.
■ Both ASM and BiDIR must have every group mapped to a rendezvous point (RP).
The RP in BiDIR does not actually do any packet processing. In BiDIR, the RP
address (RPA) is just a route vector that is used as a reference point for forwarding up
or down the shared tree.
■ BiDIR uses the concept of a Designated Forwarder (DF) that is elected on every link
in the PIM domain.
Technet24
800 Chapter 13: Troubleshooting Multicast
Because BiDIR does not require any (S, G) state, only a single (*, G) mroute entry is
required to represent a group. This can dramatically reduce the number of mroute
entries in a network with many sources, compared to ASM. With a reduction of mroute
entries, the potential scalability of the network is higher because any router platform has
a finite number of table entries that can be stored before resources become exhausted.
The increase in scale does come with a trade-off of losing visibility into the traffic of
individual sources because there is no (S, G) state to track them. However, in very large,
many-to-many environments, this downside is outweighed by the reduction in state and
the elimination of the registration process.
BiDIR has important terminology that must be defined before looking further into how it
operates. Table 13-10 provides these definitions.
PIM neighbors that can understand BiDIR set the BiDIR capable bit in their PIM hello
messages. This is a foundational requirement for BiDIR to become operational. As the
PIM process becomes operational on each router, the group-to-RP mapping table is
populated by either static configuration or through Auto-RP or BSR. When the RPA(s)
are known, the router determines its unicast routing metric for the RPA(s) and moves to
the next phase, to elect the DF on each interface.
PIM Multicast 801
Initially, all routers begin sending PIM DF election messages that carry the offer sub-
type. The offer message contains the sending router’s unicast routing metric to reach
the RPA. As these messages are exchanged, all routers on the link become aware of each
other and what each router’s metric is to the RPA. If a router receives an offer message
with a better metric, it stops sending offer messages, to allow the router with the bet-
ter metric to become elected as the DF. However, if the DF election does not occur, the
election process restarts. The result of this initial DF election should be that all routers
except for the one with the best metric stop sending offer messages. This allows the
router with the best metric to assume the DF role after sending three offers and not
receiving additional offers from any other neighbor. After assuming the DF role, the
router transmits a DF election message with the winner subtype, which tells all routers
on the link which device is the DF and informs them of the winning metric.
During normal operation, a new router might come online or metrics toward the RPA
could change. This essentially results in offer messages sent to the current DF. If the
current DF still has the best metric to the RPA, it responds with a winner message. If
the received metric is better than the current DF, the current DF sends a backoff mes-
sage. The backoff message tells the challenging router to wait before assuming the DF
role so that all routers on the link have an opportunity to send an offer message. During
this time, the original DF is still acting as the DF. After the new DF is elected, the old
DF transmits a DF election message with the pass subcode, which hands over the DF
responsibility to the new winner. After the DF is elected, the PIM BiDIR network is
ready to begin forwarding multicast packets bidirectionally using shared trees rooted at
the RPA.
Packets arriving from a downstream link are forwarded upstream until they reach the
router with the RPL, which contains the RPA. Because no registration process occurs
and no switchover to an SPT takes place, the RPA does not need to be on a router. This
is initially confusing, but it works because packets are forwarded out the RPL toward
the RPA, and (*, G) state is built from every FHR connected to a source and from every
LHR with an interested receiver toward the RPA. In other words, with BiDIR, packets do
not have to actually traverse the RP as they do in ASM. The intersecting branches of the
bidirectional (*, G) tree can distribute multicast directly between source and receiver.
In NX-OS, up to eight BiDIR RPAs are supported per VRF. Redundancy for the RPA is
achieved using a concept referred to as a phantom RP. The term is used because the RPA
is not assigned to any router in the PIM domain. For example, assume an RPA address of
10.1.1.1. NX-1 could have 10.1.1.0/30 configured on its Loopback10 interface and NX-3
could have 10.1.1.0/29 configured on its Loopback10 interface. All routers in the PIM
domain follow the longest-prefix-match rule in their routing table to prefer NX-1. If NX-1
failed, NX-3 would then become the preferred path to the RPL and thus the RP as soon
as the unicast routing protocol converges.
Technet24
802 Chapter 13: Troubleshooting Multicast
Receiver
.1 .2 .1 .2
NX-3 Bidirectional
NX-4
Eth3/28
Eth3/29 Shared Tree Eth3/29
RPT Join
RPL
NX-5
.4 .2 .1 .1 .2
Source
When a receiver attached to VLAN 215 on NX-4 joins 239.115.115.1, a (*, G) mroute
entry is created on NX-4. On the link between NX-4 and NX-1, NX-1 is the elected DF
because it has a better unicast metric to the RPA. Therefore the (*, G) join from NX-4 is
sent to NX-1 upstream toward the primary RPA.
NX-1 and NX-3 are both configured with a link (Loopback99) to the phantom RP
10.99.99.99. However, NX-1 has a more specific route to the RPA through its RPL and is
used by all routers in the topology to reach the RPA.
PIM Multicast 803
When 10.115.1.4 begins sending multicast traffic to 239.115.115.1, the traffic arrives on
VLAN 115 on NX-2. Because NX-2 is the elected DF on VLAN 115, the traffic is for-
warded upstream toward the RPA on its RPF interface, VLAN 1101. NX-1 is the elected
DF for VLAN 1101 between NX-2 and NX-1 because it has a better metric to the RPA.
NX-1 receives the traffic from NX-2 and forwards it based on the current OIL for its
(*, G) mroute entry. The OIL contains both the Ethernet 3/17 link to NX-4 and also the
Loopback99 interface with is the RPL. As traffic flows from the source to the receiver,
the shared tree is used end to end, and NX-4 never uses the direct link it has to NX-2
because no SPT switchover takes place with BiDIR. No source needs to be registered
with a PIM RP and no (S, G) state needs to be created because all traffic for the group
flows along the shared tree.
BiDIR Configuration
The configuration for PIM BiDIR is similar to the configuration of PIM ASM. PIM
sparse mode must be enabled on all interfaces. The BiDIR capable bit is set in PIM hello
messages by default, so no interface-level command is required to specifically enable PIM
BiDIR. An RP is designated as a BiDIR RPA when it is configured with the bidir keyword
in the ip pim rp-address [RP address] group-range [groups] bidir command.
Example 13-45 shows the phantom RPA configuration that was previously described.
Loopback99 is the RPL, which is configured with a subnet that contains the RPA. The
RPA is not actually configured on any router in the topology, which is a major differ-
ence between PIM BiDIR and PIM ASM. This RPA is advertised to the PIM domain with
OSPF; because you want OSPF to advertise the link as 10.99.99.96/29, the ip ospf
network point-to-point command is used. This forces OSPF on NX-1 to advertise this as
a stub-link in the type 1 router link-state advertisement (LSA).
feature pim
interface Vlan1101
ip pim sparse-mode
interface loopback0
ip pim sparse-mode
interface loopback99
ip pim sparse-mode
Technet24
804 Chapter 13: Troubleshooting Multicast
interface Ethernet3/17
ip pim sparse-mode
interface Ethernet3/18
ip pim sparse-mode
NX-1# show run interface loopback99
! Output omitted for brevity
interface loopback99
ip address 10.99.99.98/29
ip ospf network point-to-point
ip router ospf 1 area 0.0.0.0
ip pim sparse-mode
NX-1# show ip pim group-range 239.115.115.1
PIM Group-Range Configuration for VRF "default"
Group-range Action Mode RP-address Shrd-tree-range Origin
Note All other routers in the topology have the same BiDIR-specific configuration,
which is the static RPA with the BiDIR keyword. NX-1 and NX-3 are the only routers con-
figured with an RPL to the RPA.
PIM Multicast 805
BiDIR Verification
To understand the mroute state and BiDIR events, verification begins from NX-4, where a
receiver is connected in VLAN 215. Example 13-46 gives the output of show ip mroute
from NX-4, which is the LHR. The (*, G) mroute was created as a result of the IGMP
membership report from the receiver. Because this is a bidirectional shared tree, notice
that the RPF interface Ethernet 3/29 used to reach the RPA is also included in the OIL for
the mroute.
The DF election process in BiDIR determines which PIM router on each interface is
responsible for sending join-prune messages and routing packets from upstream to
downstream and vice versa on the bidirectional shared tree. The output of show ip pim
df provides a concise view of the current DF state on each PIM-enabled interface (see
Example 13-47). On VLAN 215, this router is the DF; on the RPF interface toward the
RPA, this router is not the DF because the peer has a better metric to the RPA.
Technet24
806 Chapter 13: Troubleshooting Multicast
If additional detail is needed about the BiDIR DF election process, the output of show ip
pim internal event-history bidir provides information on the interface state machine and
its reaction to the received PIM DF election messages. Example 13-48 shows the event-
history output from NX-4. The DF election is seen for VLAN 215; no other offers are
received and NX-4 becomes the winner. On Ethernet 3/29, NX-4 (10.2.13.3) has a worse
metric (-1/-1) than the current DF (10.2.13.1) and does not reply with an offer message.
This allows NX-1 to become the DF on this interface.
Because NX-4 is the DF election winner on VLAN 215, it sends a PIM join for the
shared tree to the DF on the RPF interface Ethernet 3/29. The show ip pim internal
event-history join-prune command is used to view these events (see Example 13-49 for
the output).
PIM Multicast 807
In addition to the detailed information in the event-history output, the interface statistics
can be checked to view the total number of BiDIR messages that were exchanged (see
Example 13-50).
Technet24
808 Chapter 13: Troubleshooting Multicast
The next hop in the bidirectional shared tree is NX-1, which is NX-4’s RPF neighbor to
the RPA. The join-prune event-history confirms that the (*, G) join was received from
NX-4 (see Example 13-51).
The mroute state for NX-1 contains Ethernet3/17 as well as Loopback99, which is the
RPL in Example 13-52. All groups that map to the RPA are forwarded on the RPL toward
the RPA.
PIM Multicast 809
Example 13-53 gives the output of show ip pim df. Because the RPL is local to this
device, it is the DF winner on all interfaces except for the RPL. No DF is elected on the
RPL in PIM BiDIR.
No (S, G) join exists from the RPA toward the source as there would have been in PIM
ASM. In BiDIR, all traffic from the source is forwarded from NX-2, which is the FHR
toward the RPA. Therefore, a join from NX-1 to NX-2 is not required to pull the traffic to
NX-1 across VLAN1101. This fact highlights one troubleshooting disadvantage of BiDIR.
Technet24
810 Chapter 13: Troubleshooting Multicast
No visibility from the RPA to the FHR is available about this particular source because
the (S, G) state does not exist.
An ELAM capture can be used on NX-1 to verify that traffic is arriving from NX-2.
Another useful technique is to configure a permit line in an ACL to match the traffic.
Configure the ACL with statistics per-entry, which provides a counter to verify
that traffic has arrived. In the output of Example 13-54, the ACL named verify was
configured to match the source connected on NX-2. The ACL is applied ingress on
VLAN 1101, which is the interface traffic should be arriving on.
interface Vlan1101
description L3 to 7009-B-NX-2
no shutdown
mtu 9216
ip access-group verify in
no ip redirects
ip address 10.1.11.1/30
no ipv6 redirects
ip ospf cost 1
ip router ospf 1 area 0.0.0.0
ip pim sparse-mode
NX-1# show access-list verify
In this exercise, the source is connected to NX-2, so the mroute entry can be verified
to ensure that VLAN 1101 to NX-1 is included in the OIL. Example 13-55 shows the
mroute from NX-2. The mroute entry covers all groups mapped to the RPA.
PIM Multicast 811
Because NX-2 is the DF winner on VLAN 115, it is responsible for forwarding multicast
traffic from VLAN 115 toward the RPF interface for the RPA that is on VLAN 1101.
With BiDIR, NX-2 has no need to register its source with the RPA; it simply forwards
traffic from VLAN 115 up the bidirectional shared tree.
This section explained PIM BiDIR and detailed how to confirm the DF and mroute
entries at each multicast router participating in the bidirectional shared tree. BiDIR and
ASM have several differences with respect to multicast state and forwarding behavior.
When faced with troubleshooting a BiDIR problem, it is important to know which RPA
should be used for the group and which devices along the tree are functioning as the
DR. It should then be possible to trace from the receiver toward the source and isolate
the problem to a particular device along the path.
PIM RP Configuration
When PIM SM is configured for ASM or BiDIR, each multicast group must map to a
PIM RP address. This mapping must be consistent in the network, and each router in the
PIM domain must know the RP address–to–group mapping. Three options are available
for configuring the PIM RP address in a multicast network:
Technet24
812 Chapter 13: Troubleshooting Multicast
2. Auto-RP: PIM RPs announce themselves to a mapping agent. The mapping agent
advertises the RP to group mapping to all routers in the PIM domain. Cisco created
Auto-RP before the PIM BSR mechanism was standardized.
3. BSR: Candidate RPs announce themselves to the bootstrap router. The bootstrap
router advertises the group to RP mapping in a bootstrap message to all routers in
the PIM domain.
Static RP Configuration
Static RP is the simplest mechanism to implement. Each router in the domain is config-
ured with a PIM RP address, as shown in Example 13-56.
feature pim
interface Vlan215
ip pim sparse-mode
interface Vlan216
ip pim sparse-mode
interface Vlan303
ip pim sparse-mode
interface Ethernet3/28
ip pim sparse-mode
interface Ethernet3/29
ip pim sparse-mode
The simplicity has drawbacks, however. Any change to the group mapping requires the
network operator to update the configuration on each router. In addition, a single static
PIM RP could become a scalability bottleneck as hundreds or thousands of sources are
being registered. If the network is small in scale, or if a single PIM RP address is being
used for all groups, a static RP could be a good option.
PIM Multicast 813
Multiple mapping agents could exist in the network, so a deterministic method is needed
to determine which mapping agent routers should listen to. Routers in the network use
the mapping agent with the highest IP address to populate their group-to-RP mapping
tables. See Figure 13-15 for the topology used here to discuss the operation and
verification of Auto-RP.
10.99.99.99/32 224.0.0.0/4
10.3.3.3/32 239.0.0.0/8
Mapping Agent
10.2.1.3 Mapping Agent
Auto-RP Candidate 10.2.2.3
NX-3 NX-4
Loopback1 10.3.3.3
RP-Discovery
224.0.1.40
RP-Announce Auto-RP
224.0.1.39 Listener
Auto-RP Candidate
Loopback99 10.99.99.99 NX-1 NX-2
Technet24
814 Chapter 13: Troubleshooting Multicast
populate the local RP-to-group mapping information. This example was built to illustrate
the fact that multiple candidate RPs (and multiple mapping agents) can coexist.
When the PIM domain has overlapping or conflicting information, such as two candidate RPs
announcing the same group, the mapping agent must decide which RP is advertised in the
RP-discovery messages. The tie-breaking rule is as follows:
2. If the groups are announced with an equal number of mask bits, choose the RP with
the higher IP address.
Example 13-57 shows the PIM configuration for NX-1. The ip pim auto-rp rp-candidate
command configures NX-1 to send Auto-RP RP-announce messages with a TTL of 16 for
all multicast groups. NX-OS does not listen to or forward Auto-RP messages by default.
The ip pim auto-rp forward listen command instructs the device to listen for and for-
ward the Auto-RP groups 224.0.1.39 and 224.0.1.40. The local PIM RP-to-group mapping
is shown with the show ip pim rp command. It displays the current group mapping for
each RP, along with the RP-source, which is the mapping agent NX-4 (10.2.2.3).
feature pim
interface Vlan1101
ip pim sparse-mode
interface loopback99
ip pim sparse-mode
interface Ethernet3/17
ip pim sparse-mode
PIM Multicast 815
interface Ethernet3/18
ip pim sparse-mode
The group range can be configured for additional granularity using the group-list,
prefix-list, or route-map options.
Note The interface used as an Auto-RP candidate-RP or mapping agent must be config-
ured with ip pim sparse-mode.
Example 13-58 shows the Auto-RP mapping agent configuration from NX-4. This config-
uration results in NX-4 sending RP-discovery messages with a TTL of 16. In the output
of show ip pim rp, because NX-4 is the current mapping agent, a timer is displayed to
indicate when the next RP-discovery message will be sent.
feature pim
Technet24
816 Chapter 13: Troubleshooting Multicast
interface Vlan215
ip pim sparse-mode
interface Vlan216
ip pim sparse-mode
interface Vlan303
ip pim sparse-mode
interface loopback0
ip pim sparse-mode
interface Ethernet3/28
ip pim sparse-mode
interface Ethernet3/29
ip pim sparse-mode
NX-4# show ip pim rp
PIM RP Status Information for VRF "default"
BSR disabled
Auto-RP RPA: 10.2.2.3*, next Discovery message in: 00:00:29
BSR RP Candidate policy: None
BSR RP policy: None
Auto-RP Announce policy: None
Auto-RP Discovery policy: None
Note Do not use an anycast IP address for the mapping agent address. This could result
in frequent refreshing of the RP mapping in the network.
feature pim
interface Vlan215
ip pim sparse-mode
interface Vlan216
ip pim sparse-mode
interface Vlan303
ip pim sparse-mode
interface loopback0
ip pim sparse-mode
interface loopback1
ip pim sparse-mode
interface Ethernet3/28
ip pim sparse-mode
interface Ethernet3/29
ip pim sparse-mode
NX-3# show ip pim rp
PIM RP Status Information for VRF "default"
BSR disabled
Auto-RP RPA: 10.2.2.3, uptime: 01:21:50, expires: 00:02:49
BSR RP Candidate policy: None
BSR RP policy: None
Auto-RP Announce policy: None
Auto-RP Discovery policy: None
Technet24
818 Chapter 13: Troubleshooting Multicast
Finally, the configuration of NX-2 is to simply act as an Auto-RP listener and forwarder.
Example 13-60 shows the configuration, which allows NX-4 to receive the Auto-RP
RP-discovery messages from NX-4 and NX-3.
feature pim
interface Vlan115
ip pim sparse-mode
interface Vlan116
ip pim sparse-mode
interface Vlan1101
ip pim sparse-mode
interface Ethernet3/17
ip pim sparse-mode
interface Ethernet3/18
ip pim sparse-mode
NX-2# show run pim
PIM RP Status Information for VRF "default"
BSR disabled
Auto-RP RPA: 10.2.2.3, uptime: 00:07:29, expires: 00:02:25
BSR RP Candidate policy: None
BSR RP policy: None
PIM Multicast 819
Because the Auto-RP messages are bound by their configured TTL scope, care must be
taken to ensure that all RP-announce messages can reach all mapping agents in the net-
work. It is also important to ensure that the scope of the RP-discovery messages is large
enough for all routers in the PIM domain to receive the messages. If multiple mapping
agents exist and the TTL is misconfigured, it is possible to have inconsistent RP-to-group
mapping throughout the PIM domain, depending on the proximity to the mapping agent.
Technet24
820 Chapter 13: Troubleshooting Multicast
Auto-RP state is dynamic and must be refreshed periodically by sending and receiving
RP-announce and RP-discovery messages in the network. If RP state is lost on a device or
is incorrect, the investigation should follow the appropriate Auto-RP message back to its
source to identify any misconfiguration. The NX-OS event-history and Ethanalyzer utili-
ties are the primary tools for finding the root cause of the problem.
BSR relies on candidate-RPs (C-RPs) and a bootstrap router (BSR), which is elected based
on the highest priority. If priority is equal, the highest IP address is used as a tie breaker
to elect a single BSR. When a router is configured as a candidate-BSR (C-BSR), it begins
sending bootstrap messages that allow all the C-BSRs to hear each other and determine
which should become the elected BSR. After the BSR is elected, it should be the only
router sending bootstrap messages in the PIM domain.
C-RPs listen for bootstrap messages from the elected BSR to discover the unicast address
the BSR is using. This allows the C-RPs to announce themselves to the elected BSR by
sending unicast candidate-RP messages. The messages from the C-RP include the RP
address and groups for which it is willing to become an RP, along with other details, such
as the RP priority. The BSR receives RP information from all C-RPs and then builds a PIM
bootstrap message to advertise this information to the rest of the network. The same
PIM Multicast 821
bootstrap message that is used to advertise the list of group-to-RP mappings in the
network is also used by C-BSRs to determine the elected BSR, offering a streamlined
approach. This approach also allows another C-BSR to assume the role of the elected BSR
in case the active BSR stops sending bootstrap messages for some reason.
Until now, the process sounds similar to Auto-RP. However, unlike the Auto-RP mapping
agent, the BSR does not attempt to perform any selection of RP-to-group mappings to
include in the bootstrap message. Instead, the BSR includes the data received from all
C-RPs in the bootstrap message.
When a router receives the bootstrap message from the BSR, it must determine which RP
address will be used for each group range. This process is summarized as follows:
1. Perform a longest match on the group range and mask length to obtain a list of RPs.
3. If only one RP remains, the RP selection process is finished for that group range.
4. If multiple RPs are in the list, use the PIM hash function to choose the RP.
The hash function is applied when multiple RPs for a group range have the same longest
match mask length and priority. The hash function on each router in the domain returns
the same result so that a consistent group-to-RP mapping is applied in the network.
Section 4.7.2 of RFC 4601 describes the hash function as follows:
Value(G,M,C(i))=
■ M = The hash length provided by the bootstrap message from the BSR
The calculation is done for each C-RP matching the group range, and it returns the RP
address to be used. The RP with the highest resulting hash calculated value is chosen for
the group. If two C-RPs happen to have the same hash result, the RP with the higher IP
address is used. The default hash length of 30 results in four consecutive multicast group
addresses being mapped to the same RP address.
Technet24
822 Chapter 13: Troubleshooting Multicast
The topology in Figure 13-16 is used here in reviewing the configuration and verification
steps for BSR.
C-BSR
10.2.1.3 C-BSR (Elected)
10.2.2.3
BSR C-RP
NX-3 NX-4
Loopback1 10.3.3.3/32
239.0.0.0/8
Bootstrap Message
224.0.0.13
C-RP Unicast
BSR Listener
BSR C-RP
Loopback99 10.99.99.99/32 NX-1 NX-2
224.0.0.0/4
NX-1 is configured to be a C-RP for the 224.0.0.0/4 multicast group range (see
Example 13-62). Because routers do not listen for or forward BSR messages by default,
the device is configured with the ip pim bsr listen forward command. After NX-1
learns of the BSR address through a received bootstrap message, it begins sending
unicast C-RP messages advertising the willingness to be an RP for 224.0.0.0/4.
The output of show ip pim rp provides the RP-to-group mapping selection being
used, based on the information received from the bootstrap message originated by the
elected BSR.
feature pim
interface Vlan1101
ip pim sparse-mode
interface loopback0
ip pim sparse-mode
interface loopback99
ip pim sparse-mode
interface Ethernet3/17
ip pim sparse-mode
interface Ethernet3/18
ip pim sparse-mode
The elected BSR is NX-4 because its BSR IP address is higher than that of NX-3
(10.2.2.3 vs. 10.2.1.3); both C-BSRs have equal default priority of 64. The ip pim bsr-
candidate loopback0 command configures NX-4 to be a C-BSR and allows it to begin
sending periodic bootstrap messages. The output of show ip pim rp confirms that the
local device is the current BSR and provides a timer value that indicates when the next
bootstrap message is sent. The hash length is the default value of 30, but it is configu-
rable in the range of 0 to 32. Example 13-63 shows the configuration and RP mapping
information for NX-4.
Technet24
824 Chapter 13: Troubleshooting Multicast
feature pim
interface Vlan215
ip pim sparse-mode
interface Vlan216
ip pim sparse-mode
interface Vlan303
ip pim sparse-mode
interface loopback0
ip pim sparse-mode
interface Ethernet3/28
ip pim sparse-mode
interface Ethernet3/29
ip pim sparse-mode
NX-4# show ip pim rp
PIM RP Status Information for VRF "default"
BSR: 10.2.2.3*, next Bootstrap message in: 00:00:53,
priority: 64, hash-length: 30
Auto-RP disabled
BSR RP Candidate policy: None
BSR RP policy: None
Auto-RP Announce policy: None
Auto-RP Discovery policy: None
Example 13-64 shows the configuration of NX-3, which is configured to be both a C-RP
for 239.0.0.0/8 and a C-BSR. NX-3 has a lower C-BSR address than NX-4, so it does not
send any bootstrap messages after losing the BSR election.
interface Vlan215
ip pim sparse-mode
interface Vlan216
ip pim sparse-mode
interface Vlan303
ip pim sparse-mode
interface loopback0
ip pim sparse-mode
interface loopback1
ip pim sparse-mode
interface Ethernet3/28
ip pim sparse-mode
interface Ethernet3/29
ip pim sparse-mode
NX-3# show ip pim rp
Technet24
826 Chapter 13: Troubleshooting Multicast
The final router to review is NX-2, which is acting only as a BSR listener and forwarder.
In this configuration, NX-2 receives the bootstrap message from NX-4 and inspects its
contents. It then selects the RP-to-group mapping for each group range and installs the
entry in the local RP cache. Note that NX-4, NX-3, and NX-1 are BSR clients as well, but
they are also acting as C-RPs or C-BSRs. Example 13-65 shows the configuration and RP
mapping from NX-2.
feature pim
interface Vlan115
ip pim sparse-mode
interface Vlan116
ip pim sparse-mode
interface Vlan1101
ip pim sparse-mode
PIM Multicast 827
interface Ethernet3/17
ip pim sparse-mode
interface Ethernet3/18
ip pim sparse-mode
NX-2# show ip pim rp
PIM RP Status Information for VRF "default"
BSR: 10.2.2.3, uptime: 07:11:35, expires: 00:01:39,
priority: 64, hash-length: 30
Auto-RP disabled
BSR RP Candidate policy: None
BSR RP policy: None
Auto-RP Announce policy: None
Auto-RP Discovery policy: None
Unlike Auto-RP, BSR messages are not constrained by a configured TTL scope. In
a complex BSR design, defining which C-RPs are allowed to communicate with a
particular BSR might be desirable. This is achieved by filtering the bootstrap messages
and the RP-Candidate messages using the ip pim bsr [bsr-policy | rp-candidate-policy]
commands and using a route map for filtering purposes.
Technet24
828 Chapter 13: Troubleshooting Multicast
In addition to the event-history output, the show ip pim statistics command is useful for
viewing device-level aggregate counters for the various messages associated with BSR
and for troubleshooting. Example 13-67 shows the output from NX-4.
When multiple C-RPs exist for a particular group range, determining which group range
is mapped to which RP can be challenging. NX-OS provides two commands to assist the
user (see Example 13-68).
The first command is the show ip pim group-range [group address] command, which
provides the current PIM mode used for the group, the RP address, and the method
used to obtain the RP address. The second command is the show ip pim rp-hash [group
address] command, which runs the PIM hash function on demand and provides the hash
result and selected RP among all the C-RPs for the group range.
Technet24
830 Chapter 13: Troubleshooting Multicast
Running both Auto-RP and BSR in the same PIM domain is not supported. Auto-RP and
BSR both are capable of providing dynamic and redundant RP mapping to the network.
If third-party vendor devices are also participating in the PIM domain, BSR is the IETF
standard choice and allows for multivendor interoperability.
Fortunately, another approach is available for administrators who favor the simplicity
of a static PIM RP but also desire RP redundancy. Anycast RP configuration involves
multiple PIM routers sharing a single common IP address. The IP address is configured
on a Loopback interface using a /32 mask. Each router that is configured with the any-
cast address advertises the connected host address into the network’s chosen routing
protocol. Each router in the PIM domain is configured to use the anycast address as
the RP. When an FHR needs to register a source, the network’s unicast routing protocol
automatically routes the PIM message to the closest device configured with the any-
cast address. This allows many devices to share the load of PIM register messages and
provides redundancy in the case of an RP failure.
MSDP allows each PIM RP configured with the Anycast RP address to act independently,
while still sharing active source information with all other Anycast RPs in the domain.
For example, in the topology in Figure 13-17, an FHR can register a source for a multicast
group with Anycast RP NX-3, and then a receiver can join that group through Anycast
RP NX-4. After traffic is received through the RPT, normal PIM SPT switchover behavior
occurs on the LHR.
MSDP Peer
PIM Register
TCP Session
Message
NX-1 NX-2
Static RP 10.99.99.99 Static RP 10.99.99.99
Anycast RP with MSDP requires that each Anycast RP have an MSDP peer with every
other Anycast RP. The MSDP peer session is established over Transmission Control
Protocol (TCP) port 639. When the TCP session is established, MSDP can send keepalive
and source-active (SA) messages between peers, encoded in a TLV format.
When an Anycast RP learns of a new source, it uses the SA message to inform all its MSDP
peers about that source. The SA message contains the following information:
Technet24
832 Chapter 13: Troubleshooting Multicast
When the peer receives the MSDP SA, it subjects the message to an RPF check, which
compares the IP address of the PIM RP in the SA message to the MSDP peer address.
This address must be a unique IP address on each MSDP peer and cannot be an anycast
address. NX-OS provides the ip msdp originator-id [address] command to configure the
originating RP address that gets used in the SA message.
Note Other considerations for the MSDP SA message RPF check are not relevant to the
MSDP example used in this chapter. Section 10 of RFC 3618 gives the full explanation of
the MSDP SA message RPF check.
If the SA message is accepted, it is sent to all other MSDP peers except the one from
which the SA message was received. A concept called a mesh group can be configured
to reduce the SA message flooding when many anycast RPs are configured with MSDP
peering. The mesh group is a group of MSDP peers that have an MSDP neighbor with
every other mesh group peer. Therefore, any SA message received from a mesh group
peer does not need to be forwarded to any peers in the mesh group because all peers
should have received the same message from the originator.
MSDP supports the use of SA filters, which can be used to enforce specific design
parameters through message filtering. SA filters are configured with the ip msdp
sa-policy [peer address] [route-map | prefix-list] command. It is also possible to limit
the total number of SA messages from a peer with the ip msdp sa-limit [peer address]
[number of SAs] command.
The example network in Figure 13-17 was configured with anycast RPs and MSDP
between NX-3 and NX-4. NX-3 and NX-4 are both configured with the Anycast RP
address of 10.99.99.99 on their Loopback99 interfaces. The Loopback0 interface on NX-3
and NX-4 is used to establish the MSDP peering. NX-1 and NX-2 are statically config-
ured to use the anycast RP address of 10.99.99.99.
The output of Example 13-69 shows the configuration for anycast RP with MSDP from
NX-3. As with PIM, before MSDP can be configured, the feature must be enabled with
the feature msdp command. The originator-id and the MSDP connect source are both
using the unique IP address configured on interface Loopback0, while the PIM RP is
configured to use the anycast IP address of Loopback99. The MSDP peer address is the
Loopback0 interface of NX-4.
feature pim
interface Vlan215
ip pim sparse-mode
interface Vlan216
ip pim sparse-mode
interface Vlan303
ip pim sparse-mode
interface loopback0
ip pim sparse-mode
interface loopback99
ip pim sparse-mode
interface Ethernet3/28
ip pim sparse-mode
interface Ethernet3/29
ip pim sparse-mode
NX-3# show run msdp
! Output omitted for brevity
!Command: show running-config msdp
feature msdp
interface loopback0
ip address 10.2.1.3/32
ip router ospf 1 area 0.0.0.0
ip pim sparse-mode
interface loopback99
ip address 10.99.99.99/32
ip router ospf 1 area 0.0.0.0
ip pim sparse-mode
Technet24
834 Chapter 13: Troubleshooting Multicast
The configuration of NX-4 is similar to that of NX-3; the only difference is the
Loopback0 IP address and the IP address of the MSDP peer, which is NX-3’s Loopback0
address. Example 13-70 contains the anycast RP with MSDP configuration for NX-4.
feature pim
interface Vlan215
ip pim sparse-mode
interface Vlan216
ip pim sparse-mode
interface Vlan303
ip pim sparse-mode
interface loopback0
ip pim sparse-mode
interface loopback99
ip pim sparse-mode
interface Ethernet3/28
ip pim sparse-mode
interface Ethernet3/29
ip pim sparse-mode
NX-3# show run msdp
! Output omitted for brevity
!Command: show running-config msdp
feature msdp
interface loopback0
ip address 10.2.2.3/32
ip router ospf 1 area 0.0.0.0
ip pim sparse-mode
!Command: show running-config interface loopback99
interface loopback99
ip address 10.99.99.99/32
ip router ospf 1 area 0.0.0.0
ip pim sparse-mode
After the configuration is applied, NX-3 and NX-4 establish the MSDP peering session
between their Loopback0 interfaces using TCP port 639. The MSDP peering status can
be confirmed with the show ip msdp peer command (see Example 13-71). The output
provides an overview of the MSDP peer status and how long the peer has been estab-
lished. It also lists any configured SA policy filters or limits and provides counters for the
number of MSDP messages exchanged with the peer.
Technet24
836 Chapter 13: Troubleshooting Multicast
When 10.115.1.4 starts sending traffic to 239.115.115.1, NX-2 sends a PIM register mes-
sage to NX-4. When the source is registered, the output in Example 13-73 is stored in the
show ip msdp internal event-history route and show ip msdp internal event-history tcp
commands. This event-history has the following interesting elements:
Technet24
838 Chapter 13: Troubleshooting Multicast
04:06:04.659887 msdp [1621]: : TCP at peer 10.2.1.3 accepted 104 bytes, 0 bytes
left to send from buffer, total send bytes: 0
04:06:04.659484 msdp [1621]: : 104 bytes enqueued for send (104 bytes in buffer)
to peer 10.2.1.3
04:05:17.778269 msdp [1621]: : Read 3 bytes from TCP with peer 10.2.1.3 ,
buffer offset 0
04:05:17.736188 msdp [1621]: : TCP at peer 10.2.1.3 accepted 3 bytes, 0 bytes
left to send from buffer, total send bytes: 0
04:04:20.111337 msdp [1621]: : Connection established on passive side
04:04:13.085442 msdp [1621]: : We are listen (passive) side of connection, using
local address 10.2.2.3
Even if the MSDP SA message is correctly generated and advertised to the peer, it can
still be discarded because of an RPF failure, an SA failure, or an SA limit. The same event-
history output on the peer is used to determine why MSDP is discarding the message
upon receipt. Remember that the PIM RP is the root of the RPT. If an LHR has an (S, G)
state for a problematic source and group, the problem is likely to be on the SPT rooted at
the source.
All examples in the “Anycast RP with MSDP” section of this chapter used a static PIM
RP configuration. Using the anycast RP with MSDP functionality in combination with
Auto-RP or BSR is fully supported, for dynamic group-to-RP mapping and provides the
additional benefits of an anycast RP.
PIM Anycast RP
RFC 4610 specifies PIM anycast RP. The design goal of PIM anycast RP is to remove
the dependency on MSDP and to achieve anycast RP functionality using only the PIM
protocol. The benefit of this approach is that the end-to-end process has one fewer
control plane protocol and one less point of failure or misconfiguration.
PIM anycast RP relies on the PIM register and register-stop messages between the
anycast RPs to achieve the same functionality that MSDP provided previously. PIM
anycast is designed around the following requirements:
■ Each anycast RP also has a unique address to use for PIM messages between the
anycast RPs.
■ Every anycast RP is configured with the addresses of all the other anycast RPs.
The example network in Figure 13-18 helps in understanding PIM anycast RP configura-
tion and troubleshooting.
PIM Multicast 839
NX-3 NX-4
PIM Register PIM Register Stop
Message from NX-4
PIM Register
Message from NX-2
Loopback0 10.1.1.1
NX-1 NX-2
As with the previous examples in this chapter, a multicast source 10.115.1.4 is attached
to NX-2 on VLAN 115 and begins sending to group 239.115.115.4. This is not illustrated
in Figure 13-18, for clarity. NX-2 is the FHR and is responsible for registering the source
with the RP. When NX-2 builds the register message, it performs a lookup in the unicast
routing table to find the anycast RP address 10.99.99.99. The anycast address 10.99.99.99
is configured on NX-1, NX-3, and NX-4, which are all members of the same anycast RP
set. The register message is sent to NX-4 following the best routing in the routing table.
When the register message arrives at NX-4, the PIM anycast RP functionality implements
additional checks and processing on the received message. NX-4 builds its (S, G) state just
as any PIM RP would. However, NX-4 looks at the source of the register message and
determines that because the address is not part of the anycast RP set, it must be an FHR.
NX-4 must then build a register message originated from its own Loopback0 address
and send it to all other anycast RPs that are in the configured anycast RP set. NX-4 then
sends a register-stop message to the FHR, NX-2. When NX-1 and NX-3 receive the regis-
ter message from NX-4, they also build an (S, G) state in the mroute table and reply back
to NX-4 with a register stop. Because NX-4 is part of the anycast RP set on NX-1 and
NX-3, they recognize NX-4 as a member of the anycast RP set and no additional register
messages are required to be built on NX-1 and NX-3.
The PIM anycast RP configuration uses the standard PIM messaging of register and
register-stop that happens between FHRs and RPs and applies it to the members of the
anycast RP set. The action of building a register message to inform the other anycast RPs
is based on the source address of the register. If it is not a member of the anycast RP set,
then the sender of the message must an FHR, so a register message is sent to the other
members of the anycast RP set. The approach is elegant and straightforward.
Example 13-74 shows the configuration for NX-4. The static RP of 10.99.99.99 for groups
224.0.0.0/4 is configured on every PIM router in the domain. The anycast RP set is exactly
the same on NX-1, NX-3, and NX-4 and includes all anycast RP Loopback0 interface
addresses, including the local device’s own IP.
Technet24
840 Chapter 13: Troubleshooting Multicast
feature pim
interface Vlan215
ip pim sparse-mode
interface Vlan216
ip pim sparse-mode
interface Vlan303
ip pim sparse-mode
interface loopback0
ip pim sparse-mode
interface loopback99
ip pim sparse-mode
interface Ethernet3/28
ip pim sparse-mode
interface Ethernet3/29
ip pim sparse-mode
The same debugging methodology used for the PIM source registration process can be
applied to the PIM Anycast RP set. The show ip pim internal event-history null-register
and show ip pim internal event-history data-header-register outputs provide a record of
the messages being exchanged between the Anycast-RP set and any FHRs that are send-
ing register messages to the device.
Example 13-75 shows the event-history output from NX-4. The null register message
from 10.115.1.254 is from NX-2, which is the FHR. After adding the mroute entry, NX-4
forwards the register message to the other members of the anycast RP set and then
receives a register stop message in response.
PIM Multicast 841
All examples in the PIM anycast RP section of this book used a static PIM RP
configuration. Using the PIM anycast RP functionality in combination with Auto-RP
or BSR is fully supported, for dynamic group-to-RP mapping and to benefit from the
advantages of anycast RP.
SSM functions without a PIM RP because the receiver has knowledge of each source and
group address that it will join. This knowledge can be preconfigured in the application,
resolved through a Domain Name System (DNS) query, or mapped at the LHR. Because no
PIM RP exists in SSM, the entire concept of the RPT or shared tree is eliminated along with
the SPT switchover. The process of registering a source with the RP is also no longer required,
which results in greater efficiency and less protocol overhead, compared to PIM ASM.
PIM SSM refers to a (source, group) combination as a uniquely identifiable channel. In PIM
ASM mode, any source may send traffic to a group. In addition, the receiver implicitly joins
any source that is sending traffic to the group address. In SSM, the receiver requests each
source explicitly through an IGMPv3 membership report. This allows different applica-
tions to share the same multicast group address by using a unique source address. Because
NX-OS implements an IP-based IGMP snooping table by default, it is possible for hosts to
receive traffic for only the sources requested. A MAC-based IGMP snooping table has no
way to distinguish different source addresses sending traffic to the same group.
Technet24
842 Chapter 13: Troubleshooting Multicast
Note SSM can natively join a source in another PIM domain because the source address
is known to the receiver. PIM ASM and BiDIR require the use of additional protocols and
configuration to enable interdomain multicast to function.
The topology in Figure 13-19 applies to the discussion on the configuration and verifica-
tion of PIM SSM.
.1 .2 .1 .2
.253
NX-3 NX-4
Eth3/29 Eth3/28
Eth3/29
SPT Join
NX-5
Source .4 .2 .1 .1 .2
The (S, G) on NX-2 is created by either receiving the PIM join from NX-4 or receiv-
ing data traffic from the source, depending on which event occurs first. If no receiver
exists for an SSM group, the FHR silently discards the traffic and the OIL of the mroute
becomes empty. When the (S, G) SPT state is built, traffic flows downstream from the
source 10.115.1.4 directly to the receiver on the SSM group 232.115.115.1.
SSM Configuration
The configuration for PIM SSM requires ip pim sparse-mode to be configured on each
interface participating in multicast forwarding. There is no PIM RP to be defined, but any
interface connected to a receiver must be configured with ip igmp version 3. The ip pim
ssm-range command is configured by default to the IANA reserved range of 232.0.0.0/8.
Configuring a different range of addresses is supported, but care must be taken to ensure
that this is consistent throughout the PIM domain. Otherwise, forwarding is broken
because the misconfigured router assumes that this is an ASM group and it does not have
a valid PIM RP-to-group mapping.
Example 13-76 shows the output of the complete SSM configuration for NX-2.
feature pim
interface Vlan115
ip pim sparse-mode
interface Vlan116
ip pim sparse-mode
Technet24
844 Chapter 13: Troubleshooting Multicast
interface Vlan1101
ip pim sparse-mode
interface Ethernet3/17
ip pim sparse-mode
interface Ethernet3/18
ip pim sparse-mode
interface Vlan115
no shutdown
no ip redirects
ip address 10.115.1.254/24
ip ospf passive-interface
ip router ospf 1 area 0.0.0.0
ip pim sparse-mode
ip igmp version 3
feature pim
interface Vlan215
ip pim sparse-mode
interface Vlan216
ip pim sparse-mode
interface Vlan303
ip pim sparse-mode
PIM Multicast 845
interface loopback0
ip pim sparse-mode
interface Ethernet3/28
ip pim sparse-mode
interface Ethernet3/29
ip pim sparse-mode
interface Vlan215
no shutdown
no ip redirects
ip address 10.215.1.253/24
ip ospf passive-interface
ip router ospf 1 area 0.0.0.0
ip pim sparse-mode
ip igmp version 3
NX-1 and NX-3 are configured in a similar way. Because they do not play a role in for-
warding traffic in this example, the configuration is not shown.
SSM Verification
To verify the SPT used in SSM, it is best to begin at the LHR where the receiver is
attached. If the receiver sent an IGMPv3 membership report, an (S, G) state is present
on the LHR. If this entry is missing, check the host for the proper configuration. SSM
requires that the host have knowledge of the source address, and it works correctly only
when the host knows which source to join, or when a correct translation is configured
when the receiver is not using IGMPv3.
If any doubt arises that the host is sending a correct membership report, perform an
Ethanalyzer capture on the LHR. In addition, the output of show ip igmp groups and
show ip igmp snooping groups can be used to confirm that the interface has received a
valid membership report. Example 13-78 shows this output from NX-4. Because this is
IGMPv3 and NX-OS uses an IP-based table, both the source and group information is
present.
Technet24
846 Chapter 13: Troubleshooting Multicast
When NX-4 receives the membership report, an (S, G) mroute entry is created. The (S, G)
mroute state is created because the receiver is already aware of the precise source address
it wants to join for the group. In contrast, PIM ASM builds a (*, G) state because the LHR
does not yet know the source. Example 13-79 shows the mroute table for NX-4.
The RPF interface to 10.115.1.4 is Ethernet 3/28, which connects directly to NX-2. The
show ip pim internal event-history join-prune command can be checked to confirm
that the SPT join has been sent from NX-4. Example 13-80 shows the output of this
command.
PIM Multicast 847
The PIM Join is received on NX-2, and the OIL of the mroute entry is updated to include
Ethernet 3/17, which is directly connected with NX-4. Example 13-81 gives the event-
history for PIM join-prune and the mroute entry from NX-2.
Technet24
848 Chapter 13: Troubleshooting Multicast
Most problems with SSM result from a misconfigured SSM group range on a subset of
devices or stem from a receiver host that is misconfigured or that is attempting to join the
wrong source address. The troubleshooting methodology is similar to the one to address
problems with the SPT in PIM ASM: Start at the receiver and work through the network
hop by hop until the FHR connected to the source is reached. Packet capture tools such
as ELAM, ACLs, or SPAN can be used to isolate any packet forwarding problems on a
router along the tree.
Although L2 state is synchronized between the vPC peers through Cisco Fabric Services
(CFS), both peers have an independent L3 control plane. As with standard port-channels,
a hash table is used to determine which member link is chosen to forward packets of a
particular flow. Traffic arriving from a vPC-connected host is received on either vPC peer,
depending on the hash result. Because of this, both peers must be capable of forwarding
traffic to or from a vPC-connected host. NX-OS supports both multicast sources and
receivers connected behind vPC. Support for multicast traffic over vPC requires the
following:
■ IGMP is synchronized between peers with the CFS protocol. This populates the
IGMP snooping forwarding tables on both vPC peers with the same information.
PIM and mroutes are not synchronized with CFS.
■ The vPC peer link is an mrouter port in the IGMP snooping table, which means that
all multicast packets received on a vPC VLAN are forwarded across the peer link to
the vPC peer.
■ Packets received from a vPC member port and sent across the peer link are not sent
out of any vPC member port on the receiving vPC peer.
■ With vPC-connected multicast sources, both vPC peers can forward multicast traffic
to an L3 OIF.
■ With vPC-connected receivers, the vPC peer with the best unicast metric to the
source will forward packets. If the metrics are the same, the vPC operational primary
forwards the packets. This vPC assert mechanism is implemented through CFS.
■ PIM SSM and PIM BiDIR are not supported with vPC because of the possibility of
incorrect forwarding behavior.
Multicast and Virtual Port-Channel 849
Note Although multicast source and receiver traffic is supported over vPC, an L3 PIM
neighbor from the vPC peers to a vPC-connected multicast router is not yet supported.
vPC-Connected Source
The example network topology in Figure 13-20 illustrates the configuration and verifica-
tion of a vPC-connected multicast source.
.1 .2 .1 .2
Sources
NX-6
Eth3/18
Eth3/18 Eth3/17 Eth3/17
NX-5
Receiver
.4 .1 .2 .1 .2
Technet24
850 Chapter 13: Troubleshooting Multicast
In Figure 13-20, the multicast sources are 10.215.1.1 in VLAN 215 and 10.216.1.1 in
VLAN 216 for group 239.215.215.1. Both sources are attached to L2 switch NX-6, which
uses its local hash algorithm to choose a member link to forward the traffic to. NX-3 and
NX-4 are vPC peers and act as FHRs for VLAN 215 and VLAN 216, which are trunked
across the vPC with NX-6.
The receiver is attached to VLAN 115 on NX-2, which is acting as the LHR. The network
was configured with a static PIM anycast RP of 10.99.99.99, which is Loopback 99 on
NX-1 and NX-2.
When vPC is configured, no special configuration commands are required for vPC and
multicast to work together. Multicast forwarding is integrated into the operation of vPC
by default and is enabled automatically. CFS handles IGMP synchronization, and PIM
does not require the user to enable any vPC-specific configuration beyond enabling ip
pim sparse-mode on the vPC VLAN interfaces.
Example 13-82 shows the PIM and vPC configuration for NX-4.
feature pim
interface Vlan215
ip pim sparse-mode
interface Vlan216
ip pim sparse-mode
interface Vlan303
ip pim sparse-mode
interface loopback0
ip pim sparse-mode
interface Ethernet3/28
ip pim sparse-mode
interface Ethernet3/29
ip pim sparse-mode
Multicast and Virtual Port-Channel 851
feature vpc
vpc domain 2
peer-switch
peer-keepalive destination 10.33.33.1 source 10.33.33.2 vrf peerKA
peer-gateway
interface port-channel1
vpc peer-link
interface port-channel2
vpc 2
Example 13-83 shows the PIM and vPC configuration on the vPC peer NX-3.
feature pim
interface Vlan215
ip pim sparse-mode
interface Vlan216
ip pim sparse-mode
interface Vlan303
ip pim sparse-mode
interface loopback0
ip pim sparse-mode
interface Ethernet3/28
ip pim sparse-mode
Technet24
852 Chapter 13: Troubleshooting Multicast
interface Ethernet3/29
ip pim sparse-mode
feature vpc
vpc domain 2
peer-switch
peer-keepalive destination 10.33.33.2 source 10.33.33.1 vrf peerKA
peer-gateway
interface port-channel1
vpc peer-link
interface port-channel2
vpc 2
After implementing the configuration, the next step is to verify that PIM and IGMP are
operational on the vPC peers. The output of show ip pim interface from NX-4 indicates
that VLAN 215 is a vPC VLAN (see Example 13-84). Note that NX-3 (10.215.1.254) is
the PIM DR and handles registration of the source with the PIM RP. PIM neighbor verifi-
cation on NX-3 and NX-4 for the non-vPC interfaces and for NX-1 and NX-2 is identical
to the previous examples shown in the PIM ASM section of this chapter.
The show ip igmp interface command in Example 13-85 indicates that VLAN 215 is a vPC
VLAN. The output also identifies the PIM DR as the vPC peer, not the local interface.
Technet24
854 Chapter 13: Troubleshooting Multicast
Identifying which device is acting as the PIM DR for the VLAN of interest is important
because this device is responsible for registering the source with the RP, as with tradi-
tional PIM ASM. What differs in vPC for source registration is the interface on which
the DR receives the packets from the source. Packets can arrive either directly on the vPC
member link or from the peer link. Packets are forwarded on the peer link because it is
programmed in IGMP snooping as an mrouter port (see Example 13-86).
When the multicast source in VLAN 216 begins sending traffic to 239.215.215.1, the
traffic arrives on NX-4. NX4 creates an (S, G) mroute entry and forwards the packet
across the peer link to NX-3. NX-3 receives the packet and also creates an (S, G) mroute
entry and registers the source with the RP. Traffic from 10.215.1.1 in VLAN 215 arrives at
NX-3 on the vPC member link. NX-3 creates an (S, G) mroute and then forwards a copy
of the packets to NX-4 over the peer link. In response to receiving the traffic on the peer
link, NX-4 also creates an (S, G) mroute entry.
Example 13-87 shows the mroute entries on NX-3 and NX-4. Even though traffic from
10.216.1.1 for group 239.215.215.1 is hashing only to NX-4, notice that both vPC peers
created (S, G) state. This state is created because of the packets received over the peer link.
Example 13-87 Multicast vPC Source MROUTE Entry on NX-3 and NX-4
Technet24
856 Chapter 13: Troubleshooting Multicast
When the (S, G) mroutes are created on NX-3 and NX-4, both devices realize that the
sources are directly connected. Both devices then determine the forwarder for each
source. In this example, the sources are vPC connected, which makes the forwarding state
for both sources Win-force (forwarding). The result of the forwarding election is found
in the output of show ip pim internal vpc rpf-source (see Example 13-88). This output
indicates which vPC peer is responsible for forwarding packets from a particular source
address. In this case, both are equal; because the source is directly attached through vPC,
both NX-3 and NX-4 are allowed to forward packets in response to receiving a PIM join
or IGMP membership report message.
Example 13-88 PIM vPC RPF-Source Cache Table on NX-3 and NX-4
PIM vPC RPF-Source Cache for Context "default" - Chassis Role Primary
Source: 10.215.1.1
Pref/Metric: 0/0
Ref count: 1
In MRIB: yes
Is (*,G) rpf: no
Source role: primary
Forwarding state: Win-force (forwarding)
MRIB Forwarding state: forwarding
Source: 10.216.1.1
Pref/Metric: 0/0
Ref count: 1
In MRIB: yes
Is (*,G) rpf: no
Source role: primary
Forwarding state: Win-force (forwarding)
MRIB Forwarding state: forwarding
NX-3# show ip pim internal vpc rpf-source
! Output omitted for brevity
PIM vPC RPF-Source Cache for Context "default" - Chassis Role Secondary
Source: 10.215.1.1
Pref/Metric: 0/0
Ref count: 1
In MRIB: yes
Is (*,G) rpf: no
Source role: secondary
Multicast and Virtual Port-Channel 857
Source: 10.216.1.1
Pref/Metric: 0/0
Ref count: 1
In MRIB: yes
Is (*,G) rpf: no
Source role: secondary
Forwarding state: Win-force (forwarding)
MRIB Forwarding state: forwarding
Note The historical vPC RPF-Source Cache creation events are viewed in the output of
show ip pim internal event-history vpc.
NX-3 is the PIM DR for both VLAN 215 and VLAN 216 and is responsible for register-
ing the sources with the PIM RP (NX-1 and NX-2). NX-3 sends PIM register messages
to NX-1, as shown in the output of show ip pim internal event-history null-register in
Example 13-89. Because NX-1 is part of an anycast RP set, it then forwards the register
message to NX-2 and sends a register-stop message to NX-3. At this point, both vPC
peers have an (S, G) for both sources, and both anycast RPs have an (S, G) state.
After the source has been registered with the RP, the receiver in VLAN 115 sends an
IGMP membership report requesting all sources for group 239.215.215.1, which arrives at
NX-2. NX-2 joins the RPT and then initiates switchover to the SPT after the first packet
arrives. NX-2 has two equal-cost routes to reach the sources (see Example 13-90), and it
choses to join 10.215.1.1 through NX-3 and 10.216.1.1 through NX-4. NX-OS is enabled
for multipath multicast by default, which means it could send a PIM join on either valid
RPF interface toward the source when joining the SPT.
Technet24
858 Chapter 13: Troubleshooting Multicast
Example 13-90 Unicast Routes from NX-2 for VLAN 215 and VLAN 216
The output of show ip pim internal event-history join-prune confirms that NX-2
has joined the VLAN 215 source through NX-3 and has joined the VLAN 216 source
through NX-4 (see Example 13-91).
Example 13-91 PIM SPT Joins from NX-2 for vPC-Connected Sources
When these PIM joins arrive at NX-3 and NX-4, both are capable of forwarding packets
from VLAN 215 and VLAN 216 to the receiver on the SPT. Because NX-2 chose to
join (10.216.1.1, 239.215.215.1) through NX-4, its OIL is populated with Ethernet 3/28
Multicast and Virtual Port-Channel 859
and NX-3 forwards (10.215.1.1, 239.215.215.1) in response to the PIM join from NX-2.
Example 13-92 shows the mroute entries from NX-3 and NX-4 after receiving the SPT
joins from NX-2.
Example 13-92 MROUTE Entries from NX-3 and NX-4 after SPT Join
The final example for a vPC-connected source is to demonstrate what occurs when a vPC-
connected receiver joins the group. To create this state on the vPC pair, 10.216.1.1 initiates
an IGMP membership report to join group 239.215.215.1. This membership report mes-
sage is sent to either NX-3 or to NX-4 by the L2 switch NX-6. When the IGMP member-
ship report arrives on vPC port-channel 2 at NX-3 or NX-4, two events occur:
1. The IGMP membership report message is forwarded across the vPC peer link
because the vPC peer is an mrouter.
2. A CFS message is sent to the peer. The CFS message informs the vPC peer to pro-
gram vPC port-channel 2 with an IGMP OIF. vPC port-channel 2 is the interface on
which the original IGMP membership report was received.
These events create a synchronized (*, G) mroute with an IGMP OIF on both NX-3 and
NX-4 (see Example 13-93). The OIF is also added to the (S, G) mroutes that existed
previously.
Technet24
860 Chapter 13: Troubleshooting Multicast
Example 13-93 MROUTE Entries from NX-3 and NX-4 after IGMP Join
We now have a (*, G) entry because the IGMP membership report was received, and both
(S, G) mroutes now contain VLAN 216 in the OIL. In this scenario, packets are hashed by
NX-6 from the source 10.215.1.1 to NX-3. While the traffic is being received at NX-3, the
following events occur:
■ NX-3 forwards the packets across the peer link in VLAN 215.
■ NX-3 replicates the traffic and multicast-routes the packets from VLAN 215 to
VLAN 216, based on its mroute entry.
■ NX-3 sends packets toward the receiver in VLAN 216 on Port-channel 2 (vPC).
Multicast and Virtual Port-Channel 861
■ NX-4 receives the packets from NX-3 in VLAN 215 from the peer link. NX-4 forwards
the packets to any non-vPC receivers but does not forward the packets out a vPC VLAN.
The (RPF) flag on the (10.216.1.1, 239.215.215.1) mroute entry signifies that a source and
receiver are in the same VLAN.
vPC-Connected Receiver
The same topology used to verify a vPC-connected source is reused to understand how
a vPC-connected receiver works. Although the location of the source and receivers
changed, the rest of the topology remains the same (see Figure 13-21).
VLAN 215 VLAN 216
.1 .2 .1 .2
Receiver
NX-6
NX-5
Source
.4 .1 .2 .1 .2
Technet24
862 Chapter 13: Troubleshooting Multicast
The configuration is not modified in any way from the vPC-connected source example,
with the exception of one command. The ip pim pre-build-spt command was configured
on both NX-4 and NX-3. When configured, both vPC peers initiate an SPT join for each
source, but only the elected forwarder forwards traffic toward vPC-connected receivers.
The purpose of this command is to allow for faster failover in case the current vPC
forwarder suddenly stops sending traffic as the result of a failure condition.
When the multicast source 10.115.1.4 begins sending traffic to the group 239.115.115.1,
the traffic is forwarded by L2 switch NX-5 to NX-2. Upon receiving the traffic, NX-2
creates an (S, G) entry for the traffic. Because no receivers exist yet, the OIL is empty at
this time. However, NX-2 informs NX-1 about the source using a PIM register message
because NX-1 and NX-2 are configured as PIM anycast RPs in the same RP set.
The receiver is 10.215.1.1 and is attached to the network in vPC VLAN 215. NX-6
forwards the IGMP membership report message to its mrouter port on Port-channel 2.
This message can hash to either NX-3 or NX-4. When NX-4 receives the message,
IGMP creates a (*, G) mroute entry. The membership report from the receiver is then sent
across the peer link to NX-3, along with a corresponding CFS message. Upon receiving
the message, NX-3 also creates a (*, G) mroute entry. Example 13-94 shows the IGMP
snooping state, IGMP group state, and mroute on NX-4.
Example 13-95 shows the output from NX-3 after receiving the CFS messages from
NX-4. Both vPC peers are synchronized to the same IGMP state, and IGMP is correctly
registered with the vPC manager process.
Technet24
864 Chapter 13: Troubleshooting Multicast
The number of CFS messages sent between NX-3 and NX-4 can be seen in the output of
show ip igmp snooping statistics (see Example 13-96). CFS is used to synchronize IGMP
state and allows each vPC peer to communicate and elect a forwarder for each source.
Note IGMP control plane packet activity is seen in the output of show ip igmp snooping
internal event-history vpc.
PIM joins are sent toward the RP from both NX-3 and NX-4, which can be seen in the
show ip pim internal event-history join-prune output of Example 13-97.
Upon receiving the (*, G) join messages from NX-3 and NX-4, the mroute entry on NX-2
is updated to include the Ethernet 3/17 and Ethernet 3/18 interfaces to NX-3 and NX-4 in
the OIL. Traffic then is sent out on the RPT.
As the traffic arrives on the RPT at NX-3 and NX-4, the source address of the group
traffic becomes known, which triggers the creation of the (S, G) mroute entry. NX-3 and
NX-4 then determine which device will act as the forwarder for this source using CFS.
The communication for the forwarder election is viewed in the output of show ip pim
internal event-history vpc. Because both NX-3 and NX-4 have equal metrics and route
preference to the source, a tie occurs. However, because NX-4 is the vPC primary, it wins
over NX-3 and acts as the forwarder for 10.115.1.4.
Technet24
866 Chapter 13: Troubleshooting Multicast
After the election results are obtained, an entry is created in the vPC RPF-Source cache,
which is seen with the show ip pim internal vpc rpf-source command. Example 13-98
contains the PIM vPC forwarding election output from NX-4 and NX-3.
Source: 10.115.1.4
Pref/Metric: 110/44
Ref count: 1
In MRIB: yes
Is (*,G) rpf: no
Source role: primary
Forwarding state: Tie (forwarding)
MRIB Forwarding state: forwarding
Multicast and Virtual Port-Channel 867
Source: 10.115.1.4
Pref/Metric: 110/44
Ref count: 1
In MRIB: yes
Is (*,G) rpf: no
Source role: secondary
Forwarding state: Tie (not forwarding)
MRIB Forwarding state: not forwarding
For this election process to work correctly, PIM must be registered with the vPC manager
process. This is indicated in the highlighted output of Example 13-99.
With ip pim pre-build-spt, both NX-3 and NX-4 initiate (S, G) joins toward NX-2
following the RPF path toward the source. However, because NX-3 is not the forwarder,
it simply discards the packets it receives on the SPT. NX-4 forwards packets toward the
vPC receiver and across the peer link to NX-3.
Example 13-100 shows the (S, G) mroute state and resulting PIM SPT joins from NX-3
and NX-4. Only NX-4 has an OIL containing VLAN 215 for the (S, G) mroute entry.
Technet24
868 Chapter 13: Troubleshooting Multicast
More detail about the mroute state is seen in the output of the show routing ip multicast
source-tree detail command. This command provides additional information that can
be used for verification. The output confirms that NX-4 is the RPF-Source Forwarder
for this (S, G) entry (see Example 13-101). NX-3 has the same OIL, but its status is set to
inactive, which indicates that it is not forwarding.
Multicast and Virtual Port-Channel 869
Technet24
870 Chapter 13: Troubleshooting Multicast
■ Increase the PIM SG-Expiry timer with the ip pim sg-expiry-timer command. The
value should be sufficiently large so that the (S, G) state does not time out during
business hours.
■ Use multicast source-generated probe packets to populate the (S, G) state in the net-
work before each business day.
The purpose of these steps is to have the SPT trees built before any business-critical data
is sent each day. The increased (S, G) expiry timer allows the state to remain in place dur-
ing critical times and avoid state timeout and re-creation for intermittent multicast send-
ers. This avoids state transitions and the potential for duplicate traffic.
Reserved VLAN
The Nexus 5500 and Nexus 6000 series platforms utilize a reserved VLAN for the pur-
poses of multicast routing when vPC is configured. When traffic arrives from a vPC-
connected source, the following events occur:
■ The traffic is replicated to any receivers in the same VLAN, including the peer link.
■ A copy is sent across the peer link using the reserved VLAN.
As packets arrive from the peer link at the vPC peer, if the traffic is received from any
VLAN except for the reserved VLAN, it will not be multicast routed. If the vpc bind-
vrf [vrf name] vlan [VLAN ID] is not configured on both vPC peers, orphan ports or
L3-connected receivers will not receive traffic. This command must be configured for
each VRF participating in multicast routing.
Summary 871
Ethanalyzer Examples
Various troubleshooting steps in this chapter have relied on the NX-OS Ethanalyzer
facility to capture control plane protocol messages. Table 13-11 provides examples of
Ethanalyzer protocol message captures for the purposes of troubleshooting. In general,
when performing an Ethanalyzer capture, you must decide whether the packets should
be displayed in the session, decoded in the session, or written to a local file for offline
analysis. The basic syntax of the command is ethanalyzer local interface [inband]
capture-filter [filter-string in quotes] write [location:filename]. Many variations of the
command exist, depending on which options are desired.
Ethanalyzer syntax might vary slightly, depending on the platform. For example, some
NX-OS platforms such as Nexus 3000 have inband-hi and inband-lo interfaces. For most
control plane protocols, the packets are captured on the inband-hi interface. However, if the
capture fails to collect any packets, the user might need to try a different interface option.
Summary
Multicast communication using NX-OS was covered in detail throughout this chapter.
The fundamental concepts of multicast forwarding were introduced before delving into
the NX-OS multicast architecture. The IGMP and PIM protocols were examined in detail
to build a foundation for the detailed verification examples. The supported PIM operat-
ing modes (ASM, BiDIR, and SSM) were explored, including the various message types
used for each and the process for verifying each type of multicast distribution tree.
Finally, multicast and vPC were reviewed and explained, along with the differences in
Technet24
872 Chapter 13: Troubleshooting Multicast
protocol behavior that are required when operating in a vPC environment. The goal of
this chapter was not to cover every possible multicast forwarding scenario, but instead to
provide you with a toolbox of fundamental concepts that can be adapted to a variety of
troubleshooting situations in a complex multicast environment.
References
RFC 1112, Host Extensions for IP Multicasting, S. Deering. IETF, https://tools.ietf.org/
html/rfc1112, August 1989.
RFC 2236, Internet Group Management Protocol, Version 2, W. Fenner. IETF, https://
tools.ietf.org/html/rfc2236, November 1997.
RFC 3446, Anycast Rendezvous Point (RP) Mechanism Using Protocol Independent
Multicast (PIM) and Multicast Source Discovery Protocol (MSDP). D. Kim, D. Meyer,
H. Kilmer, D. Farinacci. IETF, https://www.ietf.org/rfc/rfc3446.txt, January 2003.
RFC 3618, Multicast Source Discovery Protocol (MSDP). B. Fenner, D. Meyer. IETF,
https://www.ietf.org/rfc/rfc3618.txt, October 2003.
RFC 4541, Considerations for Internet Group Management Protocol (IGMP) and
Multicast Listener Discovery (MLD) Snooping Switches. M. Christensen, K. Kimball, F.
Solensky. IETF, https://www.ietf.org/rfc/rfc4541.txt, May 2006.
RFC 4610, Anycast-RP Using Protocol Independent Multicast (PIM). D. Farinacci, Y. Cai. IETF,
https://www.ietf.org/rfc/rfc4610.txt, August 2006.
RFC 5059, Bootstrap Router (BSR) Mechanism for Protocol Independent Multicast
(PIM). N. Bhaskar, A. Gall, J. Lingard, S. Venaas. IETF, https://www.ietf.org/rfc/
rfc5059.txt, January 2008.
RFC 5771, IANA Guidelines for IPv4 Multicast Address Assignments. M. Cotton,
L. Vegoda, D. Meyer. IETF, https://tools.ietf.org/rfc/rfc5771.txt, March 2010.
Doyle, Jeff, and Jennifer DeHaven Carroll. Routing TCP/IP, Volume II (Indianapolis:
Cisco Press, 2001).
Edgeworth, Brad, Aaron Foss, and Ramiro Garza Rios. IP Routing on Cisco IOS,
IOS XE and IOS XR (Indianapolis: Cisco Press, 2014).
Esau, Matt. “Troubleshooting NXOS Multicast” (Cisco Live: San Francisco, 2014.)
Fuller, Ron, David Jansen, and Matthew McPherson. NX-OS and Cisco Nexus Switching
(Indianapolis: Cisco Press, 2013).
Loveless, Josh, Ray Blair, and Arvind Durai. IP Multicast, Volume I: Cisco IP Multicast
Networking (Indianapolis: Cisco Press, 2016).
Technet24
This page intentionally left blank
Chapter 14
Troubleshooting Overlay
Transport Virtualization (OTV)
OTV Fundamentals
The desire to connect data center sites at L2 is driven by the need for Virtual
Machine (VM) and workload mobility, or for creating geographically diverse redun-
dancy. Critical networks may even choose to have a fully mirrored disaster recovery site
that synchronizes data and services between sites. Having the capability to put services
from multiple locations into the same VLAN allows mobility between data centers with-
out reconfiguring the network layer addressing of the host or server when it is moved.
Technet24
876 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
The challenges and considerations associated with connecting two or more data centers
at L2 are the following:
Before OTV, L2 data center interconnect (DCI) was achieved with the use of direct fiber
links configured as L2 trunks, IEEE 802.1Q Tunneling (Q-in-Q), Ethernet over MPLS
(EoMPLS), or Virtual Private LAN Service (VPLS). These options rely on potentially
complex configuration by a transport service provider to become operational. Adding a
site with those solutions means the service provider needs to be involved to complete the
necessary provisioning.
OTV, however, can provide an L2 overlay network between sites using only an L3 routed
underlay. Because OTV is encapsulated inside an IP packet for transport, it can take
advantage of the strengths of L3 routing; for example, IP Equal Cost Multipath (ECMP)
routing for load sharing and redundancy as well as optimal packet paths between OTV
edge devices (ED) based on routing protocol metrics. Troubleshooting is simplified as
well because traffic in the transport network is traditional IP with established and famil-
iar troubleshooting techniques.
Solutions for L2 DCI such as Q-in-Q, EoMPLS, and VPLS all require the service pro-
vider to perform some form of encapsulation and decapsulation on the traffic for a site.
With OTV, the overlay encapsulation boundary is moved from the service provider to
the OTV site, which provides greater visibility and control for the network operator.
The overlay configuration can be modified at will and does not require any interac-
tion with or dependence on the underlay service provider. Modifications to the overlay
include actions like adding new OTV sites or changing which VLANs are extended
across the OTV overlay.
The previously mentioned transport protocols rely on static or stateful tunneling. With
OTV, encapsulation of the overlay traffic happens dynamically based on MAC address
to IP next-hop information supplied by OTV’s Intermediate System to Intermediate
System (IS-IS) control plane. This concept is referred to as MAC address routing, and it
is explored in detail throughout this chapter. The important point to understand is that
OTV maps a MAC address to a remote IP next-hop dynamically using a control plane
protocol.
had to be planned and configured carefully to avoid L2 loops and Spanning-Tree Protocol
(STP) blocking ports. OTV has considerations for multihoming built in to the protocol.
For example, multiple OTV edge devices can be deployed in a single site, and each can
actively forward traffic for different VLANs. Between data centers, multiple L3 routed
links exist and provide L3 ECMP redundancy and load sharing between the OTV edge
devices in each data center site.
Having redundant data centers is useful only if they exist in different fault domains, and
problems from one data center do not affect the other. This implies that each data center
must be isolated in terms of STP, and traffic forwarding loops between sites must be
avoided. OTV allows each data center site to contain an independent STP Root Bridge
for the VLANs extended across OTV. This is possible because OTV does not forward
STP Bridge Protocol Data Units (BPDU) across the overlay, allowing each site to function
independently.
Technet24
878 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
Note OTV is also supported on Cisco ASR1000 series routers. The protocol functional-
ity is similar but there may be implementation differences. This chapter focuses only on
OTV on the Nexus 7000 series switches.
VLANs are aggregated into a distribution switch and then fed into a dedicated OTV
VDC through a L2 trunk. Any traffic in a VLAN that needs to reach the remote data
center is switched to the OTV VDC where it gets encapsulated by the edge device.
The packet then traverses the routed VDC as an L3 IP packet and gets routed toward
the remote OTV edge device for decapsulation. Traffic that requires L3 routing is fed
from the L2 distribution to a routing VDC. The routing VDC typically has a First Hop
Redundancy Protocol (FHRP) like Hot Standby Router Protocol (HSRP) or Virtual Router
Redundancy Protocol (VRRP) to provide a default-gateway address to the hosts in the
attached VLANs and to perform Inter VLAN routing.
Note Configuring multiple VDCs may require the installation of additional licenses,
depending on the requirements of the deployment and the number of VDCs.
OTV Terminology
An OTV network topology example is shown in Figure 14-1. There are two data center
sites connected by an L3 routed network that is enabled for IP multicast. The L3 routed
network must provide IP connectivity between the OTV edge devices for OTV to func-
tion correctly. The placement of the ED is flexible as long as the OTV ED receives L2
frames for the VLANs that require extension across OTV. Usually the OTV ED is con-
nected at the L2 and L3 boundary.
Data center 1 contains redundant OTV VDCs NX-2 and NX-4, which are the edge
devices. NX-1 and NX-3 perform the routing and L2 VLAN aggregation and con-
nect the access switch to the OTV VDC internal interface. The OTV join interface is a
Layer 3 interface connected to the routing VDC. Data center 2 is configured as a mirror
of Data center 1; however, the port-channel 3 interface is used as the OTV internal
interface instead of the OTV join interface as in Data center 1. VLANs 100–110 are being
extended with OTV between the data centers across the overlay.
OTV Fundamentals 879
NX-5 NX-7
Eth3/41 L3 Po1 (PL) Eth3/41 L3
OTV Join Interface OTV Join Interface
Layer 3 ECMP Layer 3 ECMP
Layer 3 ECMP Layer 3 ECMP
Po1 (PL)
Technet24
880 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
Term Definition
Join Interface Interface on the OTV edge device that connects to the L3 routed
network and used to source OTV encapsulated traffic. It can be a
Loopback, L3 point-to-point interface, or L3 Port-channel interface.
Subinterfaces may also be used. Multiple overlays can use the same
join interface.
Overlay Interface Interface on the OTV ED. The overlay interface is used to dynamically
encapsulate the L2 traffic for an extended VLAN in an IP packet for
transport to a remote OTV site. Multiple overlay interfaces are supported
on an edge device.
Site VLAN A VLAN that exists in the local site that connects the OTV edge
devices at L2. The site VLAN is used to discover other edge devices
in the local site and allows them to form an adjacency. After the adja-
cency is formed, the Authoritative Edge Device (AED) for each VLAN
is elected. The site VLAN should be dedicated for OTV and not
extended across the overlay. The site VLAN should be the same VLAN
number at all OTV sites.
Site Identifier The site-id must be the same for all edge devices that are part of the
same site. Value ranges from 0x1 to 0xffffffff. The site-id is advertised
in IS-IS packets, and it allows edge devices to identify which edge
devices belong to the same site. Edge devices form an adjacency on
the overlay as well as on the site VLAN (Dual adjacency). This allows
the adjacency between edge devices in a site to be maintained even
if the site VLAN adjacency gets broken due to a connectivity prob-
lem. The overlay interface will not come up until a site identifier is
configured.
Site Adjacency Formed across the site VLAN between OTV edge devices that are part
of the same site. If an IS-IS Hello is received from an OTV ED on the
site VLAN with a different site-id than the local router, the overlay
is disabled. This is done to prevent a loop between the OTV internal
interface and the overlay. This behavior is why it is recommended to
make the OTV internal VLAN the same at each site.
Overlay OTV adjacency established on the OTV join interface. Adjacencies
Adjacency on the overlay interface are formed between sites, as well as for
edge devices that are part of the same site. Edge devices form
dual adjacency (site and overlay) for resiliency purposes. For
devices in the same site to form an overlay adjacency, the site-id
must match.
OTV Fundamentals 881
Deploying OTV
The configuration of the OTV edge device consists of the OTV internal interface, the
join interface, and the overlay virtual interface. Before attempting to configure OTV,
the capabilities of the transport network must be understood, and it must be correctly
configured to support the OTV deployment model.
■ Adjacency Server Mode: Neighbors must be manually configured for the overlay
interface. Unicast control plane packets are created for each individual neighbor and
routed through the transport.
The OTV deployment model that is deployed should be decided during the planning
phase after verifying the capabilities of the transport network. If multicast is sup-
ported in the transport, it is recommended to use the multicast deployment model.
If there is no multicast support available in the transport network, use the adjacency
server model.
The transport network must provide IP routed connectivity for unicast and multicast
communication between the OTV EDs. The unicast connectivity requirements are
achieved with any L3 routing protocol. If the OTV ED does not form a dynamic routing
adjacency with the data center, it must be configured with static routes to reach the join
interfaces of the other OTV EDs.
Multicast routing in the transport must be configured to support Protocol
Independent Multicast (PIM). An Any Source Multicast (ASM) group is used for the
OTV control-group, and a range of PIM Source Specific Multicast (SSM) groups are
used for OTV data-groups. IGMPv3 should be enabled on the join interface of the
OTV ED.
Note It is recommended to deploy PIM Rendezvous Point (RP) redundancy in the trans-
port network for resiliency.
Technet24
882 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
With the deployment model determined and the OTV VDC created with the
TRANSPORT_SERVICES_PKG license installed, the following steps are used
to enable OTV functionality. The following examples are based upon a multicast
enabled transport.
OTV Configuration
Before any OTV configuration is entered, the feature must be enabled with the feature
otv command. Example 14-1 shows the configuration associated with the OTV internal
interface, which is the L2 trunk port that participates in traditional switching with the
existing data center network. The VLANs to be extended over OTV are VLAN 100–110.
The site VLAN for both data centers is VLAN 10, which is being trunked over the OTV
internal interface, along with VLANs 100–110.
vlan 1,10,100-110
interface Ethernet3/5
description To NX-1 3/19, OTV internal interface
switchport
switchport mode trunk
mtu 9216
no shutdown
The OTV internal interface should be considered as an access switch in the design of the
data center’s STP domain.
After the OTV internal interface is configured, the OTV join interface can be configured.
The OTV join interface can be configured on M1, M2, M3, or F3 modules and can be
a Loopback interface or an L3 point-to-point link. It is also possible to use an L3 port-
channel, or a subinterface, depending on the deployment requirements. Example 14-2
shows the relevant configuration for the OTV join interface.
OTV Fundamentals 883
interface port-channel3
description To NX-1 Po3, OTV Join interface
mtu 9216
ip address 10.1.12.1/24
ip router ospf 1 area 0.0.0.0
ip igmp version 3
interface Ethernet3/7
description To NX-1 Eth3/22, OTV Join interface
mtu 9216
channel-group 3 mode active
no shutdown
interface Ethernet3/8
description To NX-1 Eth3/23, OTV Join interface
mtu 9216
channel-group 3 mode active
no shutdown
The OTV join interface is an Layer 3 point-to-point interface and is configured for IGMP
version 3. IGMPv3 is required so the OTV ED can join the control-group and data-groups
required for OTV functionality.
Open Shortest Path First (OSPF) is the routing protocol in this topology and is used in
both data centers. The OTV ED learns the unicast routes to reach all other OTV EDs
through OSPF. The entire data center was configured with MTU 9216 on all infrastruc-
ture links to allow full 1500 byte frames to pass between applications without the need
for fragmentation.
Beginning in NX-OS Release 8.0(1), a loopback interface can be used as the OTV join
interface. If this option is used, the configuration will differ from this example, which
utilizes an L3 point-to-point interface. At least one L3 routed interface must connect the
OTV ED to the data center network. A PIM neighbor needs to be established over this
L3 interface, and the OTV ED needs to be configured with the correct PIM Rendezvous
Point (RP) and SSM-range that matches the routed data center devices and the transport
network. Finally, the loopback interface used as the join interface must be configured
with ip pim sparse-mode so that it can act as both a source and receiver for the OTV
control-group and data-groups. The loopback also needs to be included in the dynamic
routing protocol used for Layer 3 connectivity in the data center so that reachability
exists to other OTV EDs.
Technet24
884 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
Note OTV encapsulation increases the size of L2 frames as they are transported across
the IP transport network. The considerations for OTV MTU are further discussed later in
this chapter.
With the OTV internal interface and join interface configured; the logical interface
referred to as the overlay interface can now be configured and bound to the join
interface. The overlay interface is used to dynamically encapsulate VLAN traffic
between OTV sites. The number assigned to the overlay interface must be the same on
all OTV EDs participating in the overlay. It is possible for multiple overlay interfaces
to exist on the same OTV ED, but the VLANs extended on each overlay must not
overlap.
The OTV site VLAN is used to form a site adjacency with any other OTV EDs located in
the same site. Even for a single OTV ED site, the site VLAN must be configured for the
overlay interface to come up. Although not required, it is recommended that the same
site VLAN be configured at each OTV site. This is to allow OTV to detect if OTV sites
become merged, either on purpose or in error. The site VLAN should not be included in
the OTV extended VLAN list. The site identifier should be configured to the same value
for all OTV EDs that belong to the same site. The otv join-interface [interface] com-
mand is used to bind the overlay interface to the join interface. The join interface is used
to send and receive the OTV multicast control plane messaging used to form adjacencies
and learn MAC addresses from other OTV EDs.
Because this configuration is utilizing a multicast capable transport network, the otv
control-group [group number] is used to declare which IP PIM ASM group will be used
for the OTV control plane group. The control plane group will carry OTV control plane
traffic such as IS-IS hellos across the transport and allow the OTV EDs to communicate.
The group number should match on all OTV EDs and must be multicast routed in the
transport network. Each OTV ED acts as both a source and receiver for this multicast
group.
The otv data-group [group number] is used to configure which Source Specific
Multicast (SSM) groups are used to carry multicast data traffic across the over-
lay. This group is used to transport multicast traffic within a VLAN across the
OTV overlay between sites. The number of multicast groups included in the data-
group is a balance between optimization and scalability. If a single group is used,
all OTV EDs receive all multicast traffic on the overlay, even if there is no receiver
at the site. If a large number of groups is defined, multicast traffic can be for-
warded optimally, but the number of groups present in the transport network could
become a scalability concern. Presently, 256 multicast data groups are supported
for OTV.
has been properly configured for both unicast and multicast routing. Example 14-3
contains the configuration for interface Overlay0 on NX-2 as well as the site-VLAN and
site-identifier configurations.
otv site-vlan 10
interface Overlay0
description Site A
otv join-interface port-channel3
otv control-group 239.12.12.12
otv data-group 232.1.1.0/24
otv extend-vlan 100-110
no shutdown
OTV uses the existing functionality of IS-IS as much as possible. This includes the forma-
tion of neighbors and the use of LSPs and PDUs to exchange reachability information.
OTV EDs discover each other with IS-IS hello packets and form adjacencies on the site
VLAN as well as on the overlay, as shown in Figure 14-2.
Technet24
886 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
NX-5 NX-7
NX-2 NX-4
OTV is an overlay protocol, which means its operation is dependent upon the under-
lying transport protocols and the reachability they provide. As the control plane is
examined in this chapter, it will become apparent that to troubleshoot OTV, the net-
work operator must be able to segment the different protocol layers and understand
the interaction between them. The OTV control plane consists of L2 switching, L3
routing, IP multicast, and IS-IS. If troubleshooting is being performed in the transport
network, the OTV control plane packets must now be thought of as data plane pack-
ets, where the source and destination hosts are actually the OTV EDs. The transport
network has control plane protocols that may also need investigation to solve an
OTV problem.
Understanding and Verifying the OTV Control Plane 887
The transport network’s multicast capability allows OTV to form IS-IS adjacencies as if
each OTV ED were connected to a common LAN segment. In other words, think of the
control-group as a logical multipoint connection from one OTV ED to all other OTV
EDs. The site adjacency is formed over the site VLAN, which connects both OTV EDs in
a site across the internal interface using direct L2 communication.
NX-6 NX-8
Po2 (Access)
NX-5 NX-7
OTV Join Interface OTV Join Interface
PIM Join
(10.1.12.1, 239.12.12.12)
Replicated for PIM Join OTV Join Interface
NX-4, NX-6, NX-8 IGMP Join to NX-3
ISIS Hello
NX-1 ISIS Hello NX-3
OTV Edge Device OTV Edge Device
ISIS Hello
NX-2 NX-4
Technet24
888 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
Note The behavior of forming Dual Adjacencies on the site VLAN and the overlay began
with NX-OS release 5.2(1). Prior to this, OTV EDs in a site only formed site adjacencies.
The IS-IS protocol used by OTV does not require any user configuration for basic func-
tionality. When OTV is configured the IS-IS process gets enabled and configured auto-
matically. Adjacencies form provided that the underlying transport is functional and the
configured parameters for the overlay are compatible between OTV EDs.
The IS-IS control plane is fundamental to the operation of OTV. It provides the mecha-
nism to discover both local and remote OTV EDs, form adjacencies, and exchange MAC
address reachability between sites. MAC address advertisements are learned through the
IS-IS control plane. An SPF calculation is performed, and then the OTV MAC routing
table is populated based on the result. When investigating a MAC address reachability
issue, the advertisement is tracked through the OTV control plane to ensure that the ED
has the correct information from all IS-IS neighbors. If a host-to-host reachability problem
exists across the overlay, it is recommended to begin the investigation with a validation of
the control plane configuration and operational state before moving into the data plane.
The output of Example 14-4 verifies the Overlay0 interface is operational, which VLANs
are being extended, the transport multicast groups for the OTV control-group and data-
groups, the join interface, site VLAN, and AED capability. This information should match
what has been configured in the overlay interface on the local and remote site OTV EDs.
Example 14-5 demonstrates how to verify that the IS-IS adjacencies are properly formed
for OTV on the overlay interface.
Overlay-Interface Overlay0 :
The output of the show otv site command, as shown in Example 14-6, is used to verify
the site adjacency. The adjacency with NX-4 is in the Full state, which indicates that
both the overlay and site adjacencies are functional (Dual Adjacency).
Technet24
890 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
-------------------------------------------------------------------------------
NX-4 64a0.e73e.12c2 Full 13:50:52 Yes
Examples 14-5 and 14-6 show a different adjacency uptime for the site and overlay adja-
cencies because these are independent IS-IS interfaces, and the adjacencies form indepen-
dently of each other. The site-id for an IS-IS neighbor is found in the output of show otv
internal adjacency, as shown in Example 14-7. This provides information about which
OTV EDs are part of the same site.
Overlay-Interface Overlay0 :
System-ID Dest Addr Adj-State TM_State Adj-State inAS Site-ID
Version
64a0.e73e.12c2 10.1.22.1 default default UP UP 0000.0000.0001*
HW-St: Default N backup (null)
Note OTV has several event-history logs that are useful for troubleshooting. The show
otv isis internal event-history adjacency command is used to review recent adjacency
changes.
A point-to-point tunnel is created for each OTV ED that has an adjacency. These
tunnels are used to transport OTV unicast packets between OTV EDs. The output
of show tunnel internal implicit otv brief should have a tunnel present for each
OTV ED reachable on the transport network. The output from NX-2 is shown in
Example 14-8.
Understanding and Verifying the OTV Control Plane 891
-------------------------------------------------------------------------------
Interface Status IP Address Encap type MTU
-------------------------------------------------------------------------------
Tunnel16384 up -- GRE/IP 9178
Tunnel16385 up -- GRE/IP 9178
Tunnel16386 up -- GRE/IP 9178
Additional details about a specific tunnel is viewed with show tunnel internal implicit
otv tunnel_num [number]. Example 14-9 shows detailed output for tunnel 16384. The
MTU, transport protocol source, and destination address are shown, which allows a
tunnel to be mapped to a particular neighbor. This output should be verified if a
specific OTV ED is having a problem.
When the OTV Adjacencies are established, the AED role is determined for each VLAN
that is extended across the overlay using a hash function. The OTV IS-IS system-id is
used along with the VLAN identifier to determine the AED role for each VLAN based
on an ordinal value. The device with the lower system-id becomes AED for the even-
numbered VLANs, and the device with the higher system-id becomes AED for the odd
numbered VLANs.
The show otv vlan command from NX-2 is shown in Example 14-10. The VLAN state
column lists the current state as Active or Inactive. An Active state indicates this OTV ED
is the AED for that VLAN and is responsible for forwarding packets across the overlay
and advertising MAC address reachability for the VLAN. This is an important piece of
information to know when troubleshooting to ensure the correct device is being investi-
gated for a particular VLAN.
Technet24
892 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
Legend:
(NA) - Non AED, (VD) - Vlan Disabled, (OD) - Overlay Down
(DH) - Delete Holddown, (HW) - HW: State Down
(NFC) - Not Forward Capable
Adjacency problems are typically caused by configuration error, a packet delivery prob-
lem for the OTV control-group in the transport network, or a problem with the site
VLAN for the site adjacency.
For problems with an overlay adjacency, check the IP multicast state on the multicast
router connected to the OTV ED’s join interface. Each OTV ED should have a corre-
sponding (S,G) mroute for the control-group. The L3 interface that connects the multicast
router to the OTV ED should be populated in the Outgoing Interface List (OIL) for the
(*, G) and all active sources (S,G) of the OTV control-group because of the IGMP join
from the OTV ED.
The show ip mroute [group] command from NX-1 is shown in Example 14-11. The
(*, 239.12.12.12) entry has Port-channel 3 populated in the OIL by IGMP. For all active
sources sending to 239.12.12.12, the OIL is populated with Port-channel 3 as well, which
allows NX-2 to receive IS-IS hello and LSP packets from NX-4, NX-6, and NX-8. The
source address for each Source, Group pair (S,G) are the other OTV ED’s join interfaces
sending multicast packets to the group.
Understanding and Verifying the OTV Control Plane 893
The presence of a (*, G) from IGMP for a group indicates that at minimum an IGMP
join message was received by the router, and there is at least one interested receiver on
that interface. A PIM join message is sent toward the PIM RP from the last hop router,
and the (*, G) join state should be present along the multicast tree to the PIM RP.
When a data packet for the group is received on the shared tree by the last hop router,
in this case NX-1, a PIM (S, G) join message is sent toward the source. This messaging
forms what is called the source tree, which is built to the first-hop router connected to
the source. The source tree remains in place as long as the receiver is still interested in
the group.
Technet24
894 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
Example 14-12 shows how to verify the receipt of traffic with the show ip mroute
summary command, which provides packet counters and bit-rate values for each
source.
Because IS-IS adjacency failures for the overlay are often caused by multicast pack-
et delivery problems in the transport, it is important to understand what the mul-
ticast state on each router is indicating. The multicast role of each transport router
must also be understood to provide context to the multicast routing table state. For
example, is the device a first-hop router (FHR), PIM RP, transit router, or last-hop
router (LHR)? In the network example, NX-1 is a PIM LHR, FHR, and RP for the
control-group.
If NX-1 had no multicast state for the OTV control-group, it indicates that the IGMP
join has not been received from NX-2. Because NX-1 is also a PIM RP for this group,
it also indicates that none of the sources have been registered. If a (*, G) was present,
but no (S, G), it indicates that the IGMP join was received from NX-2, but multicast
data traffic from NX-4, NX-6, or NX-8 was not received by NX-1; therefore, the
switchover to the source tree did not happen. At that point, troubleshooting moves
toward the source and first-hop routers until the cause of the multicast problem is
identified.
The site adjacency is formed across the site VLAN. There must be connectivity between
the OTV ED’s internal interface across the data center network for the IS-IS adjacency
to form successfully. Example 14-13 contains the output of show otv site where the site
adjacency is down, as indicated by the Partial state because the overlay adjacency with
NX-4 is UP.
Overlay-Interface Overlay0 :
Hostname System-ID Dest Addr Up Time State
NX-4 64a0.e73e.12c2 10.1.22.1 00:01:57 UP
NX-8 64a0.e73e.12c4 10.2.43.1 00:01:57 UP
NX-6 6c9c.ed4d.d944 10.2.34.1 00:02:09 UP
The show otv isis site output confirms that the adjacency was lost on the site VLAN as
shown in Example 14-14.
Technet24
896 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
BFD: Disabled
The IS-IS adjacency being down indicates that IS-IS hellos (IIH Packets) are not being
exchanged properly on the site VLAN. The transmit and receipt of IIH packets is record-
ed in the output of show otv isis internal event-history iih. Example 14-15 confirms that
IIH packets are being sent, but none are being received across the site VLAN.
This event-history log confirms that the IIH packets are created, and the process
is sending them out to the site VLAN. The same event-history can be checked on
NX-4 to verify if the IIH packets are received. The output from NX-4 is shown in
Example 14-16, which indicates the IIH packets are being sent, but none are received
from NX-2.
The output in Example 14-15 and Example 14-16 confirms that both NX-2 and NX-4
are sending IS-IS IIH hellos to the site VLAN, but neither side is receiving packets from
the other OTV ED. At this point of the investigation, troubleshooting should follow the
VLAN across the L2 data center infrastructure to confirm the VLAN is properly con-
figured and trunked between NX-2 and NX-4. In this case, a problem was identified on
NX-3 where the site VLAN, VLAN 10, was not being trunked across the vPC peer-link.
This resulted in a Bridge Assurance inconsistency problem over the peer-link, as shown in
the output of Example 14-17.
Technet24
898 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
After correcting the trunked VLAN configuration of the vPC peer-link, the OTV site
adjacency came up on the site VLAN, and the dual adjacency state was returned to
FULL. The adjacency transitions are viewed in the output of show otv isis internal
event-history adjacency as shown in Example 14-18.
The first troubleshooting step for an adjacency problem is to ensure that both neighbors
are generating and transmitting IS-IS hellos properly. If they are, start stepping through
the transport or underlay network until the connectivity problem is isolated.
If the site VLAN was verified to be functional across the data center, the next step in
troubleshooting an adjacency problem is to perform packet captures to determine which
device is not forwarding the frames correctly. Chapter 2, “NX-OS Troubleshooting Tools,”
covers the use of various packet capture tools available on NX-OS platforms that can be
utilized to isolate the problem. An important concept to grasp is that even though these
are control plane packets for OTV IS-IS on NX-2 and NX-4, as they are traversing the L3
transport network, they are handled as ordinary data plane packets.
consistent view of the topology. After LSPs are exchanged, the Shortest Path First (SPF)
algorithm runs and constructs the topology with MAC addresses as leafs. Entries are then
installed into the OTV MAC routing table for the purpose of traffic forwarding.
An example of the OTV IS-IS database is shown in Example 14-19. This output shows
the LSP for NX-4 from the IS-IS database on NX-2.
The LSP lifetime shows that LSPs are only a few seconds old because the Lifetime counts
from 1200 to zero. Issuing the command a few times may also show the Seq Number
field incrementing, which indicates that the LSP is being updated by the originating
IS-IS neighbor with changed information. This could cause OTV MAC routes to be
refreshed and reinstalled as the SPF algorithm executes constantly. LSPs may refresh and
get updated as part of normal IS-IS operation, but in this case the updates are happening
constantly, which is abnormal in a steady-state.
To investigate the problem, check the LSP contents for changes over time. To understand
which OTV ED is advertising which LSP, check the hostname to system-id mapping. The
Hostname TLV provides a way to dynamically learn the system-id to hostname mapping
for a neighbor. To identify which IS-IS database entries belong to which neighbors, use
the show otv isis hostname command, as shown in Example 14-20. The asterisk (*) indi-
cates the local system-id.
Technet24
900 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
The contents of an individual LSP are verified with the show otv isis database detail
[lsp-id]. Example 14-21 contains the LSP received from NX-4 at NX-2 and contains sev-
eral important pieces of information, such as neighbor and MAC address reachability, the
site-id, and which device is the AED for a particular VLAN.
AED-Server-ID : 64a0.e73e.12c2
Version 57
ED Summary : Device ID : 6c9c.ed4d.d942 : fwd_ready : 1
ED Summary : Device ID : 64a0.e73e.12c2 : fwd_ready : 1
Site ID : 0000.0000.0001 : Partition ID : ffff.ffff.ffff
Device ID : 64a0.e73e.12c2 Cluster-ID : 0
Vlan Status : AED : 0 Back-up AED : 1 Fwd ready : 1 Priority : 0 Delete : 0
Local : 1 Remote : 1 Range : 1 Version : 9
Start-vlan : 101 End-vlan : 109 Step : 2
AED : 1 Back-up AED : 0 Fwd ready : 1 Priority : 0 Delete : 0 Local : 1
Remote : 1 Range : 1 Version : 9
Start-vlan : 100 End-vlan : 110 Step : 2
Site ID : 0000.0000.0001 : Partition ID : ffff.ffff.ffff
Device ID : 64a0.e73e.12c2 Cluster-ID : 0
AED SVR status : Old-AED : 64a0.e73e.12c2 New-AED : 6c9c.ed4d.d942
old-backup-aed : 0000.0000.0000 new-backup-aed : 64a0.e73e.12c2
Delete-flag : 0 No-of-range : 1 Version : 9
Start-vlan : 101 End-vlan : 109 Step : 2
Old-AED : 64a0.e73e.12c2 New-AED : 64a0.e73e.12c2
old-backup-aed : 0000.0000.0000 new-backup-aed : 6c9c.ed4d.d942
Delete-flag : 0 No-of-range : 1 Version : 9
Start-vlan : 100 End-vlan : 110 Step : 2
Digest Offset : 0
To determine what information is changing in the LSP, use the NX-OS diff utility. As
shown in Example 14-22, the diff utility reveals that the Sequence Number is updated,
and the LSP Lifetime has refreshed again to 1198. The changing LSP contents are related
to HSRP MAC addresses in several VLANs extended by OTV.
Technet24
902 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
The MAC reachability information from the LSP is installed into the OTV MAC routing
table. Each MAC address is installed with a next-hop known either via the site VLAN
or from an OTV ED reachable across the overlay interface. The OTV MAC routing table
in Example 14-23 confirms that MAC address entries are unstable and are refreshing.
The Uptime for several entries is less than 1 minute and some were dampened with
the (D) flag.
Additional information is obtained from the OTV event-traces. Because you are
interested in the changes being received in the IS-IS LSP from a remote OTV ED, the
show otv isis internal event-history spf-leaf is used to view what is changing and
causing the routes to be refreshed in the OTV route table. This output is provided in
Example 14-24.
Understanding and Verifying the OTV Control Plane 903
NX-2# show otv isis internal event-history spf-leaf | egrep "Process 0103-0000.0c07.
ac67"
20:12:48.699301 isis_otv default [13901]: [13911]: Process 0103-0000.0c07.ac67
contained in 6c9c.ed4d.d944.00-00 with metric 0
20:12:45.060622 isis_otv default [13901]: [13911]: Process 0103-0000.0c07.ac67
contained in 6c9c.ed4d.d944.00-00 with metric 0
20:12:32.909267 isis_otv default [13901]: [13911]: Process 0103-0000.0c07.ac67
contained in 6c9c.ed4d.d944.00-00 with metric 1
20:12:30.743478 isis_otv default [13901]: [13911]: Process 0103-0000.0c07.ac67
contained in 6c9c.ed4d.d944.00-00 with metric 1
20:12:28.652719 isis_otv default [13901]: [13911]: Process 0103-0000.0c07.ac67
contained in 6c9c.ed4d.d944.00-00 with metric 0
20:12:26.470400 isis_otv default [13901]: [13911]: Process 0103-0000.0c07.ac67
contained in 6c9c.ed4d.d944.00-00 with metric 0
20:12:25.978913 isis_otv default [13901]: [13911]: Process 0103-0000.0c07.ac67
contained in 6c9c.ed4d.d944.00-00 with metric 0
20:12:13.239379 isis_otv default [13901]: [13911]: Process 0103-0000.0c07.ac67
contained in 6c9c.ed4d.d944.00-00 with metric 0
It is now apparent what is changing in the LSPs and why the lifetime is continually reset-
ting to 1200. The metric is changing from zero to one.
The next step is to further investigate the problem at the remote AED that is originating
the MAC advertisements across the overlay. In this particular case, the problem is caused
by an incorrect configuration. The HSRP MAC addresses are being advertised across the
overlay through OTV incorrectly. The HSRP MAC should be blocked using the First Hop
Routing Protocol (FHRP) localization filter, as described later in this chapter, but instead
it was advertised across the overlay resulting in the observed instability.
The previous example demonstrated a problem with the receipt of a MAC advertisement
from a remote OTV ED. If a problem existed with MAC addresses not being advertised
out to other OTV EDs from the local AED, the first step is to verify that OTV is pass-
ing the MAC addresses into IS-IS for advertisement. The show otv isis mac redistribute
route command shown in Example 14-25 is used to verify that MAC addresses were
passed to IS-IS for advertisement to other OTV EDs.
0101-64a0.e73e.12c1, all
Advertised into L1, metric 1 LSP-ID 6c9c.ed4d.d942.00-00
Technet24
904 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
0101-6c9c.ed4d.d941, all
Advertised into L1, metric 1 LSP-ID 6c9c.ed4d.d942.00-00
0101-c464.135c.6600, all
Advertised into L1, metric 1 LSP-ID 6c9c.ed4d.d942.00-00
0103-64a0.e73e.12c1, all
Advertised into L1, metric 1 LSP-ID 6c9c.ed4d.d942.00-00
0103-6c9c.ed4d.d941, all
Advertised into L1, metric 1 LSP-ID 6c9c.ed4d.d942.00-00
0105-64a0.e73e.12c1, all
Advertised into L1, metric 1 LSP-ID 6c9c.ed4d.d942.00-00
0105-6c9c.ed4d.d941, all
Advertised into L1, metric 1 LSP-ID 6c9c.ed4d.d942.00-00
0107-64a0.e73e.12c1, all
Advertised into L1, metric 1 LSP-ID 6c9c.ed4d.d942.00-00
0109-64a0.e73e.12c1, all
Advertised into L1, metric 1 LSP-ID 6c9c.ed4d.d942.00-00
0109-6c9c.ed4d.d941, all
Advertised into L1, metric 1 LSP-ID 6c9c.ed4d.d942.00-00
The integrity of the IS-IS LSP is a critical requirement for the reliability and stability of
the OTV control plane. Packet corruption problems or loss in the transport can affect
both OTV IS-IS adjacencies as well as the advertisement of LSPs. Separate IS-IS statis-
tics are available for the overlay and site VLAN, as shown in Examples 14-26 and 14-27,
which provide valuable clues when troubleshooting an adjacency or LSP issue.
SPF calculations: 0
LSPs sourced: 2
LSPs refreshed: 13
LSPs purged: 0
Incrementing receive errors or retransmits indicate a problem with IS-IS PDUs, which
may result in MAC address reachability problems. Incrementing RcvAuthErr indicates an
authentication mismatch between OTV EDs.
otv site-vlan 10
key chain OTV-CHAIN
key 0
key-string 7 073c046f7c2c2d
interface Overlay0
description Site A
otv isis authentication-type md5
otv isis authentication key-chain OTV-CHAIN
otv join-interface port-channel3
otv control-group 239.12.12.12
otv data-group 232.1.1.0/24
otv extend-vlan 100-110
no shutdown
otv-isis default
otv site-identifier 0x1
Technet24
906 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
OTV IS-IS authentication is enabled as verified with the show otv isis interface overlay
[overlay-number] output in Example 14-29.
All OTV sites need to be configured with the same authentication commands for the
overlay adjacency to form. Incrementing RcvAuthErr for LAN-IIH frames, as shown in
the output of Example 14-30, indicates the presence of an authentication mismatch.
The output of show otv adjacency and show otv site varies depending on which adjacen-
cies are down. The authentication configuration is applied only to the overlay interface,
so it is possible the site adjacency is up even if one OTV ED at a site has authentication
misconfigured for the overlay.
Understanding and Verifying the OTV Control Plane 907
Example 14-31 shows that the overlay adjacency is down, but the site adjacency is still
valid. In this scenario, the state is shown as Partial.
A multicast transport allows the ED to generate only a single multicast packet, which is
then replicated by the transport network. Therefore, it is preferred to use multicast mode
whenever possible because of the increase in efficiency. However, in deployments where
only two sites exist, or where multicast is not possible in the transport, adjacency server
mode allows for a completely functional OTV deployment over IP unicast.
Technet24
908 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
The OTV overlay configuration for each ED is configured to use the adjacency server
unicast IP address as shown in Example 14-32. The role of the adjacency server is handled
by a user-designated OTV ED. Each OTV ED registers itself with the adjacency server
by sending OTV IS-IS hellos, which are transmitted from the OTV join interface as OTV
encapsulated IP unicast packets. When the adjacency server forms an adjacency with a
remote OTV ED, a list of OTV EDs is created dynamically. The adjacency server takes
the list of known EDs and advertises it to each neighbor. All EDs then have a mechanism
to dynamically learn about all other OTV EDs so that update messages are created and
replicated to each remote ED.
interface Overlay0
otv join-interface port-channel3
otv extend-vlan 100-110
otv use-adjacency-server 10.1.12.1 unicast-only
no shutdown
otv site-identifier 0x1
Example 14-33 shows the configuration for NX-2, which is now acting as the adjacency
server. When configuring an OTV ED in adjacency server mode, the otv control-group
[multicast group] and otv data-group [multicast-group] configuration on each OTV
ED shown in the previous examples must be removed. The otv use-adjacency-server
[IP address] is then configured to enable OTV adjacency server mode and the otv
adjacency-server unicast-only command specifies that NX-2 will be the adjacency
server. The join interface and internal interface configurations remain unchanged from
the previous examples in this chapter.
interface port-channel3
description 7009A-Main-OTV Join
mtu 9216
ip address 10.1.12.1/24
ip router ospf 1 area 0.0.0.0
ip igmp version 3
Understanding and Verifying the OTV Control Plane 909
interface Overlay0
description Site A
otv join-interface port-channel3
otv extend-vlan 100-110
otv use-adjacency-server 10.1.12.1 unicast-only
otv adjacency-server unicast-only
no shutdown
otv site-identifier 0x1
Dynamically advertising a list of known OTV EDs saves the user from having to config-
ure every OTV ED with all other OTV ED addresses to establish adjacencies. The process
of registration with the adjacency server and advertisement of the OTV Neighbor List is
shown in Figure 14-4. The site adjacency is still present but not shown in the figure for
clarity.
NX-6 NX-8
Po2 (Access)
ISIS Hello ISIS Hello
NX-5 NX-7
OTV Join Interface OTV Join Interface
10.2.34.1 10.2.43.1
After the OTV Neighbor List (oNL) is built, it is advertised to each OTV ED from the
adjacency server as shown in Figure 14-5.
Technet24
910 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
NX-6 NX-8
Po2 (Access)
NX-5 NX-7
OTV Join Interface OTV Join Interface
10.2.34.1 10.2.43.1
oNL
NX-2 NX-4
Each OTV ED then establishes IS-IS adjacencies with all other OTV EDs. Updates are
sent with OTV encapsulation in IP unicast packets from each OTV ED. Each OTV ED
must replicate its message to all other neighbors. This step is shown in Figure 14-6.
Example 14-34 contains the output of show otv adjacency from NX-4. After receiving
the OTV Neighbor List from the adjacency Server, IS-IS adjacencies are formed with all
other OTV EDs.
Overlay-Interface Overlay0 :
Hostname System-ID Dest Addr Up Time State
NX-8 64a0.e73e.12c4 10.2.43.1 00:20:35 UP
NX-2 6c9c.ed4d.d942 10.1.12.1 00:20:35 UP
NX-6 6c9c.ed4d.d944 10.2.34.1 00:20:35 UP
Understanding and Verifying the OTV Control Plane 911
NX-6
NX-6 NX-8
10.2.43.1 10.1.12.1
Po2 (Access)
10.1.22.1 10.1.22.1
10.1.12.1 NX-5 NX-7 10.2.34.1
OTV Join Interface OTV Join Interface
10.2.34.1 10.2.43.1
An OTV IS-IS site adjacency is still formed across the site VLAN, as shown in the output of
show otv site in Example 14-35.
Technet24
912 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
-------------------------------------------------------------------------------
NX-2 6c9c.ed4d.d942 Full 00:42:04 Yes
Redundant OTV adjacency servers are supported for resiliency purposes. However,
the two adjacency servers operate independently, and they do not synchronize state
with each other. If multiple adjacency servers are present, each OTV ED registers
with each adjacency server. An OTV ED uses the replication list from the primary
adjacency server until it is no longer available. If the adjacency with the primary
adjacency server goes down, the OTV ED starts using the replication list received
from the secondary adjacency server. If the primary OTV ED comes back up before a
10-minute timeout, the OTV EDs revert back to the primary replication list. If more
than 10 minutes pass, a new replication-list is pushed by the primary when it finally
becomes active again.
The importance of CoPP is realized when the OTV ARP-ND-Cache is enabled. ARP
Reply messages are snooped and added to the local cache so the OTV AED can answer
ARP requests on behalf of the target host. These packets must be handled by the con-
trol plane and could cause policing drops or high CPU utilization if the volume of ARP
traffic is excessive. The OTV ARP-ND-Cache is discussed in more detail later in this
chapter.
The show policy-map interface control-plane command from the default VDC pro-
vides statistics for each control plane traffic class. If CoPP drops are present and ARP
resolution failure is occurring, the solution is typically not to adjust the control plane
Understanding and Verifying the OTV Data Plane 913
policy to allow more traffic, but to instead track down the source of excessive ARP
traffic. Ethanalyzer is a good tool for this type of problem along with the event histo-
ries for OTV.
The default overlay encapsulation for OTV is GRE, shown in Figure 14-7. This is also
referred to as OTV 1.0 encapsulation.
VLAN ID Ether
Overlay # DMAC SMAC 802.1Q Payload CRC
Type
Ethernet Frame
Original Layer 2 Frame Encapsulated
Po3 Eth3/5
To Transport From VLAN
GRE MPLS NX-2
4B 4B OTV Join Interface OTV Internal Interface
When a frame arrives on the internal interface, a series of lookups are used to deter-
mine how to rewrite the packet for transport across the overlay. The original payload,
ethertype, source MAC address, and destination MAC address are copied into the new
OTV Encapsulated frame. The 802.1Q header is removed, and an OTV SHIM header
is inserted. The SHIM header contains information about the VLAN and the overlay it
belongs to. This field in OTV 1.0 is actually an MPLS-in-GRE encapsulation, where the
MPLS label is used to derive the VLAN. The value of the MPLS label is equal to 32 +
VLAN identifier. For this example, VLAN 101 is encapsulated as MPLS label 133. The
outer IP header is added, which contains the source IP address of the local OTV ED
and the destination IP address of the remote OTV ED.
Control plane IS-IS frames are encapsulated in a similar manner between OTV EDs across
the overlay and also carry the same 42 bytes of OTV Overhead. The MPLS label used for
IS-IS control plane frames is the reserved label 1, which is the Router Alert label.
Technet24
914 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
Note If a packet capture is taken in the transport, OTV 1.0 encapsulation is decoded
as MPLS Pseudowire with no control-word using analysis tools, such as Wireshark.
Unfortunately, at the time of this writing, Wireshark is not able to decode all the IS-IS
PDUs used by OTV.
NX-OS release 7.2(0)D1(1) introduced the option of UDP encapsulation for OTV when
using F3 or M3 series modules in the Nexus 7000 series switches. The OTV 2.5 UDP
encapsulation is shown in Figure 14-8.
Instance # Ether
Overlay # DMAC SMAC 802.1Q Payload CRC
Type
Ethernet Frame
Original Layer 2 Frame Encapsulated
Po3 Eth3/5
To Transport From VLAN
NX-2
OTV Join Interface OTV Internal Interface
Ethernet Frames arriving from the OTV internal interface have the original payload,
ethertype, 802.1Q header, source MAC address, and destination MAC address copied
into the new OTV 2.5 Encapsulated frame. The OTV 2.5 encapsulation uses the same
packet format as Virtual Extensible LAN (VxLAN), which is detailed in RFC 7348.
The OTV SHIM header contains information about the Instance and Overlay. The
instance is the table identifier that should be used at the destination OTV ED to lookup
the destination, and the overlay identifier is used by the control plane packets to identify
packets belonging to a specific overlay. A control plane packet has the VxLAN Network
ID (VNI) bit set to False (zero), while an encapsulated data frame has this value set to
True (one). The UDP header contains a variable source port and destination port of 8472.
Fragmentation of OTV frames containing data packets becomes a concern if the trans-
port MTU is not at least 1550 bytes with OTV 2.5, or 1542 bytes with OTV 1.0. This
is based on the assumption that a host in the data center has an interface MTU of
1500 bytes and attempts to send full MTU sized frames. When the OTV encapsulation is
added, the packet no longer fits into the available MTU size.
The minimum transport MTU requirement for control plane packets is either 1442 for
multicast transport, or 1450 for unicast transport in adjacency server mode. OTV sets the
Don’t Fragment bit in the outer IP header to ensure that no OTV control plane or data
plane packets become fragmented in the transport network. If MTU restrictions exist, it
could result in OTV IS-IS adjacencies not forming, or the loss of frames for data traffic
when the encapsulated frame size exceeds the transport MTU.
Understanding and Verifying the OTV Data Plane 915
Note The OTV encapsulation format must be the same between all sites (GRE or UDP)
and is configured with the global configuration command otv encapsulation-format ip
[gre | udp].
ARP Request
Data Center 1 Data Center 2
Host A Host C
10.101.1.1/16 10.101.2.1/16
C464.135c.6600 442b.03ec.cb00
Host B OTV Internal Interface
10.101.1.3/16 ARP Reply
C464.135c.6601
Host A broadcasts an ARP request message to the destination MAC address ff:ff:ff:ff:ff:ff
with a target IP address of 10.101.2.1. This frame is sent out of all ports that belong to
the same VLAN in the L2 switch, including the OTV internal interface of NX-2 and the
port connected to Host B. Because NX-2 is an OTV ED for Data Center 1, it receives
the frame and encapsulates it using the OTV control-group of 239.12.12.12. NX-2 also
creates a MAC address table entry for Host A, known via the internal interface. Host A’s
MAC is advertised from NX-2 across the overlay through the IS-IS control plane, provid-
ing reachability information to all other OTV EDs.
The control-group multicast frame from NX-2 traverses the transport underlay network
until it reaches NX-6 where the multicast OTV encapsulation is removed and the frame
is sent out of the OTV internal interface toward Host C. Host C processes the broadcast
frame and recognizes the IP address as its own. Host C then issues the ARP reply to Host
A, which is sent to NX-6. NX-6 at this point has an entry in the OTV MAC routing table
for Host A with an IP next-hop of NX-2 since the IS-IS update was received. There is also
a MAC address table entry for Host A in VLAN101 pointing to the overlay interface.
As the ARP reply from Host C is received at NX-6, a local MAC address table entry is
created pointing to the OTV internal interface. This MAC address entry is then adver-
tised to all remote OTV EDs through IS-IS, just as NX-2 did for Host A.
Technet24
916 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
NX-6 then encapsulates the ARP reply and sends it across the overlay to NX-2 in
Data Center 1. NX-2 removes the OTV encapsulation from the frame and sends it out
of the internal interface where it reaches Host A, following the MAC address table of
the VLAN.
The OTV ARP-ND-Cache is populated by listening to ARP reply messages. The initial
ARP request is sent to all OTV EDs via the OTV control-group. When the ARP reply
comes back using the OTV control-group, each OTV ED snoops the reply and builds an
entry in the cache. If Host B were to send an ARP request for Host C, NX-2 replies to
the ARP request on behalf of Host C, using the cached entry created previously, which
reduces unnecessary traffic across the overlay.
Note If multiple OTV EDs exist at a site, only the AED forwards packets onto the over-
lay, including ARP request and replies. The AED is also responsible for advertising MAC
address reachability to other OTV EDs through the IS-IS control plane.
The ARP-ND-Cache is populated in the same way for multicast mode or adjacency server
mode. With adjacency server mode, the ARP request and response are encapsulated as
OTV Unicast packets and replicated for the remote OTV EDs.
If hosts are unable to communicate with other hosts across the overlay, verify the ARP-
ND-Cache to ensure it does not contain any stale information. Example 14-36 demon-
strates how to check the local ARP-ND-Cache on NX-2.
OTV also keeps an event-history for ARP-ND cache activity, which is viewed with show
otv internal event-history arp-nd. Example 14-37 shows this output from the AED for
the VLAN 100.
The OTV ARP-ND cache timer is configurable from 60 to 86400 seconds. The default
value is 480 seconds or 8 minutes, plus an additional 2-minute grace-period. During the
grace-period an AED forwards ARP requests across the overlay so that the reply refreshes
the entry in the cache. It is recommended to have the ARP-ND cache time value lower
than the MAC aging timer. By default, the MAC aging timer is 30 minutes.
Note The ARP-ND-Cache is enabled by default. In some environments with a lot of ARP
activity, it may cause the CPU of the OTV ED to become high or experience CoPP drops
because the supervisor CPU must handle the ARP traffic to create the cache entries.
Broadcasts
Broadcast frames received by an OTV ED on the internal interface are forwarded across
the overlay by the AED for the extended VLAN. Broadcast frames, such as ARP request,
are encapsulated into an L3 multicast packet where the source address is the local OTV
EDs join interface, and the group is the OTV Control-group address. The multicast
packet is sent to the transport where it gets replicated to each remote OTV ED that has
joined the control-group.
When using a multicast enabled transport, OTV allows for the configuration of a dedicated
otv broadcast-group, as shown in Example 14-38. This allows the operator to separate
the OTV control-group from the broadcast group for easier troubleshooting and to allow
different handling of the packets based on group address. For example, a different PIM
rendezvous point could be defined for each group, or a different Quality of Service (QoS)
treatment could be applied to the control-group and broadcast-group in the transport.
Technet24
918 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
With either multicast or unicast transport, when the packet is received by the remote
OTV ED, the outer L3 packet encapsulation is removed. The broadcast frame is then for-
warded to all internal facing L2 ports in the VLAN by the AED.
There are situations where a silent host is unavoidable. To allow these hosts to function,
OTV provides a configuration option to allow selective unicast flooding beginning in
NX-OS 6.2(2). Example 14-39 provides a configuration example to allow flooding of
packets to a specific destination MAC address in VLAN 101 across the overlay.
feature otv
otv site-identifier 0x1
otv flood mac C464.135C.6600 vlan 101
The result of adding this command is a static OTV route entry for the VLAN, which
causes traffic to flow across the overlay, as shown in Example 14-40.
IP Traffic
Data Center 1 Data Center 2
Host C
Host A
10.103.2.1/16
10.103.1.1/16
442b.03ec.cb00
c464.135c.6600
OTV Internal Interface
Mac in IP
NX-2 Layer 3 NX-6
Traffic from Host A is first sent to the L2 switch where it has an 802.1Q VLAN tag added
for VLAN 103. The frames follow the MAC address table entries at the L2 switch across
the trunk port to reach NX-2 on the OTV internal interface Ethernet3/5. When the pack-
ets arrive at NX-2, it performs a MAC address table lookup in the VLAN to determine
how to reach Host C’s MAC address 442b.03ec.cb00. The MAC address table of NX-2 is
shown in Example 14-41.
Legend:
* - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC
age - seconds since last seen,+ - primary entry using vPC Peer-Link, E - EVPN
entry
Technet24
920 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
(T) - True, (F) - False , ~~~ - use 'hardware-age' keyword to retrieve age info
VLAN/BD MAC Address Type age Secure NTFY Ports/SWID.SSID.LID
---------+-----------------+--------+---------+------+----+------------------
* 103 0000.0c07.ac67 dynamic ~~~ F F Eth3/5
O 103 442b.03ec.cb00 dynamic - F F Overlay0
* 103 64a0.e73e.12c1 dynamic ~~~ F F Eth3/5
O 103 64a0.e73e.12c3 dynamic - F F Overlay0
O 103 6c9c.ed4d.d943 dynamic - F F Overlay0
* 103 c464.135c.6600 dynamic ~~~ F F Eth3/5
The MAC address table indicates that Host C’s MAC is reachable across the overlay,
which means that the OTV MAC Routing table (ORIB) should be used to obtain the
IP next-hop and encapsulation details. The ORIB indicates how to reach the remote
OTV ED that advertised the MAC address to NX-2 via IS-IS, which is NX-6 in this
example.
Note If multiple OTV EDs exist at a site, ensure the data path is being followed to the
AED for the VLAN. This is verified with the show otv vlan command. Under normal con-
ditions the MAC forwarding entries across the L2 network should lead to the AED’s inter-
nal interface.
Legend:
(NA) - Non AED, (VD) - Vlan Disabled, (OD) - Overlay Down
(DH) - Delete Holddown, (HW) - HW: State Down
(NFC) - Not Forward Capable
VLAN Auth. Edge Device Vlan State Overlay
---- ----------------------------------- ---------------------- -------
After verifying the AED state for VLAN 103 to ensure you are looking at the correct
device, check the ORIB to determine which remote OTV ED will receive the encapsu-
lated frame from NX-2. The ORIB for NX-2 is shown in Example 14-43.
Recall that the ORIB data is populated by the IS-IS LSP received from NX-6, which indi-
cates MAC address 442b.03ec.cb00 is an attached host. This is confirmed by obtaining
the system-id of NX-6 in show otv adjacency, and then finding the correct LSP in the
output of show otv isis database detail.
At the AED originating the advertisement, the redistribution from the local MAC table
into OTV IS-IS is verified on NX-6 using the show otv isis redistribute route command,
which is shown in Example 14-44.
At this point, it has been confirmed that NX-6 is the correct remote OTV ED to receive
frames with a destination MAC address of 442b.03ec.cb00 in VLAN 103. The next
step in delivering the packet to Host C is for NX-2 to rewrite the packet to impose the
OTV header and send the encapsulated frame into the transport network from the join
interface.
OTV uses either UDP or GRE encapsulation, and in this example the default GRE
encapsulation is being used. There is a point-to-point tunnel created dynamically for
each remote OTV ED that has formed an adjacency with the local OTV ED. These
tunnels are viewed with show tunnel internal implicit otv detail, as shown in
Example 14-45.
Technet24
922 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
0103-442b.03ec.cb00, all
Advertised into L1, metric 1 LSP-ID 6c9c.ed4d.d944.00-00
0103-64a0.e73e.12c3, all
Advertised into L1, metric 1 LSP-ID 6c9c.ed4d.d944.00-00
0103-6c9c.ed4d.d943, all
Advertised into L1, metric 1 LSP-ID 6c9c.ed4d.d944.00-00
The dynamic tunnels represent the software forwarding component of the OTV encap-
sulation. The hardware forwarding component for the OTV encapsulation is handled by
performing multiple passes through the line card forwarding engine to derive the correct
packet rewrite that includes the OTV encapsulation header.
Note The verification of the packet rewrite details in hardware varies depending on the
type of forwarding engine present in the line card. Verify the adjacencies, MAC address
table, ORIB, and tunnel state before suspecting a hardware programming problem. If con-
nectivity fails despite correct control plane programming, and MAC addresses are learned,
engage the Cisco TAC for support.
After the OTV MAC-in-IP encapsulation is performed by NX-2, the packet traverses the
Layer 3 transport network with a unicast OTV header appended. The source IP address
is the join interface of NX-2 and the destination IP address is the join interface of NX-6.
The Layer 3 packet arrives on the OTV join interface of NX-6, which must remove the
OTV encapsulation and look up the destination.
The destination IP address of the outer packet header is the OTV join interface address
of NX-6, 10.2.34.1. In a similar manner to the encapsulation of OTV, removing the OTV
encapsulation also requires multiple forwarding engine passes on the receiving line card.
Understanding and Verifying the OTV Data Plane 923
Because the outer destination IP address belongs to NX-6, it will strip the outer IP header
and look into the OTV shim header where the VLAN ID is found. The information from
this lookup is originated from the ORIB, which contains the VLAN, MAC address, and
destination interface, as shown in Example 14-46.
The next-pass through the forwarding engine performs a lookup on the VLAN MAC
address table to find the correct outgoing interface and physical port. The MAC address
table of NX-6 is shown in Example 14-47.
Legend:
* - primary entry, G - Gateway MAC, (R) - Routed MAC, O - Overlay MAC
age - seconds since last seen,+ - primary entry using vPC Peer-Link, E - EVPN
entry
(T) - True, (F) - False , ~~~ - use 'hardware-age' keyword to retrieve age info
VLAN/BD MAC Address Type age Secure NTFY Ports/SWID.SSID.LID
---------+-----------------+--------+---------+------+----+------------------
* 103 0000.0c07.ac67 dynamic ~~~ F F Po3
* 103 442b.03ec.cb00 dynamic ~~~ F F Po3
O 103 64a0.e73e.12c1 dynamic - F F Overlay0
* 103 64a0.e73e.12c3 dynamic ~~~ F F Po3
* 103 6c9c.ed4d.d943 dynamic ~~~ F F Po3
O 103 c464.135c.6600 dynamic - F F Overlay0
Technet24
924 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
The frame exits Port-channel 3 on the L2 trunk with a VLAN tag of 103. The L2 switch
in data center 2 receives the frame and performs a MAC address table lookup to find the
port where Host C is connected and delivers the frame to its destination.
Note Troubleshooting unicast data traffic when using the adjacency server mode follows the
same methodology used for a multicast enabled transport. The difference is that any control
plane messages are exchanged between OTV EDs using a unicast encapsulation method and
replicated by the advertising OTV ED to all adjacent OTV EDs. The host-to-host data traffic
is still MAC-in-IP unicast encapsulated from source OTV ED to the destination OTV ED.
IGMP snooping must also learn where multicast routers (mrouters) are connected. Any
multicast traffic must be forwarded to an mrouter so that interested receivers on other L3
networks can receive it. The mrouter is also responsible for registering the source with
a rendezvous point if PIM ASM is being used. IGMP snooping discovers mrouters by
listening for Protocol Independent Multicast (PIM) hello messages, which indicate an L3
capable mrouter is present on that port. The L2 forwarding table is then updated to send
all multicast group traffic to the mrouter, as well as any interested receivers. OTV EDs use
a dummy PIM Hello message to draw multicast traffic and IGMP membership reports to
the OTV ED’s internal interface.
OTV maintains its own mroute table for multicast forwarding just as it maintains an OTV
routing table for unicast forwarding. There are three types of OTV mroute entries, which are
described as VLAN, Source, and Group. The purpose of each type is detailed in Table 14-2.
The OTV IS-IS control plane protocol is utilized to allow hosts to send and receive mul-
ticast traffic within an extended VLAN between sites without the need to send IGMP
messages across the overlay. Figure 14-11 shows a simple OTV topology where Host A
is a multicast source for group 239.100.100.100, and Host C is a multicast receiver. Both
Host A and Host C belong to VLAN 103.
In this example, the L3 transport network is enabled for IP multicast. Each OTV ED is con-
figured with a range of Source Specific Multicast (SSM) groups, referred to as the Delivery
Group or data-group, which may be used interchangeably. The delivery group configura-
tion of NX-6 is highlighted in the configuration sample provided in Example 14-48.
The delivery group must be coordinated with the L3 transport to ensure that PIM SSM is
supported and that the correct range of groups are defined for use as SSM groups. Each
OTV ED is configured with the same range of otv data-groups, and each OTV ED can
be a source for the SSM group. Remote OTV EDs join the SSM group in the transport
to receive multicast frames from a particular OTV ED acting as source. The signaling of
which SSM group to use is accomplished with IS-IS advertisements between OTV EDs to
allow for discovery of active sources and receivers at each site.
The site group is the multicast group that is being transported across the overlay using
the delivery group. In Figure 14-11, the site group is 239.100.100.100 sourced by Host
A and received by Host C. Essentially, OTV is using a multicast-in-multicast OTV
Technet24
926 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
encapsulation scheme to send the site group across the overlay using the delivery group
in the transport network.
PIM Hello
442b.03ec.cb00
c464.135c.6600 OTV Internal Interface
Layer 3
NX-2 NX-6
An Ethanalyzer capture of the PIM dummy hello packet from NX-6 on VLAN 103 is
shown in Example 14-49.
Type: IP (0x0800)
Internet Protocol Version 4, Src: 0.0.0.0 (0.0.0.0),Dst: 224.0.0.13 (224.0.0.13)
Version: 4
Understanding and Verifying the OTV Data Plane 927
Example 14-50 shows the IGMP snooping status of the L2 switch in Data Center 2 after
receiving the PIM dummy hello packets on VLAN103 from NX-6.
Technet24
928 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
When Host C’s IGMP membership report message reaches NX-6, it is snooped on
the internal interface and added to the OTV mroute table as an IGMP created entry.
Remember that any switch performing IGMP snooping must forward all IGMP member-
ship reports to mrouter ports.
Example 14-51 shows the OTV mroute table from NX-6 with the IGMP created (V, *, G)
entry and Outgoing Interface (OIF) of Port-channel 3 where the membership report was
received.
NX-6 then builds an IS-IS message to advertise the group membership (GM-Update)
to all OTV EDs. NX-2 in Data Center 1 receives the IS-IS GM-Update, as shown in
Example 14-52. NX-6 is identified by the IS-IS system-id of 6c9c.ed4d.d944. The correct
LSP to check is confirmed with the output of show otv adjacency, which lists the
system-id of each OTV ED IS-IS neighbor.
Note At this point only Host C joined the multicast group, and there are no sources
actively sending to the group.
NX-2 installs an OTV mroute entry in response to receiving the IS-IS GM-Update from
NX-6, as shown in Example 14-53. The OIF on NX-2 is the overlay interface. The r
indicates the receiver is across the overlay.
Host A now begins sending traffic to the site group 239.100.100.100 in Data Center 1.
Because of the PIM dummy packets being sent by NX-2, the L2 switch creates an
IGMP snooping mrouter entry for the port. The L2 switch forwards all multicast traffic
to NX-2, where its received by the OTV internal interface. The receipt of this traffic
creates an OTV mroute entry, as shown in Example 14-54. The delivery group (S, G)
is visible with the addition of the detail keyword. The source of the delivery group is
the AED’s OTV join interface, and the group address is one of the configured OTV
data-groups.
Technet24
930 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
The OTV mroute is redistributed automatically into IS-IS, as shown in Example 14-55,
where the VLAN, site (S,G), delivery (S,G), and LSP-ID are provided.
The redistributed route is advertised to all OTV EDs through IS-IS. Example 14-56 shows
the LSP originated by NX-2, as received by NX-6.
Note The show otv isis internal event-history mcast command is useful for trouble-
shooting the IS-IS control plane for OTV multicast and the advertisement of groups and
sources for a particular VLAN.
Understanding and Verifying the OTV Data Plane 931
NX-6 updates this information into its OTV mroute table, as shown in Example 14-57.
The s indicates the source is located across the overlay.
The show otv data-group command is used to verify the site group and delivery group
information for NX-2 and NX-6, as shown in Example 14-58. This should match what is
present in the output of show otv mroute.
Technet24
932 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
OTV EDs act as source hosts and receiver hosts for the delivery groups used on the
transport network. An IGMPv3 membership report from the join interface is sent to
the transport to allow the OTV ED to start receiving packets from the delivery group
(10.1.12.1, 232.1.1.0).
Verification in the transport is done based on the PIM SSM delivery group information
obtained from the OTV EDs. Each AED’s join interface is a source for the delivery group.
The AED joins only delivery group sources that are required based on the OTV mroute
table and the information received through the IS-IS control plane. This mechanism
allows OTV to optimize the multicast traffic in the transport so that only the needed
data is received by each OTV ED. The use of PIM SSM allows specific source addresses
to be joined for each delivery group.
Example 14-59 shows the mroute table of a transport router. In this output 10.1.12.1 is
NX-2’s OTV join interface, which is a source for the delivery group 232.1.1.0/32. The
incoming interface should match the routing table path toward the source to pass the
Reverse Path Forwarding (RPF) check. Interface Ethernet3/30 is the OIF and is connected
to the OTV join interface of NX-6.
ED must perform head-end replication of the traffic and send a copy to each site, which
becomes inefficient at scale.
OTV Unicast
Figure 14-13 Multicast Traffic Across OTV with Adjacency Server Mode
In this example, Host A and Host C are both members of VLAN 103. Host A is sending
traffic to the site group 239.100.100.100, and Host C sends an IGMP membership report
message to the Data Center 2 L2 switch. The L2 switch forwards the membership report
to NX-6 because it is an mrouter port in IGMP snooping. The same PIM dummy hello
packet mechanism is used on the OTV internal interface, just as with a multicast enabled
transport. The arrival of the IGMP membership report on NX-6 triggers an OTV mroute
to be created, as shown in Example 14-60, with the internal interface Port-channel 3
as an OIF.
Technet24
934 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
The OTV mroute is then redistributed automatically into IS-IS for advertisement to all
other OTV EDs, as shown in Example 14-61. The LSP ID should be noted so that it
can be checked on NX-2, which is the OTV ED for the multicast source Host A in Data
Center 1.
Because IGMP packets are not forwarded across the overlay, the IS-IS messages used to
signal an interested receiver are counted as IGMP proxy-reports. Example 14-62 shows
the IGMP snooping statistics of NX-6, which indicate the proxy-report being originated
through IS-IS. The IGMP proxy-report mechanism is not specific to OTV adjacency
server mode.
Following the path from receiver to the source in Data Center 1, the IS-IS database is veri-
fied on NX-2. This is done to confirm that the overlay is added as an OIF for the OTV
mroute. Example 14-63 contains the GM-LSP received from NX-6 on NX-2.
The IGMP Snooping table on NX-2 confirms that the overlay is included in the port list,
as shown in Example 14-64.
The OTV mroute on NX-2 contains the (V, *, G) entry, which is populated as a result of
receiving the IS-IS GM-LSP from NX-6. This message indicates Host C is an interested
receiver in Data Center 2 and that NX-2 should add the overlay as an OIF for the group.
The OTV mroute table from NX-2 is shown in Example 14-65. The r indicates the
receiver is reachable across the overlay. The (V, S, G) entry is also present, which indicates
Host A is actively sending traffic to the site group 239.100.100.100.
Technet24
936 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
Note The OTV mroute table lists an OIF of NX-6 installed by OTV. This is a result of the
OTV Unicast encapsulation used in adjacency server mode. The delivery group has values
of all zeros for the group address. This information is populated with a valid delivery group
when multicast transport is being used.
NX-2 encapsulates the site group packets in an OTV unicast packet with a destination
address of NX-6’s join interface. The OTV unicast packets traverse the transport network
until they arrive at NX-6. When the packets arrive at NX-6 on the OTV join interface,
the outer OTV unicast encapsulation is removed. The next lookup is done on the inner
multicast packet, which results in an OIF for the mroute installed by IGMP on the OTV
internal interface. Example 14-66 shows the OTV mroute table of NX-6. The site group
multicast packet leaves on Po3 toward the L2 switch in Data Center 2 and ultimately
reaches Host C.
With adjacency server mode, the source is not advertised to the other OTV EDs by
NX-2. This is because there is no delivery group used across the transport for remote
OTV EDs to join. NX-2 only needs to know that there is an interested receiver across the
overlay and which OTV ED has the receiver. The join interface of that OTV ED is used
as the destination address of the multicast-in-unicast OTV packet across the transport.
The actual encapsulation of the site group multicast frame is done using the OTV unicast
point-to-point dynamic tunnel, as shown in Example 14-67.
Technet24
938 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
FHRP isolation is configured on the OTV EDs to allow each site’s FHRP to operate inde-
pendently. The purpose of this configuration is to filter any FHRP protocol traffic, as
well as ARP from hosts trying to resolve the virtual IP across the overlay. A configuration
example from NX-2 is shown in Example 14-68.
ip access-list ALL_IPs
10 permit ip any any
ipv6 access-list ALL_IPv6s
10 permit ipv6 any any
mac access-list ALL_MACs
10 permit any any
ip access-list HSRP_IP
10 permit udp any 224.0.0.2/32 eq 1985
20 permit udp any 224.0.0.102/32 eq 1985
ipv6 access-list HSRP_IPV6
10 permit udp any ff02::66/128
mac access-list HSRP_VMAC
10 permit 0000.0c07.ac00 0000.0000.00ff any
20 permit 0000.0c9f.f000 0000.0000.0fff any
30 permit 0005.73a0.0000 0000.0000.0fff any
arp access-list HSRP_VMAC_ARP
10 deny ip any mac 0000.0c07.ac00 ffff.ffff.ff00
20 deny ip any mac 0000.0c9f.f000 ffff.ffff.f000
30 deny ip any mac 0005.73a0.0000 ffff.ffff.f000
40 permit ip any mac any
vlan access-map HSRP_Localization 10
match mac address HSRP_VMAC
match ip address HSRP_IP
match ipv6 address HSRP_IPV6
action drop
Advanced OTV Features 939
service dhcp
otv-isis default
vpn Overlay0
redistribute filter route-map OTV_HSRP_filter
otv site-identifier 0x1
ip arp inspection filter HSRP_VMAC_ARP vlan 100-110
Recall the topology depicted in Figure 14-1. In Data Center 1 HSRP is configured on
NX-1 and NX-3 for all VLANs. HSRP is also configured between NX-5 and NX-7 for
all VLANs in Data Center 2. The configuration in Example 14-68 is composed of three
filtering components:
■ VLAN Access Control List (VACL) to filter and drop HSRP Hellos
■ ARP Inspection Filter to drop ARP sourced from the HSRP Virtual MAC
■ Redistribution Filter Route-Map on the overlay to filter HSRP Virtual MAC (VMAC)
from being advertised through OTV IS-IS
Multihoming
A multihomed site in OTV refers to a site where two or more OTV ED are configured to
extend the same range of VLANs. Because OTV does not forward STP BPDUs across the
overlay, L2 loops form without the election of an AED.
Technet24
940 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
When multiple OTV EDs exist at a site, the AED election runs using the OTV IS-IS
system-id and VLAN identifier. This is done by using a hash function where the
result is an ordinal value of zero or one. The ordinal value is used to assign the
AED role for each extended VLAN to one of the forwarding capable OTV EDs at
the site.
When two OTV EDs are present, the device with the lower system-id is the AED for
the even-numbered VLANs, and the higher system-id is the AED for the odd-numbered
VLANs. The AED is responsible for advertising MAC addresses and forwarding traffic
for an extended VLAN across the overlay.
Beginning in NX-OS 5.2(1) the dual site adjacency concept is used. This allows OTV
EDs with the same site identifier to communicate across the overlay as well as across
the site VLAN, which greatly reduces the chance of one OTV ED being isolated and
creating a dual active condition. In addition, the overlay interface of an OTV ED is
disabled until a site identifier is configured, which ensures that OTV is able to detect
any mismatch in site identifiers. If a device becomes non-AED capable, it proactively
notifies the other OTV ED at the site so it can take over the role of AED for all
VLANs.
Figure 14-14 shows that NX-11 has Equal Cost Multipath (ECMP) routes to reach the
10.103.0.0/16 subnet through either NX-9 or NX-10. Depending on the load-sharing hash,
packets originating behind NX-11 reach either Data Center 1 or Data Center 2. If for
example the destination of the traffic was Host C, and NX11 choose to send the traf-
fic to NX-9 as next-hop, a suboptimal forwarding path is used. NX-9 then has to try to
resolve where Host C is located to forward the traffic. The packets reach the internal
interface of NX-2, which then performs an OTV encapsulation and routes the packets
back across the overlay to reach Host C.
OTV ED OTV ED
NX-2 NX-6
10
6
.1
/1
NX-9 NX-10
.0
3.
.0
0.
03
0/
.1
16
10
NX-11
Non-OTV Site
10.103.0.0/16 NX-9 10
NX-10 10
Another solution to this problem is to advertise more specific, smaller subnets from each
site along with the /16 summary to the rest of the routing domain. Routing follows the
more specific subnet to Data Center 1 or Data Center 2, and if either partially fails, the
/16 summary can still be used to draw in traffic. Assuming OTV is still functional in
the partially failed state through a backdoor link, the traffic then relies on the overlay to
cross from Data Center 1 to Data Center 2. The best solution to this problem depends on
the deployment scenario and if the two OTV sites are acting as Active/Standby or if they
are Active/Active from a redundancy perspective.
VLAN Translation
In some networks, a VLAN configured at an OTV site may need to communicate with a
VLAN at another site that is using a different VLAN numbering scheme. There are two
solutions to this problem:
VLAN mapping on the overlay interface is not supported with Nexus 7000 F3 or M3
series modules. If VLAN mapping is required with F3 or M3 modules, VLAN mapping
on the OTV internal interface, which is an L2 trunk, must be used.
Example 14-69 demonstrates the configuration of VLAN mapping on the overlay inter-
face. VLAN 200 is extended across the overlay. The local VLAN 200 is mapped to
VLAN 300 at the other OTV site.
Technet24
942 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
If F3 or M3 modules are being used, the VLAN mapping must be performed on the OTV
internal interface, as shown in Example 14-70. This configuration translates VLAN 200
to VLAN 300, which is then extended across OTV to interoperate with the remote site
VLAN scheme.
■ L3 Source Address
■ L3 Destination Address
■ Layer 4 Protocol
This polarization problem happens when each layer of the transport network applies the
same hash function. Using the same inputs results in the same output interface decision at
each hop. For example, if a router chose an even-numbered interface, the next router also
chooses an even-numbered interface, and the next one also chooses an even-numbered
interface, and so on.
OTV provides a solution to this problem. When using the default GRE/IP encapsulation
for the overlay, secondary IP addresses can be configured in the same subnet on the OTV
join interface, as shown in Example 14-71. This allows OTV to build secondary dynamic
tunnels between different pairs of addresses. The secondary address allows the transport
network to provide different hash results and load-balance the overlay traffic more
effectively.
The status of the secondary OTV adjacencies are seen with the show otv adjacency
detail command, as shown in Example 14-72.
Overlay-Interface Overlay0 :
Hostname System-ID Dest Addr Up Time State
NX-4 64a0.e73e.12c2 10.1.22.1 00:03:07 UP
Secondary src/dest: 10.1.12.4 10.1.22.1 UP
Technet24
944 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
HW-St: Default
NX-8 64a0.e73e.12c4 10.2.43.1 00:03:07 UP
Secondary src/dest: 10.1.12.4 10.2.43.1 UP
HW-St: Default
NX-6 6c9c.ed4d.d944 10.2.34.1 00:03:06 UP
Secondary src/dest: 10.1.12.4 10.2.34.1 UP
HW-St: Default
Note OTV tunnel depolarization is enabled by default. It is disabled with the otv depo-
larization disable global configuration command.
When OTV UDP encapsulation is used, the depolarization is applied automatically with
no additional configuration required. The Ethernet frames are encapsulated in a UDP
packet that uses a variable UDP source port and a UDP destination port of 8472. By hav-
ing a variable source port, the OTV ED is able to influence the load-sharing hash of the
transport network.
Note OTV UDP encapsulation is supported starting in NX-OS release 7.2(0)D1(1) for F3
and M3 modules.
The site VLAN IS-IS adjacency can be configured to use Bidirectional Forwarding
Detection (BFD) on the site VLAN to detect IS-IS neighbor loss. This is useful to detect
any type of connectivity failure on the site VLAN. Example 14-73 shows the configura-
tion required to enable BFD on the site VLAN.
otv site-vlan 10
Advanced OTV Features 945
interface Vlan10
no shutdown
bfd interval 250 min_rx 250 multiplier 3
no ip redirects
ip address 10.111.111.1/30
The status of BFD on the site VLAN is verified with the show otv isis site command, as
shown in Example 14-74. Any BFD neighbor is also present in the output of the show
bfd neighbors command.
Technet24
946 Chapter 14: Troubleshooting Overlay Transport Virtualization (OTV)
For the overlay adjacency, the presence of a route to reach the peer OTV ED’s join
interface can be tracked to detect a reachability problem that eventually causes the
IS-IS neighbor to go down. Example 14-75 shows the configuration to enable next-
hop adjacency tracking for the overlay adjacency of OTV EDs, which use the same site
identifier.
otv-isis default
track-adjacency-nexthop
vpn Overlay0
redistribute filter route-map OTV_HSRP_filter
Example 14-76 contains the output of show otv isis track-adjacency-nexthop, which
verifies the feature is enabled and tracking next-hop reachability of NX-4.
This feature depends on a nondefault route, learned from a dynamic routing protocol for
the peer OTV ED’s join interface. When the route disappears, OTV IS-IS brings down the
adjacency without waiting for the hold timer to expire, which allows the other OTV ED
to assume the AED role for all VLANs.
Summary
OTV was introduced in this chapter as an efficient and flexible way to extend L2 VLANs
to multiple sites across a routed transport network. The concepts of MAC routing and
the election of an AED were explained as an efficient way to solve the challenges pre-
sented by other DCI solutions without relying on STP. The examples and end-to-end
walk-through for the control plane, unicast traffic, and multicast traffic provided in this
chapter can be used as a basis for troubleshooting the various types of connectivity
problems that may be observed in a production network environment.
References 947
References
Fuller, Ron, David Jansen, and Matthew McPherson. NX-OS and Cisco Nexus
Switching. Indianapolis: Cisco Press, 2013.
RFC 6165, Extensions to IS-IS for Layer-2 Systems. A. Banerjee, D. Ward. IETF,
https://tools.ietf.org/html/rfc6165, April 2011.
RFC 7348. Virtual eXtensible Local Area Network (VXLAN): A Framework for
Overlaying Virtualized L2 Networks over L3 Networks. M. Mahalingam et al. IETF,
https://tools.ietf.org/html/rfc7348, August 2014.
Technet24
This page intentionally left blank
Chapter 15
Programmability and
Automation
■ NX-API
Either option is time consuming, but the second one, applying a configuration work-
around, involves the least amount of time. Applying a workaround on 100 nodes is
not an easy task, however. This is where automation comes into play. If the process
Technet24
950 Chapter 15: Programmability and Automation
The following sections discuss in detail the multiple automation and programmability
tools available with NX-OS to give network engineers more control and flexibility in
performing various actions and running third-party applications.
One reason Linux has achieved success in all aspects of computing and networking is its
flexibility and vast user support community. With Open NX-OS, Linux applications can
run on the switch to complement the feature-rich NX-OS operating system without a
wrapper library or customization. The major components of Open NX-OS are listed here:
■ Kernel version 3.4: This is a 64-bit kernel that provides a balance of features and
stability.
■ Kernel stack: The user space Netstack process that previous versions of NX-OS used
has been replaced with the kernel stack. This allows the interfaces on the switch to
Introduction to Open NX-OS 951
be mapped to the kernel as standard Linux netdevs and namespaces. Interfaces are
managed using standard Linux commands such as ifconfig and tcpdump from the
Bash shell.
■ Open package management: Tools such as RPM Package Manager (RPM) and
Yellowdog Updater, Modified (YUM) aid in installing or patching software on the
switch and provide extensibility.
■ Container support: Linux containers (LXCs) run directly on the platform and
provide access to a Centos 7–based Guest shell. This enables users to customize their
switch in a secure and isolated environment.
Open NX-OS provides the foundation for a true DevOps-managed data center switch
by providing Linux capabilities such as modularity, fault isolation, resiliency, and
much more.
Note For more details on Open NX-OS architecture, refer to the book Programmability
and Automation with Cisco Open NX-OS, at Cisco.com.
The NX-OS operating system provides a shell more commonly known as the command-
line interface (CLI). As the practice of automation through scripting and network manage-
ment techniques has evolved, the capability to have direct shell access to the underlying
Linux operating system of NX-OS has become desirable. The following section provides
examples of the NX-OS Bash shell, the Guest shell, and Python capabilities. These pow-
erful tools enable the automation of many operational tasks, reducing the administrative
burden.
Bash Shell
Bourne-Again Shell (Bash) is a modern UNIX shell, a successor of the Bourne shell. It
provides a rich feature set and built-in capability to interact with the low-level compo-
nents of the underlying operating system. The Bash shell is currently available on the
Nexus 9000, Nexus 3000, and Nexus 3500 series platforms. The Bash shell provides
shell access to the underlying Linux operating system, which has additional capabili-
ties that the standard NX-OS CLI does not provide. To enable the Bash shell on Nexus
Technet24
952 Chapter 15: Programmability and Automation
9000 switches, enable the command feature bash-shell. Then use the command run
bash cli to execute any Bash CLI commands. Users can also move into shell mode by
using the NX-OS CLI command run bash and then can execute the relevant Bash CLI
commands from the Bash shell. Example 15-1 illustrates how to enable the bash-shell
feature and use the Bash shell command pwd to display the current working directory.
To check whether the bash-shell feature is enabled, use the command show bash-shell.
Example 15-1 also demonstrates various basic commands on the Bash shell. The Bash
command id -a is used to verify the current user, as well as Group and Group ID infor-
mation. You can also use echo commands to print various messages based on the script
requirements.
Example 15-1 Enabling the bash-shell Feature and Using Bash Commands
bash-4.2$ id -a
uid=2002(admin) gid=503(network-admin) groups=503(network-admin)
bash-4.2$
bash-4.2$ echo "First Example on " 'uname -n' " using bash-shell " $BASH_VERSION
First Example on N9k-1 using bash-shell 4.2.10(1)-release
Note It is recommended that you become familiar with the UNIX/Linux bash shell com-
mands for this section.
On NX-OS, only users with the roles network-admin, vdc-admin, and dev-ops can use
the Bash shell. Other users are restricted from using Bash unless it is specially allowed in
their role. To validate check roles are permitted to use the Bash shell, use the command
show role [name role-name]. Example 15-2 displays the permission for the network-
admin and dev-ops user roles.
Introduction to Open NX-OS 953
Role: network-admin
Description: Predefined network admin role has access to all commands
on the switch
-------------------------------------------------------------------
Rule Perm Type Scope Entity
-------------------------------------------------------------------
1 permit read-write
Role: dev-ops
Description: Predefined system role for devops access. This role
cannot be modified.
-------------------------------------------------------------------
Rule Perm Type Scope Entity
-------------------------------------------------------------------
6 permit command conf t ; username *
5 permit command attach module *
4 permit command slot *
3 permit command bcm module *
2 permit command run bash *
1 permit command python *
With the NX-OS bash-shell feature, it becomes possible to create Bash shell scripts con-
sisting of multiple Bash commands that execute in sequence on the underlying Linux
operating system. The Bash script is created and saved with the extension .sh. The Bash
shell also gives users traceability options, which is useful for debugging purposes while
executing a shell script. This is activated by using the option -x along with the #!/bin/
bash statement. Example 15-3 illustrates how to create a shell script and verify its execu-
tion with script debugging enabled via the -x option.
bash-4.2$ pwd
/bootflash/home/admin
Technet24
954 Chapter 15: Programmability and Automation
then
echo "Following Routes Flapped @ " 'date'
vsh -c "show tech ospf >> bootflash:shtechospf"
vsh -c "show tech routing ip unicast >> bootflash:shtechrouting_unicast"
else
echo "No Flapping Routes at this point"
bash-4.2$ /bin/bash -x test.sh
+ echo 'Troubleshooting Route Flapping Issue Using Bash Shell'
Troubleshooting Route Flapping Issue Using Bash Shell
++ vsh -c 'show ip route ospf | grep 00:00:0 | count'
+ counter=0
+ echo 'Printing Counter - ' 0
Printing Counter - 0
+ '[' 0 -gt 0 ']'
+ echo 'No Flapping Routes at this point'
No Flapping Routes at this point
In addition, the Bash shell is used to install RPM packages on NX-OS. Use the yum com-
mand from the Bash shell to perform various RPM-related operations such as install,
remove, and delete. Example 15-4 demonstrates how to view the list of all installed
packages on the Nexus switch, as well as how to install and remove a package. In this
example, the BFD package is installed and removed. Note that when the package is
removed, the feature becomes unavailable from the NX-OS CLI; the packages determine
which features are made available to NX-OS.
Introduction to Open NX-OS 955
Example 15-4 Installing and Removing RPM Packages from the Bash Shell
Technet24
956 Chapter 15: Programmability and Automation
Dependencies Resolved
================================================================================
Package Arch Version Repository Size
Installing:
bfd lib32_n9000 2.0.0-7.0.3.I6.1 groups-repo 483 k
Transaction Summary
================================================================================
Install 1 Package
Installed:
bfd.lib32_n9000 0:2.0.0-7.0.3.I6.1
Complete!
Dependencies Resolved
================================================================================
Package Arch Version Repository Size
================================================================================
Removing:
bfd lib32_n9000 2.0.0-7.0.3.I6.1 @groups-repo 1.8 M
Transaction Summary
================================================================================
Remove 1 Package
Removed:
bfd.lib32_n9000 0:2.0.0-7.0.3.I6.1
Complete!
Guest Shell
The network paradigm has moved from hardware, software, and management network
elements to extensible network elements. The built-in Python and Bash execution envi-
ronments enable network operators to execute custom scripts in NX-OS environments
using the Cisco-supplied APIs and classes to interact with some of the major NX-OS
components. However, in some scenarios, network operators want to integrate third-
party applications and host the application on NX-OS. To meet those needs, NX-OS
provides a third-party application hosting framework that enables users to host their
applications in a dedicated Linux user space environment. Network operators must use
the Cisco Application Development Toolkit (ADT) to cross-compile their software and
package it with a Linux root file system into a Cisco Open Virtual Appliance (OVA) pack-
age. These OVAs are then deployed on the NX-OS network element using the application
hosting feature.
Technet24
958 Chapter 15: Programmability and Automation
NX-OS software introduces the NX-OS Guest shell feature on the Nexus 9000 and
Nexus 3000 series switches. The Guest shell is an open source and secure Linux environ-
ment for rapid third-party software development and deployment. The guestshell feature
leverages the benefits of the Python and Bash execution environments and the NX-OS
application hosting framework.
The Guest shell is enabled by default on Nexus 9000 and Nexus 3000. You can explic-
itly enable or destroy the guestshell feature on NX-OS. Table 15-1 describes the various
guest shell commands.
When the Guest shell is up and running, you can check the details using the command
show guestshell detail. This command displays the path of the OVA file, the status of
the Guest shell service, resource reservations, and the file system information of the
Introduction to Open NX-OS 959
Guest shell. Example 15-5 displays the detailed information of the Guest shell on a
Nexus 9000 switch.
Attached devices
Type Name Alias
---------------------------------------------
Disk _rootfs
Disk /cisco/core
Serial/shell
Serial/aux
Serial/Syslog serial2
Serial/Trace serial3
If the Guest shell does not come up, check the log for any error messages using the show
logging logfile command. To troubleshoot issues with the Guest shell, use the command
show virtual-service [list] to view both the status of the Guest shell and the resources
the Guest shell is using. Example 15-6 displays the virtual service list and the resources
being utilized by the current Guest shell on the Nexus 9000 switch.
Technet24
960 Chapter 15: Programmability and Automation
Note If you cannot resolve the Guest shell problem, collect the output of show virtual-
service tech-support and contact the Cisco Technical Assistance Center (TAC) for further
investigation.
Python
With the networking industry’s push toward software-defined networking (SDN), mul-
tiple doors have opened for integrating scripting and programming languages with net-
work devices. Python has gained industry-wide acceptance as the programming language
of choice. Python is a powerful and easy-to-learn programming language that provides
efficient high-level data structures and object-oriented features. These features make it an
ideal language for rapid application development on most platforms.
Python integration is available on most Nexus platforms and does not require the instal-
lation of any special license. The interactive Python interpreter is invoked from the CLI
on Nexus platforms by typing the python command. On Nexus 9000 and Nexus 3000
Introduction to Open NX-OS 961
platforms, Python can also be used through the Guest shell. After executing the python
command, the user is placed directly into the Python interpreter. Example 15-7 demon-
strates the use of the Python interpreter from both the CLI and the guest shell.
Example 15-7 Python Interpreter from CLI and the Guest Shell
N9k-1# python
Python 2.7.5 (default, Nov 5 2016, 04:39:52)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print "Hello World...!!!"
Hello World...!!!
N9k-1# guestshell
[admin@guestshell ~]$ python
Python 2.7.5 (default, Jun 17 2014, 18:11:42)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> print "Hello Again..!!!"
Hello Again..!!!
Note Readers are advised to become familiar with the Python programming language.
This chapter does not focus on writing specific Python programs, however; instead, it
focuses on how to use Python on Nexus platforms.
In addition to the standard Python libraries, NX-OS provides the Cisco and CLI librar-
ies, which you can import into your Python script to perform Cisco-specific functions
on the Nexus switch. The Cisco library provides access to Cisco Nexus components. The
CLI library provides the capability to execute commands from the Nexus CLI and return
the result. Example 15-8 displays the package contents of both the Cisco and CLI librar-
ies on NX-OS.
NAME
cisco
FILE
/usr/lib64/python2.7/site-packages/cisco/__init__.py
PACKAGE CONTENTS
Technet24
962 Chapter 15: Programmability and Automation
acl
bgp
buffer_depth_monitor
check_port_discards
cisco_secret
dohost
feature
history
interface
ipaddress
key
line_parser
mac_address_table
nxapi
nxcli
ospf
routemap
section_parser
ssh
system
tacacs
transfer
vlan
vrf
NAME
cli
FILE
/usr/lib64/python2.7/site-packages/cli.py
FUNCTIONS
cli(cmd)
clid(cmd)
clip(cmd)
Introduction to Open NX-OS 963
Noninteractive Python scripts are created and saved in the bootflash:scripts/ directory
and are invoked with the source [script name] command. Another option is to utilize
the Guest shell to create and invoke a Python script. The first line of your Python script
must include the path to the Python interpreter, which is /usr/bin/env. Example 15-9
provides a sample Python script to configure a loopback interface and also to list all
the interfaces in UP state on the Nexus switch. This script is created and invoked from
within the Guest shell environment.
#!/usr/bin/env python
import sys
from cli import *
import json
sys.exit(0)
[admin@guestshell ~]$ python test.py
mgmt0
Ethernet1/4
Ethernet1/5
Ethernet1/13
Ethernet1/14
Ethernet1/15
Ethernet1/16
Ethernet1/19
Ethernet1/32
Ethernet1/37
Ethernet2/1
Technet24
964 Chapter 15: Programmability and Automation
port-channel10
port-channel101
port-channel600
loopback0
loopback5
loopback100
Vlan100
Vlan200
Vlan300
A Python script can also be invoked from an Embedded Event Manager (EEM) applet
as part of the action statement. Because multiple actions can be performed per event,
multiple Python scripts can be called at different action steps, providing flexibility in
the logic used to build the EEM applet. Example 15-10 illustrates the configuration of
an EEM applet that triggers the previous Python script from an action statement. In this
example, because the Python script is configured within the Guest shell, the Python
script in the EEM script is invoked from the Guest shell. If the Python script is present
in the bootflash:source/ directory, the command action number cli source python
file-name must be used.
NX-SDK
The NX-OS software development kit (NX-SDK) is a C++ plug-in library that allows
custom, native applications to access NX-OS functions and infrastructure. Using the
NX-SDK, you can create custom CLI commands, syslog messages, event handlers, and
error handlers. An example of using this functionality would be creating your custom
application to register with the route manager to receive routing updates from the routing
information base (RIB) and then taking some action based on the presence of the route.
Three primary requirements must be met for using NX-SDK:
■ Docker
Note NX-SDK can also be integrated with Python. Thus, Cisco SDK is not required for
Python applications.
NX-SDK 965
The NX-SDK must be installed before it can be used in the development environment.
The installation steps follow:
1. export ENXOS_SDK_ROOT=/enxos-sdk
2. cd $ENXOS_SDK_Root
3. source environment-setup-x86-linux
Explore the API after forking the NX-SDK from GitHub and use it to create custom
application packages to be installed on the Nexus switch.
Note When creating custom applications, refer to the documentation and custom sample
application code available as part of the NX-SDK.
Once the applications are built, use the rpm_gen.py Python script to automatically
generate the RPM package. The script is present in the /NX-SDK/scripts directory.
When the RPM package is built, the RPM package can be copied to the Nexus Switch
in the bootflash: directory, where the package is then installed on the Nexus switch for
further use. Example 15-11 demonstrates the installation steps for an RPM package on
the Nexus 9000 switch. This example demonstrates the sample RPM package named
customCliApp that is available as part of the NX-SDK kit. To start a custom application,
first enable feature nxsdk. Then add the custom application as a service using the com-
mand nxsdk service-name app-name. You can check the status of the application using
the command show nxsdk internal service.
N9k-1# conf t
! Output omitted for brevity
Enter configuration commands, one per line. End with CNTL/Z.
N9k-1(config)# install add bootflash:customCliApp-1.0-1.0.0.x86_64.rpm
[####################] 100%
Install operation 1 completed successfully at Sun Nov 26 06:12:49 2017
Technet24
966 Chapter 15: Programmability and Automation
Inactive Packages:
customCliApp-1.0-1.0.0.x86_64
Active Packages:
customCliApp-1.0-1.0.0.x86_64
Note In Example 15-11, the RPM package is installed using the Virtual shell (VSH). The
RPM package can also be installed from the Bash shell.
If any failure or erroneous events occur with the custom application installation, you can
check the NX-SDN event history logs using the command show nxsdk internal event-
history [events | error]. Example 15-12 displays the event history logs for NX-SDK and
highlights the logs that indicate the successful activation and startup of the application
customCliApp.
Note In addition to the event history logs, you can collect the output of show tech
nxsdk if custom applications are failing to install or are not working.
Technet24
968 Chapter 15: Programmability and Automation
NX-API
NX-OS provides an API known as the NX-API that enables you to interact with the
switch using a standard request/response language. The traditional CLI was designed for
human-to-switch interaction. Requests are made by typing a CLI command and receiving
a response from the switch in the form of output to the client terminal. This response
data is unstructured and requires the human operator to evaluate the output line by line
to find the interesting piece of information in the output. Operators that use the tradi-
tional CLI interface to automate tasks through scripting are forced to follow the same
data interpretation method by screen-scraping the output for the interesting data. This
is not only inefficient, but also cumbersome because it requires output iteration and spe-
cific text matching through regular expressions.
The benefit of using the NX-API is the capability to send requests and receive responses
that are optimized for machine-to-machine communication. In other words, when com-
municating through the NX-API, the request and response are formatted as structured
data. The response received from the NX-API is provided as either Extensible Markup
Language (XML) or JavaScript Object Notation (JSON). This is much more efficient and
less error prone than parsing the entire human-readable CLI output for only a small per-
centage of interesting data. NX-API is used to obtain output from show commands, as
well as to add or remove configuration, thus streamlining and automating operations and
management in a large-scale network.
Communication between the client and NX-API running on the switch uses the Transport
Control Protocol (TCP) and can be either Hypertext Transfer Protocol (HTTP) or
Hypertext Transfer Protocol Secure (HTTPS), depending on the requirements. NX-API
uses HTTP basic authentication. Requests must carry the username and password in
the HTTP header. After a successful authentication, NX-API provides a session-based
cookie using the name nxapi_auth. That session cookie should be included in subsequent
NX-API requests. The privilege of the user is checked to confirm that the request is being
made by a user with a valid username and password on the switch who also has the prop-
er authorization for the commands being executed through the NX-API.
After successful authentication, you can start sending requests. The NX-API request
object is either in JSON-RPC or a Cisco proprietary format. Table 15-2 describes the
fields present in the JSON-RPC request object.
Field Description
params A structured value that holds the parameter values to be used during the
invocation of the method. It must contain the following fields:
“cmd”: A CLI command
“version”: NX-API request version identifier
id An optional identifier established by the client that must contain a string,
a number, or a NULL value. If the user does not specify the id parameter,
the server assumes that the request is simply a notification and provides
no response.
Figure 15-1 shows an example JSON-RPC request object used to query the switch for its
configured switch name.
JSON-RPC Format
[
{
"jsonrpc": "2.0",
"method": "cli",
"params": {
"cmd": "show switchname",
"version": 1
},
"id": 1
}
]
The second type of request object is the Cisco proprietary format, which is either XML
or JSON. Table 15-3 provides a description for the fields used in the Cisco proprietary
request object.
Technet24
970 Chapter 15: Programmability and Automation
Field Description
sid Valid when the response message is chunked. To retrieve the next
chunk of a message, the user should send a request with the sid set
to the sid in the previous response message.
input The input can be one or multiple commands. Multiple commands
should be separated with “ ;” (a blank character followed by
semicolon).
ouput_format The expected output format of the request message (XML
or JSON).
Figure 15-2 shows an example Cisco proprietary request object in both JSON and
XML formats. This request object is used to query the switch for its configured
switch name.
{ <?xml version="1.0"?>
"ins_api":{ <ins_api>
"version":"1.0", <version>1.0</version>
"type":"cli_show", <type>cli_show</type>
"chunk":"0", <chunk>0</chunk>
"sid":"1", <sid>sid</sid>
"input":"show switchname", <input>show switchname</input>
"output_format":"json" <output_format>xml</output_format>
} </ins_api>
}
The request object is sent to the switch on the configured HTTP (TCP port 80) or
HTTPS (TCP port 443) port. The received request object is validated by the web
server and the appropriate software object is provided with the request. The response
object is then sent back from the switch in either JSON-RPC or Cisco proprietary
formats to the client. Table 15-4 provides the field descriptions of the JSON-RPC
response object.
Field Description
error This field is included only on an errored request. The error object contains
the following fields:
“code”: An integer error code specified by the JSON-RPC specification
“message”: A human-readable string that corresponds to the error code
“data”: An optional structure that contains other useful information for
the user.
id This field contains the same value as the id field in the corresponding
request object. If a problem occurred while parsing the id field in the
request, this value is null.
{
"jsonrpc": "2.0",
"result": {
"body": {
"hostname": "NX02"
}
},
"id": 1
}
Table 15-5 describes the fields included in the Cisco proprietary response object.
Technet24
972 Chapter 15: Programmability and Automation
Figure 15-4 shows an example Cisco proprietary response object in both JSON and
XML formats.
{ <?xml version="1.0"?>
"ins_api":{ <ins_api>
"type":"cli_show", <type>cli_show</type>
"version":"1.0", <version>1.0</version>
"sid":"eoc", <sid>eoc</sid>
"outputs":{ <outputs>
"output":{ <output>
"input":"show switchname", <body>
"msg":"Success", <hostname>NX02</hostname>
"code":"200", </body>
"body":{ <input>show switchname</input>
"hostname":"NX02" <msg>Success</msg>
} <code>200</code>
} </output>
} </outputs>
} </ins_api>
}
Multiple commands can be sent in a single request. For the JSON-RPC request object,
this is done by linking an unlimited number of single JSON-RPC requests into a single
JSON-RPC array. For the Cisco proprietary request object, up to 10 semicolon-separated
commands can be linked in the input object. With either request object type, if a request
fails for any reason, the subsequent requests are not executed.
The NX-API feature must be enabled in the global configuration of the switch using the
feature nxapi command, as shown in the output of Example 15-13.
NX-2# conf t
Enter configuration commands, one per line. End with CNTL/Z.
NX-2(config)# feature nxapi
Note The default HTTP and HTTPS ports are changed using the nxapi http port and
nxapi https port configuration commands.
NX-API 973
When the NX-API feature is enabled, you may authenticate and begin sending requests
to the appropriate HTTP or HTTPS port. NX-OS also provides a sandbox environment
for testing the functions of the API; this is accessed by using a standard web browser
and connecting through HTTP to the switch management address.
NX-2# ethanalyzer local interface mgmt capture-filter "tcp port 443" limit-captured-
frames 0
Capturing on mgmt0
192.168.1.50 -> 192.168.1.201 TCP 52018 > https [SYN] Seq=0 Win=65535 Len=0
MSS=1460 WS=5 TSV=568065210 TSER=0
192.168.1.201 -> 192.168.1.50 TCP https > 52018 [SYN, ACK] Seq=0 Ack=1
Win=16768 Len=0 MSS=1460 TSV=264852 TSER=568065210
192.168.1.50 -> 192.168.1.201 TCP 52018 > https [ACK] Seq=1 Ack=1 Win=65535
Len=0 TSV=568065211 TSER=264852
192.168.1.50 -> 192.168.1.201 SSL Client Hello
192.168.1.201 -> 192.168.1.50 TLSv1.2 Server Hello, Certificate, Server Key
Exchange, Server Hello Done
192.168.1.50 -> 192.168.1.201 TCP 52018 > https [ACK] Seq=518 Ack=1294
Win=65535 Len=0 TSV=568065232 TSER=264852
192.168.1.50 -> 192.168.1.201 TLSv1.2 Client Key Exchange, Change Cipher Spec,
Hello Request, Hello Request
192.168.1.201 -> 192.168.1.50 TLSv1.2 Encrypted Handshake Message, Change
Cipher Spec, Encrypted Handshake Message
After confirming that the TCP session from the client is established, additional informa-
tion about the NX-API communication with the client is found with the show nxapi-
server logs command. The server logs in Example 15-15 show the connection attempt, as
well as the details of the request that was received. The execution of the CLI command
is also shown in the log file, which is helpful in identifying why a particular batch of
commands is failing. Finally, the response object sent to the client is also provided.
Technet24
974 Chapter 15: Programmability and Automation
Message {
"ins_api": {
"version": "1.0",
"type": "cli_show",
"chunk": "0",
"sid": "1",
"input": "show switchname",
"output_format": "json"
}
}
parse_user_from_request:41 2017 November 17 02:18:25.292 : cookie had user
‘admin’
parse_user_from_request:55 2017 November 17 02:18:25.292 : auth header had user
‘admin’
pterm_idle_vsh_sweep:667 2017 November 17 02:18:25.292 : pterm_idle_vsh_sweep
pterm_get_vsh:710 2017 November 17 02:18:25.292 : vsh found: child_pid = 10558,
fprd = 0x98d0800, fpwr = 0x98d0968, fd = 14, user = admin, vdc id = 1
pterm_write_to_vsh:446 2017 November 17 02:18:25.292 : In vsh [14] Writing cmd
"show switchname | xml "
pterm_write_to_vsh:522 2017 November 17 02:18:25.302 : Cmd ‘show switchname | xml
‘ returned with ‘0’
pterm_write_to_vsh:627 2017 November 17 02:18:25.302 : Done processing vsh output
(ret=0)
_ins_api_cli_cmd:288 2017 November 17 02:18:25.302 : Incorrect XML data,
replacing special characters
_ins_api_cli_cmd:304 2017 November 17 02:18:25.302 : found ns vdc_mgr and copied
it to blob vdc_mgr len 7
pterm_write_to_vsh:446 2017 November 17 02:18:25.302 : In vsh [14] Writing cmd
"end"
pterm_write_to_vsh:522 2017 November 17 02:18:25.304 : Cmd ‘end’ returned with
‘0’
pterm_write_to_vsh:627 2017 November 17 02:18:25.304 : Done processing vsh output
(ret=0)
ngx_http_ins_api_post_body_handler:675 2017 November 17 02:18:25.304 : Sending
response {
"ins_api": {
"type": "cli_show",
"version": "1.0",
"sid": "eoc",
"outputs": {
"output": {
"input": "show switchname",
"msg": "Success",
"code": "200",
References 975
"body": {
"hostname": "NX02"
}
}
}
}
Note Any activity from the NX-API is logged in the switch accounting log just like in
the traditional CLI. The username associated with the NX-API is listed in the accounting
log as nginx.
In addition to the NX-API server logs, NX-OS has a detailed show tech nxapi command
that provides the server logs in addition to the nginx web server logs from the Linux
process.
Summary
Automation and programmability are the defining building blocks for the future of net-
working. Open NX-OS was conceived to meet the future needs of SDN and the desire
for users to natively execute third-party applications directly on Nexus switches. Open
NX-OS provides the architecture that allows network operators and developers to create
and deploy custom applications on their network devices. Integration of the power-
ful Bash shell and Guest shell has made it easy to create scripts for automating tasks on
Nexus switches. This chapter covered in detail how you can leverage the Bash shell and
the Guest shell to deploy third-party applications. Integration of Python with NX-OS
enables you to create dynamic applications that enhance the functionality and manage-
ability of Nexus switches. In addition to Python support, Cisco provides the NX-SDK,
which supports building applications in both the C++ and Python languages and compile
them as RPM packages. Finally, this chapter covered NX-API, an API that enables users
to interact with the Nexus switch using standard request/response language.
References
Programmability and Automation with Cisco Open NX-OS: https://www.cisco.com/c/
dam/en/us/td/docs/switches/datacenter/nexus9000/sw/open_nxos/programmability/
guide/Programmability_Open_NX-OS.pdf
NX-SDK: https://www.cisco.com/c/en/us/td/docs/switches/datacenter/nexus9000/
sw/7-x/programmability/guide/b_Cisco_Nexus_9000_Series_NX-OS_Programmability_
Guide_7x/b_Cisco_Nexus_9000_Series_NX-OS_Programmability_Guide_7x_
chapter_011010.pdf
Technet24
This page intentionally left blank
Index
Technet24
978 address assignment (IPv6)
Technet24
980 BGP (Border Gateway Protocol)
Technet24
982 broadcast traffic
Technet24
984 commands
Technet24
986 commands
Technet24
988 commands
Technet24
990 commands
Technet24
992 configuring
Technet24
994 CSMA/CD (Carrier Sense Multiple Access/Collision Detect)
Technet24
996 dynamic tunnel encapsulation
Technet24
998 event history logs
Technet24
1000 filtering routes
Technet24
1002 ICMP echo probes
Technet24
1004 IPv6 services
Technet24
1006 logging
M OPEN, 601–602
types of, 601
MAC addresses UPDATE, 602
address table example, 316 PIM (Protocol Independent
in FabricPath, 305–306 Multicast)
host C example, 919–920 assert message, 778–779
host C on NX-6 example, 923 bootstrap message, 777–778
in IS-IS, 512–513 candidate RP advertisement
message, 779
multicast source example, 796
DF election message, 779–780
for multicast traffic, 738–739
hello message, 775
preventing forwarding loops,
242–243 join-prune message, 776–777
redistribution into OTV IS-IS register message, 775–776
example, 903–904, 921–922 register-stop message, 776
viewing, 198–199 types of, 773–774
MSDP (Multicast Source Discovery Protocol) 1007
Technet24
1008 MST (Multiple Spanning-Tree Protocol)
Technet24
1010 network QoS policy verification example
Technet24
1012 NX-OS
Technet24
1014 OTV (Overlay Transport Virtualization)
Technet24
1016 path modification on NX-6 example
Technet24
1018 plaintext authentication in OSPF
Technet24
1020 route advertisement in BGP
Technet24
1022 show bgp private attr detail command
show bgp private attr detail show fabricpath route command, 307
command, 652–653 show fabricpath switch-id command,
show bgp process command, 303, 315
607–609 show fabricpath switch-id command
show cli list command, 42–43 output example, 303
show cli list command example, show fabricpath unicast routes vdc
42–43 command, 308–309
show cli syntax command, 43 show fex command, 126–128
show cli syntax command example, 43 show file command, 20
show clock command, 82 show file logflash: command, 24–25
show command output redirection show forwarding distribution ip igmp
example, 40 snooping vlan command, 765
show copp diff profile command, 188 show forwarding distribution ip
show cores command, 29 multicast route group command,
797
show cores vdc-all command, 108
show forwarding internal trace
show diagnostic bootup level
v4-adj-history command, 162
command, 99
show forwarding internal trace
show diagnostic content module
v4-pfx-history command, 172–173
command, 101–103
show forwarding ipv4 adjacency
show diagnostic content module
command, 162–163
command output example,
102–103 show forwarding ipv4 route
command, 173–174
show diagnostic ondemand setting
command, 106–107 show forwarding route command,
173–174
show diagnostic result module
command, 103–105 show glbp and show glbp brief
command output example, 387–388
show event manager policy internal
command, 85–86 show glbp brief command, 386–388
show event manager system-policy show glbp command, 386–388
command, 84–85 show guestshell detail command,
show fabricpath conflict all command, 958–959
310 show hardware capacity interface
show fabricpath isis adjacency command, 113
command, 304–305 show hardware command, 98
show fabricpath isis interface show hardware flow command,
command, 303–304 76–77
show fabricpath isis topology show hardware internal cpu-mac eobc
command, 306 stats command, 118–119
show fabricpath isis vlan-range show hardware internal cpu-mac
command, 305–306 inband counters command, 123
show ip igmp route command 1023
Technet24
1024 show ip igmp snooping groups command
Technet24
1026 show otv isis database command
show otv isis database command, 899 show policy-map system type
show otv isis database detail network-qos command, 194–195
command, 900–902 show port-channel compatibility-
show otv isis hostname command, parameters command, 272
899 show port-channel load-balance
show otv isis interface overlay command, 273–274
command, 906 show port-channel summary
show otv isis internal event-history command, 260–261, 272,
adjacency command, 898 704–705
show otv isis internal event-history iih show port-channel traffic command,
command, 896–897 273
show otv isis internal event-history show processes log pid command, 29
spf-leaf command, 902–903 show processes log vdc-all command,
show otv isis ip redistribute mroute 109–110
command, 930, 934 show queueing interface command,
show otv isis mac redistribute route 114
command, 903–904 show queuing interface command,
show otv isis redistribute route 193, 194
command, 921–922 show role command, 952
show otv isis site command, 895–896 show routing clients command,
show otv isis site statistics command, 167–168
904–905 show routing event-history command,
show otv isis traffic overlay0 647–648
command, 904, 906 show routing internal event-history
show otv mroute command, 928, 929 msgs command, 169–170
show otv mroute detail command, show routing ip multicast event-
929–930, 931, 933 history rib command, 770
show otv overlay command, 888 show routing ip multicast source-tree
detail command, 868–869
show otv route command, 902, 923
show routing memory statistics
show otv route vlan command, 921
command, 171
show otv site command, 889–890,
show run aclmgr command, 572
895, 911–912
show run all | include glean command,
show otv vlan command, 891–892,
161
920
show run copp all command, 186
show policy-map interface command,
114 show run netflow command, 76
show policy-map interface control- show run otv command, 908–909,
plane command, 189–190 917–918
show policy-map interface control- show run pim command, 781
plane output example, 189–190 show run sflow command, 79
show system internal fabricpath switch-id event-history errors command 1027
Technet24
1028 show system internal feature-mgr feature action command
Technet24
1030 show virtual-service tech-support command
Technet24
1032 system redundancy state example
clients, 168
U route installation, 647–648
UDLD (unidirectional link detection), verifying FabricPath, 307
246–250 verifying vPC+, 316–317
configuring, 247 URPF (Unicast Reverse Path
empty echo detection example, 249 Forwarding), 351–352
event-history example, 248–249 UUID (Universally Unique Identifier), 9
UDP echo probes, 324–325
UDP jitter probes, 325–327 V
UFDM process, 171–175
VDC (Virtual Device Contexts),
UFDM route distribution to IPFIB
35–37, 130–131
and acknowledgment example, 172
configuring, 133–134
underscore (_) in RegEx, 677–678
initializing, 134–136
unicast flooding, 198
internal event history logs example,
with multicast enabled transport,
140–141
919–924
management, 137–142
in OTV, 877
out-of-band and in-band
selective unicast flooding, 918–919
management, 137
unicast forwarding components, 167
resource templates, 131–132
unicast routes from NX-2 for VLAN
verifying
215 and VLAN 216 example, 858
access port mode example, 203–204
unicast RPF configuration and
verification example, 351–352 access-list counters
unicast traffic, 734 in hardware example, 574–575
unicast transport, multicast traffic in TCAM example, 573–574
with, 932–937 ACLs (access control lists)
unidirectional links, 245 on line card for DHCP relay
bridge assurance, 250–252 example, 339–340
loop guard, 245–246 statistics on line card for DHCP
relay example, 340–341
UDLD (unidirectional link detection),
246–250 active interfaces, 402–403
unique router-ID in OSPF, 471 AED for VLAN 103 example, 920
unique System-ID in IS-IS, 539 anycast RP, 830–841
update generation process in BGP, ARP ACLs, 348–349
643–646 ARP ND-Cache example, 916
UPDATE message, 602 ASM (any source multicast), 788–789
URIB (Unicast Routing Information Auto-RP, 813–820
Base), 167–171
Technet24
1034 verifying
Technet24
1036 verifying
Technet24
1038 virtualization
Technet24
Exclusive Offer – 40% OFF
Cisco Press
Video Training
ciscopress.com/video
Use coupon code CPVIDEO40 during checkout.
ciscopress.com/video
Technet24
REGISTER YOUR PRODUCT at CiscoPress.com/register
Access Additional Benefits and SAVE 35% on Your Next Purchase
• Download available product updates.
• Access bonus material when applicable.
• Receive exclusive offers on new editions and related products.
(Just check the box to hear from us when setting up your account.)
• Get a coupon for 35% for your next purchase, valid for 30 days.
Your code will be available in your Cisco Press cart. (You will also find
it in the Manage Codes section of your account page.)
Registration benefits vary by product. Benefits will be listed on your account page
under Registered Products.
CiscoPress.com – Learning Solutions for Self-Paced Study, Enterprise, and the Classroom
Cisco Press is the Cisco Systems authorized book publisher of Cisco networking technology,
Cisco certification self-study, and Cisco Networking Academy Program materials.
At CiscoPress.com you can
• Shop our books, eBooks, software, and video training.
• Take advantage of our special offers and promotions (ciscopress.com/promotions).
• Sign up for special offers and content newsletters (ciscopress.com/newsletters).
• Read free articles, exam profiles, and blogs by information technology experts.
• Access thousands of free chapters and video lessons.
Connect with Cisco Press – Visit CiscoPress.com/community
Learn about Cisco Press community events and programs.