Hardware accelerating Linux network
functions
Roopa Prabhu, Wilson Kok
Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada
Agenda
● Recap: offload models, offload drivers
● Introduction to switch asic hardware
● L2 offload to switch ASIC
○ Mac Learning, ageing
○ stp handling
○ igmp snooping
○ vxlan
● L3 offload to switch ASIC
Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada
Offload models ... ● Single consistent netlink based
UAPI
● Single kernel offload API to
rtnetlink api: offload to variety of hardware
bridge vlan add (nics, switch asics, ..)
bridge fdb add
Rtnetlink API PATH
Offload API path
kernel kernel
FDB
FDB (in sync with hw)
bridge
bridge bridge
port1 port2 port3 port4
port1 port2 portn
port1 port2 portn
port1 port1 port2
port1 port2 NIC1
switch asic
Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada
FDB CPU MEM
NIC1 FDB
NIC2
The bigger
tc OVSdb
picture... iproute2
mstpd
bridge
nftables
snmpd
quagga lldpd
brctl
bird
user swp1 swpN
kernel Bonds Bridges VXLAN
hw driver
Routing Bridge Netfilter
ARP Tables tc
Tables FDB/MDB Tables
kernel
HW
Routing Bridge
ARP Tables acls CPU MEM
Tables FDB/MDB
Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada
HW offload driver (kernel) switch ports
RTnetlink API
switchdev
mstp routing offload API
daemon
br0
swp2 swpN
user swp1
kernel
netdev_ops {
.ndo_fdb_add/del
.ndo_fib_add/del
}
FIB
hw driver
Bridge br0
FDB/MDB
kernel
HW
HW
Routing Bridge
ARP Tables acls CPU CPU ASIC MEM
MEM
Tables FDB/MDB
Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada
switch ports
HW offload driver (user space) rtnetlink API
RtNetlink
mstp routing rtnetlink notifications
daemon listener
br0
hw driver
swp2 swpN
user swp1
kernel
FIB
Bridge br0
FDB/MDB
kernel
HW
HW
Routing Bridge
ARP Tables acls CPU CPU ASIC MEM
MEM
Tables FDB/MDB
Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada
switch hardware
switch driver:
● Creates netdevs for front
panel ports
swp1 swp2 swp3
swpn ● Port netdevs only see traffic
kernel forwarded to the CPU port
● Sets hardware offload flag
switch
driver
NETIF_F_HW_SWITCH_OFFLOAD
on netdevs
switch hardware
netdevs for each front
panel ports
Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada cpu port
1 2 3 n front panel ports
ip link show switch ports
# ip link show 55: swp53: <BROADCAST,MULTICAST> mtu 1500
1: lo: <LOOPBACK> mtu 16436 qdisc noqueue state qdisc noop state DOWN mode DEFAULT qlen 500
DOWN mode DEFAULT link/ether 00:e0:ec:27:4e:f7 brd ff:ff:ff:ff:ff:ff
link/loopback 00:00:00:00:00:00 brd 00:00:00:00: 56: swp54s0: <BROADCAST,MULTICAST> mtu 1500
00:00 qdisc noop state DOWN mode DEFAULT qlen 500
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> link/ether 00:e0:ec:27:4e:fb brd ff:ff:ff:ff:ff:ff
mtu 1500 qdisc mq state UP mode DEFAULT qlen 1000
57: swp54s1: <BROADCAST,MULTICAST> mtu 1500
link/ether 00:e0:ec:27:4e:b6 brd ff:ff:ff:ff:ff:ff qdisc noop state DOWN mode DEFAULT qlen 500
3: swp1: <BROADCAST,MULTICAST,UP,LOWER_UP> link/ether 00:e0:ec:27:4e:fc brd ff:ff:ff:ff:ff:ff
mtu 1500 qdisc pfifo_fast state UP mode DEFAULT
qlen 500 58: swp54s2: <BROADCAST,MULTICAST> mtu 1500
qdisc noop state DOWN mode DEFAULT qlen 500
link/ether 44:38:39:00:27:ac brd ff:ff:ff:ff:ff:ff
link/ether 00:e0:ec:27:4e:fd brd ff:ff:ff:ff:ff:ff
4: swp2: <BROADCAST,MULTICAST> mtu 9000 qdisc
pfifo_fast state DOWN mode DEFAULT qlen 500 59: swp54s3: <BROADCAST,MULTICAST> mtu 1500
qdisc noop state DOWN mode DEFAULT qlen 500
link/ether 00:e0:ec:27:4e:b8 brd ff:ff:ff:ff:ff:ff
link/ether 00:e0:ec:27:4e:fe brd ff:ff:ff:ff:ff:ff
[snip]
switch ports
Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada
management port
ethtool on switch port
$ethtool swp1
Settings for swp1: Transceiver: external
Supported ports: [ FIBRE ] Auto-negotiation: off
Supported link modes: 1000baseT/Full Current message level: 0x00000000
(0)
10000baseT/Full
Link detected: yes
Supported pause frame use: Symmetric
Receive-only
Supports auto-negotiation: Yes
Advertised link modes: 1000baseT/Full
Advertised pause frame use: No
Advertised auto-negotiation: No
Speed: 10000Mb/s
Duplex: Full
Port: FIBRE Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada
PHYAD: 0
Creating a hardware accelerated Linux bridge
device
# ip link add br0 type bridge
# ip link set dev swp1 master br0
# ip link set dev swp2 master br0
# bridge vlan add vid 10-20 dev swp1
# bridge vlan add vid 20-30 dev swp2
Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada
Bonds as bridge ports
rtnetlink api: ● switch ASICS support
bridge vlan add Link aggregation
bridge fdb add
● bonding driver LAG
config is offloaded to the
switch ASIC
● fdb and vlan offloads go
kernel
through the bonding
FDB (in sync with hw)
driver
bridge bonding driver
bridge
bond0
port1 port2 portn-1 portn
port1 port2 portn-1 portn LAG
bond0 (portn-1, NIC1
switch asic rtnetlink API
portn
Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada
FDB CPU MEM
switchdev
offload API
Bridging hardware offload: packet path
kernel
known unicast (transit)
bridge
BUM*
system generated/
destined to system
swp1 swp2
switch asic
VLAN
swp1 swp2
Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada
Bridging hardware offload: packet path
● Known unicast traffic not destined to system is
forwarded only in hardware
● BUM traffic is forwarded in hardware plus a copy MAY
be sent to kernel
● BUM traffic in kernel should not be forwarded again
(duplicate copies from hardware and software)
Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada
Bridging hardware offload: fdb learn
br0
user rtnetlink swp1 swp2 swpN
kernel
fdb add/update switch driver
00:11:22:33:44:55
00:11:22:33:44:55 vlan 10
br0 intf_id 9876
swp2
Bridge br0 notification
FDB/MDB
hw events: learn/move
kernel
HW
CPU ASIC MEM
Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada
Bridging hardware offload: learning in HW
● Turn off learning in bridge driver
● switch driver listens to learn notifications from hardware
● converts hardware interface id and vlan to kernel ifindex of bridge
port (and vlan) and bridge
● sends netlink fdb update to kernel (userspace driver) or calls bridge
driver learn sync switchdev API (kernel driver)
Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada
Bridging hardware offload: kernel ageing
br0
user rtnetlink swp1 swp2 swpN
kernel
fdb update
switch driver
fdb delete
Bridge br0 fdb delete
FDB/MDB
get fdb hit status
kernel
HW
CPU ASIC MEM
Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada
Bridging hardware offload: hardware ageing
br0
user rtnetlink swp1 swp2 swpN
kernel
switch driver
fdb delete
Bridge br0 fdb delete
FDB/MDB
kernel
HW
CPU ASIC MEM
Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada
Bridging hardware offload: ageing
Bridge driver very seldom sees packets with hardware offload. FDB
age is not up to date.
Hardware ageing
● bridge driver should not do ageing if hardware is doing it
● fdb show will need to get age from hardware during ‘show’, or need
periodic age update from switch driver
Kernel ageing
● definitely need periodic age update from switch driver
Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada
STP offload
STP
● bridge driver maintains STP states (either kernel STP or
userspace STP)
● bridge driver communicates STP states to switch driver
using switchdev offload API
● OR a switch driver in userspace can listen to STP state
notifications to update HW state
Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada
IGMP snooping offload
kernel
bridge
dev bridge port swp1 grp 224.1.2.3 temp
router ports on bridge: swp2 report
swp1 swp2
switch asic
query
data
swp1 swp2
Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada 224.1.2.3 Query
Join 224.1.2.3
IGMP snooping offload
● switch driver configures hardware to send IGMP reports
and queries to software
● bridge driver maintains IGMP group membership
● in some cases the reports or queries need to be re-
forwarded in the kernel
Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada
VXLAN offload - hardware vtep
swp3
172.16.21.150
MAC Destination
lo: 172.16.20.103 vxlan100
macC 172.16.21.150
macC
20.0.0.2 unknown 172.16.22.125
MAC Interface bridge
macA swp1
macB swp2
macC vxlan100
swp1 swp2
Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada
macA macB
20.0.0.3 20.0.0.5
VXLAN offload - hardware vtep
Model
● VXLAN link as bridge port
○ bridging between local ports
○ VXLAN tunneling for remote MACs
● BUM traffic handling
○ multicast
○ using off-system replicator
■ could have a list of redundant replicators, need to choose ONE out of
the list of remote dests (per flow or per vni etc.)
○ self replication
■ vtep sends to a list of remote vteps, need to choose ALL of the list of
remote dests
Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada
VXLAN offload - ovsdb integration
Agent to translate ovsdb schema objects to kernel constructs.
OVSDB Linux kernel
logical switch vxlan link + bridge
physical switch tunnel_ip vxlan link local ip
logical port binding bridge member port, vlan
unicast remote mac + physical locator bridge fdb (mac, vlan, dst <remote ip>)
mcast remote mac “unknown” + physical vxlan link default dest
locator list
unicast local mac + physical locator bridge fdb (mac, vlan, local dev)
Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada
l3 offloads
iproute rtnetlink API path
ip route add 1.1.1.1/32
nexthop via Quagga/Bird
192.168.200.3 nexthop offload API path
via 192.168.200.4
Network
manager
arping for
unresolved
nexthop
user swp1 swp2 swpN
kernel
switch driver
FIB neigh table
kernel
HW
Routing Tables Neigh tables CPU ASIC MEM
Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada
l3 hardware offload
● Routes via routing daemons go to the kernel
● Unresolved next hops, point to CPU in HW
● switch driver tries to resolve them by probes
(arping)
● Refresh neigh entries for pkts routed through
hardware (hit bit provided by hardware)
Proceedings of netdev 0.1, Feb 14-17, 2015, Ottawa, On, Canada