Skip to content

Commit f389a40

Browse files
committed
Merge branch 'ipv4-nexthop-link-status'
Andy Gospodarek says: ==================== changes to make ipv4 routing table aware of next-hop link status This series adds the ability to have the Linux kernel track whether or not a particular route should be used based on the link-status of the interface associated with the next-hop. Before this patch any link-failure on an interface that was serving as a gateway for some systems could result in those systems being isolated from the rest of the network as the stack would continue to attempt to send frames out of an interface that is actually linked-down. When the kernel is responsible for all forwarding, it should also be responsible for taking action when the traffic can no longer be forwarded -- there is no real need to outsource link-monitoring to userspace anymore. This feature is only enabled with the new per-interface or ipv4 global sysctls called 'ignore_routes_with_linkdown'. net.ipv4.conf.all.ignore_routes_with_linkdown = 0 net.ipv4.conf.default.ignore_routes_with_linkdown = 0 net.ipv4.conf.lo.ignore_routes_with_linkdown = 0 ... When the above sysctls are set, the kernel will not only report to userspace that the link is down, but it will also report to userspace that a route is dead. This will signal to userspace that the route will not be selected. With the new sysctls set, the following behavior can be observed (interface p8p1 is link-down): default via 10.0.5.2 dev p9p1 10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15 70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1 80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1 dead linkdown 90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1 dead linkdown 90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2 90.0.0.1 via 70.0.0.2 dev p7p1 src 70.0.0.1 cache local 80.0.0.1 dev lo src 80.0.0.1 cache <local> 80.0.0.2 via 10.0.5.2 dev p9p1 src 10.0.5.15 cache While the route does remain in the table (so it can be modified if needed rather than being wiped away as it would be if IFF_UP was cleared), the proper next-hop is chosen automatically when the link is down. Now interface p8p1 is linked-up: default via 10.0.5.2 dev p9p1 10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15 70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1 80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1 90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1 90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2 192.168.56.0/24 dev p2p1 proto kernel scope link src 192.168.56.2 90.0.0.1 via 80.0.0.2 dev p8p1 src 80.0.0.1 cache local 80.0.0.1 dev lo src 80.0.0.1 cache <local> 80.0.0.2 dev p8p1 src 80.0.0.1 cache and the output changes to what one would expect. If the global or interface sysctl is not set, the following output would be expected when p8p1 is down: default via 10.0.5.2 dev p9p1 10.0.5.0/24 dev p9p1 proto kernel scope link src 10.0.5.15 70.0.0.0/24 dev p7p1 proto kernel scope link src 70.0.0.1 80.0.0.0/24 dev p8p1 proto kernel scope link src 80.0.0.1 linkdown 90.0.0.0/24 via 80.0.0.2 dev p8p1 metric 1 linkdown 90.0.0.0/24 via 70.0.0.2 dev p7p1 metric 2 If the dead flag does not appear there should be no expectation that the kernel would skip using this route due to link being down. v2: Split kernel changes into 2 patches: first to add linkdown flag and second to add new sysctl settings. Also took suggestion from Alex to simplify code by only checking sysctl during fib lookup and suggestion from Scott to add a per-interface sysctl. Added iproute2 patch to recognize and print linkdown flag. v3: Code cleanups along with reverse-path checks suggested by Alex and small fixes related to problems found when multipath was disabled. v4: Drop binary sysctls v5: Whitespace and variable declaration fixups suggested by Dave v6: Style changes noticed by Dave and checkpath suggestions. v7: Last checkpatch fixup. Though there were some that preferred not to have a configuration option and to make this behavior the default when it was discussed in Ottawa earlier this year since "it was time to do this." I wanted to propose the config option to preserve the current behavior for those that desire it. I'll happily remove it if Dave and Linus approve. An IPv6 implementation is also needed (DECnet too!), but I wanted to start with the IPv4 implementation to get people comfortable with the idea before moving forward. If this is accepted the IPv6 implementation can be posted shortly. There was also a request for switchdev support for this, but that will be posted as a followup as switchdev does not currently handle dead next-hops in a multi-path case and I felt that infra needed to be added first. FWIW, we have been running the original version of this series with a global sysctl and our customers have been happily using a backported version for IPv4 and IPv6 for >6 months. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2 parents 5c8079d + 0eeb075 commit f389a40

File tree

12 files changed

+130
-47
lines changed

12 files changed

+130
-47
lines changed

include/linux/inetdevice.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,9 @@ static inline void ipv4_devconf_setall(struct in_device *in_dev)
120120
|| (!IN_DEV_FORWARD(in_dev) && \
121121
IN_DEV_ORCONF((in_dev), ACCEPT_REDIRECTS)))
122122

123+
#define IN_DEV_IGNORE_ROUTES_WITH_LINKDOWN(in_dev) \
124+
IN_DEV_CONF_GET((in_dev), IGNORE_ROUTES_WITH_LINKDOWN)
125+
123126
#define IN_DEV_ARPFILTER(in_dev) IN_DEV_ORCONF((in_dev), ARPFILTER)
124127
#define IN_DEV_ARP_ACCEPT(in_dev) IN_DEV_ORCONF((in_dev), ARP_ACCEPT)
125128
#define IN_DEV_ARP_ANNOUNCE(in_dev) IN_DEV_MAXCONF((in_dev), ARP_ANNOUNCE)

include/net/fib_rules.h

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,8 @@ struct fib_lookup_arg {
3636
void *result;
3737
struct fib_rule *rule;
3838
int flags;
39-
#define FIB_LOOKUP_NOREF 1
39+
#define FIB_LOOKUP_NOREF 1
40+
#define FIB_LOOKUP_IGNORE_LINKSTATE 2
4041
};
4142

4243
struct fib_rules_ops {

include/net/ip_fib.h

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -226,15 +226,15 @@ static inline struct fib_table *fib_new_table(struct net *net, u32 id)
226226
}
227227

228228
static inline int fib_lookup(struct net *net, const struct flowi4 *flp,
229-
struct fib_result *res)
229+
struct fib_result *res, unsigned int flags)
230230
{
231231
struct fib_table *tb;
232232
int err = -ENETUNREACH;
233233

234234
rcu_read_lock();
235235

236236
tb = fib_get_table(net, RT_TABLE_MAIN);
237-
if (tb && !fib_table_lookup(tb, flp, res, FIB_LOOKUP_NOREF))
237+
if (tb && !fib_table_lookup(tb, flp, res, flags | FIB_LOOKUP_NOREF))
238238
err = 0;
239239

240240
rcu_read_unlock();
@@ -249,28 +249,30 @@ void __net_exit fib4_rules_exit(struct net *net);
249249
struct fib_table *fib_new_table(struct net *net, u32 id);
250250
struct fib_table *fib_get_table(struct net *net, u32 id);
251251

252-
int __fib_lookup(struct net *net, struct flowi4 *flp, struct fib_result *res);
252+
int __fib_lookup(struct net *net, struct flowi4 *flp,
253+
struct fib_result *res, unsigned int flags);
253254

254255
static inline int fib_lookup(struct net *net, struct flowi4 *flp,
255-
struct fib_result *res)
256+
struct fib_result *res, unsigned int flags)
256257
{
257258
struct fib_table *tb;
258259
int err;
259260

261+
flags |= FIB_LOOKUP_NOREF;
260262
if (net->ipv4.fib_has_custom_rules)
261-
return __fib_lookup(net, flp, res);
263+
return __fib_lookup(net, flp, res, flags);
262264

263265
rcu_read_lock();
264266

265267
res->tclassid = 0;
266268

267269
for (err = 0; !err; err = -ENETUNREACH) {
268270
tb = rcu_dereference_rtnl(net->ipv4.fib_main);
269-
if (tb && !fib_table_lookup(tb, flp, res, FIB_LOOKUP_NOREF))
271+
if (tb && !fib_table_lookup(tb, flp, res, flags))
270272
break;
271273

272274
tb = rcu_dereference_rtnl(net->ipv4.fib_default);
273-
if (tb && !fib_table_lookup(tb, flp, res, FIB_LOOKUP_NOREF))
275+
if (tb && !fib_table_lookup(tb, flp, res, flags))
274276
break;
275277
}
276278

@@ -305,9 +307,9 @@ void fib_flush_external(struct net *net);
305307

306308
/* Exported by fib_semantics.c */
307309
int ip_fib_check_default(__be32 gw, struct net_device *dev);
308-
int fib_sync_down_dev(struct net_device *dev, int force);
310+
int fib_sync_down_dev(struct net_device *dev, unsigned long event);
309311
int fib_sync_down_addr(struct net *net, __be32 local);
310-
int fib_sync_up(struct net_device *dev);
312+
int fib_sync_up(struct net_device *dev, unsigned int nh_flags);
311313
void fib_select_multipath(struct fib_result *res);
312314

313315
/* Exported by fib_trie.c */

include/uapi/linux/ip.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -164,6 +164,7 @@ enum
164164
IPV4_DEVCONF_ROUTE_LOCALNET,
165165
IPV4_DEVCONF_IGMPV2_UNSOLICITED_REPORT_INTERVAL,
166166
IPV4_DEVCONF_IGMPV3_UNSOLICITED_REPORT_INTERVAL,
167+
IPV4_DEVCONF_IGNORE_ROUTES_WITH_LINKDOWN,
167168
__IPV4_DEVCONF_MAX
168169
};
169170

include/uapi/linux/rtnetlink.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -338,6 +338,9 @@ struct rtnexthop {
338338
#define RTNH_F_PERVASIVE 2 /* Do recursive gateway lookup */
339339
#define RTNH_F_ONLINK 4 /* Gateway is forced on link */
340340
#define RTNH_F_OFFLOAD 8 /* offloaded route */
341+
#define RTNH_F_LINKDOWN 16 /* carrier-down on nexthop */
342+
343+
#define RTNH_COMPARE_MASK (RTNH_F_DEAD | RTNH_F_LINKDOWN)
341344

342345
/* Macros to handle hexthops */
343346

net/ipv4/devinet.c

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2169,6 +2169,8 @@ static struct devinet_sysctl_table {
21692169
"igmpv2_unsolicited_report_interval"),
21702170
DEVINET_SYSCTL_RW_ENTRY(IGMPV3_UNSOLICITED_REPORT_INTERVAL,
21712171
"igmpv3_unsolicited_report_interval"),
2172+
DEVINET_SYSCTL_RW_ENTRY(IGNORE_ROUTES_WITH_LINKDOWN,
2173+
"ignore_routes_with_linkdown"),
21722174

21732175
DEVINET_SYSCTL_FLUSHING_ENTRY(NOXFRM, "disable_xfrm"),
21742176
DEVINET_SYSCTL_FLUSHING_ENTRY(NOPOLICY, "disable_policy"),

net/ipv4/fib_frontend.c

Lines changed: 18 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -280,7 +280,7 @@ __be32 fib_compute_spec_dst(struct sk_buff *skb)
280280
fl4.flowi4_tos = RT_TOS(ip_hdr(skb)->tos);
281281
fl4.flowi4_scope = scope;
282282
fl4.flowi4_mark = IN_DEV_SRC_VMARK(in_dev) ? skb->mark : 0;
283-
if (!fib_lookup(net, &fl4, &res))
283+
if (!fib_lookup(net, &fl4, &res, 0))
284284
return FIB_RES_PREFSRC(net, res);
285285
} else {
286286
scope = RT_SCOPE_LINK;
@@ -319,7 +319,7 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
319319
fl4.flowi4_mark = IN_DEV_SRC_VMARK(idev) ? skb->mark : 0;
320320

321321
net = dev_net(dev);
322-
if (fib_lookup(net, &fl4, &res))
322+
if (fib_lookup(net, &fl4, &res, 0))
323323
goto last_resort;
324324
if (res.type != RTN_UNICAST &&
325325
(res.type != RTN_LOCAL || !IN_DEV_ACCEPT_LOCAL(idev)))
@@ -354,7 +354,7 @@ static int __fib_validate_source(struct sk_buff *skb, __be32 src, __be32 dst,
354354
fl4.flowi4_oif = dev->ifindex;
355355

356356
ret = 0;
357-
if (fib_lookup(net, &fl4, &res) == 0) {
357+
if (fib_lookup(net, &fl4, &res, FIB_LOOKUP_IGNORE_LINKSTATE) == 0) {
358358
if (res.type == RTN_UNICAST)
359359
ret = FIB_RES_NH(res).nh_scope >= RT_SCOPE_HOST;
360360
}
@@ -1063,9 +1063,9 @@ static void nl_fib_lookup_exit(struct net *net)
10631063
net->ipv4.fibnl = NULL;
10641064
}
10651065

1066-
static void fib_disable_ip(struct net_device *dev, int force)
1066+
static void fib_disable_ip(struct net_device *dev, unsigned long event)
10671067
{
1068-
if (fib_sync_down_dev(dev, force))
1068+
if (fib_sync_down_dev(dev, event))
10691069
fib_flush(dev_net(dev));
10701070
rt_cache_flush(dev_net(dev));
10711071
arp_ifdown(dev);
@@ -1081,7 +1081,7 @@ static int fib_inetaddr_event(struct notifier_block *this, unsigned long event,
10811081
case NETDEV_UP:
10821082
fib_add_ifaddr(ifa);
10831083
#ifdef CONFIG_IP_ROUTE_MULTIPATH
1084-
fib_sync_up(dev);
1084+
fib_sync_up(dev, RTNH_F_DEAD);
10851085
#endif
10861086
atomic_inc(&net->ipv4.dev_addr_genid);
10871087
rt_cache_flush(dev_net(dev));
@@ -1093,7 +1093,7 @@ static int fib_inetaddr_event(struct notifier_block *this, unsigned long event,
10931093
/* Last address was deleted from this interface.
10941094
* Disable IP.
10951095
*/
1096-
fib_disable_ip(dev, 1);
1096+
fib_disable_ip(dev, event);
10971097
} else {
10981098
rt_cache_flush(dev_net(dev));
10991099
}
@@ -1107,9 +1107,10 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
11071107
struct net_device *dev = netdev_notifier_info_to_dev(ptr);
11081108
struct in_device *in_dev;
11091109
struct net *net = dev_net(dev);
1110+
unsigned int flags;
11101111

11111112
if (event == NETDEV_UNREGISTER) {
1112-
fib_disable_ip(dev, 2);
1113+
fib_disable_ip(dev, event);
11131114
rt_flush_dev(dev);
11141115
return NOTIFY_DONE;
11151116
}
@@ -1124,16 +1125,22 @@ static int fib_netdev_event(struct notifier_block *this, unsigned long event, vo
11241125
fib_add_ifaddr(ifa);
11251126
} endfor_ifa(in_dev);
11261127
#ifdef CONFIG_IP_ROUTE_MULTIPATH
1127-
fib_sync_up(dev);
1128+
fib_sync_up(dev, RTNH_F_DEAD);
11281129
#endif
11291130
atomic_inc(&net->ipv4.dev_addr_genid);
11301131
rt_cache_flush(net);
11311132
break;
11321133
case NETDEV_DOWN:
1133-
fib_disable_ip(dev, 0);
1134+
fib_disable_ip(dev, event);
11341135
break;
1135-
case NETDEV_CHANGEMTU:
11361136
case NETDEV_CHANGE:
1137+
flags = dev_get_flags(dev);
1138+
if (flags & (IFF_RUNNING | IFF_LOWER_UP))
1139+
fib_sync_up(dev, RTNH_F_LINKDOWN);
1140+
else
1141+
fib_sync_down_dev(dev, event);
1142+
/* fall through */
1143+
case NETDEV_CHANGEMTU:
11371144
rt_cache_flush(net);
11381145
break;
11391146
}

net/ipv4/fib_rules.c

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,11 +47,12 @@ struct fib4_rule {
4747
#endif
4848
};
4949

50-
int __fib_lookup(struct net *net, struct flowi4 *flp, struct fib_result *res)
50+
int __fib_lookup(struct net *net, struct flowi4 *flp,
51+
struct fib_result *res, unsigned int flags)
5152
{
5253
struct fib_lookup_arg arg = {
5354
.result = res,
54-
.flags = FIB_LOOKUP_NOREF,
55+
.flags = flags,
5556
};
5657
int err;
5758

0 commit comments

Comments
 (0)