Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Route leak between a VRF and default VRF points to dev lo and packets are time exceeded in-transit #15909

Closed
2 tasks done
EasyNetDev opened this issue May 2, 2024 · 20 comments · Fixed by #16044
Closed
2 tasks done
Labels
triage Needs further investigation

Comments

@EasyNetDev
Copy link
Contributor

EasyNetDev commented May 2, 2024

Description

Trying a simple local BGP route-leak via VPN between a VRF and default VRF (GRT) leads to time exceeded in-transit because from VRF to GRT all routes points to dev lo.

Version

FRRouting 10.1-dev (R05) on Linux(6.6.15-amd64).

How to reproduce

Using BGP import router between GRT and a VRF and we will have something like this:

R05# show ip route vrf default
Codes: K - kernel route, C - connected, L - local, S - static,
       R - RIP, O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric, t - Table-Direct,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure
..
B>* 172.17.64.0/24 [20/0] is directly connected, mgmt (vrf mgmt), weight 1, 19:42:56
..
C>* 192.168.163.0/24 is directly connected, lan0.100, 19:43:03
L * 192.168.163.1/32 is directly connected, v0860fc3b-64-4, 19:42:58
C>* 192.168.163.1/32 is directly connected, v0860fc3b-64-4, 19:42:58

And in mgmt:

R05# show ip route vrf mgmt
Codes: K - kernel route, C - connected, L - local, S - static,
       R - RIP, O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric, t - Table-Direct,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

VRF mgmt:
..
C>* 172.17.64.0/24 is directly connected, lan0.17, 19:43:34
L>* 172.17.64.1/32 is directly connected, lan0.17, 19:43:34
B>* 192.168.163.0/24 [20/0] is directly connected, lo (vrf default), weight 1, 19:43:27
B>* 192.168.163.1/32 [20/0] is directly connected, lo (vrf default), weight 1, 19:43:27

Expected behavior

Should I have a connection between the VRF and default VRF (GRT) and not a L3 loop.

Actual behavior

A Layer 3 loop apears on dev lo:

A simple ping from default network to mgmt leads to this:

18:21:49.651970 gi0-2 P   IP (tos 0x0, ttl 128, id 40334, offset 0, flags [none], proto ICMP (1), length 60)
    192.168.163.200 > 172.17.64.3: ICMP echo request, id 1, seq 497, length 40
18:21:49.651970 lan0  P   IP (tos 0x0, ttl 128, id 40334, offset 0, flags [none], proto ICMP (1), length 60)
    192.168.163.200 > 172.17.64.3: ICMP echo request, id 1, seq 497, length 40
18:21:49.651970 lan0.100 P   IP (tos 0x0, ttl 128, id 40334, offset 0, flags [none], proto ICMP (1), length 60)
    192.168.163.200 > 172.17.64.3: ICMP echo request, id 1, seq 497, length 40
18:21:49.651970 v0860fc3b-64-4 In  IP (tos 0x0, ttl 128, id 40334, offset 0, flags [none], proto ICMP (1), length 60)
    192.168.163.200 > 172.17.64.3: ICMP echo request, id 1, seq 497, length 40
18:21:49.652148 mgmt  Out IP (tos 0x0, ttl 127, id 40334, offset 0, flags [none], proto ICMP (1), length 60)
    192.168.163.200 > 172.17.64.3: ICMP echo request, id 1, seq 497, length 40
18:21:49.652186 lan0.17 Out IP (tos 0x0, ttl 127, id 40334, offset 0, flags [none], proto ICMP (1), length 60)
    192.168.163.200 > 172.17.64.3: ICMP echo request, id 1, seq 497, length 40
18:21:49.652200 lan0  Out IP (tos 0x0, ttl 127, id 40334, offset 0, flags [none], proto ICMP (1), length 60)
    192.168.163.200 > 172.17.64.3: ICMP echo request, id 1, seq 497, length 40
18:21:49.652227 gi0-2 Out IP (tos 0x0, ttl 127, id 40334, offset 0, flags [none], proto ICMP (1), length 60)
    192.168.163.200 > 172.17.64.3: ICMP echo request, id 1, seq 497, length 40
18:21:49.654207 gi0-2 P   IP (tos 0x0, ttl 255, id 40334, offset 0, flags [none], proto ICMP (1), length 60)
    172.17.64.3 > 192.168.163.200: ICMP echo reply, id 1, seq 497, length 40
18:21:49.654207 lan0  In  IP (tos 0x0, ttl 255, id 40334, offset 0, flags [none], proto ICMP (1), length 60)
    172.17.64.3 > 192.168.163.200: ICMP echo reply, id 1, seq 497, length 40
18:21:49.654207 lan0.17 In  IP (tos 0x0, ttl 255, id 40334, offset 0, flags [none], proto ICMP (1), length 60)
    172.17.64.3 > 192.168.163.200: ICMP echo reply, id 1, seq 497, length 40
18:21:49.654386 lo    In  IP (tos 0x0, ttl 254, id 40334, offset 0, flags [none], proto ICMP (1), length 60)
    172.17.64.3 > 192.168.163.200: ICMP echo reply, id 1, seq 497, length 40
18:21:49.654443 lo    In  IP (tos 0x0, ttl 253, id 40334, offset 0, flags [none], proto ICMP (1), length 60)
    172.17.64.3 > 192.168.163.200: ICMP echo reply, id 1, seq 497, length 40
18:21:49.654487 lo    In  IP (tos 0x0, ttl 252, id 40334, offset 0, flags [none], proto ICMP (1), length 60)
    172.17.64.3 > 192.168.163.200: ICMP echo reply, id 1, seq 497, length 40
18:21:49.654528 lo    In  IP (tos 0x0, ttl 251, id 40334, offset 0, flags [none], proto ICMP (1), length 60)
    172.17.64.3 > 192.168.163.200: ICMP echo reply, id 1, seq 497, length 40
18:21:49.654567 lo    In  IP (tos 0x0, ttl 250, id 40334, offset 0, flags [none], proto ICMP (1), length 60)
    172.17.64.3 > 192.168.163.200: ICMP echo reply, id 1, seq 497, length 40
..
18:21:49.664103 lo    In  IP (tos 0x0, ttl 3, id 40334, offset 0, flags [none], proto ICMP (1), length 60)
    172.17.64.3 > 192.168.163.200: ICMP echo reply, id 1, seq 497, length 40
18:21:49.664141 lo    In  IP (tos 0x0, ttl 2, id 40334, offset 0, flags [none], proto ICMP (1), length 60)
    172.17.64.3 > 192.168.163.200: ICMP echo reply, id 1, seq 497, length 40
18:21:49.664177 lo    In  IP (tos 0x0, ttl 1, id 40334, offset 0, flags [none], proto ICMP (1), length 60)
    172.17.64.3 > 192.168.163.200: ICMP echo reply, id 1, seq 497, length 40
18:21:49.664297 mgmt  Out IP (tos 0xc0, ttl 64, id 21992, offset 0, flags [none], proto ICMP (1), length 88)
    172.17.64.1 > 172.17.64.3: ICMP time exceeded in-transit, length 68
        IP (tos 0x0, ttl 1, id 40334, offset 0, flags [none], proto ICMP (1), length 60)
    172.17.64.3 > 192.168.163.200: ICMP echo reply, id 1, seq 497, length 40
18:21:49.664332 lan0.17 Out IP (tos 0xc0, ttl 64, id 21992, offset 0, flags [none], proto ICMP (1), length 88)
    172.17.64.1 > 172.17.64.3: ICMP time exceeded in-transit, length 68
        IP (tos 0x0, ttl 1, id 40334, offset 0, flags [none], proto ICMP (1), length 60)
    172.17.64.3 > 192.168.163.200: ICMP echo reply, id 1, seq 497, length 40
18:21:49.664344 lan0  Out IP (tos 0xc0, ttl 64, id 21992, offset 0, flags [none], proto ICMP (1), length 88)
    172.17.64.1 > 172.17.64.3: ICMP time exceeded in-transit, length 68
        IP (tos 0x0, ttl 1, id 40334, offset 0, flags [none], proto ICMP (1), length 60)
    172.17.64.3 > 192.168.163.200: ICMP echo reply, id 1, seq 497, length 40
18:21:49.664370 gi0-2 Out IP (tos 0xc0, ttl 64, id 21992, offset 0, flags [none], proto ICMP (1), length 88)
    172.17.64.1 > 172.17.64.3: ICMP time exceeded in-transit, length 68
        IP (tos 0x0, ttl 1, id 40334, offset 0, flags [none], proto ICMP (1), length 60)
    172.17.64.3 > 192.168.163.200: ICMP echo reply, id 1, seq 497, length 40

Additional context

No response

Checklist

  • I have searched the open issues for this bug.
  • I have not included sensitive information in this report.
@EasyNetDev EasyNetDev added the triage Needs further investigation label May 2, 2024
@EasyNetDev
Copy link
Contributor Author

EasyNetDev commented May 16, 2024

After research a little bit I found the issue.
On these systems I've notice this behavior I have a script that is doing 2 things:

1000:   from all lookup [l3mdev-table]
2000:   from all lookup [l3mdev-table] unreachable
32765:  from all lookup local
32766:  from all lookup main
32767:  from all lookup default

So, on a normal Linux system you will have:

0:  from all lookup local
1000:   from all lookup [l3mdev-table]
2000:   from all lookup [l3mdev-table] unreachable
32766:  from all lookup main
32767:  from all lookup default

The lookup for local is the first one. Before Kernel 6.6 or even earlier (I have no clue) we should theoretical do this change.
There is a pretty nice description here:
https://stbuehler.de/blog/article/2020/02/29/using_vrf__virtual_routing_and_forwarding__on_linux.html

I still found this VRF preparation in Linux Kernels:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/testing/selftests/net/forwarding/lib.sh
and look for functions vrf_prepare() and vrf_cleanup():

vrf_prepare()
{
	ip -4 rule add pref 32765 table local
	ip -4 rule del pref 0
	ip -6 rule add pref 32765 table local
	ip -6 rule del pref 0
}

vrf_cleanup()
{
	ip -6 rule add pref 0 table local
	ip -6 rule del pref 32765
	ip -4 rule add pref 0 table local
	ip -4 rule del pref 32765
}

and you can check the sources of the Kernel and still find a lot of time calls of these functions on tests.

But seems that from a specific version of Kernel this problem was fixed and if you move the lookup for local after VRFs you will end in this situation.

Does anybody have any idea why we should or not use this trick with Kernel rules?

@taspelund
Copy link

The reason you'd want the local rule after the l3mdev rule is so that routed packets ingressing a VRF slave won't be terminated if they target an address assigned only to the default VRF.

If you do ip route show table <vrf-table> you'll see that a VRF collapses both normal and local routes into a single table, while the default VRF splits them across tables main and local. The idea behind the move is to consolidate the local/main lookups and to avoid unintentional leaking of local routes from the default VRF into other VRFs.

I'm not sure I understand why a loop is occurring just from this output, but hopefully that gives a little context around the ip rule behavior.

Maybe it would be helpful to see the output of ip route show, ip route show table local, and ip route show table <mgmt table id> to better understand what the routes look like in the kernel

@EasyNetDev
Copy link
Contributor Author

Hi @taspelund ,

Thanks for the explanation. Makes sens what you explain, indeed. Now I have a better idea why we need to move rule 0 after l3mdev rule.
Let me do a small lab, moving the rules as I have it in my router and to test on a smaller route tables (R01 and R02 have full IPv4 tables).

@EasyNetDev
Copy link
Contributor Author

EasyNetDev commented May 17, 2024

I created a testlab:

R01:

root@FRR01:/# uname -a
Linux FRR01 6.7.12-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.7.12-1 (2024-04-24) x86_64 GNU/Linux

FRRouting config:

frr version 10.1-dev
frr defaults traditional
hostname FRR01
service integrated-vtysh-config
!
vrf red
 ip route 0.0.0.0/0 blackhole
 ipv6 route ::/0 blackhole
exit-vrf
!
vrf green
 ip route 0.0.0.0/0 192.168.1.1
exit-vrf
!
vrf blue
exit-vrf
!
interface ens33
 description MGMT
 ip address 192.168.1.200/24
exit
!
interface ens37
 description to C01 / ens36
 ip address 10.0.0.1/30
exit
!
interface ens38
 description to C02 / ens36
 ip address 10.0.1.1/30
exit
!
interface lo
 ip address 10.100.0.1/32
exit
!
router bgp 65001
 bgp router-id 10.100.0.1
 !
 address-family ipv4 unicast
  redistribute connected route-map rm-GRT-C-export
  label vpn export auto
  rd vpn export 65001:10001
  rt vpn import 65001:10000 65001:10999
  rt vpn export 65001:10000
  export vpn
  import vpn
 exit-address-family
exit
!
router bgp 65001 vrf red
 bgp router-id 10.100.0.1
 !
 address-family ipv4 unicast
  redistribute connected route-map rm-RED-C-export
  label vpn export auto
  rd vpn export 65001:11001
  rt vpn import 65001:11000 65001:11999
  rt vpn export 65001:11000
  export vpn
  import vpn
 exit-address-family
exit
!
route-map rm-GRT-C-export permit 1000
 description Exported connected from GRT must be imported in RED
 set extcommunity rt 65001:11999
exit
!
route-map rm-RED-C-export permit 1000
 description Exported connected from RED must be imported in GRT
 set extcommunity rt 65001:10999
exit
!
segment-routing
 traffic-eng
 exit
exit
!
end

Output on R01:

FRR01# show vrf
vrf blue id 32 table 4000
vrf green id 36 table 5000
vrf red id 31 table 3000 (configured)

FRR01# show ip forwarding
IP forwarding is on

FRR01# show bgp vrf default ipv4 10.0.0.0/30
BGP routing table entry for 10.0.0.0/30, version 17
Paths: (1 available, best #1, table default)
  Not advertised to any peer
  Local
    0.0.0.0 from 0.0.0.0 (10.100.0.1)
      Origin incomplete, metric 0, weight 32768, valid, sourced, best (First path received
)
      Extended Community: RT:65001:11999
      Last update: Fri May 17 12:01:57 2024
FRR01# show bgp vrf default ipv4 10.0.1.0/30
BGP routing table entry for 10.0.1.0/30, version 19
Paths: (1 available, best #1, table default)
  Not advertised to any peer
  Imported from 65001:11001:10.0.1.0/30
  Local
    0.0.0.0 from 0.0.0.0 (10.100.0.1) vrf red(31) announce-nh-self
      Origin incomplete, metric 0, weight 32768, valid, sourced, local, best (First path r
eceived)
      Extended Community: RT:65001:10999 RT:65001:11000
      Originator: 10.100.0.1
      Last update: Fri May 17 12:02:02 2024
FRR01# show bgp vrf red ipv4 10.0.0.0/30
BGP routing table entry for 10.0.0.0/30, version 21
Paths: (1 available, best #1, vrf red)
  Not advertised to any peer
  Imported from 65001:10001:10.0.0.0/30
  Local
    0.0.0.0 from 0.0.0.0 (10.100.0.1) vrf default(0) announce-nh-self
      Origin incomplete, metric 0, weight 32768, valid, sourced, local, best (First path r
eceived)
      Extended Community: RT:65001:11999 RT:65001:10000
      Originator: 10.100.0.1
      Last update: Fri May 17 12:01:57 2024
FRR01# show bgp vrf red ipv4 10.0.1.0/30
BGP routing table entry for 10.0.1.0/30, version 23
Paths: (1 available, best #1, vrf red)
  Not advertised to any peer
  Local
    0.0.0.0 from 0.0.0.0 (10.100.0.1)
      Origin incomplete, metric 0, weight 32768, valid, sourced, best (First path received
)
      Extended Community: RT:65001:10999
      Last update: Fri May 17 12:02:03 2024


FRR01# show ip route
Codes: K - kernel route, C - connected, L - local, S - static,
       R - RIP, O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric, t - Table-Direct,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

C>* 10.0.0.0/30 is directly connected, ens37, 04:12:34
L>* 10.0.0.1/32 is directly connected, ens37, 04:12:34
B>* 10.0.1.0/30 [20/0] is directly connected, red (vrf red), weight 1, 00:03:12
L * 10.100.0.1/32 is directly connected, lo, 00:08:16
C>* 10.100.0.1/32 is directly connected, lo, 00:08:16

FRR01# show ip forwarding
IP forwarding is on
FRR01# show ip route vrf red
Codes: K - kernel route, C - connected, L - local, S - static,
       R - RIP, O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric, t - Table-Direct,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

VRF red:
S>* 0.0.0.0/0 [1/0] unreachable (blackhole), weight 1, 03:36:01
B>* 10.0.0.0/30 [20/0] is directly connected, lo (vrf default), weight 1, 00:03:49
C>* 10.0.1.0/30 is directly connected, ens38, 04:08:07
L>* 10.0.1.1/32 is directly connected, ens38, 04:08:07
B>* 10.100.0.1/32 [20/0] is directly connected, lo (vrf default), weight 1, 00:03:49

So everything looks fine as route leak.

R01 nexthop tables:

root@FRR01:/# ip nexthop show
id 2 dev lo0 scope link proto zebra
id 4 dev lo1 scope link proto zebra
id 6 dev lo2 scope link proto zebra
id 8 dev lo3 scope link proto zebra
id 10 dev ens36 scope host proto zebra
id 17 dev ens37 scope host proto zebra
id 18 dev ens38 scope host proto zebra
id 19 dev lo scope host proto zebra
id 20 dev ens33 scope host proto zebra
id 21 blackhole proto zebra
id 22 blackhole proto zebra
id 24 via 192.168.1.1 dev ens33 scope link proto zebra
id 32 dev ens33 scope link proto zebra
id 34 dev lo scope host proto zebra
id 36 dev red scope host proto zebra

root@FRR01:/# ip nexthop show vrf red
id 18 dev ens38 scope host proto zebra
id 25 dev ens38 scope link proto zebra

R01: routing tables:

root@FRR01:/# ip route list
10.0.0.0/30 dev ens37 proto kernel scope link src 10.0.0.1
10.0.1.0/30 nhid 36 dev red proto bgp metric 20

root@FRR01:/# ip route show table local
local 10.0.0.1 dev ens37 proto kernel scope host src 10.0.0.1
broadcast 10.0.0.3 dev ens37 proto kernel scope link src 10.0.0.1
local 10.100.0.1 dev lo proto kernel scope host src 10.100.0.1
broadcast 10.100.0.1 dev lo proto kernel scope link src 10.100.0.1
local 127.0.0.0/8 dev lo proto kernel scope host src 127.0.0.1
local 127.0.0.1 dev lo proto kernel scope host src 127.0.0.1
broadcast 127.255.255.255 dev lo proto kernel scope link src 127.0.0.1

root@FRR01:/# ip route show vrf red
blackhole default proto static metric 20
10.0.0.0/30 nhid 34 dev lo proto bgp metric 20
10.0.1.0/30 dev ens38 proto kernel scope link src 10.0.1.1
10.100.0.1 nhid 34 dev lo proto bgp metric 20

root@FRR01:/# ip route show table red
blackhole default proto static metric 20
10.0.0.0/30 nhid 34 dev lo proto bgp metric 20
10.0.1.0/30 dev ens38 proto kernel scope link src 10.0.1.1
local 10.0.1.1 dev ens38 proto kernel scope host src 10.0.1.1
broadcast 10.0.1.3 dev ens38 proto kernel scope link src 10.0.1.1
10.100.0.1 nhid 34 dev lo proto bgp metric 20

root@FRR01:/# ip route show vrf red
blackhole default proto static metric 20
10.0.0.0/30 nhid 34 dev lo proto bgp metric 20
10.0.1.0/30 dev ens38 proto kernel scope link src 10.0.1.1
10.100.0.1 nhid 34 dev lo proto bgp metric 20

root@FRR01:/# ip ru l
0:      from all lookup local
1000:   from all lookup [l3mdev-table]
32766:  from all lookup main
32767:  from all lookup default

C01:

root@Client02:/# ip r l
default nhid 8 via 10.0.1.1 dev ens36 proto static metric 20
10.0.1.0/30 dev ens36 proto kernel scope link src 10.0.1.2

root@Client02:/# ip next show
id 3 dev ens33 scope link proto zebra
id 4 dev ens36 scope link proto zebra
id 6 dev ens36 scope host proto zebra
id 8 via 10.0.1.1 dev ens36 scope link proto zebra

C02:

root@Client01:/# ip r l
default nhid 8 via 10.0.0.1 dev ens36 proto static metric 20
10.0.0.0/30 dev ens36 proto kernel scope link src 10.0.0.2
root@Client01:/# ip next l
id 3 dev ens33 scope link proto zebra
id 4 dev ens36 scope link proto zebra
id 6 dev ens36 scope host proto zebra
id 8 via 10.0.0.1 dev ens36 scope link proto zebra

But I can't ping the endpoints:

12:26:27.928748 ens37 In  IP (tos 0x0, ttl 64, id 26402, offset 0, flags [DF], proto ICMP (1), length 84)
    10.0.0.2 > 10.0.1.2: ICMP echo request, id 47463, seq 1, length 64
12:26:27.928784 red   Out IP (tos 0x0, ttl 63, id 26402, offset 0, flags [DF], proto ICMP (1), length 84)
    10.0.0.2 > 10.0.1.2: ICMP echo request, id 47463, seq 1, length 64
12:26:27.928797 ens38 Out IP (tos 0x0, ttl 63, id 26402, offset 0, flags [DF], proto ICMP (1), length 84)
    10.0.0.2 > 10.0.1.2: ICMP echo request, id 47463, seq 1, length 64
12:26:27.929984 ens38 In  IP (tos 0x0, ttl 64, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930035 lo    In  IP (tos 0x0, ttl 63, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930050 lo    In  IP (tos 0x0, ttl 62, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930065 lo    In  IP (tos 0x0, ttl 61, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930079 lo    In  IP (tos 0x0, ttl 60, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930092 lo    In  IP (tos 0x0, ttl 59, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930106 lo    In  IP (tos 0x0, ttl 58, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930120 lo    In  IP (tos 0x0, ttl 57, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930134 lo    In  IP (tos 0x0, ttl 56, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930147 lo    In  IP (tos 0x0, ttl 55, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930161 lo    In  IP (tos 0x0, ttl 54, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930175 lo    In  IP (tos 0x0, ttl 53, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930189 lo    In  IP (tos 0x0, ttl 52, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930202 lo    In  IP (tos 0x0, ttl 51, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930216 lo    In  IP (tos 0x0, ttl 50, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930230 lo    In  IP (tos 0x0, ttl 49, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930245 lo    In  IP (tos 0x0, ttl 48, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930259 lo    In  IP (tos 0x0, ttl 47, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930274 lo    In  IP (tos 0x0, ttl 46, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930288 lo    In  IP (tos 0x0, ttl 45, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930302 lo    In  IP (tos 0x0, ttl 44, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930317 lo    In  IP (tos 0x0, ttl 43, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930331 lo    In  IP (tos 0x0, ttl 42, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930345 lo    In  IP (tos 0x0, ttl 41, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930359 lo    In  IP (tos 0x0, ttl 40, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930373 lo    In  IP (tos 0x0, ttl 39, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930388 lo    In  IP (tos 0x0, ttl 38, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930403 lo    In  IP (tos 0x0, ttl 37, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930418 lo    In  IP (tos 0x0, ttl 36, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930434 lo    In  IP (tos 0x0, ttl 35, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930613 lo    In  IP (tos 0x0, ttl 34, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930629 lo    In  IP (tos 0x0, ttl 33, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930643 lo    In  IP (tos 0x0, ttl 32, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930657 lo    In  IP (tos 0x0, ttl 31, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930671 lo    In  IP (tos 0x0, ttl 30, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930684 lo    In  IP (tos 0x0, ttl 29, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930697 lo    In  IP (tos 0x0, ttl 28, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930710 lo    In  IP (tos 0x0, ttl 27, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930725 lo    In  IP (tos 0x0, ttl 26, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930738 lo    In  IP (tos 0x0, ttl 25, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930752 lo    In  IP (tos 0x0, ttl 24, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930765 lo    In  IP (tos 0x0, ttl 23, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930779 lo    In  IP (tos 0x0, ttl 22, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930793 lo    In  IP (tos 0x0, ttl 21, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930807 lo    In  IP (tos 0x0, ttl 20, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930820 lo    In  IP (tos 0x0, ttl 19, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930834 lo    In  IP (tos 0x0, ttl 18, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930847 lo    In  IP (tos 0x0, ttl 17, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930861 lo    In  IP (tos 0x0, ttl 16, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930875 lo    In  IP (tos 0x0, ttl 15, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930888 lo    In  IP (tos 0x0, ttl 14, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930901 lo    In  IP (tos 0x0, ttl 13, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930915 lo    In  IP (tos 0x0, ttl 12, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930928 lo    In  IP (tos 0x0, ttl 11, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930941 lo    In  IP (tos 0x0, ttl 10, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930955 lo    In  IP (tos 0x0, ttl 9, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930969 lo    In  IP (tos 0x0, ttl 8, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930983 lo    In  IP (tos 0x0, ttl 7, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.930996 lo    In  IP (tos 0x0, ttl 6, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.931010 lo    In  IP (tos 0x0, ttl 5, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.931024 lo    In  IP (tos 0x0, ttl 4, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.931038 lo    In  IP (tos 0x0, ttl 3, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.931052 lo    In  IP (tos 0x0, ttl 2, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.931066 lo    In  IP (tos 0x0, ttl 1, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.931101 red   Out IP (tos 0xc0, ttl 64, id 7029, offset 0, flags [none], proto ICMP (1), length 112)
    10.0.1.1 > 10.0.1.2: ICMP time exceeded in-transit, length 92
        IP (tos 0x0, ttl 1, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64
12:26:27.931112 ens38 Out IP (tos 0xc0, ttl 64, id 7029, offset 0, flags [none], proto ICMP (1), length 112)
    10.0.1.1 > 10.0.1.2: ICMP time exceeded in-transit, length 92
        IP (tos 0x0, ttl 1, id 24964, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 47463, seq 1, length 64

C02 is receiving from R01:

12:28:02.498584 IP (tos 0x0, ttl 63, id 37569, offset 0, flags [DF], proto ICMP (1), length 84)
    10.0.0.2 > 10.0.1.2: ICMP echo request, id 26026, seq 1, length 64
12:28:02.498612 IP (tos 0x0, ttl 64, id 46570, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 26026, seq 1, length 64
12:28:02.499278 IP (tos 0xc0, ttl 64, id 13366, offset 0, flags [none], proto ICMP (1), length 112)
    10.0.1.1 > 10.0.1.2: ICMP time exceeded in-transit, length 92
        IP (tos 0x0, ttl 1, id 46570, offset 0, flags [none], proto ICMP (1), length 84)
    10.0.1.2 > 10.0.0.2: ICMP echo reply, id 26026, seq 1, length 64

This is extreamly weird this behavion when I'm trying to route-leak between GRT and VRFs.

If I'm moving rule 0 after l3mdev I will not be able to ping the R01 loopback addresses from GRT. Is going in the same issue.

@EasyNetDev
Copy link
Contributor Author

EasyNetDev commented May 17, 2024

What I remember is that previous versions of FRRouting was inserting the routes leaks using the interface source, like:
VRF default:

B>* 10.0.1.0/30 [20/0] is directly connected, ens38 (vrf red), weight 1, 00:03:12

instead

B>* 10.0.1.0/30 [20/0] is directly connected, red (vrf red), weight 1, 00:21:36

VRF red:

B>* 10.0.0.0/30 [20/0] is directly connected, ens37 (vrf default), weight 1, 00:03:49

instead:

B>* 10.0.0.0/30 [20/0] is directly connected, lo (vrf default), weight 1, 00:21:01

Like FRR shows in the show ip route all traffic is going to default VRF via interface lo in default VRF, not towards the interface where is attached.
But if I'm going on FRR routing the packet is going to lo interface and it ends in a L3 loop.
I tested also with kernel 6.1.90-1.

In the past, when VRF feature was introduced in Linux, I remember that I tried to route via VRF interface and I ended in similar situations.

I have no idea if is just a show command format or is really going like this.

@EasyNetDev
Copy link
Contributor Author

Yes, as I suspected. Downgrading to version 9.0.2 stable the ping is working and I have this:

FRR01# show version
FRRouting 9.0.2 (FRR01) on Linux(6.7.12-amd64).
Copyright 1996-2005 Kunihiro Ishiguro, et al.
configured with:
    '--build=x86_64-linux-gnu' '--prefix=/usr' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--disable-option-checking' '--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--localstatedir=/var/run/frr' '--sbindir=/usr/lib/frr' '--sysconfdir=/etc/frr' '--with-vtysh-pager=/usr/bin/pager' '--libdir=/usr/lib/x86_64-linux-gnu/frr' '--with-moduledir=/usr/lib/x86_64-linux-gnu/frr/modules' '--disable-dependency-tracking' '--enable-rpki' '--disable-scripting' '--enable-pim6d' '--with-libpam' '--enable-doc' '--enable-doc-html' '--enable-snmp' '--enable-fpm' '--disable-protobuf' '--disable-zeromq' '--enable-ospfapi' '--enable-bgp-vnc' '--enable-multipath=256' '--enable-user=frr' '--enable-group=frr' '--enable-vty-group=frrvty' '--enable-configfile-mask=0640' '--enable-logfile-mask=0640' 'build_alias=x86_64-linux-gnu' 'PYTHON=python3'

FRR01# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

C>* 10.0.0.0/30 is directly connected, ens37, 00:00:11
B>* 10.0.1.0/30 [20/0] is directly connected, ens38 (vrf red), weight 1, 00:00:05
C>* 10.100.0.1/32 is directly connected, lo, 00:00:11

FRR01# show ip route vrf red
Codes: K - kernel route, C - connected, S - static, R - RIP,
       O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

VRF red:
S>* 0.0.0.0/0 [1/0] unreachable (blackhole), weight 1, 00:01:19
B>* 10.0.0.0/30 [20/0] is directly connected, ens37 (vrf default), weight 1, 00:01:14
C>* 10.0.1.0/30 is directly connected, ens38, 00:01:20
B>* 10.100.0.1/32 [20/0] is directly connected, lo (vrf default), weight 1, 00:01:14

All routes points to the interface via VRF and not to the VRF interface itself.

root@FRR01:/opt/Kitts/frr/9.0.2# ip nexthop show
id 14 dev lo scope host proto zebra
id 15 dev ens33 scope host proto zebra
id 16 dev ens36 scope host proto zebra
id 17 dev ens37 scope host proto zebra
id 18 dev ens38 scope host proto zebra
id 19 dev ens33 scope link proto zebra
id 21 dev ens36 scope link proto zebra
id 23 dev ens37 scope link proto zebra
id 25 dev ens38 scope link proto zebra
id 26 dev lo3 scope link proto zebra
id 30 blackhole proto zebra
id 31 blackhole proto zebra
id 32 via 192.168.1.1 dev ens33 scope link proto zebra
id 36 dev ens37 scope host proto zebra
id 37 dev lo scope host proto zebra
id 38 dev ens38 scope host proto zebra

root@FRR01:/opt/Kitts/frr/9.0.2# ip nexthop show vrf red
id 18 dev ens38 scope host proto zebra
id 25 dev ens38 scope link proto zebra
id 38 dev ens38 scope host proto zebra

root@FRR01:/opt/Kitts/frr/9.0.2# ip route list
10.0.0.0/30 dev ens37 proto kernel scope link src 10.0.0.1
10.0.1.0/30 nhid 38 dev ens38 proto bgp metric 20

root@FRR01:/opt/Kitts/frr/9.0.2# ip route show table local
local 10.0.0.1 dev ens37 proto kernel scope host src 10.0.0.1
broadcast 10.0.0.3 dev ens37 proto kernel scope link src 10.0.0.1
local 10.100.0.1 dev lo proto kernel scope host src 10.100.0.1
broadcast 10.100.0.1 dev lo proto kernel scope link src 10.100.0.1
local 127.0.0.0/8 dev lo proto kernel scope host src 127.0.0.1
local 127.0.0.1 dev lo proto kernel scope host src 127.0.0.1
broadcast 127.255.255.255 dev lo proto kernel scope link src 127.0.0.1

root@FRR01:/opt/Kitts/frr/9.0.2# ip route show vrf red
blackhole default proto static metric 20
10.0.0.0/30 nhid 36 dev ens37 proto bgp metric 20
10.0.1.0/30 dev ens38 proto kernel scope link src 10.0.1.1
10.100.0.1 nhid 37 dev lo proto bgp metric 20

root@FRR01:/opt/Kitts/frr/9.0.2# ip route show table red
blackhole default proto static metric 20
10.0.0.0/30 nhid 36 dev ens37 proto bgp metric 20
10.0.1.0/30 dev ens38 proto kernel scope link src 10.0.1.1
local 10.0.1.1 dev ens38 proto kernel scope host src 10.0.1.1
broadcast 10.0.1.3 dev ens38 proto kernel scope link src 10.0.1.1
10.100.0.1 nhid 37 dev lo proto bgp metric 20

root@FRR01:/opt/Kitts/frr/9.0.2# ip route show vrf red
blackhole default proto static metric 20
10.0.0.0/30 nhid 36 dev ens37 proto bgp metric 20
10.0.1.0/30 dev ens38 proto kernel scope link src 10.0.1.1
10.100.0.1 nhid 37 dev lo proto bgp metric 20

root@FRR01:/opt/Kitts/frr/9.0.2# ip rule list
0:      from all lookup local
1000:   from all lookup [l3mdev-table]
32766:  from all lookup main
32767:  from all lookup default
root@FRR01:/opt/Kitts/frr/9.0.2#

@EasyNetDev
Copy link
Contributor Author

EasyNetDev commented May 17, 2024

I've tested also with 10.0 stable and I have the same issue.
Could be a regression in 10.x ?

@EasyNetDev
Copy link
Contributor Author

EasyNetDev commented May 17, 2024

I've test 9.1 and is still working ok. I'm presuming that something was changed between 9.1 and 10.0 and made a regression.

@taspelund
Copy link

taspelund commented May 17, 2024

I suspect the issue is with this route using dev lo as its nexthop.

root@FRR01:/# ip route show vrf red
blackhole default proto static metric 20
10.0.0.0/30 nhid 34 dev lo proto bgp metric 20   <<<<<
10.0.1.0/30 dev ens38 proto kernel scope link src 10.0.1.1
10.100.0.1 nhid 34 dev lo proto bgp metric 20

I know using dev <vrf> as the nexthop will result in the packet doing a route lookup in the VRF's FIB table, but I'm not sure if lo behaves the same way.

I think we need to understand what the expected kernel behavior is here (should routing to dev lo trigger a route lookup) so we know if FRR is programming the kernel in accordance with how the data plane actually works for the default VRF, or if FRR has made an invalid assumption about how the data plane works for the default VRF

@louis-6wind
Copy link
Contributor

I'll have a look at this issue

@EasyNetDev
Copy link
Contributor Author

EasyNetDev commented May 17, 2024

Exactly the same I was thinking. I think dev lo is not reinjecting the packet back in the network stack.
Between VRFs is ok. I saw the lookup is done if you point to VRF interface.

I was trying to build a multi-loopback interface driver for Linux and I faced some similar issues.
The driver I created is based on Linux loopback (a part of it), dummy and VRF interface.
But what I understood from the VRF and Loopback interfaces, looks like loopback interface doesn't reinject the traffic back in the network stack. That's why I tried to use parts from VRF interface.
VRF is using some additional steps like lo_process_v4_outbound in the code, while loopback doesn't.

Here is my project I'm trying to achieve multi loopback interfaces: https://github.com/EasyNetDev/linux-multi-loopback
I did a short documentation how I was thinking to create this driver.
Don't take it as a fulky working one, but maybe gives you an idea why is doing.

@louis-6wind
Copy link
Contributor

Try this patch (for the latest master)

diff --git a/bgpd/bgp_mplsvpn.c b/bgpd/bgp_mplsvpn.c
index d237a00e04..90881621b3 100644
--- a/bgpd/bgp_mplsvpn.c
+++ b/bgpd/bgp_mplsvpn.c
@@ -2208,8 +2208,9 @@ static void vpn_leak_to_vrf_update_onevrf(struct bgp *to_bgp,   /* to */
 	 * Let the kernel to decide with double lookup the real next-hop
 	 * interface when installing the route.
 	 */
-	if (src_bgp || bpi_ultimate->sub_type == BGP_ROUTE_STATIC ||
-	    bpi_ultimate->sub_type == BGP_ROUTE_REDISTRIBUTE) {
+	if (src_vrf->vrf_id != VRF_DEFAULT &&
+	    (src_bgp || bpi_ultimate->sub_type == BGP_ROUTE_STATIC ||
+	     bpi_ultimate->sub_type == BGP_ROUTE_REDISTRIBUTE)) {
 		ifp = if_get_vrf_loopback(src_vrf->vrf_id);
 		if (ifp)
 			static_attr.nh_ifindex = ifp->ifindex;

@EasyNetDev
Copy link
Contributor Author

Ok. I will try this evening.
I thing I saw this code:

	if (src_bgp || bpi_ultimate->sub_type == BGP_ROUTE_STATIC ||
	    bpi_ultimate->sub_type == BGP_ROUTE_REDISTRIBUTE) {
 		ifp = if_get_vrf_loopback(src_vrf->vrf_id);
 		if (ifp)
 			static_attr.nh_ifindex = ifp->ifindex;

When I was zapping through the PR in 10.0

@louis-6wind
Copy link
Contributor

Good catch. This is the code I am patching

@EasyNetDev
Copy link
Contributor Author

EasyNetDev commented May 17, 2024

Yep, is working:

FRR01# show version
FRRouting 10.1-dev (FRR01) on Linux(6.7.12-amd64).
Copyright 1996-2005 Kunihiro Ishiguro, et al.
configured with:
    '--build=x86_64-linux-gnu' '--prefix=/usr' '--includedir=${prefix}/include' '--mandir=${prefix}/share/man' '--infodir=${prefix}/share/info' '--sysconfdir=/etc' '--localstatedir=/var' '--disable-option-checking' '--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu' '--libexecdir=${prefix}/lib/x86_64-linux-gnu' '--disable-maintainer-mode' '--sbindir=/usr/lib/frr' '--with-vtysh-pager=/usr/bin/pager' '--libdir=/usr/lib/x86_64-linux-gnu/frr' '--with-moduledir=/usr/lib/x86_64-linux-gnu/frr/modules' '--disable-dependency-tracking' '--enable-rpki' '--disable-scripting' '--enable-pim6d' '--with-libpam' '--enable-doc' '--enable-doc-html' '--enable-snmp' '--enable-fpm' '--disable-protobuf' '--disable-zeromq' '--enable-ospfapi' '--enable-bgp-vnc' '--enable-multipath=256' '--enable-user=frr' '--enable-group=frr' '--enable-vty-group=frrvty' '--enable-configfile-mask=0640' '--enable-logfile-mask=0640' '--enable-config-rollbacks' 'build_alias=x86_64-linux-gnu' 'PYTHON=python3'

FRR01# show ip route vrf red
Codes: K - kernel route, C - connected, L - local, S - static,
       R - RIP, O - OSPF, I - IS-IS, B - BGP, E - EIGRP, N - NHRP,
       T - Table, v - VNC, V - VNC-Direct, A - Babel, F - PBR,
       f - OpenFabric, t - Table-Direct,
       > - selected route, * - FIB route, q - queued, r - rejected, b - backup
       t - trapped, o - offload failure

VRF red:
S>* 0.0.0.0/0 [1/0] unreachable (blackhole), weight 1, 00:00:15
B>* 10.0.0.0/30 [20/0] is directly connected, ens37 (vrf default), weight 1, 00:00:10
C>* 10.0.1.0/30 is directly connected, ens38, 00:00:16
L>* 10.0.1.1/32 is directly connected, ens38, 00:00:16
B>* 10.100.0.1/32 [20/0] is directly connected, lo (vrf default), weight 1, 00:00:10
root@Client02:/home/adrian# cd /
root@Client02:/# ping 10.0.0.2 -c2
PING 10.0.0.2 (10.0.0.2) 56(84) bytes of data.
64 bytes from 10.0.0.2: icmp_seq=1 ttl=63 time=1.58 ms
64 bytes from 10.0.0.2: icmp_seq=2 ttl=63 time=0.848 ms

--- 10.0.0.2 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1002ms
rtt min/avg/max/mdev = 0.848/1.212/1.577/0.364 ms
root@Client02:/#

@EasyNetDev
Copy link
Contributor Author

@taspelund you will create a PR ?
Or I can do it for you, if you want.

@taspelund
Copy link

You go ahead. I didn't do any of the coding, just have my 2 cents on what I thought was wrong :-)

@EasyNetDev
Copy link
Contributor Author

Ah, sorry @taspelund. I was tagging you wrongly. I wanted to ask @louis-6wind .

@louis-6wind
Copy link
Contributor

I will do the pull-request. Let me check a few things first

@louis-6wind
Copy link
Contributor

If you do ip route show table <vrf-table> you'll see that a VRF collapses both normal and local routes into a single table, while the default VRF splits them across tables main and local. The idea behind the move is to consolidate the local/main lookups and to avoid unintentional leaking of local routes from the default VRF into other VRFs.

You are correct. If you add a route on the local table, you can solve the routing loop issue.
ip route add X.X.X.X/XX dev XX table local

louis-6wind added a commit to louis-6wind/frr that referenced this issue May 20, 2024
Leaked route from the l3VRF are installed with the loopback as the
nexthop interface instead of the real interface.

> B>* 10.0.0.0/30 [20/0] is directly connected, lo (vrf default), weight 1, 00:21:01

Routing of packet from a L3VRF to the default L3VRF destined to a leak
prefix fails because of the default routing rules on Linux.

> 0:      from all lookup local
> 1000:   from all lookup [l3mdev-table]
> 32766:  from all lookup main
> 32767:  from all lookup default

When the packet is received in the loopback interface, the local rules
are checked without match, then the l3mdev-table says to route to the
loopback. A routing loop occurs (TTL is decreasing).

> 12:26:27.928748 ens37 In  IP (tos 0x0, ttl 64, id 26402, offset 0, flags [DF], proto ICMP (1), length 84)
>     10.0.0.2 > 10.0.1.2: ICMP echo request, id 47463, seq 1, length 64
> 12:26:27.928784 red   Out IP (tos 0x0, ttl 63, id 26402, offset 0, flags [DF], proto ICMP (1), length 84)
>     10.0.0.2 > 10.0.1.2: ICMP echo request, id 47463, seq 1, length 64
> 12:26:27.928797 ens38 Out IP (tos 0x0, ttl 63, id 26402, offset 0, flags [DF], proto ICMP (1), length 84)
>     10.0.0.2 > 10.0.1.2: ICMP echo request, id 47463, seq 1, length 64

Do not set the lo interface as a nexthop interface. Keep the real
interface where possible.

Fixes: db7cf73 ("bgpd: fix interface on leaks from redistribute connected")
Fixes: 067fbab ("bgpd: fix interface on leaks from network statement")
Fixes: 8a02d9f ("bgpd: Set nh ifindex to VRF's interface, not the real")
Fixes: FRRouting#15909
Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
mergify bot pushed a commit that referenced this issue May 24, 2024
Leaked route from the l3VRF are installed with the loopback as the
nexthop interface instead of the real interface.

> B>* 10.0.0.0/30 [20/0] is directly connected, lo (vrf default), weight 1, 00:21:01

Routing of packet from a L3VRF to the default L3VRF destined to a leak
prefix fails because of the default routing rules on Linux.

> 0:      from all lookup local
> 1000:   from all lookup [l3mdev-table]
> 32766:  from all lookup main
> 32767:  from all lookup default

When the packet is received in the loopback interface, the local rules
are checked without match, then the l3mdev-table says to route to the
loopback. A routing loop occurs (TTL is decreasing).

> 12:26:27.928748 ens37 In  IP (tos 0x0, ttl 64, id 26402, offset 0, flags [DF], proto ICMP (1), length 84)
>     10.0.0.2 > 10.0.1.2: ICMP echo request, id 47463, seq 1, length 64
> 12:26:27.928784 red   Out IP (tos 0x0, ttl 63, id 26402, offset 0, flags [DF], proto ICMP (1), length 84)
>     10.0.0.2 > 10.0.1.2: ICMP echo request, id 47463, seq 1, length 64
> 12:26:27.928797 ens38 Out IP (tos 0x0, ttl 63, id 26402, offset 0, flags [DF], proto ICMP (1), length 84)
>     10.0.0.2 > 10.0.1.2: ICMP echo request, id 47463, seq 1, length 64

Do not set the lo interface as a nexthop interface. Keep the real
interface where possible.

Fixes: db7cf73 ("bgpd: fix interface on leaks from redistribute connected")
Fixes: 067fbab ("bgpd: fix interface on leaks from network statement")
Fixes: 8a02d9f ("bgpd: Set nh ifindex to VRF's interface, not the real")
Fixes: #15909
Signed-off-by: Louis Scalbert <louis.scalbert@6wind.com>
(cherry picked from commit 31fc89b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage Needs further investigation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants