Category: Thought experiment


This article is not really written with knowledge usable for a production network in mind. It’s more of an “I have not failed. I’ve just found 10,000 ways that won’t work.” kind of article.

I’m currently in a mailing group with fellow network engineers who are setting up GRE tunnels to each others home networks over the public internet. Over those networks we speak (external) BGP towards each other and each engineer announces his own private address range. With around 10 engineers so far and a partial mesh of tunnels, it gives a useful topology to troubleshoot and experiment with. Just like the real internet, you don’t know what happens day-to-day, neighborships may go down or suddenly new ones are added, and other next-hops may become more interesting for some routes suddenly.

SwitchRouting1

But of course it requires a device at home capable of both GRE and BGP. A Cisco router will do, as will Linux with Quagga and many other industrial routers. But the only device I currently have running 24/7 is my WS-C3560-8PC switch. Although it has an IP Services IOS, is already routing and can do GRE and BGP, it doesn’t do NAT. Easy enough: allow GRE through on the router that does the NAT in the home network. Turns out the old DD-WRT version I have on my current router doesn’t support it. Sure I can replace it but it would cost me a new router and it would not be a challenge.

SwitchRouting2

Solution: give the switch a direct public IP address and do the tunnels from there. After all, the internal IP addresses are encapsulated in GRE for transport so NAT is required for them. Since the switch already has a default route towards the router, set up host routes (a /32) per remote GRE endpoint. However, this still introduces asymmetric routing: the provider subnet is a connected subnet for the switch, so incoming traffic will go through the router and outgoing directly from the switch to the internet without NAT. Of course that will not work.

SwitchRouting3

So yet another problem to work around. This can be solved for a large part using Policy-Based Routing (PBR): on the client VLAN interface, redirect all traffic not meant for a private range towards the router. But again, this has complications: the routing table does not reflect the actual routing being done, more administrative overhead, and all packets originated from the local switch will still follow the default (the 3560 switch does not support PBR for locally generated packets).

Next idea: it would be nice to have an extra device that can do GRE and BGP directly towards the internet and my switch can route private range packets towards it. But the constraint is no new device. So that brings me to VRFs: split the current 3560 switch in two: one routing table for the internal routing (vrf MAIN), one for the GRE tunnels (vrf BGP). However, to connect the two VRFs on the same physical device I would need to loop a cable from one switchport to another, and I only have 8 ports. The rest would work out fine: point private ranges from a VLAN interface in one VRF to a next-hop VLAN interface over that cable in another VRF. That second VRF can have a default route towards the internet and set up GRE tunnels. The two VRFs would share one subnet.

SwitchRouting4

Since I don’t want to deal with that extra cable, would it be possible to route between VRFs internally? I’ve tried similar actions before, but those required a route-map and a physical incoming interface. I might as well use PBR if I go that way. Internal interfaces for routing between VRFs exist on ASR series, but not my simple 8-port 3560. But what if I replace the cable with tunnel interfaces? Is it possible to put both endpoints in different VRFs? Yes, the 15.0(2) IOS supports it!

SwitchRouting5

The tunnel interfaces have two commands that are useful for this:

  • vrf definition : just like on any other layer 3 interface, it specifies the routing table of the packets in the interface (in the tunnel).
  • tunnel vrf :  specifies the underlying VRF from which the packets will be sent, after GRE encapsulation.

With these two commands, it’s possible to have tunnels in one VRF transporting packets for another VRF. The concept is vaguely similar to MPLS-VPN,  where your intermediate (provider) routers only have one routing table which is used to transport packets towards routers that have the VRF-awareness (provider-edge).

interface Vlan2
ip address 192.168.2.1 255.255.255.0
interface Vlan3
ip address 192.168.3.1 255.255.255.0
interface Tunnel75
vrf forwarding MAIN
ip address 192.168.7.5 255.255.255.252
tunnel source Vlan2
tunnel destination 192.168.3.1
interface Tunnel76
vrf forwarding BGP
ip address 192.168.7.6 255.255.255.252
tunnel source Vlan3
tunnel destination 192.168.2.1

So I configure two tunnel interfaces, both in the main routing table. Source and destination are two IP addresses locally configured on the router.  I chose VLAN interface, loopbacks will likely work as well. Inside the tunnels, one is set to the first VRF, the other to the second. One of the VRFs may be shared with the main (outside tunnels) routing table, but it’s not a requirement. Configure both tunnel interfaces as two sides of a point-to-point connection and they come up. Ping works, and even MTU 1500 works over the tunnels, despite the show interface command showing an MTU of only 1476!

Next, I set up BGP to be VRF-aware. Logically, there are two ‘routers’, one of which is the endpoint for the GRE tunnels, and another one which connects to it behind it for internal routing. Normally if it were two physical routers, I would set up internal BGP between them since I’m already using that protocol. But there’s no difference here: you can make the VRFs speak BGP to each other using one single configuration.

router bgp 65000
address-family ipv4 vrf MAIN
neighbor 192.168.7.6 remote-as 65000
network 192.168.0.0 mask 255.255.248.0
neighbor 192.168.7.6 activate
exit-address-family
address-family ipv4 vrf BGP
bgp router-id 192.168.7.6
neighbor 192.168.7.5 remote-as 65000
neighbor 192.168.7.5 activate
exit-address-family

A few points did surface: you need to specify the neighbors (the IP addresses of the local device in the different VRFs) under the correct address families. You also need to specify a route distinguisher under the VRF as it is required for VRF-aware BGP. And maybe the most ironic: you need a bgp router-id set inside the VRF address-family so it differs from the other VRF (the highest interface IP address by default), otherwise the two ‘BGP peers’ will notice the duplicate router-id and it will not work. But after all of that, BGP comes up and routes are exchanged between the two VRFs! For the GRE tunnels towards the internet, the tunnel vrf command is required in the GRE tunnels so they use the correct routing table for routing over the internet.

So what makes this not production-worthy? The software-switching.

The ASIC can only do a set number of actions in a certain sequence without punting towards the switch CPU. Doing a layer 2 CAM table lookup or a layer 3 RIB lookup is one thing. But receiving a packet, have the RIB pointing it to a GRE tunnel, encapsulate, decapsulate and RIB lookup of another VRF is too much. It follows the expected steps in the code accordingly, the IOS software does not ‘see’ what the point is and does not take shortcuts. GRE headers are actually calculated for each packet traversing the ‘internal tunnel’ link. I’ve done a stress test and the CPU would max out at 100% at… 700 kBps, about 5,6 Mbps. So while this is a very interesting configuration and it gives an ideal situation to learn more, it’s just lab stuff.

So that’s the lesson, as stated in the beginning: how not to do it. Can you route between VRFs internally on a Cisco switch or router (not including ASR series)? Yes. Would you want to do it? No!

And no FabricPath either. This one works without any active protocol involved, and no blocked links. Too good to be true? Of course!

LAN-NoSTP

Take the above example design: three switches connected by port channels. Let’s assume users connect to these switches with desktops.

Using a normal design, spanning tree would be configured (MST, RPVST+, you pick) and one of the three port-channel links would go into blocking. The root switch would be the one connecting to the rest of the network or a WAN uplink, assuming you set bridge priorities right.

Easy enough. And it would work. Any user in a VLAN would be able to reach another user on another switch in the same VLAN. They would always have to pass through the root switch though, either by being connected to it, or because spanning tree blocks the direct link between the non-root switches.

Disabling spanning-tree would make all links active. And a loop would definitely follow. However, wouldn’t it be nice if a switch would not forward a frame received from another switch to other switches? This would require some sort of split horizon, which VMware vSwitches already do: if a frame enters from a pNIC (physical NIC) it will not be sent out another pNIC again, preventing the vSwitch from becoming a transit switch. Turns out this split horizon functionality exists on a Cisco switch: ‘switchport protect’ on the interface, which will prevent any frames from being sent out that came in through another port with the same command.

Configuring it on the port channels on all three switches without disabling spanning tree proves the point: the two non-root switches can’t reach each other anymore because the root switch does not forward frames between the port channels. But disabling spanning tree after it creates a working situation: all switches can reach each other directly! No loops are formed because no switch forwards between the port channels.

Result: a working network with active-active links and optimal bandwidth usage. So what are the caveats? Well…

  • It doesn’t scale: you need a full mesh between switches. Three inter-switch links for three switches, six for four switches, ten for five switches,… After a few extra switches, you’ll run out of ports just for the uplinks.
  • Any link failure breaks the network: if the link between two switches is broken, those two switches will not be able to reach each other anymore. This is why my example uses port-channels: as long as one link is active it will work. But there will not be any failover to a path with better bandwidth.

Again a disclaimer, I don’t recommend it in any production environment. And I’m sure someone will (or already has) ignore(d) this.

If you’re working in a large enterprise with its own AS and public range(s), you’ll probably recognize the following image:

MultiClientDataCenter

On top, the internet with BGP routers, peering with multiple upstream providers and advertising the public range(s), owned by the company. Below that, multiple firewalls or routers (but I do hope firewalls). Those devices either provide internet access to different parts of the company network, or provide internet access to different customers. Whatever the case, they will have a default route pointing towards the BGP routers (a nice place for HSRP/VRRP).

Vice versa, the BGP routers have connectivity to the firewalls in the connected subnet, but they must also have a route for each NAT range towards the firewalls. This is often done using static routes: for each NAT range, a static route is defined on the BGP routers, with a firewall as a next hop. On that firewall, those NAT addresses are used, e.g. if the BGP has a route for 192.0.2.0/30, those four (yes, including broadcast and network) addresses can be used to NAT a server or users behind, even if those aren’t present on any interface in that firewall.

The problem in this setup is that it quickly becomes a great administrative burden, and since all the firewalls have a default route pointing towards the BGP routers, traffic between firewalls travels an extra hop. Setting up static routing for all the NAT ranges on each firewall is even more of a burden, and forgotten routes introduce asymmetric routing, where the return path is through the BGP. And while allocating one large public NAT range to each firewall sounds great in theory, reality is that networks tend to grow beyond their designed capacity and new NAT ranges must be allocated from time to time, as more servers are deployed. Perhaps customers even pay for each public IP they have. Can this be solved with dynamic routing? After all, the NAT ranges aren’t present on any connected subnet.

Yes, it’s possible! First, set up a dynamic routing protocol with all routers in the connected public subnet. Personally, I still prefer OSPF. In this design, the benefit of OSPF is the DR and BDR concept: set these to the BGP routers so they efficiently multicast routing updates to all firewalls. Next, on all firewalls, allow the redistribution of static routes, preferably with an IP prefix filter that allows only subnets in your own public IP range, used for NAT. Finally, if you need a NAT range, you just create a Null route on the firewall that needs it (e.g. 192.0.2.0/30 to Null0), and the route will automatically be redistributed into OSPF towards the BGP routers, who will send updates to all other firewalls. Problem solved!

But what about that Null route? Won’t packets be discarded? No, because this is where a particular logic of NAT comes into play: NAT translations are always done before the next hop is calculated. A packet entering with destination address pointing towards a Null route (e.g. 192.0.2.1) will first be translate to the private IP address, after which the route lookup gives a connected subnet or another route further into the private network. The Null route is never actually used!

This one’s for you, Chris.

I’ve read countless articles, comments, posts about IPv6 and ‘that there are x IPv6 addresses for every human/square meter/grain of sand… on earth’. Okay, but let’s go hypothetical now, make some assumptions, and try to predict when the IPv6 address pool will be depleted. It’s just for fun, don’t expect any educational value in this article.

I’m going to use powers of ten to keep it somewhat imaginable. The total number of possible combinations with 128 bits is 3.40*10^38.  However, some ranges are unusable for internet routable traffic, and reserved by IETF:

  • FE80::/10 – Link local addresses.
  • FC00::/7 – Unique local packets, somewhat resembling RFC 1918 addresses for IPv4.
  • FF00::/8 – Multicast addresses.
  • 2002::/16 – 6to4 addresses, which are more of a transition mechanism and not really host addresses.

Currently, only the 2000::/3 range is global unicast, but other ranges can be assigned so in the future. The currently assigned range alone gives 4.25*10^37 addresses. For simplicity, I’m going to assume that eventually the entire address range will be used, except 2002::/16 and the entire F000::/4 block. That should cover all current and future reserved assignments. That’s still 3.19*10^38 addresses.

It should be clear by now we will run out of useable MAC addresses much sooner than IPv6 addresses, but let’s leave that out of the equation here. Let’s continue with our 39-digit number. But, even if we assume that in the near future every mobile device has an IP address, and we assume that the number of mobile devices doubles (e.g. one smartphone, one tablet), then we have about 10 billion devices, according to Wikipedia. That’s 10^10, which is nothing next to 10^38. Even multiplying that number by 1000, to cover all servers, point-to-point links, multiple home computers, household devices with an IP (refrigerators, ovens, photo frames, you name it), the total number of cars,… It will ‘only’ count up to 10^13, which doesn’t even scratch the surface of the entire available address pool.

The answer doesn’t lie in counting the number of devices that could possibly have an IP address. But, so far, I have been assuming perfectly filled subnets, no wasted addresses. That’s certainly not going to happen in IPv6: the EUI-64 mechanism for stateless autoconfiguration of IPv6 addresses requires a subnet to have 64 bits. This means most, if not all subnets (except point-to-point links), will have a size of /64 in the future. Highly inefficient for the conservation of addresses. Using the values I assumed earlier, that gives me 1.73*10^19 subnets. So from a 39-digit number, we now have a more realistic 20-digit number.
Let’s try to deplete that by, for example, giving one /56 per house (recommended, but that’s not likely to happen, as described in detail already by network instructor Chris Welsh), and one /48 per company. I haven’t found any definitive numbers, but I think it’s a conservative assumption that there are about 3 people on average sharing one home in the world, with fewer in the Western world (about 2.5 on average) and more in other parts of the world. As far as companies go, I’ve only found one quote saying there were 56 million companies worldwide in 2004, without anything to back it up, but it’s the best I have.

Assuming 60 million companies in 2012 with a /48, and 7.2 billion people, with an average of three per household, for a total of 2.4 billion households with a /56, I can calculate:

  • A /56 is 8 bits, 256 subnets times 2.4 billion makes 614.4 billion.
  • A /48 is 16 bits, 65536 subnets times 60 million makes 3.93*10^12
  • Together, that’s 4.55*10^12 subnets used.

That still doesn’t touch the 10^19 value. However, we now have a number that represents the used number of subnets versus the whole population (and assuming the whole world has access to IPv6 technology and needs it, but hey, we’re making assumptions for the future, and I’m hoping the best for humanity). Dividing our number of used subnets by the population, we get roughly 632 of /64 subnets used per human on this planet.
1.73*10^19 total subnets, divided by 632 subnets per human, gives 27,37*10^15 humans on the planet before the subnets are depleted. That’s 27 million billion. At this point, I was going to take a chart about human population growth, but none of them make predictions past 20 billion.

Conclusion: unless we reach for the stars and spread our IPv6 address space over the entire galaxy (even our solar system wouldn’t do), or finally perfect nanotechnology and decide to give each nanobot his own IPv6 address, we will not run out of IPv6 addresses. Even with large error margins, there are simply too many addresses and too few things to give it to. ISPs, hand out those /56 already!

I’ve described in an earlier post how routing works in a hub-and-spoke Frame Relay network. The idea is that you have to get layer 3 services (routing) to work in an environment where no direct layer 2 contact is possible. You do this by using the hub router as a relay for all data.

In this post, I’m going to attempt to do the same in an Ethernet network. Normally, an Ethernet is a broadcast network, but Private VLANs form an exception to this rule: any data, even broadcasts, sent from an isolated port will not be received on other isolated ports. This is very similar to the hub-and-spoke Frame Relay, where spokes can’t communicate directly with each other.

I’m using the following setup:
PVLAN-Routing

After configuring the PVLANs on the switch (for details about this, see an earlier blog post), I configure all routers with EIGRP. Just a simple setup:

Router(config)#hostname Rx
Rx(config)#router eigrp 1
Rx(config-router)#network 10.0.0.0 0.0.255.255
Rx(config-router)#exit
Rx(config)#interface Loopback 0
Rx(config-int)#ip address 10.0.1.x 255.255.255.255
Rx(config-int)#exit
Rx(config)#interface Ethernet0/0
Rx(config-if)#ip address 10.0.0.x 255.255.255.0
Rx(config-if)#no shutdown

The ‘x’ represents the router number (R1, R2, R3). The loopback is configured to show that the routing protocol is working. After configuring all this, EIGRP adjacencies are formed between R1 and R2, and R1 and R3, but not between R2 and R3, as they can’t see each other because of the isolated PVLAN. A ‘show ip route’ on R2 and R3 shows the connected subnet, and one EIGRP learned route: the loopback from R1. Clearly, R2 and R3 don’t exist for each other.

But just like with the hub-and-spoke Frame Relay, this can be solved by disabling split-horizon on R1: in interface configuration mode, this is the ‘no ip split-horizon eigrp’ command. Neighbor associations are lost after this command and formed again quickly. R2 and R3 still aren’t neighbors, but their routing tables now also have routes for each other’s loopback adapters, learned by EIGRP. But they can’t reach this loopback: pings don’t work. The reason for this is of course in the PVLAN configuration, which prevents direct communication. And this is where it gets interesting: in the hub-and-spoke Frame Relay configuration, this was solved by adding static mappings to the PVCs so all frames are sent to the hub router, who then relays it to the correct spoke router. Can this be done here too?

It can: configure router R2 and R3 with static ARP information. But not with the ‘correct’ information: if you, for example, do a static arp entry on the R2 router for 10.0.0.3 with the MAC address of router R3, it still won’t work, as no direct communication is possible. Just like with Frame Relay, where you use the PVC of the hub router instead of the PVC of the spoke router, you make a static arp entry on R2 for 10.0.0.3 with the MAC address of R1! So, first do a ‘show interface Ethernet0/0’ on R1 to reveal the MAC address, in my case 0030.85e0.1d40. Next, on the other routers, make static arp entries:

R2(config)#arp 10.0.0.3 0030.85e0.1d40 arpa

R3(config)#arp 10.0.0.2 0030.85e0.1d40 arpa

Now all pings work! R1 relays the frames to the correct router, and adds the correct destination MAC address to the frames. I initially said ‘thought experiment’ here because it has little real-world use, but it does allow for interesting configurations, like ACL filtering on the central router (R1) in the same subnet.

Everytime I see private range addresses somewhere, I automatically think about Network Address Translation. But NAT and private addresses do not always need to be used together.

Take the following network:

Example network

Now suppose you have received a /22 public address range for your company, e.g. 123.45.68.0/22, which you then split up in subnets for your users and servers. Since IPv4 addresses are limited and you have three point-to-point links in the ‘Internal’ part of the network, you’re hesistant to go waste 12 addresses (three times a /30) on them. Sure you could use /31’s and use only 6 IP’s, but if the number of links increases, so do the wasted IP’s.

But if you give these links IP’s in the ranges 192.168.1.0/30, 192.168.1.4/30 and 192.168.1.8/30 and advertise them internally with whatever IGP you’re using, things will work too. You will need to filter any packets originating from these private ranges at the WAN edge router so they don’t reach the internet. Any hosts on the internet can’t reach the IP’s on the point-to-point links as they aren’t advertised to outside the company (added security!). Internally in your network, the private ranges become part of the network, without NAT. They can be pinged perfectly. You could even set up a subnet using private ranges for servers that must only be accessed internally (very secure, though automatic updates would not be so easy, I can imagine). For remote connectivity, VPN should allow access.

All in all, a real world implementation by this design may have some flaws, but it proves a point, and adds security in some sense.