Latest Entries »

Am I running IPv6 on my network?

RunningIPv6

Advertisements

Setting up a routing protocol neighborship isn’t hard. In fact, it’s so easy I’ve made them by accident! How? There were already two OSPF neighbors in a subnet and I was configuring a third router for OSPF with yet another fourth router. But because the third router had an interface in that same subnet and I used the command ‘network 0.0.0.0 255.255.255.255 area 0’ the neighborships came up. This serves as an example that securing a neighborship is not only to avoid malicious intent, but also to minimize human error.

Session authentication
The most straightforward way to secure a neighborship is adding a password to the session. However, this is not as perfect as it should be: it doesn’t encrypt the session so everyone can still read it, and the hash used is usually done in md5, which can easily be broken at the time of writing. Nevertheless, a quick overview of the password protection for EIGRP, OSPF and BGP:

Router(config)#key chain KEY-EIGRP
Router(config-keychain)#key 1
Router(config-keychain-key)#key string
Router(config-keychain-key)#exit
Router(config-keychain)#exit
Router(config)#interface Fa0/0
Router(config-if)#ip authentication mode eigrp 65000 md5
Router(config-if)#ip authentication key-chain eigrp 65000 KEY-EIGRP
Router(config-if)#exit

Router(config)#interface Fa0/1
Router(config-if)#ip ospf message-digest-key 1 md5
Router(config-if)#exit
Router(config)#router ospf 1
Router(config-router)#area 0 authentication message-digest
Router(config-router)#exit

Router(config)#router bgp 65000
Router(config-router)#neighbor 10.10.10.10 remote-as 65000
Router(config-router)#neighbor 10.10.10.10 password

Note a few differences. EIGRP uses a key chain. The positive side about this is that multiple keys can be used, each with his own lifetime. The downside: administrative overhead and unless the keys change every 10 minutes it’s not of much use. I doubt anyone uses this in a production network.

BGP does the configuration for a neighbor (or a peer-group of multiple peers at the same time for scalability). Although there’s no mention of hashing, it still uses md5. It works with eBGP as well but you’ll need to agree on this with the service provider.

OSPF sets the authentication key on the interface and can activate authentication on the interface, but here it’s shown in the routing process, as it’s likely you’ll want it on all interfaces. It would have been even better if it was possible to configure the key under the routing process, saving some commands and possible misconfigurations on the interfaces. OSPF authentication commands can be confusing, as Jeremy points out. However:

OSPFv3 authentication using IPsec
The new OSPF version allows for more. Now before you decide that you’re not using this because you don’t run IPv6, let it be clear that OSPFv3 can be used for IPv4 as well. OSPFv3 does run on top of IPv6, but only link-local addresses. This means that you need IPv6 enabled on the interfaces, but you don’t need IPv6 routing and there’s no need to think about an IPv6 addressing scheme.

Router(config)#router ospfv3 1
Router(config-router)#address-family ipv4 unicast
Router(config-router-af)#area 0 authentication ipsec spi 256 sha1 8a3fe4a551b81dc24f6148b03e865b803fec49f7
Router(config-router-af)#exit
Router(config-router)#exit
Router(config)#interface Fa0/0
Router(config-if)#ipv6 enable
Router(config-if)#ospfv3 1 area 0 ipv4
Router(config-if)#ospfv3 bfd

This new OSPF version shows two advantages: you can configure authentication per area instead of per interface, and you can use SHA1 for hashing. The key has to be a 40-digit hex string, it will not accept anything else. A non-hex character or 39 or 41 digits gives a confusing ‘command not recognized’ error. The SPI vaue needs to be the same on both sides, just like the key of course. The final command is to show optional BFD support.

EIGRP static neighbors
For EIGRP you can define the neighbors on the router locally, instead of discovering them using multicast. This way, the router will not allow any neighborships from untrusted routers.

Router(config)#router eigrp 65000
Router(config-router)#network 10.0.0.0 0.0.0.255
Router(config-router)#neighbor 10.0.0.2 Fa0/0
Router(config-router)#exit

Static neighbor definition is one command, but there is a consequence: EIGRP will stop multicasting hello packets on he interface where the static neighbor is. This is expected behavior, but easily forgotten when setting it up. Also, the routing process still needs the ‘network’ command to include that interface, or nothing will happen.

BGP Secure TTL
Small yet useful: checking TTL for eBGP packets. By default an eBGP session uses a TTL of 1. By issuing the ‘neighbor ebgp-multihop ‘ you can change this value. The problem is that an attacker can send SYN packets towards a BGP router with a spoofed source of a BGP peer. This will force the BGP router to respond to the session request (SYN) with a half-open session (SYN-ACK). Many half-open sessions can overwelm the BGP process and bring it down entirely.

TTL-Security

Secure TTL solves this by changing the way TTL is checked: instead of setting it to the hop count where the eBGP peer expects a TTL of 1, the TTL is set to 255 to begin with, and the peer checks upon arrival of the packet if the TTL is 255 minus the number of hops. Result: an attacker can send spoofed SYN packets, but since he’ll be more hops away and the TTL can’t be set higher than 255, the packets will arrive with a too low TTL value and are dropped without any notification. The configuration needs to be done on both sides:

Router(config)#router bgp 1234
Router(config-router)#neighbor 2.3.4.5 remote-as 2345
Router(config-router)#neighbor 2.3.4.5 ttl-security hops

These simple measures can help defend against the unexpected, and although it’s difficult in reality to implement them in a live network, it’s good to know when (re)designing.

Another series of articles. So far in my blog, I’ve concentrated on how to get routed networks running with basic configuration. But at some point, you may want to refine the configuration to provide better security, better failover, less chance for unexpected issues, and if possible make things less CPU and memory intensive as well.

While I was recently designing and implementing a MPLS network, it got clear that using defaults everywhere wasn’t the best way to proceed. As visible in the MPLS-VPN article, several different protocols are used: BGP, OSPF and LDP. Each of these establishes a neighborship with the next-hop, all using different hello timers to detect issues: 60 seconds for BGP, 10 seconds for OSPF and 5 seconds for LDP.

First thing that comes to mind is synchronizing these timers, e.g. setting them all to 5 seconds and a 15 second dead-time. While this does improve failover, there’s three keepalives going over the link to check if the link works, and still several seconds of failover. It would be better to bind all these protocols to one common keepalive. UDLD comes to mind, but that’s to check fibers if they work in both directions, it needs seconds to detect a link failure, and only works between two adjacent layer 2 interfaces. The ideal solution would check layer 3 connectivity between two routing protocol neighbors, regardless of switched path in between. This would be useful for WAN links, where fiber signal (the laser) tends to stay active even if there’s a failure in the provider network.

BFD-Session

Turns out this is possible: Bidirectional Forwarding Detection (BFD) can do this. BFD is an open-vendor protocol (RFC 5880) that establishes a session between two layer 3 devices and periodically sends hello packets or keepalives. If the packets are no longer received, the connection is considered down. Configuration is fairly straightforward:

Router(config-if)#bfd interval 50 min_rx 50 multiplier 3

The values used above are all minimum values. The first 50 for ‘interval’ is how much time in milliseconds there is between hello packets. The ‘mix_rx’ is the expected receive rate for hello packets. Documentation isn’t clear on this and I was unable to see a difference in reaction in my tests if this parameter was changed. The ‘multiplier’ value is how many hello packets kan be missed before flagging the connection as down. The above configuration will trigger a connection issue after 150 ms. The configuration needs to be applied on the remote interface as well, but that will not yet activate BFD. It needs to be attached to a routing process on both sides before it starts to function. It will take neighbors from those routing processes to communicate with. Below I’m listing the commands for OSPF, EIGRP and BGP:

Router(config)#router ospf 1
Router(config-router)#bfd all-interfaces
Router(config-router)#exit
Router(config)#router eigrp 65000
Router(config-router)#bfd all-interfaces
Router(config-router)#exit
Router(config)#router bgp 65000
Router(config-router)#neighbor fall-over bfd
Router(config-router)#exit

This makes the routing protocols much more responsive to link failures. For MPLS, the LDP session cannot be coupled with BFD on a Cisco device, but on a Juniper it’s possible. This is not mandatory as the no frames will be sent on the link anymore as soon as the routing protocol neighborships break and the routing table (well, the FIB) is updated.

Result: fast failover, relying on a dedicated protocol rather than some out-of-date default timers:

Router#show bfd neighbor

NeighAddr                         LD/RD    RH/RS     State     Int
10.10.10.1                       1/1     Up        Up        Fa0/1

Jun 29 14:16:21.148: %OSPF-5-ADJCHG: Process 1, Nbr 10.10.10.1 on FastEthernet0/1 from FULL to DOWN, Neighbor Down: BFD node down

Not bad for a WAN line.

My cat’s syslog.

CatSyslog

I’d show the next day, but it’s almost the same.

If you ever managed a Campus LAN, you’ll know what happens with a lot of end users that have access to ethernet cables on desks. The occasional rogue hub, a loop now and then, and if they have access to some more advanced tools, some BPDU’s and a rogue DHCP server. Most of these events are not intended to be malicious (even the BPDU’s and rogue DHCP), but they happen because end users are not aware of the impact of some devices on the network.

But, given a malicious intend, what are the possibilities of attacking a switched Cisco network from a directly attached interface? Operating system for all the upcoming attacks: BackTrack Linux, which has many interesting tools installed.

MAC Flood
The classic attack first: flooding the switch’s CAM table with random source MAC addresses.
Tool: macof
Countermeasure: port-security

First attacking without port security: as expected, CAM table fills, CPU increases and everything is flooded. Congestion everywhere.

L2Attack-1

Finding this attack without port-security is feasible by checking CPU processes: HLFM address learning doesn’t normally consume that much CPU. Turning of MAC address learning does the same but without CPU impact.

So does turning port-security on solve the problem? It depends. Turning it on and setting it to only block new mac addresses but not shutting down the port actually makes things worse for the CPU:

L2Attack-2

The best solution: port-security with shutdown of the port in case of too many MAC addresses. No flooding, no CPU hogging.

CDP Flood
A Cisco-only attack. Flooding CDP frames with fake neighbors, causing not only CPU spikes, but also clogging the memory with all the neighbor entries. ‘show cdp neighbor’ becomes like showing the route table on a BGP router: endless.
Tool: Yersinia
Countermeasure: disabling CDP on the port or globally.

The attack with CDP turned on (the default) is very effective:

L2Attack-3

Finding the attack is easy, as both CPU and memory will clearly show the CDP process is using up resources. However, without CDP on the port, the attack does nothing. So the best solution: always turn CDP off towards a user-facing port. Even behind an IP Phone, although some functionality will be lost.

Root BPDU inject
A funny one. Inject a BPDU claiming to be root to cause spanning-tree recalculations and creating suboptimal paths in the network.
Tool: Yersinia
Countermeasure: Root Guard.

L2Attack-4

Notice the root ID, which has a nearly identical MAC address to make it difficult to spot the difference, and the aging time of two days, making this an attack that will last while the attacker is no longer connected. Root Guard on the port counters this attack easily though.

BPDU Flood
This attack doesn’t try to change the spanning-tree topology, but rather overload the STP process. Consequence is high CPU and eventually spanning tree inconsistencies.
Tool: Yersinia
Countermeasure: BPDU Guard

L2Attack-5

Spanning-tree should not use that much CPU on a switch. HLFM address learning will increase too due to the random source MAC addresses, and depending on the switch, Hulc LED Process will increase too. This is the process that governs the LED status of all switchports: the more ports the switch, the more this process will consume CPU if flooding attacks are happening.

BPDU Guard stops this effectively by shutting down the port. BPDU Filter not so much: it still needs to look at the BPDU to drop it and not forward it in hardware. BPDU Filter is generally not recommended anyway.

DHCP Discover Flood
Not really a layer 2 attack, but still impacting for the local subnet. Sending a flood of DHCP Discover messages, quickly overloading the DHCP Server(s) for the subnet.
Tool: Yersinia
Countermeasure: DHCP Snooping and DHCP Snooping Rate Limit

If DHCP Snooping isn’t enabled on the switch, it behaves like a MAC Flood attack and can be countered accordingly. Simply enabling DHCP Snooping, which is against rogue servers and not against flooding, makes things worse.

L2Attack-6Not only does it make the CPU spike, but it’s one of the few attacks that makes the switch unresponsive in the data plane, meaning not only management is lost, but the switch stops forwarding most frames, with packet loss on all ports. Simple snooping does prevent execution from a virtual machine:

L2Attack-7

But to really protect against this attack, DHCP Snooping rate-limiting helps:

L2Attack-8

OSPF Flood
Sending a flood of OSPF Hello packets over a switch.
Tool: a virtual machine running ospfd (Vyatta, OpenBSD), and a hub between switch and computer with cable loop to cause the flood.
Countermeasure: ACL

For this one I didn’t use any specific tool. I just made my computer send out an OSPF Hello, and made sure the hub between computer and switch was wired so it would flood the frame. Result: spectacular. The switch CPU rises to 100% and management connections, including console, are dropped. Reason is that the OSPF process has higher priority. But now the shocking part: this was done on a layer 3 Cisco switch without OSPF configured, and without an IP address in the attacker VLAN.

Explanation: Cisco switches use something called pak_priority. It means that certain packets on ingress are labeled by the interface driver as priority and to be checked by CPU (Source). This is done to make sure network control packets get to the CPU in case of congestion. It’s the case for RIP, OSPF and EIGRP, but not for BGP packets.

I retried it with EIGRP (although this required a second Cisco device to generate the EIGRP hello) and the result is the same: no EIGRP configuration on the switch, still impact. The data plane does not have any impact: forwarding stays as usual mostly.

Solution? Strange enough, an ACL on each port blocking EIGRP (IP Protocol 88) and OSPF (IP Protocol 89) and allowing everything else seems to work. The ACL is checked in hardware as long as the ‘log’ parameter isn’t present. So for better security, it seems you’re stuck with an ACL on each switchport of a layer 3 switch.

Conclusion
I’m sure most of the readers now conclude that there’s still a security leak somewhere in their network. Just for reference, I’ll include the CPU graph of an hour of testing all these attacks.

L2Attack-9

In Europe, the cheapest WAN links start around 2 Mbps these days. While this makes some WAN optimizations covered in Cisco’s QoS guides unnecessary, it’s good to know them and the effects of slower links on traffic.

Serialization delay
Putting a frame on the wire from a switch or router requires time. The amount of time is directly related to the line speed of the link. Note ‘line speed’: the actual negotiated speed on layer 1. For a 1 Gbps interface negotiated to 100 Mbps Full Duplex which is QoS rate-limited at 10 Mbps, the line speed is 100 Mbps. The formula is the following:

SerializationDelay

This means that if you have a 1514 bytes frame (standard MTU of 1500 bytes plus the layer 2 header of 14 bytes) and send it out of a 100 Mbps interface it will take (1,500*8)/(10^8)= 0.12 ms or 121 µs. If a small voice frame arrives in the egress queue of a switch or router it can incur op to 121 µs of additional latency if a 1514 bytes frame is just being sent out. Consequence: even under near perfect conditions and good QoS configuration where voice frames are given absolute priority over the network, there’s a possible jitter per hop. The higher the general bandwidth throughout the network, the lower the jitter, so latency-sensitive traffic does benefit from high bandwidth and fewer hops. Over a 10 GE interface that same frame would be serialized in just 1.21 µs per hop.

There are some consequences for WAN links: at slower speeds, the serialization delay increases rapidly. At 2 Mbps for a 1514 byte frame it’s 6 ms. At 64 kbps, it’s 190 ms. And in case you’re enabling jumbo frames: 9014 bytes over 10 Mbps is 7.2 ms.

Link Fragmentation and interleaving
Generally, at 768 kbps and below, jitter for voice becomes unacceptable. This is where Link Fragmentation and Interleaving (LFI). It works by splitting up large frames into smaller parts and putting low-latency packets between these parts.

LFI

Configuration of LFI on a Cisco router is as following:

Router(config)#interface Multilink1
Router(config-if)#ip address 192.0.2.1 255.255.255.0
Router(config-if)#ppp multilink
Router(config-if)#ppp multilink fragment delay 3
Router(config-if)#ppp multilink interleave
Router(config-if)#ppp multilink group 1
Router(config-if)#exit
Router(config)#interface Serial0/0
Router(config-if)#
Router(config-if)#encapsulation ppp
Router(config-if)#clock rate 768000
Router(config-if)#ppp multilink
Router(config-if)#ppp multilink group 1
Router(config-if)#exit

First, create a Multilink interface. It will serve as an overlay for the actual physical interface, as this is where the LFI will be configured on. The Multilink interface will have all configuration: IP address, service policies,… Except for the layer 1 configuration (notice the clock rate command on the serial interface).

The ‘ppp multilink interleave’ activates LFI. The ‘ppp multilink fragment delay 3’ means LFI will automatically try to split up large packets so no packet has to wait longer than 3 ms while another is serialized. On the serial interface, encapsulation has to be set to ppp first. Next, it becomes possible to associate the interface with a Multilink overlay interface using the ‘ppp multilink group’ command.

The configuration has to be done on both sides of the WAN link, of course. The other side needs to use PPP encapsulation as well, and needs to have LFI enabled to reassemble to split up packets.

This concludes the series of QoS articles on this blog. Up next, I’ll try out different attacks on a Catalyst switch and see how it reacts.

QoS part VII: wireless.

Back to the definition of QoS in part I: a mechanism to determine which packet to process next in case of unavailable resources. While unavailable resources usually referred to bandwidth, and ASIC throughput in case of some 6500 line cards, for wireless the limiting factor is transmit opportunities.

Contention Window
Wireless is a half-duplex medium, and as such uses CSMA/CA to check if the wireless frequency is occupied by another wireless node before transmitting a frame. Before sending, the wireless node waits a random period of time. The random period of time is between a fixed back off time and a Contention Window (CW). The time is expressed in ‘slots’, and I’m unable to find any documentation that explain what a slot is. It might be a microsecond, but nothing I found confirms this.

The CW has a minimum value and a maximum value. Before a frame is transmitted, the wireless device will wait a fixed back off time plus a random extra time between zero and the minimum contention window, CWmin. If transmit fails (due to a collision in the wireless half-duplex medium), the process will be retried, again with fixed back off time, but with a larger contention window. If transmission fails again, the process will continue until a set number of tries is reached, after which the frame is dropped. The CW will increase every time until a fixed maximum value is reached: CWmax.

WifiQoS-1

Using low CW values means the packet has low latency, but the chance for collisions increases. A higher CW means a higher average latency, and more possible jitter. CW is a number between 1 and 10. That number is not the number of slots, but it’s part of the formula 2^x-1 = slots. A CW of 3 is 7 slots (2^3-1). A CW of 5 is 31 (2^5-1). For example, frames transmitted with a fixed back off time of 5, a CWmin of 4 and a CWmax of 6 will have on average 12 slots of latency when there’s little traffic (5 + 15/2), and if traffic increases, it goes up and can reach 20 slots latency on average (5+31/2).

Traffic classes
This is where the wireless QoS comes into play: the 802.11e standard allows traffic to be split up into classes that each receive different back off times and CWmin and CWmax values. Contrary to switches and routers, it’s a fixed number of classes, with a fixed name. Ironically, voice is given a CoS value of 6 in the wireless world, not 5. The table gives these values:

WifiQoS-2

This table is directly from the Wikipedia page about 802.11e.

Configuration
The configuration on an Aironet is the same as other Cisco devices as far as classification goes, but the difference is in the output queueing with the contention window. Let’s assume we want to classify a RTP voice stream, UDP port 16384, as CoS 6, and everything else as CoS 0:

AP1142N(config)#ip access-list extended AL4-RTP
AP1142N(config-ext-nacl)#permit udp any any eq 16384
AP1142N(config-ext-nacl)#exit
AP1142N(config)#class-map CM-RTP
AP1142N(config-cmap)#match access-group name AL4-RTP
AP1142N(config-cmap)#exit
AP1142N(config)#policy-map PM-Wifi
AP1142N(config-pmap)#class CM-RTP
AP1142N(config-pmap-c)#set cos 6
AP1142N(config-pmap-c)#exit
AP1142N(config-pmap)#class class-default
AP1142N(config-pmap-c)#set cos 0
AP1142N(config-pmap-c)#exit
AP1142N(config-pmap)#exit
AP1142N(config)#interface Dot11Radio 0.10
AP1142N(config-subif)#service-policy output PM-Wifi

Note that the policy is applied on a subinterface in the output direction: it’s for sending only, and on a per-SSID basis. This is not the case for the actual CW configuration: it’s on the main interface because it’s how the antenna will work.

AP1142N(config)#interface Dot11Radio 0
AP1142N(config-if)#dot11 qos class best-effort local
AP1142N(config-if-qosclass)#fixed-slot 5
AP1142N(config-if-qosclass)#cw-min 3
AP1142N(config-if-qosclass)#cw-max 5
AP1142N(config-if-qosclass)#exit
AP1142N(config-if)#dot11 qos class voice local
AP1142N(config-if-qosclass)#fixed-slot 1
AP1142N(config-if-qosclass)#cw-min 1
AP1142N(config-if-qosclass)#cw-max 3

The classes have fixed names which can’t be changed. CoS 0 maps to the best-effort class, CoS 6 maps to the voice class. By giving lower values to the voice frames, these will, on average, experience less latency and be transmitted faster. They are also more likely to be transmitted because the back off timer will expire faster. Result: even when the wireless network is dealing with a lot of traffic, voice frames will be transmitted faster and with less jitter.

On to a bigger platform: the 6500 series, Cisco’s flagship Campus LAN switch. Unlike the previously discussed platform, the 6500 series uses the older Weighted Round Robin (WRR) queueing mechanism, and uses CoS internally to put packets in queues.

Queues and thresholds
The capabilities per port also differ per line card and unlike the 3560/3750 series, it uses multiple ASIC per line card.

Switch#show interfaces GigabitEthernet 1/2/1 capabilities | include tx|rx|ASIC
Flowcontrol:                  rx-(off,on,desired),tx-(off,on,desired)
QOS scheduling:           rx-(1q8t), tx-(1p3q8t)
QOS queueing mode:    rx-(cos), tx-(cos)
Ports-in-ASIC (Sub-port ASIC) : 1-24 (1-12)

The above output is of a WS-X6748-GE-TX line card. The ‘1p3q8t’ for egress (tx) means one fixed priority queue and three normal queues, each with eight thresholds. The fixed priority queue cannot be changed to a normal queue: if a packet is in the queue, it will be transmitted next.

QoS9

The ingress ‘1q8t’ means there is one ingress queue with eight thresholds. Unlike the 3560/3750, there is some oversubscription on the line card. It has two ASICs, one per 24 ports (the line card is 48 ports total). Each of these ASICs has a 20 Gbps connection to the 6500 backplane. If all 24 gigabit ports together on that part of the line card start receiving more than 20 Gbps of traffic, the ASIC and backplane connection will not be able to handle all the traffic. Granted, this is a rare event: 24 Gbps maximum throughput on a 20 Gbps capable ASIC is an oversubscription of 1.2 to 1. But in case this happens, the different thresholds can help decide which traffic to drop. However, to drop on ingress, the decision must be made on existing markings.  The ASIC does classification and remarking, and the ingress queue is before the ASIC. This is not a problem usually, since classification and marking is best done at the access layer and 6500s are best used for distribution and core layer.

Switch#show interfaces TengigabitEthernet 1/9/1 capabilities | include tx|rx|ASIC
Flowcontrol:                  rx-(off,on),tx-(off,on)
QOS scheduling:           rx-(1p7q2t), tx-(1p7q4t)
QOS queueing mode:    rx-(cos,dscp), tx-(cos,dscp)
Ports-in-ASIC (Sub-port ASIC) : 1-8 (1-4)

The WS-X6716-10GE, a 10 GE line card, has different queues, especially for ingress. This line card has a high oversubscription of 4:1 and one ASIC per eight ports, for a total of two ASIC for the 16-port line card. This means that, while eight ports can deliver up to 80 Gbps, the ASIC and backplane connection behind it are still just 20 Gbps. The ASIC is much more likely to get saturated, so ingress queueing becomes important here. The fixed priority queue allows some traffic to be handled by the ASIC in low latency, even when saturated.

I’m only going to explain these two line cards, the rest is similar. A full list with details per line card can be found here. The logic is similar to the 3560/3750 platform: configure the buffers and the thresholds, but this time for both ingress and egress. First the ingress queue on the gigabit interface. The ingress queue has no buffer sizing command, as this line card has only one ingress queue.

Switch(config)#interface Gi1/2/1
Switch(config-if)#rcv-queue threshold 1 65 70 75 80 85 90 95 100
Warning: rcv thresholds will not be applied in hardware.
To modify rcv thresholds in hardware, all of the interfaces below
must be put into ‘trust cos’ state:
Gi1/2/1 Gi1/2/2 Gi1/2/3 Gi1/2/4 Gi1/2/5 Gi1/2/6 Gi1/2/7 Gi1/2/8 Gi1/2/9 Gi1/2/10 Gi1/2/11 Gi1/2/12
Switch(config-if)#

That configures the eight thresholds for the first and only queue: threshold 1 at 65%, threshold 2 at 70%, and so on. Note the warning: for ingress queueing, existing cos markings have to be trusted. Also, remember that the 3560/3750 ingress and buffer allocation commands work switch-wide, because it has one ASIC per switch. The X6748 line card on a 6500 has two ASIC, which for QoS are sub-divided in two sub-ASIC per twelve ports. Applying a command that changes the ASIC QoS allocations means that the command will automatically apply to the other twelve interfaces as well.

Next, egress queueing. First configuring buffer allocations, next the thresholds for the first queue, similar to the ingress queue.

Switch(config-if)#wrr-queue queue-limit 70 20 10
Switch(config-if)#wrr-queue threshold 1 65 70 75 80 85 90 95 100
Switch(config-if)#exit

Again something special here: the buffer allocation command ‘wrr-queue queue-limit’ needs only three values despite four queues. This is because queue 4, the priority queue, is a strict priority queue: any packet entering it will be serviced next. This means that if a lot of traffic ends up in the priority queue, it can end up clogging the other queues because these will not be serviced anymore. The only way to counter this is to tightly control what ends up in that queue.

On to the 10 GE line card. First ingress, this time with a buffer command because there are multiple queues on ingress.

Switch(config)#interface Te1/9/1
Switch(config-if)#rcv-queue queue-limit 40 20 20 0 0 0 20
Warning: rcv queue-limit will not be applied in hardware.
To modify rcv queue-limit in hardware, all of the interfaces below
must be put into ‘trust cos’ state:
Te1/9/1 Te1/9/2 Te1/9/3 Te1/9/4
Switch(config-if)#
Switch(config-if)#rcv-queue threshold 2 80 100
Switch(config-if)#rcv-queue threshold 3 90 100
HW-QOS: Rx high threshold 2 is fixed at 100 percent

Propagating threshold configuration to:  Te1/9/1 Te1/9/2 Te1/9/3 Te1/9/4
Warning: rcv thresholds will not be applied in hardware.
To modify rcv thresholds in hardware, all of the interfaces below
must be put into ‘trust cos’ state:
Te1/9/1 Te1/9/2 Te1/9/3 Te1/9/4
Switch(config-if)#end

The 10 GE line card has one ASIC per eight ports, one sub-ASIC for QoS per four ports. It has two thresholds per queue. The rest isn’t any different from previous configurations.

Queue mappings
As mentioned already, internally the 6500 platform uses CoS to determine in which queue a packet (and thus flow) ends up, although some newer line cards can work with both CoS and DSCP. The mappings are again similar to previous configurations:

Switch(config)#interface Gi1/2/1
Switch(config-if)#rcv-queue cos-map 1 4 5
Propagating cos-map configuration to:  Gi1/2/1 Gi1/2/2 Gi1/2/3 Gi1/2/4 Gi1/2/5 Gi1/2/6 Gi1/2/7 Gi1/2/8 Gi1/2/9 Gi1/2/10 Gi1/2/11 Gi1/2/12
Warning: rcv cosmap will not be applied in hardware.
To modify rcv cosmap in hardware, all of the interfaces below
must be put into ‘trust cos’ state:
Gi1/2/1 Gi1/2/2 Gi1/2/3 Gi1/2/4 Gi1/2/5 Gi1/2/6 Gi1/2/7 Gi1/2/8 Gi1/2/9 Gi1/2/10 Gi1/2/11 Gi1/2/12
Switch(config-if)#wrr-queue cos-map 2 3 6
Propagating cos-map configuration to:  Gi1/2/1 Gi1/2/2 Gi1/2/3 Gi1/2/4 Gi1/2/5 Gi1/2/6 Gi1/2/7 Gi1/2/8 Gi1/2/9 Gi1/2/10 Gi1/2/11 Gi1/2/12
Switch(config-if)#priority-queue cos-map 1 1 5
Propagating cos-map configuration to:  Gi1/2/1 Gi1/2/2 Gi1/2/3 Gi1/2/4 Gi1/2/5 Gi1/2/6 Gi1/2/7 Gi1/2/8 Gi1/2/9 Gi1/2/10 Gi1/2/11 Gi1/2/12

The following things are configured here: CoS 5 is mapped to ingress queue 1, threshold 4. Next, CoS 6 is mapped to egress queue 2, threshold 3. And the third command is the mapping of CoS 5 to the first (and only) priority queue, first threshold.

DSCP to CoS mapping
Mapping CoS to a queue is okay, but what if you’re using DSCP for marking? And what if you have access ports on the 6500? CoS is part of the 802.1q header. For this you can do a DSCP to CoS mapping. For example, to map DSCP EF to CoS 5 and DSCP AF41 to CoS 3:

Switch(config)#mls qos map dscp-cos 46 to 5
Switch(config)#mls qos map dscp-cos 34 to 3

Now packets incoming or remarked on ingress as DSCP EF will be treated as CoS 5 in the queueing.

Bandwidth sharing & random-detect.
There are no shaping commands on the 6500 platform, only sharing of bandwidth. Again, only three values are possible for four queues as the priority queue will just take the bandwidth it needs. You can use shared weights using the ‘wrr-queue bandwidth’ command, but it’s easier to add the ‘percent’ keyword and let it total 100 for a more clear configuration:

Switch(config-if)#wrr-queue bandwidth percent 80 10 10

90% for the first queue, 10% for the two others.

The 6500 platform also supports random early detection in hardware, a function borrowed from routers. It can be activated for a non-priority queue, for example the second queue:

Switch(config-if)#wrr-queue random-detect 2

The thresholds for RED can be modified using the ‘wrr-queue random-detect min-threshold’ and ‘wrr-queue random-detect max-threshold’ commands. They configure the thresholds (eight for the gigabit line card) with a minimum value at which RED starts to work, and a maximum value at which RED starts to drop all packets entering the queue.

Show command
So far I haven’t listed a ‘show’ command. This is because everything you need to know about a certain port is all gathered in one command: ‘show queueing interface’. It’s a command with a very long output, showing the queue buffers, thresholds and drops for both ingress and egress.

The DSCP to CoS mapping is switch-wide, so this is still a separate command:

Switch#show mls qos maps dscp-cos
Dscp-cos map:                                  (dscp= d1d2)
d1:d2 0   1   2   3   4   5   6   7   8   9
——————————————————
0 :    00 00 00 00 00 00 00 00 01 01
1 :    01 01 01 01 01 01 02 02 02 02
2 :    02 02 02 02 03 03 03 03 03 03
3 :    03 03 04 04 03 04 04 04 04 04
4 :    05 05 05 05 05 05 05 05 06 06
5 :    06 06 06 06 06 06 07 07 07 07
6 :    07 07 07 07

Again, d1 is the first digit, d2 the second: for DSCP 46, d1 is 4, d2 is 6.

While in part IV a router used software queues, this is not the case on a switch:

Switch(config-if)#service-policy output PM-Optimize
Warning: Assigning a policy map to the output side of an interface not supported

Why not? Because a switch forwards frames with ASICs, not with the CPU. And that means queueing is done in hardware too. And because the hardware contains a fixed number of queues, configuration is not done with a policy-map, but commands to manipulate these queues directly.

There are both ingress and egress queues, but this article will only explain egress queues, as ingress queueing has little relevance on a 3560/3750 platform. Also, I will only talk about DSCP values and ignore CoS, as this platform can use DSCP end-to-end. This article’s intent is to get a basic understanding of QoS on this platform. For a more detailed approach, this document in the Cisco Support community has proven very useful for me.

Queues and thresholds
The number of egress queues can be checked on a per-port basis:

Switch#show interfaces FastEthernet 0/1 capabilities | include tx|rx
Flowcontrol:              rx-(off,on,desired),tx-(none)
QoS scheduling:        rx-(not configurable on per port basis),
.                                tx-(4q3t) (3t: Two configurable values and one fixed.)

Notice the ‘4q3t’ number: this means the port supports for queues, each with three thresholds. Although the value can be checked on a per-port basis, the 3560/3750 series uses one ASIC for all it’s ports, so the number of queues and thresholds is the same on all ports.

QoS6

The four queues are hard-coded: no more, no less. A queue can be left unused, but no extra queues can be allocated. The thresholds are used for tail drops (the dropping of a frame when the queue is full) and allow to differentiate between traffic flows inside a queue.

An example: the third queue has thresholds at 80%, 90% and 100% (The third threshold is always 100% and can’t be changed). You put packets with DSCP value AF31, AF32 and AF33 in the third queue, but on different thresholds: AF31 on 3, AF32 on 2, AF33 on 1. The consequence is that packets with these DSCP values are put into the queue until the queue is 80% full (the first threshold). At that point, frames of DSCP AF33 are dropped, while the other two are still placed in the queue. If the queue reaches 90%, packets with AF32 are dropped as well, so the remaining 10% of the queue can only be filled with AF31-marked packets.

Each queue also has a buffer: the buffer size determines the amount packets a queue can hold. The allocation of these buffers can be checked:

Wolfberry#show mls qos queue-set 1
Queueset: 1
Queue     :       1       2       3       4
———————————————-
buffers   :        25      25      25      25
threshold1:    100     200    100    100
threshold2:    100     200    100    100
reserved  :      50      50      50      50
maximum   :  400     400    400    400

QoS7

A little explanation: everything is percentages here. The ‘buffers’ line indicates how the buffers are allocated: by default 25% for each queue. The exact amount can’t be found in the data sheets and supposedly depends on the exact type of switch.

The ‘reserved’ line means how much of those buffers are actually guaranteed to the queue. By default 50% of 25%, so 12.5% of the buffer pool is actually reserved for one queue. The other 50% of the total buffer pool can be used for any of the four queues that needs it. If all queues are filled and need it, it ends up at 25-25-25-25 again.

The other three lines are relative to the reserved value. The default first threshold of 100% means traffic set in threshold 1 of queue 1 will be dropped as soon as the queue is filled to its reserved value, 50% of the 25% allocated of the pool. The second threshold in queue 2, default 200%, means the queue will fill up to its allocated value, 100% of the 25% of the buffer pool. The maximum is the implicit third threshold, and is the maximum amount of buffer space that queue can use.

These values can all be changed with ‘mls qos queue-set output’ command. For example, let’s allocate more buffers to the second queue, as the intention is to use this for TCP traffic later on. Also let’s change other parameters: the queue will receive 50% of the buffer pool, 60% of this allocation will be reserved, and thresholds will be at 60% (100% of the reserved value), 90% (150% of the reserved value) and 120% (200% of the reserved value). Queue 1 receives 26% of the buffer pool, queue 3 & 4 each 12%. The thresholds for queue 1 will also change to 80% (160% of the reserved value) and 90% (180% of the reserved value) and 100% (200% of the reserved value).

Switch(config)#mls qos queue-set output 1 buffers 26 50 12 12
Switch(config)#mls qos queue-set output 1 threshold 2 100 150 60 200
Switch(config)#mls qos queue-set output 1 threshold 1 160 180 50 200
Switch(config)#exit
Switch#show mls qos queue-set 1
Queueset: 1
Queue     :       1       2       3       4
———————————————-
buffers   :         26      50      12      12
threshold1:     160     100    100    100
threshold2:     180     150    100    100
reserved  :       50      60      50      50
maximum   :   200     200    400    400

QoS8

Queue mappings
Now that the queues have been properly defined, how do you put packets in them? Well, assuming you’ve marked them as explained in part III, all packets have a DSCP marking. The switch automatically put packets with a certain marking into a certain queue according to the DSCP-to-output-queue table:

Switch#show mls qos maps dscp-output-q
Dscp-outputq-threshold map:
d1 :d2    0       1        2         3         4         5         6         7         8         9
————————————————————
0 :    02-01 02-01 02-01 02-01 02-01 02-01 02-01 02-01 02-01 02-01
1 :    02-01 02-01 02-01 02-01 02-01 02-01 03-01 03-01 03-01 03-01
2 :    03-01 03-01 03-01 03-01 03-01 03-01 03-01 03-01 03-01 03-01
3 :    03-01 03-01 04-01 04-01 04-01 04-01 04-01 04-01 04-01 04-01
4 :    01-01 01-01 01-01 01-01 01-01 01-01 01-01 01-01 04-01 04-01
5 :    04-01 04-01 04-01 04-01 04-01 04-01 04-01 04-01 04-01 04-01
6 :    04-01 04-01 04-01 04-01

Again an explanation: this table explains which DSCP value maps to which queue and threshold. For example, DSCP 0 (first row, first column) maps to queue 2, threshold 1 (02-01). DSCP 46 (fourth row, sixth column) maps to queue 1, threshold 1 (01-01). To map DSCP values to a certain queue and threshold, use the ‘mls qos srr-queue output dscp-map’ command:

Switch(config)#mls qos srr-queue output dscp-map queue 2 threshold 2 34

This will map packets with DSCP value 34 to queue 2, threshold 2, for example.

Bandwidth sharing and shaping
So queues are properly configured, packets are correctly put into the queues according to DSCP values… Just one more thing: what do the queues do? This is where the Shaped Round Robin mechanism comes into play. It’s one of the few egress QoS configurations that is done on a per-port basis on the 3560/3750 platform. There are two commands: ‘srr-queue bandwidth shape’ and ‘srr-queue bandwidth share’, followed by four values for the four queues.

The ‘shape’ command polices: it gives bandwidth to a queue, but at the same time limits that queue to that bandwidth. Ironically it’s an inverted scale: 25, for example, means 1 in 25 packets, or 4%. 5 means 1 in 5 packets, or 20% bandwidth. If a zero is used, that queue is not shaped. The ‘share’ command does not limit bandwidth and gives it in a relative scale: if the total of the queues is 20 and queue 1 has value 5, that’s 25% bandwidth. If the total is 50 and queue 1 has value 5, that’s 10% bandwidth.

Switch(config-if)#srr-queue bandwidth share 10 150 30 20
Switch(config-if)#srr-queue bandwidth shape 20 0 0 0

The above gives 5% bandwidth to the first queue (one in 20 packets). The other three queues receive 75%, 15% and 10% bandwidth respectively. 150+30+20 is 200 (10 is not counted here, because this queue is already shaped). 150 of 200 is 75%, 30 of 200 is 15%, 20 of 200 is 10%. How the shaped and shared queues are counted together is not clear, after all, 105% of bandwidth is allocated now. But it would require all queues to be filled at the same time to reach this situation.

Low latency queuing
And finally, the 3560/3750 allows for one priority egress queue. If a packet is placed in this queue, it will be sent out next, regardless of what is in the other queues, until it reaches the maximum allowed bandwidth. This makes it ideal for voice and other low-latency traffic. By design, the priority queue has to be the first queue, so the command doesn’t have a number in it:

Switch(config-if)#priority-queue out

Sounds logical, right? You’re correct: absolutely not. Unfortunately, Cisco uses a different type of value system for each QoS command, and the only way getting a feeling for it is trying it out. I hope this does help understand the workings of QoS in hardware. In the next article, we’ll review another platform.

Assuming you’ve marked packets on ingress as detailed in part III, it’s now time to continue to the actual prioritization. First a router: a router, e.g. a 2800 platform, forwards packets using the CPU and uses software queues for prioritization. This means packets are stored in RAM while they are queued, and the router configuration defines how many queues are used and which ones are given priority.

QoS3

This queueing in RAM means that you can customize the number of queues. By default, there is only one queue, using the simple First-in First-out (FIFO) method, but if there is needs for different treatment for other traffic classes, new queues can be allocated. The queues can also be given different parameters. While there’s a large array of commands in a policy-map possible, for basic QoS on ethernet, three commands will do: ‘bandwidth’, ‘police’ and ‘priority’.

Bandwidth
The bandwidth parameter defines what amount of bandwidth a queue is guaranteed. It is configured in Kbps. It does not set a limit: if the interface is not congested, the queue will receive all the bandwidth it needs. But in case of congestion, the bandwidth of the queue will not drop below this configured value.

Router(config)#class-map CM-FTP
Router(config-cmap)#match dscp af12
Router(config-cmap)#exit
Router(config)#policy-map PM-Optimize
Router(config-pmap)#class CM-FTP
Router(config-pmap-c)#bandwidth 10000
Router(config-pmap-c)#exit
Router(config-pmap)#exit
Router(config)#interface Eth0/0
Router(config-if)#service-policy output PM-Optimize
I/f FastEthernet0/0 class CM-FTP requested bandwidth 10000 (kbps), available only 7500 (kbps)

The configuration and error message above does show a weak point: you can easily misjudge the amount of bandwidth available. For this, the ‘bandwidth percent’ command makes it easier. Also, while it’s a 10 Mbps interface, it shows only 7.5 Mbps of available bandwidth. The reason for this is that 75% of the interface bandwidth is used for QoS calculations, and the rest is reserved for control traffic (OSPF, CDP,…). The ‘max-reserved bandwidth’ command on the interface can change this, and a modern high speed interface will have enough with a few percent for control traffic.

Router(config)#policy-map PM-Optimize
Router(config-pmap)#class CM-FTP
Router(config-pmap-c)#bandwidth percent 50
Router(config-pmap-c)#exit
Router(config-pmap)#exit
Router(config)#interface Eth0/0
Router(config-if)#max-reserved bandwidth 90

The above would guarantee a bandwidth of 4.5 Mbps for the class CM-FTP: 90% of the 10 Mbps interface is 9 Mbps, and 50% of that.

Police
The bandwidth guarantee for the ‘police’ command is the same as with the ‘bandwidth’ command. The only difference is that it is a maximum at the same time: even if there is no congestion on the link, bandwidth for the queue will still be limited. It is configured in increments of 8000 bits (no Kbps): configuring ‘police 16200’ will actually configure ‘police 16000’. This can be useful: if there is no congestion, available bandwidth is divided evenly over the queues, except the ones that use policing.

Router(config)#class-map CM-Fixed
Router(config-cmap)#match dscp af13
Router(config-cmap)#exit
Router(config)#policy-map PM-Optimize
Router(config-pmap)#class CM-Fixed
Router(config-pmap-c)#police 32000

Priority
The ‘priority’ command is nearly equal to the bandwidth command. Also measured in Kbps, also a minimum guarantee of bandwidth. The difference is that this queue will always be serviced first, resulting in low-latency queueing. Even if packets are dropped due to congestion, the ones going through will have spent the least amount of time in a queue.

Router(config)#class-map CM-Voice
Router(config-cmap)#match dscp ef
Router(config-cmap)#exit
Router(config)#policy-map PM-Optimize
Router(config-pmap)#class CM-Voice
Router(config-pmap-c)#police 32000

TCP optimization
So far, mainly latency-sensitive traffic like UDP voice has been given priority. But it doesn’t mean optimizations for TCP aren’t possible: a protocol such as FTP or any other TCP protocol that uses windowing starts behaving in a typical pattern on a congested link: windowing up until the point of congestion, losing frames and rewindowing to a smaller value, after which the process starts again.

QoS4

If multiple similar TCP connections are on a link, they tend to converge. When congestion occurs, the queue fills up, packets are eventually dropped and many TCP connections rewindow to a lower value at the same time. The consequence is that the link is suddenly only partially used. It would be better if rewindowing for each flow happens at different times, so there are no sudden drops in total bandwidth usage. This can be achieved by using Random Early Detect (RED): by dropping some packets before the queues are full, some flows will rewindow before the link is 100% full, avoiding further problems. RED starts working after the queue has been filled after a certain percentage, and will only drop one in every x number of packets. A complete explanation of RED would take another article, but a simple and effective starting point is the following configuration:

Router(config)#policy-map PM-Optimize
Router(config-pmap)#class CM-TCP
Router(config-pmap-c)#random-detect dscp-based

The ‘dcsp-based’ parameter is optional, but will cause the router to follow the DSCP markings as explained in part II: AF11 has a lower drop probability than AF13, so packets with value AF13 will be dropped more often compared to AF11.

QoS5

The result is more even distribution of bandwidth, and overall better throughput.

One last command can also help TCP: ‘queue-limit’. While the queue length for a priority queue is best set low, TCP traffic is usually tolerant of latency. It’s better to have it in a queue then have it being dropped.

Router(config)#policy-map PM-Optimize
Router(config-pmap)#class CM-TCP
Router(config-pmap-c)#queue-limit 100

A larger queue in combination with RED allows for a good throughput even with congestion. The default queue length is 64.

So that’s the basics for QoS on a router. Up next: different switch platforms, which all have their own different QoS mechanisms.