Most of the time, Wi-Fi engineers such as ourselves focus on over-the-air QoS since that is typically more critical to performance than wired QoS. However, a recent support incident highlighted the need for careful wired QoS policy definition, especially when supporting an LWAPP/CAPWAP wireless environment.
Shiny New Equipment
A recent project at our organization involved the deployment of several hundred new Cisco 3502 CleanAir access points which run on the Cisco Unified Wireless Network using the CAPWAP protocol. (For an overview of the CAPWAP protocol, see my previous blog posts here, here, here, here, and here.)
This project involved replacing existing 802.11a/b/g "legacy" access points with new 802.11n access points, as well as installation of a few hundred net-new additional APs. The replacement APs were to be installed and patched into the existing AP switch ports, while the new APs were to be patched into open switch ports in the existing data VLAN which provides DHCP services. This would allow the new APs to be deployed with zero-touch configuration, simply taken out of the box and installed by the contractor, minimizing operational expense. After the net-new APs were installed and registered to the controller, an administrator would then move them to the management VLAN and apply the standard port configuration settings for APs in our environment.
Help Desk, "We Have a Problem"
However, almost immediately after the new APs began to be installed, support tickets started rolling in. Users were reporting horribly slow wireless network performance, severe enough to the point of making the network unusable.
A quick trip to the affected building (only 5 min. away) confirmed the issue. A simple ping from a wireless client to the default gateway would drop numerous packets, sometimes as bad as 10% packet loss. And that was when the client was otherwise idle without other applications running. The issue would get even worse when attempting to load more traffic over the connection, such as pulling down a file over an SMB share or browsing a webpage with video content, spiking upwards of 25-30% packet loss. Clearly something was going on.
Sample pings from the distribution switch (housing the WiSM controller) to the wireless client showed the same symptoms in the reverse direction as well:
CAT6K#pingInitial suspicions fell on the wireless network, after all with changes come the opportunity for new problems. However, a parallel deployment of the new CleanAir APs in several other office buildings as well as a previous warehouse deployment had gone off without a hitch and no issues were experienced in those locations. Numerous wireless packet captures were performed that showed little to no issues over-the-air. Issues were experienced on both 2.4 GHz and 5 GHz frequency bands, very few retransmissions were present, and no interference was observed. Configuration changes were backed-out and testing was performed with the legacy APs, but the issue persisted.
Target IP address: 172.16.10.20
Repeat count : 100
Datagram size : 1400
Timeout in seconds :
Extended commands [n]:
Sweep range of sizes [n]:
Type escape sequence to abort.
Sending 100, 1400-byte ICMP Echos to 172.16.10.20, timeout is 2 seconds:
Success rate is 86 percent (86/100), round-trip min/avg/max = 1/7/84 ms
Additionally, the packet loss experienced to/from wireless clients was not observed when communicating directly with the AP (ping, SSH, etc.). It appeared that the wired network was performing normally.
Even stranger, we had one brand-new 3502 access point that performed great, without issue. So we tried moving this good AP to a switch port where another AP experiencing problems had been connected. Still no issue with this single AP.
How Wired QoS Killed Our CAPWAP Tunnels
Reviewing the gathered data, we began investigating switches between the APs and controller for interface errors and packet drops. All counters appeared normal, and no dropped packets were seen in the queues. However, given the predictable pattern of packet loss (as shown above) the issue smelled of rate-limiting of some sort.
Our support case with Cisco was raised to TAC Escalation engineers (aka the Big Dogs in house), who proceeded to run through numerous hidden commands on our Cat6500 switches, looking at ASIC performance and low-level debugs.
Still finding nothing, we took a shot in the dark. We disabled QoS globally on one of the access layer switches that had APs attached with issues.
no mls qos
Immediately... multi-megabit performance! SMB file transfers took seconds where they took close to an hour previously. No packet loss. We've found our culprit!
(Some of you may be thinking, if they suspected QoS issues before, why wasn't this tested earlier? In a large enterprise, testing changes to an existing configuration in a production environment is risky business. Established processes governing change management don't exactly allow for this type of activity by administrators. It's one thing to have a back-out plan for new changes, but an entirely different scenario when changing established configuration baselines.)
Questions still remained. What QoS settings were causing the problems and why were the QoS queue counters not showing any dropped packets?
Analysis of the data also revealed that only CAPWAP encapsulated packets were being dropped, not packets or traffic destined directly to the access point outside of the tunnel (pings or SSH, as mentioned). So what is unique to the CAPWAP tunnel? Well, we know that CAPWAP uses UDP packets with destination ports 5246 and 5247 to the controller. But the source ports used by the APs are randomly selected ephemeral (high-numbered) ports above 1,024.
What other application traffic uses ephemeral UDP ports? ... Voice Bearer Traffic (RTP)! A quick review of the QoS policy revealed a fairly typical looking configuration:
ip access-list extended QOS_VOICE_BEARERIn the policy, voice bearer traffic is identified conforming to typical design guide best practices through an ACL matching UDP ports from 16,384 through 32,767. Additionally, matching voice traffic was then marked with the EF classification and policed to 128kbps.
permit udp any range 16384 32767 any
permit udp any any range 16384 32767
class-map match-all MARK_EF
match access-group name QOS_VOICE_BEARER
police flow 128000 8000 conform-action set-dscp-transmit ef exceed-action drop
description **Wireless Management**
service-policy input MARK_VLAN_TRAFFIC
A quick verification of CAPWAP traffic from previous wired packet captures taken during troubleshooting efforts revealed overlapping port usage between the two applications.
And that one AP that never had an issue just chose a source port that was below the lower-bound of the ACL entry. There were likely other APs unaffected as well, but it would be highly variable based on the port chosen during the CAPWAP join process.
The workaround was to re-enable QoS globally on the switch and to prevent the switch from re-classifying traffic from the AP ports. This is accomplished by trusting the DSCP value of packets being received on switch ports connected wireless access points using the following command:
mls qos trust dscp
Additionally, the packet drops were eventually found by looking at the policed packet counter through the QoS netflow statistics on the switch.
A Cautionary Example
This incident highlights the importance of complete end-to-end QoS configuration for wireless networks. Although not directly a wireless issue, the wireless network relies on other network components to deliver traffic in a fashion congruent with unique application and traffic characteristics found in various wireless architectures.
Having a thorough understanding of both wireless and wired QoS configuration and best practices is critical for network engineers designing and deploying these systems. In addition, best practices don't always work in every environment. Sure, they are a rule-of-thumb, but engineers should examine their unique requirements and adjust accordingly.
For wireless networks, this means at a minimum, that wired QoS policies should have explicit provisions to handle wireless traffic appropriately. This may be as simple as trusting QoS values coming out of the APs, as implemented with our workaround. Or this may mean re-writing of the base QoS policy to ensure correct identification and classification of traffic. ACLs and port-matching are a broad brush which can snag victim application traffic easily. We will be reviewing the method(s) by which voice traffic is identified within our organization.
My recommendation: be fluent in end-to-end QoS, best practices, and know the applications flowing across your network so you can make sound design decisions.
QoS is a tricky trickster. Put your game face on when dealing with him!