Friday, January 7, 2011

[QoS] It's Tricky, Tricky, Tricky, Tricky...

You may have noticed somewhat of a recurring theme across several of my posts - Quality of Service. Since wireless networks are inherently a shared medium, and with Wi-Fi in particular using distributed contention protocols (DCF, EDCA), it stands to reason that implementing QoS controls and having some form of differentiated access to the network is just a bit more critical than on a switched LAN.

Most of the time, Wi-Fi engineers such as ourselves focus on over-the-air QoS since that is typically more critical to performance than wired QoS. However, a recent support incident highlighted the need for careful wired QoS policy definition, especially when supporting an LWAPP/CAPWAP wireless environment.

Shiny New Equipment
A recent project at our organization involved the deployment of several hundred new Cisco 3502 CleanAir access points which run on the Cisco Unified Wireless Network using the CAPWAP protocol. (For an overview of the CAPWAP protocol, see my previous blog posts here, here, here, here, and here.)

This project involved replacing existing 802.11a/b/g "legacy" access points with new 802.11n access points, as well as installation of a few hundred net-new additional APs. The replacement APs were to be installed and patched into the existing AP switch ports, while the new APs were to be patched into open switch ports in the existing data VLAN which provides DHCP services. This would allow the new APs to be deployed with zero-touch configuration, simply taken out of the box and installed by the contractor, minimizing operational expense. After the net-new APs were installed and registered to the controller, an administrator would then move them to the management VLAN and apply the standard port configuration settings for APs in our environment.

Help Desk, "We Have a Problem"
However, almost immediately after the new APs began to be installed, support tickets started rolling in. Users were reporting horribly slow wireless network performance, severe enough to the point of making the network unusable.

A quick trip to the affected building (only 5 min. away) confirmed the issue. A simple ping from a wireless client to the default gateway would drop numerous packets, sometimes as bad as 10% packet loss. And that was when the client was otherwise idle without other applications running. The issue would get even worse when attempting to load more traffic over the connection, such as pulling down a file over an SMB share or browsing a webpage with video content, spiking upwards of 25-30% packet loss. Clearly something was going on.

Sample pings from the distribution switch (housing the WiSM controller) to the wireless client showed the same symptoms in the reverse direction as well:
CAT6K#ping        
Protocol [ip]:
Target IP address: 172.16.10.20
Repeat count [5]: 100
Datagram size [100]: 1400
Timeout in seconds [2]:
Extended commands [n]:
Sweep range of sizes [n]:
Type escape sequence to abort.
Sending 100, 1400-byte ICMP Echos to 172.16.10.20, timeout is 2 seconds:
!!!!!!.!!!!!!.!!!!!!.!!!!!!.!!!!!!.!!!!!!.!!!!!!.!!!!!!.!!!!!!.!!!!!!.
!!!!!!.!!!!!!.!!!!!!.!!!!!!.!!
Success rate is 86 percent (86/100), round-trip min/avg/max = 1/7/84 ms
CAT6K#
Initial suspicions fell on the wireless network, after all with changes come the opportunity for new problems. However, a parallel deployment of the new CleanAir APs in several other office buildings as well as a previous warehouse deployment had gone off without a hitch and no issues were experienced in those locations. Numerous wireless packet captures were performed that showed little to no issues over-the-air. Issues were experienced on both 2.4 GHz and 5 GHz frequency bands, very few retransmissions were present, and no interference was observed. Configuration changes were backed-out and testing was performed with the legacy APs, but the issue persisted.

Additionally, the packet loss experienced to/from wireless clients was not observed when communicating directly with the AP (ping, SSH, etc.). It appeared that the wired network was performing normally.

Even stranger, we had one brand-new 3502 access point that performed great, without issue. So we tried moving this good AP to a switch port where another AP experiencing problems had been connected. Still no issue with this single AP.

How Wired QoS Killed Our CAPWAP Tunnels
Reviewing the gathered data, we began investigating switches between the APs and controller for interface errors and packet drops. All counters appeared normal, and no dropped packets were seen in the queues. However, given the predictable pattern of packet loss (as shown above) the issue smelled of rate-limiting of some sort.

Our support case with Cisco was raised to TAC Escalation engineers (aka the Big Dogs in house), who proceeded to run through numerous hidden commands on our Cat6500 switches, looking at ASIC performance and low-level debugs.

Still finding nothing, we took a shot in the dark. We disabled QoS globally on one of the access layer switches that had APs attached with issues.

no mls qos

Immediately... multi-megabit performance! SMB file transfers took seconds where they took close to an hour previously. No packet loss. We've found our culprit! 

(Some of you may be thinking, if they suspected QoS issues before, why wasn't this tested earlier? In a large enterprise, testing changes to an existing configuration in a production environment is risky business. Established processes governing change management don't exactly allow for this type of activity by administrators. It's one thing to have a back-out plan for new changes, but an entirely different scenario when changing established configuration baselines.)

Questions still remained. What QoS settings were causing the problems and why were the QoS queue counters not showing any dropped packets?

Analysis of the data also revealed that only CAPWAP encapsulated packets were being dropped, not packets or traffic destined directly to the access point outside of the tunnel (pings or SSH, as mentioned). So what is unique to the CAPWAP tunnel? Well, we know that CAPWAP uses UDP packets with destination ports 5246 and 5247 to the controller. But the source ports used by the APs are randomly selected ephemeral (high-numbered) ports above 1,024.

What other application traffic uses ephemeral UDP ports? ... Voice Bearer Traffic (RTP)! A quick review of the QoS policy revealed a fairly typical looking configuration:
ip access-list extended QOS_VOICE_BEARER
 permit udp any range 16384 32767 any
 permit udp any any range 16384 32767


class-map match-all MARK_EF
  match access-group name QOS_VOICE_BEARER


policy-map MARK_VLAN_TRAFFIC
  class MARK_EF
     police flow 128000 8000 conform-action set-dscp-transmit ef exceed-action drop


interface Vlan589
 description **Wireless Management**
 service-policy input MARK_VLAN_TRAFFIC
In the policy, voice bearer traffic is identified conforming to typical design guide best practices through an ACL matching UDP ports from 16,384 through 32,767. Additionally, matching voice traffic was then marked with the EF classification and policed to 128kbps.

A quick verification of CAPWAP traffic from previous wired packet captures taken during troubleshooting efforts revealed overlapping port usage between the two applications.


And that one AP that never had an issue just chose a source port that was below the lower-bound of the ACL entry. There were likely other APs unaffected as well, but it would be highly variable based on the port chosen during the CAPWAP join process.

The workaround was to re-enable QoS globally on the switch and to prevent the switch from re-classifying traffic from the AP ports. This is accomplished by trusting the DSCP value of packets being received on switch ports connected wireless access points using the following command:

mls qos trust dscp


Additionally, the packet drops were eventually found by looking at the policed packet counter through the QoS netflow statistics on the switch.


A Cautionary Example
This incident highlights the importance of complete end-to-end QoS configuration for wireless networks. Although not directly a wireless issue, the wireless network relies on other network components to deliver traffic in a fashion congruent with unique application and traffic characteristics found in various wireless architectures.

Having a thorough understanding of both wireless and wired QoS configuration and best practices is critical for network engineers designing and deploying these systems. In addition, best practices don't always work in every environment. Sure, they are a rule-of-thumb, but engineers should examine their unique requirements and adjust accordingly.

For wireless networks, this means at a minimum, that wired QoS policies should have explicit provisions to handle wireless traffic appropriately. This may be as simple as trusting QoS values coming out of the APs, as implemented with our workaround. Or this may mean re-writing of the base QoS policy to ensure correct identification and classification of traffic. ACLs and port-matching are a broad brush which can snag victim application traffic easily. We will be reviewing the method(s) by which voice traffic is identified within our organization.

My recommendation: be fluent in end-to-end QoS, best practices, and know the applications flowing across  your network so you can make sound design decisions.

QoS is a tricky trickster. Put your game face on when dealing with him!

Cheers,
Andrew

12 comments:

  1. just wanted to say that I really enjoy reading your blog. the info I take away from it is great, and I learn a lot about the more structured, corporate environment. thanks for sharing. dropops

    ReplyDelete
  2. yeah me too, i've learned while i'm glad i'm kind of in a smaller environment! i probably would've just burned the place down...

    ReplyDelete
  3. Great explanation! Shows how just about any change, no matter how simple, can have consequences!

    Any idea why this problem hadn't been experienced with the previous APs?

    Also, out of curioisity, what class of service should be used for CAPWAP?

    ReplyDelete
  4. Andrew,

    Thanks for sharing your experience with that high level of detail !

    ReplyDelete
  5. Actually, the problem did exist previously, it just wasn't as severe or noticeable. Once we identified the issue, we tested with the previous 1242 a/g access points and found the issue existed with those as well.

    My guess is that the change from 11a/g to 11n coupled with the Ethernet signalling change from 100Mbps to Gigabit was enough to cause the QoS policing on the switch to become more dramatic. The newer 3502 APs using Gigabit signalling more quickly hit the 128kbps threshold. Put another way, the Gigabit signalling hit the policing threshold quicker than the 100Mbps signalling, despite sending the same traffic over the wire. This happens because policing breaks down the per-second CIR (committed information rate) into a series of intervals within that second.

    This is a tough topic to understand, since policing does not actually slow down the signalling rate on the wire.

    See this great video by Kevin Wallace, starting at about the 3:00 minute mark. The 5:00 mark highlights the timing interval for measurements very well.

    http://www.youtube.com/watch?v=dSEEwHCvOnA

    Cheers,
    Andrew

    ReplyDelete
  6. That's a great story Andrew.

    I had a question concerning Cisco's implementation of CAPWAP since I've only had the opportunity to work with Motorola. If a wireless client transmits a DiffServ marked packet (say IP phone) will the Cisco AP mark the CAPWAP packet with the same DiffServ "priority" as the internally tunneled traffic?

    Thanks for sharing!

    ReplyDelete
  7. Nice article! A good wireless engineer can't stop where the wire starts.

    ReplyDelete
  8. Hi Michael,
    Wired QoS mappings for LWAPP and CAPWAP protocols are complicated to say the least.

    Essentially, lightweight APs in "Local" mode where all client data is tunneled to the controller only use DSCP for the outer LWAPP/CAPWAP tunnel. The value varies based on the contents.

    For control traffic between the AP and controller, a value of CS6 is used, except for image download which is marked best effort (0).

    For tunneled client data traffic, the WLAN QoS setting dictates the maximum allowed DSCP value for the tunnel IP header. The clients original packet DSCP value will be copied into the tunnel IP header and limited to the maximum value allowed based on the WLAN QoS setting. A mapping of WLAN QoS settings to DSCP values can be found here:
    http://www.cisco.com/en/US/docs/solutions/Enterprise/Mobility/vowlan/41dg/vowlan_ch10.html#wp1065774

    If the AP is in H-REAP mode, then layer 2 CoS may be supported if VLAN support and tagging is enabled for the AP. The administrator must choose if CoS or DSCP should be trusted from the AP.

    Cisco controllers never modify the original client DSCP values. It is recommended to enable 802.1p support in the controller QoS profiles, and always trust layer 2 CoS. The WLC will perform similar "policing" of the CoS value as the AP does with DSCP values.

    Another good reference is here:
    http://www.cisco.com/en/US/docs/solutions/Enterprise/Mobility/emob41dg/ch5_QoS.html#wp1022491

    They've made it so difficult that it's ridiculous. A serious re-design is required.

    Cheers,
    Andrew

    ReplyDelete
  9. Hi,

    I'm having troubles with the session time out between WLC and AP.
    Our Statefull firewall accept the connection between WLC and AP but somehow it drops the replies from WLC to AP over UDP 5247 It asumes it as a new connection.

    as the AP initiates a connection over 5247 UDP to WLC it is accepted as seen in my logs.
    but time out of the firewall is set on 1 minute, connection does not get a reply and closes the connection. but somehow the WLC DO send a reply but it has seen as new connection.

    how can we change the sessioin time out for the 5247 UDP connection ?

    thanks

    ReplyDelete
    Replies
    1. The UDP session timeout is dependent on your firewall configuration. You should reference the vendor documentation for your firewall to change that setting.

      For example on a Cisco ASA firewall with 8.0 code, the command syntax is
      timeout conn 1:00:00 half-closed 0:10:00 udp 0:02:00 icmp 0:00:02

      Andrew

      Delete
  10. Andrew,
    I don't understand why trusting the QoS on the switchport resolved the issue. The issue was related to the overlapping UDP ports of CAPWAP and RTP. Trusting the DSCP of AP's doesn't seem to fit the issue.

    Can you explain what I'm missing?

    Thanks,
    Ron

    ReplyDelete
    Replies
    1. Hi Ron,
      By trusting the DSCP markings on the switchport directly connected to the AP, it prevents the switch from re-classifying traffic using the policy map. Also, all inter-switch links must also be "trusted" end to end throughout the network, otherwise re-classification could occur at any point in the network.

      That was the short-term fix. The long term fix was to also re-engineer the QoS policy so that we weren't solely relying on UDP port ranges for identification of voice traffic. But that was a much longer more involved process.

      Cheers,
      Andrew

      Delete