Wednesday, November 17, 2010

Cisco WLC QoS Profile Bugs

There is a pretty devious little bug ID in the Cisco WLC QoS profile settings that can lead to problematic traffic forwarding for users and a pretty severe disruption to user experience.

Recently, during a guest wireless deployment, our wireless engineering team designed a solution to minimize potential impact of guest Wi-Fi devices on our production network by implementing QoS bandwidth contracts to guest users. This is accomplished through the Wireless > QoS > Profiles section of the wireless controller configuration, as shown below.


When this QoS profile is applied to a WLAN, each user's bandwidth is limited to the specified average and burst data rates (in Kbps). Additionally, the wired QoS protocol tagging feature uses 802.1p for layer 2 class of service integration with the wired network switches.

Once implemented, we began experiencing random TCP sessions dropping as well as random UDP packet loss. Scratching our heads, we started digging into what changed recently. Besides upgrading to 7.0.98.0 code, the only other change was enforcement of QoS bandwidth contracts using the Bronze QoS profile applied to the guest wireless WLAN. It also happened that the guest user base were the only individuals reporting problems. They were having IPSec and SSL VPN sessions dropping numerous times throughout each day. Definitely not a productive environment to support many corporate business partners whom we rely on to help us get work done.

Many packet traces at *multiple* points throughout the network later, we found return packets to clients missing, and apparently being dropped by the local controller. We expected that tracing packets in both CAPWAP and Mobility Ether-IP tunnels would be fairly complex, but were pleasantly surprised to see that Wireshark had protocol dissectors for both protocols! Yeah Wireshark!

Also, having a network of remote sniffers deployed throughout various points in your network is a godsend for remote configuration and packet capture capabilities. Thank you to our internal performance services team for having this capability! I can only imagine how painful it would have been to have to physically visit each point in the network to setup a SPAN port or other manual network sniffer solution.

So, knowing the local controller (as opposed to the DMZ anchor, or any other network equipment) was dropping the packets, it was a fairly simple matter of figuring out what changed to cause the issue. Rather than downgrading code, we first reverted the configuration to it's state prior to the problems by removing the QoS bandwidth contracts. Voila, problem magically disappeared.

Searching the Cisco documentation of this feature revealed no clues or warnings around best practice use of this feature (normally, if Cisco does not want a customer to modify a default value without contacting TAC or Advanced Services they will explicitly state that in the configuration guide). No warnings, okay - what is going on here, why is this feature not working correctly. Hmmm.... BUG?!

How about that, numerous bugs exist for the QoS profile settings for 802.1p tagging and bandwidth contracts. Here's one interesting one:

CSCsz20162 - WLC5500: QoS rate limiting feature is not accurate
This bug was resolved in 7.0.98.0 code, or was it?

Additionally, here are some other bugs not fixed until code versions after 7.0.98.0 (all fixes apply to a later release of code):

CSCth94887 - 5500: 802.1p markings not working for pings, EoIP and FastCache mode
CSCth90962 - Document QoS 802.1p tagging blocks traffic on untagged interfaces
CSCte64638 - Clients cannot ping or get IP address with 802.1p QoS profile
CSCti62070 - Per-User bandwidth contract blocks all traffic when set to 0
CSCte53175 - Per-User bandwidth contract blocks all traffic when set to 0

None of these directly relate to our issue, but there are enough bugs on the feature to make me think I may have found a new one. Additionally, since our issue cleared right up when removing the bandwidth contracts, a bug seems to be the likely bet here.

I also find it interesting that Cisco's Bug Toolkit does not include lookups for their 5508 series wireless controllers. So I'm stuck looking up bugs for the 4400 series, hoping that everything affecting 5508's also affects 4400's.

As Ethan Banks posted over in his blog, engineers are constantly opening support cases with vendors for bugs. "All the bugs. All the time, bugs. Bugs and bugs and bugs. Buggy bug bugs. AHHH!" I can relate to that.


So, what have we learned from this experience?
  1. Document all changes to systems. There is a reason that change management practices and processes exist in most organizations. Use it, live by it, it will save your butt.
  2. Test your changes prior to production implementation. We did test this change and still failed to catch the issue. But we will update our testing procedures. We catch most issues with testing, but some still fall through the cracks. Learn from them and update your testing plans accordingly.
  3. Prepare troubleshooting tools ahead of time. Without remote packet capture capability, it might have taken us 10x longer to figure out where the issue was occurring. We had new addressing space, routing, firewalls, proxies, etc. all involved that I didn't discuss for brevity, but it could have been anywhere. Take a pro-active approach to deploying troubleshooting and monitoring tools. This way, when a problem comes up you can react quickly and execute on an emergency response plan. They do this for fire drills and emergency responders, take the time to do it for your network!
  4. Admit mistakes, take responsibility, and fix the issue. I've seen too many individuals in IT attempt to deny issues and gloss over relevant information for fear of looking bad, incompetent, or just plain "not perfect" at what they do. Some even go so far as to lie about log data (or selectively focus on data that reinforces their position while discounting data that does not). This leads to longer issue resolution times, worse business impact, and once the root cause is determined, which it will be, they end up losing credibility and trust of everyone involved. Instead, present all data that could possibly be involved in the issue up front, and if it is your issue take responsibility. Everyone makes mistakes or simply cannot do a *perfect* job. Even the best engineers, including CCIEs (yes, you're fallible too)!
For the time being, I would recommend avoiding per-user bandwidth contracts in your wireless LAN controller QoS profiles. 


Best of luck out there with your bugs ;)
Andrew

9 comments:

  1. Awesome tip! I've only *read* about per-user bandwidth restrictions that were possible, never had the need to implement one. Now I know to steer people away from that type of configuration!

    ReplyDelete
  2. Nice post ! Agree with you about points 1-4. Document, test, have the right tools and be accountable. Having worked in Cisco TAC, I know bugs do exist ;-)

    ReplyDelete
  3. Thank you for this article. We have spent a better part of 2 weeks, with CISCO trying to figure out the drops. Come to find it was QOS, which were migrated from our 2106 controllers. We turned QOS off and our problems went away.

    ReplyDelete
  4. Anybody tried this with the latest code 7.0.116.0 ?

    ReplyDelete
  5. I just read through the release notes for 7.0.116.0 and didnt see anything about this being fixed. Too bad, I was just about to turn on QOS for my guest wirless network.

    http://www.cisco.com/en/US/docs/wireless/controller/release/notes/crn7_0_116_0.html

    ReplyDelete
  6. Hi Luke,
    The primary bug that I referenced, CSCsz20162, was resolved in version 7.0.98.0 code.

    To check on bug ID statuses, you should use the Bug Toolkit which can be found under the Support Tools on Cisco's website.

    Cheers,
    Andrew

    ReplyDelete
  7. Sadly, Release notes not always reflect all fixed bugs.

    Thanks
    -Van

    ReplyDelete
  8. It still doesn't work. I had the same problem.
    Running 7.0.230.0

    ReplyDelete
  9. Running 7.1.91.0.

    Still an issue. 56/108/56/108 were the values I used on our Guest SSID at Bronze QoS. Web Auth/passthrough. You can connect and accept on the Guest warning page, but then you can't do anything and it drops you almost immediately.

    Change it back to Silver and it works fine...

    ReplyDelete