Recently, during a guest wireless deployment, our wireless engineering team designed a solution to minimize potential impact of guest Wi-Fi devices on our production network by implementing QoS bandwidth contracts to guest users. This is accomplished through the Wireless > QoS > Profiles section of the wireless controller configuration, as shown below.
When this QoS profile is applied to a WLAN, each user's bandwidth is limited to the specified average and burst data rates (in Kbps). Additionally, the wired QoS protocol tagging feature uses 802.1p for layer 2 class of service integration with the wired network switches.
Once implemented, we began experiencing random TCP sessions dropping as well as random UDP packet loss. Scratching our heads, we started digging into what changed recently. Besides upgrading to 18.104.22.168 code, the only other change was enforcement of QoS bandwidth contracts using the Bronze QoS profile applied to the guest wireless WLAN. It also happened that the guest user base were the only individuals reporting problems. They were having IPSec and SSL VPN sessions dropping numerous times throughout each day. Definitely not a productive environment to support many corporate business partners whom we rely on to help us get work done.
Many packet traces at *multiple* points throughout the network later, we found return packets to clients missing, and apparently being dropped by the local controller. We expected that tracing packets in both CAPWAP and Mobility Ether-IP tunnels would be fairly complex, but were pleasantly surprised to see that Wireshark had protocol dissectors for both protocols! Yeah Wireshark!
Also, having a network of remote sniffers deployed throughout various points in your network is a godsend for remote configuration and packet capture capabilities. Thank you to our internal performance services team for having this capability! I can only imagine how painful it would have been to have to physically visit each point in the network to setup a SPAN port or other manual network sniffer solution.
So, knowing the local controller (as opposed to the DMZ anchor, or any other network equipment) was dropping the packets, it was a fairly simple matter of figuring out what changed to cause the issue. Rather than downgrading code, we first reverted the configuration to it's state prior to the problems by removing the QoS bandwidth contracts. Voila, problem magically disappeared.
Searching the Cisco documentation of this feature revealed no clues or warnings around best practice use of this feature (normally, if Cisco does not want a customer to modify a default value without contacting TAC or Advanced Services they will explicitly state that in the configuration guide). No warnings, okay - what is going on here, why is this feature not working correctly. Hmmm.... BUG?!
How about that, numerous bugs exist for the QoS profile settings for 802.1p tagging and bandwidth contracts. Here's one interesting one:
CSCsz20162 - WLC5500: QoS rate limiting feature is not accurate
This bug was resolved in 22.214.171.124 code, or was it?
Additionally, here are some other bugs not fixed until code versions after 126.96.36.199 (all fixes apply to a later release of code):
CSCth94887 - 5500: 802.1p markings not working for pings, EoIP and FastCache mode
CSCth90962 - Document QoS 802.1p tagging blocks traffic on untagged interfaces
CSCte64638 - Clients cannot ping or get IP address with 802.1p QoS profile
CSCti62070 - Per-User bandwidth contract blocks all traffic when set to 0
CSCte53175 - Per-User bandwidth contract blocks all traffic when set to 0
None of these directly relate to our issue, but there are enough bugs on the feature to make me think I may have found a new one. Additionally, since our issue cleared right up when removing the bandwidth contracts, a bug seems to be the likely bet here.
I also find it interesting that Cisco's Bug Toolkit does not include lookups for their 5508 series wireless controllers. So I'm stuck looking up bugs for the 4400 series, hoping that everything affecting 5508's also affects 4400's.
As Ethan Banks posted over in his blog, engineers are constantly opening support cases with vendors for bugs. "All the bugs. All the time, bugs. Bugs and bugs and bugs. Buggy bug bugs. AHHH!" I can relate to that.
So, what have we learned from this experience?
- Document all changes to systems. There is a reason that change management practices and processes exist in most organizations. Use it, live by it, it will save your butt.
- Test your changes prior to production implementation. We did test this change and still failed to catch the issue. But we will update our testing procedures. We catch most issues with testing, but some still fall through the cracks. Learn from them and update your testing plans accordingly.
- Prepare troubleshooting tools ahead of time. Without remote packet capture capability, it might have taken us 10x longer to figure out where the issue was occurring. We had new addressing space, routing, firewalls, proxies, etc. all involved that I didn't discuss for brevity, but it could have been anywhere. Take a pro-active approach to deploying troubleshooting and monitoring tools. This way, when a problem comes up you can react quickly and execute on an emergency response plan. They do this for fire drills and emergency responders, take the time to do it for your network!
- Admit mistakes, take responsibility, and fix the issue. I've seen too many individuals in IT attempt to deny issues and gloss over relevant information for fear of looking bad, incompetent, or just plain "not perfect" at what they do. Some even go so far as to lie about log data (or selectively focus on data that reinforces their position while discounting data that does not). This leads to longer issue resolution times, worse business impact, and once the root cause is determined, which it will be, they end up losing credibility and trust of everyone involved. Instead, present all data that could possibly be involved in the issue up front, and if it is your issue take responsibility. Everyone makes mistakes or simply cannot do a *perfect* job. Even the best engineers, including CCIEs (yes, you're fallible too)!
Best of luck out there with your bugs ;)