Wednesday, September 7, 2011

Microsoft Lync QoS

This week I'm back to my favorite topic, quality of service. Engineers across multiple teams at my current employer have been working on a project to enable wireless VoIP using softphones running Microsoft Lync on both Windows 7 and XP. This project has provided an opportunity to review how our organization handles multi-function wireless devices and network performance, with particular focus on quality of service mechanisms.

We've found some interesting things...

Wireless QoS - Not a "One-Size Fits All" Policy (Anymore)
Since Wi-Fi is a shared medium with control distributed to all active nodes in the system, proper network arbitration and performance is heavily dependent on client behavior. When dealing with QoS using 802.11e/WMM, this means accurate application traffic marking and queuing by client devices themselves. 802.11e prescribes 8 user priorities and 4 priority queues to provide a base level of differentiated services for traffic (you can read more about this in my Wireless QoS 5-part series).

Many wireless LAN vendors have only provided basic support for wireless QoS by reading the QoS values within frames and queuing traffic for downstream transmission to clients. Some vendors have begun going beyond basics to provide customers with a feature called "Airtime Fairness", which are proprietary extensions to ensure more in-depth control of clients by the infrastructure. I particularly like how Devin Akin at Aerohive highlights that this feature is really manipulation of the environment to suit a specific policy, whether it be equitable between clients or not. However, these are still mainly downlink traffic mechanisms, and the uplink flow is still controlled by the client. (There are some exceptions to this that involve infrastructure vendors monkeying with client TCP windows, acknowledgments, etc., but let's not get into that now.)

Device Convergence
A Swiss-Army Knife of Capabilities
Traditionally, vendors have implemented QoS on a per-SSID basis. This worked decent once-upon-a-time when all devices had only a single purpose in life. It was fairly easy for administrators to segment data-only devices such as laptops from voice-capable devices such as IP phones. No problem. Everything in the SSID gets slapped with a QoS template and we're done.

But what happens now when we have veritable Swiss-Army devices that perform multiple functions. What SSID do we put those in? How do we differentiate network performance and QoS based on application flows rather than device or SSID?

The answer is that wireless clients and the network must both have more intelligence to handle dynamic QoS requirements. Device convergence and the use of multi-purpose devices eliminates the ability to effectively use static 1:1 QoS policies tied to wireless SSIDs. We need something better!

Microsoft Lync QoS - A Case Study
Microsoft Lync is a software platform for unified communications providing data, voice, and video collaboration on Windows workstations. Many organizations are exploring device convergence to expand capabilities available to all employees while controlling capital and recurring costs.

As part of our lab verification of Lync prior to production deployment, we tested Wi-Fi quality of service integration. Microsoft Windows Vista, Windows 7, and Server 2008 platforms support policy-based QoS, while older Windows XP systems only support two service levels with the QoS Packet Scheduler. I'll focus on Windows 7 workstations with the more robust policy-based QoS capabilities. We setup an Active Directory GPO to classify and mark all voice traffic coming from Lync with a DSCP value of 46 (expedited forwarding), which is a general best-practice on IP-based networks (see RFC 3246 section 2.7) and conforms with Cisco's QoS Baseline recommendations.

The resulting traffic analysis verified that Lync correctly marked DSCP in the packets, but we also noticed that the layer 2 Class of Service (CoS) marking for 802.11e/WMM is set to 5. We expected to see 6, since the 802.11-2007 standard clearly states in table 9-1 that user priorities 6 and 7 are reserved for voice, while 4 and 5 are reserved for video (note that the Wikipedia entry on 802.11e contains incorrect data). This is distinctly in conflict with the 802.1p CoS values for wired LANs (using the revised 802.1Q-2005 values) which places voice traffic in priority 5.

Microsoft Lync marks layer 2 CoS values based on mapping 
IP Precedence values (3 most-significant bits in IP ToS header field)

Moreover, the mappings between layer 2 and layer 3 markings are distinctly different between vendors. Microsoft maps layer 2 CoS values, whether 802.1p or 802.11e, to the same set of layer 3 values based on IP Precedence (not DSCP) and does not differentiate between wired and wireless networks. Mappings in the opposite direction (DSCP to layer 2) also rely on only the IP Precedence values and not on the full DSCP codepoint. Therefore, a DSCP value of 46 is translated as an IP Precedence of 5 and a layer 2 priority of 5. Cisco, on the other hand, maps values differently for wired and wireless networks (see Table 2-7) to accommodate the variations between standards.

Differences in layer 2 Class of Service (CoS) values between IEEE standards
and layer 2 to layer 3 mapping implementations by vendors create complexity

The root of the issue is two-fold. First, the 802.1p and 802.11e QoS values are clearly in conflict. This makes accurate QoS implementation difficult to achieve because variations need to be dealt with correctly by each and every solution implementation. This introduces plenty of room for error. Second, Microsoft's QoS implementation in Windows maps DSCP to layer 2 CoS based on legacy IP Precedence values. It does not differentiate between wired and wireless network connections and adjust markings appropriately.

Network-Wide Ramifications
The standard practice of marking voice with DSCP 46 will result in the improper classification, marking and queuing of the traffic throughout the network.

First, wireless client transmisions (upstream) by the Windows workstations will get placed into the video queue and will not receive the appropriate contention window values. This will reduce the statistical advantage that voice frames receive for transmission over the air and could impact voice latency and jitter, especially in video rich networks. As video adoption in the enterprise increases, especially with increasing mobile device usage, this could have serious effects on voice quality.

Second, many wireless network infrastructure vendors to do not provide the ability to inspect and re-classify traffic and are forced to trust the client markings. For example, the Cisco Unified Wireless Network simply enforces a maximum QoS value for each SSID that clients cannot exceed. If the SSID is configured for Platinum QoS, then no maximum can be exceeded and traffic will be translated based on the client marking. The CAPWAP tunnel will map the client's layer 2 value of 5 (video) to a DSCP value of 34 for the outer tunnel IP header. Once the traffic is de-encapsulated by the controller, it applies an 802.1p value of 4 based on the CAPWAP packet DSCP value 34 mapped back to a layer 2 wired value, while leaving the client packet's original DSCP value of 46 untouched. Given the best practice of configuring switches to trust DSCP from CAPWAP APs and trust CoS from controllers, this traffic will be mishandled by intermediate switches and routers as well. In fact, the default Cisco switch behavior (when QoS is enabled) is for DSCP transparency to be disabled, and will result in the trusted layer 2 value coming from the WLC overriding the original client DSCP marking and being re-written to 32 based on the default switch CoS-to-DSCP mapping.

Third, advanced wireless voice control features will be ineffective and broken since the voice traffic is not properly identified within the voice queue. This includes TSPEC, call admission control (CAC), traffic bandwidth reservation, expedited bandwidth requests to facilitate emergency 911 calls, and voice stream metrics collection and reporting. Many of these features are only available to traffic streams within the voice queue. Additionally, off-channel scanning used by RRM and rogue detection are configured to defer if traffic in user priorities 4, 5, or 6 have been received in the last 100ms. If this policy has been changed to only defer for priority 6 traffic it could negatively impact Lync traffic in priority 5 by increasing network re-transmissions and network latency.

Finally, WAN bandwidth across the network could be negatively impacted. Typically, network administrators will design WAN circuits to reserve bandwidth for a specific amount of voice calls (based on Erlang calculations) as well as enforce a per-call bandwidth limitation based concurrent call volume. Since the voice calls will not be placed into the voice queue, no traffic admission or policing can occur. Should the environment grow to a point where concurrent voice calls exceeds the design, then WAN bandwidth may not be sufficient to handle the additional calls but will not be able to restrict call admission, resulting in poor voice performance.

Let's also not forget the risk of human error somewhere down the road. Assuming documentation is in proper order, policy accommodations have been made throughout the network, and engineer transitions include adequate knowledge transfer, a clear risk still exists that future changes will inadvertently affect voice traffic since it isn't supposed to be within the video queue.

Integrating Lync Into a Broader Network Architecture
Our networks aren't created in a vacuum; it's a shared resource and engineers must design networks to handle varying capabilities and demands of all connected endpoints and traffic flows. So, what options are available to mitigate this issue and effectively handle Microsoft Lync voice traffic on a converged wireless network?

We could ignore the problem and apply the standard voice QoS value of DSCP 46 (EF) in our Windows policy. This will result in accurate marking and traffic handling for wired clients, but suffers the ramifications previously described for wireless clients and broader network resource impacts. Hardly an optimal solution.

Instead, consider applying a non-standard DSCP marking. Using this approach we use policy-based QoS on Windows platforms to mark voice traffic as DSCP 48 (CS6), which allows it to be mapped to the correct wireless layer 2 CoS value of 6 and be queued correctly for transmission by the client.

The problem is that this DSCP value is reserved for Internetwork Control traffic by most networking vendors (including Cisco). Network administrators will need to ensure that wired and wireless integration is configured correctly, otherwise incorrect classification could propogate throughout the network for these traffic flows. For Cisco wireless networks this means that switch ports should be configured to trust DSCP from APs and CoS from controllers. This trust coupled with the Cisco wireless mappings between CoS and DSCP ensure that the correct DSCP value of 46 (EF) will be used both in the CAPWAP tunnel and re-written by the switch (based on WLC CoS trust) on the client packet once forwarded upstream out of the controller.

If using another wireless vendor, engineers should review the QoS capabilities of the solution and integration options to ensure accurate handling of this traffic. For example, many vendors support robust inspection and classification of traffic flows to match configured QoS policy directly in the access point. In these instances, configure APs to identify voice traffic and apply QoS based on defined policy.

This configuration will also cause incorrect markings when the workstations are connected to a wired network, which is still a likely occurrence for laptops in most environments. Network administrators should ensure that switches are configured to strip the client QoS markings on switch ports throughout the network, which is standard practice. This way the switch will ignore the client markings and re-apply QoS policy using traffic inspection techniques. Ethernet switching largely eliminates medium contention concerns on wired networks, so having an incorrect policy on the first-hop upstream is not a large concern.

I prefer to implement this solution. With proper end-to-end network QoS implemented, accurate traffic handling can be accomplished on both wired and wireless networks.

Revolution or Evolution? - Andrew's Take
Clearly something went wrong with standards-based QoS development. Divergent layer 2 QoS definitions exist from the IEEE standards. Furthermore, no standard mappings exist between layer 2 and layer 3 QoS values, as evidenced by differences in vendor implementations.

As mobile device convergence proliferates and wired and wireless networks continue to become more closely integrated, consistent end-to-end QoS policy definition and traffic handling will be critical in supporting increasing network demands.

Microsoft Lync highlights these challenges. Increasingly, organizations are looking for converged solutions to provide improved business capabilities and service. And many client platforms won't have as robust control or policy definition options as Windows. How will your organization handle voice, video and data over iOS, Android, Blackberry, Windows Phone 7, and other platforms?

Cheers,
Andrew


Additional Links for Microsoft Lync and Windows QoS:

9 comments:

  1. Andrew,

    Great article and summary on QoS needs for the wired/wireless to handle apps like Microsoft Lync. Quite a bit of marketing lingo but I thought you might be interested in our (Aruba) integration best practices with Microsoft Lync. We aim to overcome some of the challenges you highlighted with signature matching on encrypted Lync sessions to enable application awareness - then apply FW, QoS (marking or re-marking when necessary) policies... delay scanning, enable CAC.. and some more. Let me know what you think. For us the biggest puzzle was to figure out exactly which packets belong to a Lync session running on a laptop... after we figured out how to identify them in a long stream of packets from high density of clients, things got much easier. We are applying that same technique to BlackBerry MVS encrypted sessions, Apple FaceTime traffic, etc. Here is the Lync paper: http://bit.ly/prLSrH

    (via @ozwifi - Sorry, I accidentally deleted this comment in the moderation queue and couldn't get it back in it's original form. Andrew)

    ReplyDelete
  2. Thanks Andrew, for the recovery! :)

    ReplyDelete
  3. I must have missed something.
    I work with Motorola WLAN equipment. On it, you can create ACLs that may include 802.1p/DSCP tags as match conditions and re-mark them as an ACL action. You also can alter the default WMM<->802.1pp<->DSCP mapping table in order to maintain consistent L2/L3 QoS mapping across all of your network.
    All these look pretty generic to me, and, basically, would solve Lync case (alter WLAN infrastructure QoS mapping table to match LAN infrastructure mapping table, capture and remark Lync packets).
    Does Cisco not have it or there's some point that I'm missing (like, you're pointing out that network integrators SHOULD keep this issue on the checklist and not just simply trust default settings)?

    ReplyDelete
  4. Hi Arsen,
    Many vendors offer granular QoS policy definition and control. Unfortunately, Cisco does not. Having that level of control does make it easier to integrate wireless and wired policies to ensure end-to-end consistency, and is definitely a good feature to have.

    However, this does not negate the need for proper over-the-air scheduling and queuing performed by the wireless client itself. Arguably, appropriate traffic handling over the air is the most critical piece since RF is a shared medium. Microsoft Windows and Lync QoS policy settings on the client need to be accurately defined, but the way DSCP to CoS mapping is implemented by Microsoft presents opportunity for mis-configuration if network administrators are not careful to verify behavior. This is especially true since the default behavior does not conform to wireless networking "best practices".

    Thanks,
    Andrew

    ReplyDelete
  5. Andrew, thanks for explanation. 100% agree.

    ReplyDelete
  6. Hi Andrew, thanks for taking the time to write this article. I have a question.

    Firtsly, my environement is MS lync marking voice packets with EF46 and currently being handled handled correctly by our Cisco LAN/WAN for wired clients. For our wifi clients we have Aruba infrastructure. I want to switch on WMM on the controller and for the Cisco lan port to take the EF46 marked voice packet from the client.
    If we do as you suggest :-
    'Instead, consider applying a non-standard DSCP marking. Using this approach we use policy-based QoS on Windows platforms to mark voice traffic as DSCP 48 (CS6), which allows it to be mapped to the correct wireless layer 2 CoS value of 6 and be queued correctly for transmission by the client.' as you rightly state this would affect the wired connection so is not really an option as the clients are used for both wired and wirelessly. In which case, are you implying there is no way around this to enable WMM to correctly interpret our EF46 markings in this scenario?

    ReplyDelete
    Replies
    1. I recommend using DSCP 48 because in Windows the Layer 2 QoS is mapped from the Layer 3 DSCP value, and Layer 2 over-the-air link is a shared medium, so it's more important to ensure prioritized delivery from the client over the air than it is over a wired Ethernet link at Layer 2.

      You will need to couple this with network infrastructure policies on upstream equipment (wireless AP, switch, router, etc.) that inspect the traffic and re-write the correct Layer 3 DSCP value before forwarding the packet deeper into the network. This will be true for both wired and wireless traffic from clients. For example, in a Cisco wireless or wired network you would have to do this with a policy-map on a layer 3 switch or a router. On an Aerohive wireless network you could do this right in the AP with a QoS Classification Map and Marker Map. You just need to re-classify and mark the traffic once it has passed over the air, ideally as close to the AP as possible, which is why I prefer Aerohive QoS implementation because the policy is defined and implemented right in the AP at the edge, without having to traverse layer 2 switches to get to a layer 3 device before the correct QoS is re-applied.

      Cheers,
      Andrew

      Delete
  7. Hi Guys,
    A bit of a simplistic view as I am not as technical as you guys, but this solution has worked very well for me in very high density and application demand environments.

    I have recently discovered Meru which takes the client roaming decision away from the client by creating a virtual cell, therefore allowing efficient air traffic control of QOS, and because it can be deployed on a single channel you can deploy access points close to each other to ensure maximum throughput, and if required layer other channels up like a cat switch.

    Like I said it may be too simplistic, but a thought.

    Doug - Prodec Networks

    ReplyDelete
  8. Hi Doug,

    Deploying Meru APs close to each other does not ensure maximum system throughput. When APs are on the same channel, EVEN IF COORDINATED, they either: 1) interfere with each other due to CCI, or 2) have to coordinate which AP will transmit. So, either you have collisions or you have a TDMA-like environment. Neither situation enhances system throughput, but deploying more APs in this manner means 1) more AP cost, 2) more Ethernet ports, 3) more PoE power, 4) more Ethernet cables, 5) more controllers with AP and feature licenses, 6) more time to deploy.

    Other drawbacks of Meru's Virtual Cell (aka: Single Channel Architecture) include 1) lack of scaling, and 2) manditorily controller-based.

    Meru's SCA doesn't positively or negatively impact the QoS topics within this blog, as best I can tell. All of the same issues would still apply to Meru, just as they do with Cisco, Aerohive, or any other. The one issue to consider, like Andrew mentioned, is whether QoS is done in the AP or in a controller. It's always better at the AP.

    Thanks!
    Devin

    ReplyDelete