Aug 11 2009

Cisco UCS and Nexus 1000V design diagram with Palo adapter

This is a follow-up and enhancement of a previous design diagram in which I showed Cisco UCS running the standard VMware vSwitch.  In this post I am once again showing Cisco UCS utilizing the Cisco (Palo) virtualized adapter with an implementation of VMware vSphere 4.0, however in this design we are running ESXi and the Cisco Nexus 1000V distributed virtual switch (vDS).

The Cisco adapter on the UCS B-200 blade is using its Network Interface Virtualization (NIV) capabilities and presenting (4) virtual Ethernet NICs, and (2) virtual Fibre Channel HBA’s to the operating system vSphere 4.0 ESXi.  The vSphere 4.0 hypervisor sees the virtual adapters as unique physical adapters and identifies them as VMNIC’s and VMHBA’s.  The vSphere VMNIC’s are then associated to the Cisco Nexus 1000V software switch to be used as uplinks.  The NIV capabilities of the Cisco adapter allow the designer to use a familiar VMware multi-NIC design on a server that in reality has (2) 10GE physical interfaces with complete Quality of Service, bandwidth sharing, and VLAN portability among the virtual adapters.

Aside from visualizing how all the connectivity works, this diagram is also intended to illustrate some key concepts and capabilities.

Cisco Virtualization Adapter preserving familiar ESX multi-NIC network designs

In this design we have used the NIV capabilities of the Cisco “Palo” adapter to present multiple adapters to the vSphere hypervisor in an effort to preserve the familiar and well known (4) NIC design where (2) adapters are dedicated to VM’s, and (2) adapters dedicated to management connections. The vSphere hypervisor scans the PCIe bus and see’s what it believes to be (4) discreet phsyical adapters, when in reality there is only (1) physical dual-port 10GE adapter. Just as we would with a server with (4) physical NICs we can dedicate (2) virtual Ethernet adapters to the virtual machine traffic by creating a port profile called “VM-Uplink” and associating it to the Cisco adapter vNIC1 and vNIC2. Similarly we can dedicate (2) virtual Ethernet adapters to the management traffic by creating a port profile called “System-Uplink” and associating it to the Cisco adapter vNIC3 and vNIC4.

We will configure the “VM-Uplink” port profile to only forward VLANs belonging to VM’s, and configure the “System-Uplink” port profile to only forward VLANs belonging to management traffic.

Creating separate uplink Port Profiles for VM's and Management
Nexus1000V# config
Nexus1000V(config)# port-profile System-Uplink
Nexus1000V(config-port-prof)# capability uplink
Nexus1000V(config-port-prof)# vmware port-group
Nexus1000V(config-port-prof)# switchport mode trunk
Nexus1000V(config-port-prof)# switchport trunk allowed vlan 90, 100, 260-261
Nexus1000V(config-port-prof)# no shutdown
Nexus1000V(config-port-prof)# state enabled

Nexus1000V(config)# port-profile VM-Uplink
Nexus1000V(config-port-prof)# capability uplink
Nexus1000V(config-port-prof)# vmware port-group
Nexus1000V(config-port-prof)# switchport mode trunk
Nexus1000V(config-port-prof)# switchport trunk allowed vlan 10, 20
Nexus1000V(config-port-prof)# no shutdown
Nexus1000V(config-port-prof)# state enabled

The VMware administrator will now be able to associate vmnic0 and vmnic1 to the “VM-Uplink” port group, additionally vmnic2 and vmnic3 can be associated to the “System-Uplink” port group. This action puts those NICs in the control of Nexus 1000V which assigns them to a physical interface number; Eth1/1 for vmnic0, Eth1/2 for vmnic1, and so on.

Nexus 1000V VSM running on top of one of it’s own VEM’s

In this diagram the UCS blade is running the Nexus 1000V VSM in a virtual machine connected to a VEM managed by the VSM itself.  Sounds like a chicken and egg brain twister doesn’t it?  So how does that work?  Well, pretty simple actually.  We use the ‘system vlan‘ command on the uplink port profile “System Uplink”.  This allows the VLANs stated in this command to be up and forwarding prior to connecting with the VSM for ‘critical connections’ such those needed to reach the VSM and other critical VMWare management ports such as the VM Kernel.  We can also use the same ’system vlan’ command on the port profiles facing the locally hosted VSM on this blade.

Identifying important management connections
Nexus1000V# config
Nexus1000V(config)# port-profile System-Uplink
Nexus1000V(config-port-prof)# capability uplink
Nexus1000V(config-port-prof)# system vlan 90,100,260-261
! These VLANs forwarding on the uplink prior to locating VSM

Nexus1000V(config)# port-profile VMKernel
Nexus1000V(config-port-prof)# switchport mode access
Nexus1000V(config-port-prof)# switchport access vlan 100
Nexus1000V(config-port-prof)# system vlan 100
! This allows access to VMKernel if VSM is down

Nexus1000V(config)# port-profile N1K-Control
Nexus1000V(config-port-prof)# switchport mode access
Nexus1000V(config-port-prof)# switchport access vlan 260
Nexus1000V(config-port-prof)# system vlan 260
! Allows VNICs for the VSM to be up prior to connecting to the VSM itself
! Do the same for N1K-Packet and N1K-Control

Virtual Port Channel “Host Mode” on the Nexus 1000V VEM uplinks (vPC-HM)

In this design the uplink port profiles “System Uplink” and “VM Uplink” are establishing a single logical port channel interface to two separate upstream switches.  The two separate upstream switches in this case are (Fabric Interconnect LEFT) and (Fabric Interconnect RIGHT).  While the server adapter is physically wired the UCS “Fabric Extenders” (aka IOM), the fabric extender is simply providing a remote extension of the upstream master switch (the Fabric Interconnect), therefore the server adapter and Nexus 1000V VEM see itself as being connected directly to the two Fabric Interconnects.  Having said that, the two Fabric Interconnects are not vPC peers that would normally allow them to share a single port channel facing a server or upstream switch.  So how does the Nexus 1000V form a single port channel across two separate switches not enabled for vPC?  This is done with a simple configuration on the Nexus 1000V called vPC-HM.

The Nexus 1000V VEM learns via CDP that Eth 1/1 and Eth 1/2 are connected to separate physical switches and creates a “Sub Group” unique to each physical switch.  If there are multiple links to the same physical switch they will be added to the same Sub Group.  When a virtual machine is sending network traffic the Nexus 1000V will first pick a Sub Group and pin that VM to it.  If there are multiple links within the chosen Sub Group the Nexus 1000V will load balance traffic across those links on a per-flow basis.

Enabling vPC-HM on Nexus 1000V
Nexus1000V# config
Nexus1000V(config)# port-profile VM-Uplink
Nexus1000V(config-port-prof)# channel-group auto mode on sub-group cdp

Nexus1000V(config)# port-profile System-Uplink
Nexus1000V(config-port-prof)# channel-group auto mode on sub-group cdp

With this configuration the Nexus 1000V will automatically create two Port Channel interfaces and associate them to the chosen Port Profiles

Nexus1000V# show run
! unnecessary output omitted

interface port-channel1
  inherit port-profile VM-Uplink

interface port-channel2
  inherit port-profile System-Uplink

Cisco Virtualization Adapter per vNIC Quality of Service

Our multi-NIC design is enhanced by the fact that Cisco UCS can apply different Quality of Service (QoS) levels to each individual vNIC on any adapter. In this design, the virtual adapters vNIC3 and vNIC4 dedicated to management connections are given the QoS profile “Gold”. The “Gold” QoS setting can for example define a minimum guaranteed bandwidth of 1Gbps.  This works out nicely because this matches the VMware best practice of providing at least 1Gbps of guaranteed bandwidth to the VM Kernel interface. Similarly, the “Best Effort” QoS profile assigned to the NICs used by VM’s can also be given a minimum guaranteed bandwidth.

It is important to understand that this is NOT rate limiting. Interface rate limiting is an inferior and sub optimal approach that results in wasting unused bandwidth. Rather, if the VM Kernel wants 10G of bandwidth it will have access to all 10G bandwidth if available. If the VM’s happen to be using all 10G of bandwidth and the VM Kernel needs the link, the VM Kernel will get it’s minimum guarantee of 1Gbps and the VM’s will be the able to use the remaining 9Gbps, and vice versa. The net result is that Cisco UCS provides a fair sharing of available bandwidth combined with minimum guarantees.

QoS policies for the individual adapters are defined and applied centrally at the UCS Manager GUI:

Read the Cisco.com UCS Manager QoS configuration example for more information.

True NIV goes both ways: (Server and Network)

To obtain true NIV requires virtualizing the adapter towards the Server and the Network.  In this design we are providing NIV to the Server by means of SR-IOV based PCIe virtualization which fools the server into seeing more than one adapter, all from a single physical adapter.  So the virtual adapters vNIC1, vNIC2, and so on, are identifying and distinguishing  themselves to the server system with PCIe mechanisms.  This accomplishes the goal of adapter consolidation and virtualization from the Server perspective.

The next challenge is differentiating the virtual adapters towards the Network.  Remember that more than one virtual adapter is sharing the same physical cable with other virtual adapters.  In this case vNIC1 and vNIC3 are sharing the same 10GE physical cable.  When traffic is received by the adapter on this shared 10GE cable how does the physical adapter know to which vNIC the traffic belongs to?  Furthermore, when a vNIC transmits traffic towards the Network, how does the upstream network know which vNIC the traffic came from and apply a unique policy to it, such as our “Gold” QoS policy?

Cisco UCS and Nexus 5000 solve this problem with the use of a unique tag dedicated for NIV identification purposes, shown here as a VNTag.  Each virtual adapter has it’s own unique tag# assigned by UCS Manager.  When traffic is received by the physical adapter on the shared 10GE cable it simply looks at the NIV tag# to determine what vNIC the traffic belongs.  When a vNIC is transmitting traffic towards the network it applies it’s unique NIV tag# and the upstream switch (Fabric Interconnect) is able to identify which vNIC the traffic was received from and apply a unique policy to it.

Not all implementations of NIV adequately address the Network side of the equation, and as a result can impose some surprising restrictions on the data center designer.  A perfect example of this is Scott Lowe’s discovery that HP Virtual Connect Flex-10 FlexNICs cannot have the same VLAN present on two virtual adapters (FlexNICs) sharing the same LOM.  Because HP did not adequately address the Network side of NIV (such as implementing an NIV tag), HP is forcing the system to use the existing VLAN tag as the means to determine which FlexNIC is receiving or sending traffic on a shared 10GE cable, resulting in the limitation Scott Lowe discovered and wrote about on his blog.  Furthermore, HP’s Flex-10 imposes a rate limiting requirement that imposes a hard partitioning of bandwidth resulting in waste and inefficiency.  Each FlexNIC must be given a not-to-exceed rate limit, and the sum of those limits must not exceed 10Gbps.  For example, I could have (4) FlexNICs sharing one 10GE port and I could give each FlexNIC 2.5Gbps of max bandwidth.  However if the link is idle FlexNIC #1 could not transmit any faster than 2.5Gbps (wasted bandwidth).

Cisco UCS addresses NIV from both the Server side and Network side, and provides actual Quality of Service with fair sharing of bandwidth secured by minimum guarantees (not max limits).  As a result there is no VLAN or bandwidth limitations.  In the design shown here with Cisco UCS and Nexus 1000V, any VLAN can be present on any number of vNICs on any port, and any vNIC can use the full 10GE of link bandwidth, giving the Data Center Architect tremendous virtualization design flexibility and simplicity.

I hope you enjoyed this post.  Feel free to submit any questions or feedback in the comments below.

Other related posts:
Cisco UCS and VMWare vSwitch design with Cisco 10GE Virtual Adapter
Nexus 1000V with FCoE CNA and VMWare ESX 4.0 deployment diagram

###

Disclaimer: This is not an official Cisco publication.  The views and opinions expressed are solely those of the author as a private individual and do not necessarily reflect those of the author’s employer (Cisco Systems, Inc.).  This is not an official Cisco Validated Design.  Contact your local Cisco representitive for assistance in designing a data center solution that meets your specific requirements.

19 responses so far

19 Responses to “Cisco UCS and Nexus 1000V design diagram with Palo adapter”

  1. Brilliant. Keep it up, Brad! The troops are hungry for this kind of manna :-)

  2. Rodos says:

    Brad, love your work as always. I like the slight change to the storage from the last diagram, or maybe I just understand it better now.

    You use the label Fabric Extender, often referred to as the FEX. As I now understand it the FEX has been officially renamed as the IO Module or IOM for short. Some of the docs still refer to it as the FEX but thats to be updated. As I suspect this diagram will be well use could be good to see it use the new name to avoid confusion.

    Great stuff, I look forward to seeing more.

    Rodos

  3. Brad Hedlund says:

    Rodos,
    The feedback I received from you about the previous draft I sent you influenced changes in this diagram. Thank you very much Sir.

    Cheers,
    Brad

  4. Duncan says:

    Great article Brad!

  5. Mike says:

    Brad, when you talk about QoS, is it also possible to give Fibre Channel a minimum of guaranteed bandwidth, and what happens if not enough bandwidth is available, because it is needed for Networking(pause the FCoE traffic)?

    thx for info,
    Mike

  6. Brad Hedlund says:

    Mike,
    In both UCS and Nexus 5000, Fibre Channel has QoS on by default. The default setting for FCoE traffic is a guarantee of 50% link bandwidth (5Gbps) and no packet drops.

    Cheers,
    Brad

  7. HP says:

    Your article is a little biased. It is easy to stress some HP Flex-10 limitations, but the same is true for Cisco:
    * HP Flex-10 bandwidth divisions are fixed today. That is true. But at least it is available now. That can’t be said of the Cisco Palo NIC. HP Flex-10 divisions will be dynamic in future releases also.
    * You go through a lot of effort to setup everything redundantly. However, the hardware is not redundant to chip-level. You mention it yourself: “The vSphere hypervisor scans the PCIe bus and see’s what it believes to be (4) discreet phsyical adapters, when in reality there is *** only (1) physical *** dual-port 10GE adapter” and “….which fools the server into seeing more than one adapter, all from *** a single physical adapter ***”. In a half-width server, the CNA remains a SPOF.
    HP at least provided a full redundant system up to chip-level for half-height servers with two *** physically seperated *** Flex10-10GE interfaces.

  8. Brad Hedlund says:

    Calling this article biased is quite funny coming from “HP”, as you proceed to espouse Flex-10 with bias. Furthermore, you are stating the obvious. Any intelligent reader knows that a Cisco employee writing about a Cisco product has bias, and expects it.

    In a half-width server, the CNA remains a SPOF.

    Uh, yeah, that’s obvious. Any customer briefed on UCS understands that. There’s no secret there. More adapters for the sake of redundancy translates into higher cost. The customer will make the judgement call if leveraging the HA capabilities of the virtualization or clustering software can afford savings in infrastructure costs.

    p.s. Don’t be ashamed to use a real name.

    Cheers,
    Brad

  9. [...] links for Cisco UCSCisco UCS and Nexus 1000V design diagram with Palo adapterMore on Cisco UCSWhat are the hardware components of UCS – Ciscowiki The good and bad of [...]

  10. Burg Rahja says:

    Brad

    Thanks for this article I’ve learned a lot from reading it.

    I have a follow up question. How is the live migration of VMs handed so that the minimum bandwidth guarantees are enforced when the VM moves to another host?

    Thx
    Burg

  11. Brad Hedlund says:

    Burg,
    An excellent question that highlights the innovation and value of UCS and Nexus 1000V. Any minimum bandwidth guarantees as they existed on the source machine would be preserved at the destination machine provided the destination system was identically configured and the QoS policies followed the VM during vMotion.

    How is UCS and Nexus 1000V special in enabling this?

    Destination system identically configured:
    The complete server and network configuration of this system, from the server blade settings itself, to its Palo adapter, and all of the LAN/SAN settings provisioned on the Fabric Interconnect for this server are captured in a UCS Service Profile. This Service Profile could be made into a Service Profile Template. Any new blades I bring into the environment can be provisioned with a Service Profile that was cloned from the template. Following this behavior insures that my configuration is consistent among all blades. The configuration of the Palo adapter and all its QoS settings, the LAN/SAN settings and QoS on the Fabric Interconnect are all the same with no configuration drift or inconsistencies.

    QoS policies following the VM:
    This is where the Nexus 1000V shines. When a VM is migrated via vMotion to another system within the Nexus 1000V domain, any QoS settings specific to that VM are migrated along with it, resulting in consistent QoS behavior and policies regardless of the VM’s actual location. This automated migration of network QoS policies is something that was never possible before prior to Nexus 1000V.

    Hope that helps.

    Cheers,
    Brad

  12. Brad, this is a great overview. Thanks.

    Would you elaborate on the “per-flow” hashing that Nexus 1000V performs in vPC-HM? What aspects of a packet/frame are used in identifying a “flow” and how accurately does this result in load-balanced traffic across the redundant virtual adapters?

    I’d also be interested to hear your thoughts on the pros/cons of Nexus 1000V attaching to the Palo NIV devices as you’ve described here, versus PCIe device “pass-through” in the VMM to expose NIV devices directly to each VM instance. I gather that local switching between VMs would be impacted on the one hand, though perhaps hardware-assist features in the NIC would be impacted conversely. I’m not sure how overall management and scale would be affected. What other considerations come to mind?

    Thanks again,
    -Benson

  13. Brad Hedlund says:

    Benson,
    The Nexus 1000V has tons of options for hashing what constitutes a flow:
    http://www.cisco.com/en/US/docs/switches/datacenter/nexus1000/sw/4_0/command/reference/n1000v_cmds_p.html#wp1284857

    The more granular your hashing algo is, the more likely you are to get even Steven load balancing. However, before you pick a granular method such as source & dest TCP ports, your member links should be landing on the same physical switch, or a single “logical” switch created by vPC, VSS, or StackWise.

    The pro’s of hypervisor bypass are better I/O performance and lower latency for the VM’s (more like bare metal), nice for high I/O VM’s such as Oracle or Exchange etc. The tradeoff of hypervisor bypass is scalability, as the # of vNIC’s you can provision on your physical adapter (Palo in this case) is hardware limited (128, with realistic numbers in the 50 range or less). The software based approach with Nexus 1000V has no hardware limits and scalability in terms of # of VM’s per blade is much higher.

    Cheers,
    Brad

  14. scott owens says:

    How does the impact of 10Gb with Jumbo frames fit into this ?
    The 7K & 5K both support jumbos – should we expect greater performance increases between backup servers and targets along with iSCSI improvements too .
    Also … does the Palo have iSCSI offloading along with ethernet offload ?

    thanks

  15. Got on this thread by chance. Interesting stuff.

    BTW

    >HP at least provided a full redundant system up to chip-level for
    >half-height servers with two *** physically seperated *** Flex10-10GE
    >interfaces.

    Well that’s what we (IBM) say about our BladeCenter Vs the HP BladeSystem (i.e. we have a redundant backplane while you do not bla bla bla).

    Funny. I guess the glass is always half full for vendors…. isn’t it?

    Massimo.

  16. Josh says:

    My thought would be that most implementations would have either a CNA/1000V or the Palo/bypass arrangement. Does that sound right to you Brad?

    Also, you must be in switching mode with the 6100’s and are you statically pinning with separate VLAN’s per 6100? Why not use the native switching mode and keep the VLAN configs in the 6100 consistent? This config just seems much more involved than it needs to be. Am I missing something?

  17. [...] will recommend you to take a full view about the article here. Tags: Cisco UCS, menlo, Nexus, nexus 1000v, [...]

  18. Brad Hedlund says:

    Josh,

    This design diagram works with the 6100 in End-Host mode, or switching mode. Nothing in this diagram requires the use of switching mode on the 6100 Fabric Interconnects.

  19. Brad Hedlund says:

    Scott,

    Jumbo frames can easily be enabled on the Fabric Interconnect and with that can certainly only help iSCSI and vMotion performance, for example.
    The Palo adapter does TCP segmentation offloading but does not do any special HW offloading for iSCSI specific payloads, nor does it support iSCSI booting.

Leave a Reply