Introduction
Service providers with multiple datacenter often want to offer to their customers the choice of having multiple virtual datacenters from different availability zones. Failure of one availability zone should not impact different availability zone. The customer then can deploy his application resiliently in both virtual datacenters leveraging load balancing and application clustering.
Depending on the distance of datacenters and network latency between them it is possible to have multiple availability zones accessible from within single vCloud Director instance which means one single GUI or API endpoint and very easy consumption from customer’s perspective. Read vCloud Architecture Toolkit – Architecting vCloud for more detail on latency and supportability considerations.
Multiple vCenter Server Design
Typical approach in single instance vCloud Director is to have for each availability zone its own vCenter Server and vCNS Manager. vCloud Director in version 5.5 can connect up to 20 vCenter Servers.
Following diagram shows how the management and resource clusters are typically placed between two sites.
Each site has management cluster. The shared cloud management VMs (vCloud Director cells, databases, Chargeback, AMQP, etc) run primarily from site 1 with failover to site 2. Provider VDC management resources (vCenter Server, vCNS/NSX Managers, databases) are distributed to each site. There is no sharing of resource group components which makes very clean availability zone design.
One problem for the customers is that they cannot stretch organization VDC networks between the sites. The reason for this is that although VXLAN networks could be stretched over routed Layer 3 networks between sites, they cannot be stretched between different vCenter Servers. Single vCNS/NSX manager is the boundary for VXLAN network and there is 1:1 relationship between vCenter Server and vCNS/NSX Manager. This means that if the customer wants to achieve communication between VMs in each of his VDCs from different availability zones he has to create Edge Gateway IPsec VPN or provide external network connectivity between them. All that results in quite complicated routing configuration. Following diagram shows the typical example of such setup.
Single vCenter Server Design with Stretched Org Networks
I have come up with an alternative approach. The goal is to be able to achieve stretched OrgVDC network between two sites and have only one highly available Edge Gateway to manage. The desirable target state is shown in the following diagram.
To accomplish this we need only one Resource group vCenter Server instance and thus one VXLAN domain while still having the separation of resources into two availability zones. vCenter Server can be made resilient with vSphere HA (stretched cluster), vCenter Server Heartbeat or Site Recovery Manager.
Could we have the same cluster design as in multi-vCenter scenario with each Provider VDCs having its own set of clusters based on site? To answer this question I first need to describe the VXLAN transport zone (VXLAN scope) concept. VXLAN network pools created by vCloud Director have only Provider VDC scope. This means that any Org VDC network created from such VXLAN network pool will be able to span clusters that are used by the Provider VDC. When a cluster is added or removed to or from Provider VDC, the VXLAN transport zone scope is expanded or reduced by the cluster. This can be viewed in vCNS Manager or in NSX – Transport Zone menu.
There are two ways how to expand the VXLAN transport zone.
Manual VXLAN Scope Expanding
The first one is simple enough and involves manually extending the VXLAN transport zone scope in vCNS or NSX Manager. The drawback is that any reconfiguration of Provider VDC clusters or resource pool will remove this manual change. As Provider VDC reconfiguration does not happen too often this is viable option.
Stretching Provider VDCs
The second solution involves stretching at least one Provider VDC into the other site so its VXLAN scope covers both sites. The resulting Network Pool (which created the VXLAN transport zone) then can be assigned to Org VDCs needing to span networks between sites. This can be achieved with using multiple Resource Pools inside clusters and assigning those to Provider VDCs. As we want to stretch only the VXLAN networks and not the actual compute resources (we do not want vCloud Director deploying VMs into wrong site) we will have site specific storage policies. Although a Provider VDC will have access to Resource Pool from the other site it will not have access to the storage as only storage from the first site is assigned to it.
Hopefully following diagram better describes the second solution:
The advantage of the second approach is that this is much cleaner solution from support perspective although the actual setup is more complex.
Highly Available Edge Gateway
Now that we have successfully stretched the Org VDC network between both sites we also need to solve the Edge Gateway site resiliency. Resilient applications without connectivity to external world are useless. Edge Gateway (and the actual Org VDC) is created inside one (let’s call it primary) Provider VDC. The Org VDC network is marked as shared so other Org VDCs can use it as well. The Edge Gateways are deployed by the service provider. He will deploy the Edge Gateway in high availability configuration which will result in two Edge Gateway VMs deployed in the same primary Provider VDC (in System VDC sub-resource pool). The VMs will use internal Org VDC network for heartbeat communication. The trick to make it site resilient is to go into vCNS/NSX Edge configuration and change the Resource Pool (System VDC RP with the same name but different cluster) and Datastore for the 2nd instance to the other site. vCNS/NSX Manager then immediately redeploys the reconfigured Edge instance to the other site. This change survives Edge Gateway redeploys from within vCloud Director without any problems.