VI administrator is used to monitor the performance of the workloads running in his environment through the rich vSphere client interface. The performance tab presents multiple graphs displaying metrics related to CPU, memory, network, datastore, disk, … This helps him when troubleshooting, rightsizing the workloads or when doing capacity calculations.
So what happens when the VI admin becomes Cloud admin and starts deploying workloads to a public cloud? No access to vCenter Server means he has to solely rely on guest OS metrics (perfmon, top) or monitoring interfaces provided by his service provider. Although vCloud Director has monitoring dashboard it does not show any performance data – see my Org VDC Monitoring post.
What about those guest OS metrics? Any vSphere admin who went through VCP training knows that the guest OS metrics like CPU utilization are never to be trusted in virtual environment – the OS does now know how much of actual physical CPU time has been scheduled to its vCPUs and high CPU utilization could mean either that a demanding workload is running in the OS or that the VM is competing with other VMs in highly overallocated environment.
Should the VI/Cloud admin be concerned? It depends on the way the provider is oversubscribing (overselling) his compute resources. I have identified three schools of thought:
- ISP model: similarly how internet provider oversubscribes the line 1:20 the IaaS provider will sell you CPU/RAM allocation with certain percentage guaranteed (e.g. 10% for CPU). The consumer will know that during quiet times he might get 100% of requested resources, but during busy times he might get only his 10%. The consumer pays for the whole allocation.
- Telco model: the consumer commits to certain level of consumption and is charged extra for bursting above it. So again guaranteed percentage of resources is defined and known but the difference from the ISP model is that the consumer is charged flat rate for the guaranteed percentage plus the premium when he bursts above it.
- SLA model: the consumer pays for the whole allocation but does not know what resource oversubscription the provider is using. The provider must monitor the tenants to understand how much he can oversell the compute to get the highest ROI while keeping the SLAs.
All these three models are achieved by the same allocation model – Allocation Pool. Only the chargeback, amount of disclosed information and SLA differs.
It is obvious that in all three models we need performance monitoring for rightsizing the individual workloads and to correctly size the whole Org VDC. In ISP model we need to understand if we should buy more allocation because during the busy times our workloads suffer. In Telco model we need to avoid the expansive bursting and in SLA model to control the provider’s SLAs. On top of that it would be nice to be able to peek under the provider’s kimono to find out what is the overcommit level of his cloud.
By the way the need for performance monitoring still applies to Reservation Pool – where the tenant is in full control of OrgVDC overallocation and needs to understand if he went to far or not. In Pay-As-You-Go Pool it is again about understanding if my VMs are starving for CPU resources because of too aggressive oversubscription on provider’s side.
Guest SDK
One of the less known features of VMware Tools is the ability to use Guest SDK which provides read only API to for monitoring various virtual machine statistics. An example of such implementation are two additional Windows PerfMon libraries: VM Memory and VM Processor. They both contain number of counters showing very interesting information familiar to VI admin as they are exposed in vSphere client.
A linux alternative (although not so powerful) is vmware-toolbox-cmd stat command.
We can find out what is the CPU or memory reservation, if memory ballooning is active (or even VM SWAP). We can also see what is the actual physical processor speed and what is the effective VM processor speed. This gives as quite interesting peek into the hypervisor. Btw the access to Guest SDK could be disabled by the provider via advanced VM .vmx configuration parameter (not a standard practice):
tools.guestlib.enableHostInfo = "FALSE"
In the second part I will describe how these metrics can be collected, monitored and analyzed. Stay tuned…