Kevin Christopher Cloud Platform Security Architect, VMware
Yanlei Zhao Staff Engineer 2, VMware
vMotion is the live migration of a running virtual machine between two physical servers with no interruption in service.
The first generation of vMotion focused on performance. The first non-negotiable goal was cutover time: keeping the cutover under one second prevents TCP or operating systems from noticing the migration. The key innovation to minimize cutover was a pre-copy mechanism: keep running the workload, and use page table dirty bits to determine what memory needed re-transmission. Eventually virtual machines became large enough that vMotions could take a full day, so the second goal became keeping the network fully saturated. As VMs got larger and networks got faster, keeping up with these advances was a full-time job for an entire team of engineers. Any feature which might slow down a vMotion was aggressively deprioritized.
In the past five years, those tradeoffs have changed. VM sizes have plateaued as customers see large VMs as single points of failure. Kubernetes has made authoring horizontally scalable applications accessible to an entire new category of DevOps engineer. And finally, cybersecurity has become front and center in most corporate boardrooms. The idea of tampering with a live migration took only a couple of years to reach BlackHat. This author vividly remembers asking a customer if a 4x slowdown to get encryption would be okay, and getting enthusiastic assent.
The second generation of vMotion pivots towards security. The obvious candidate for encryption is a well-tested protocol like TLS. This has two major problems. First, TLS uses a maximum record size of 16KiB, which is incompatible with the multi-megabyte messages vMotion uses for efficiency. Second, for performance vMotion uses a direct kernel-to-kernel transfer (for zero-copy transmissions), and implementing TLS in a kernel is challenging. The industry does not actually have a good alternative for high-bandwidth, encrypted data transfer.
Looking deeper, what does TLS provide? Authentication handshakes using asymmetric encryption, bulk encryption with AES with symmetric encryption … why not go straight to bulk encryption?
- The complicated parts of TLS in a kernel are asymmetric encryption (RSA, ECDSA, and ECDHE). Most security vulnerabilities are in asymmetric encryption. Symmetric crypto is simpler and well-understood, in conjunction with secure key distribution.
- AES-GCM pipelines very well. Older AES-CBC ciphers use “ENC(IV ⊕ DATA)” as a primitive, but AES-GCM uses “ENC(IV) ⊕ DATA” as a primitive. Decoupling the slow encryption from the data dependency allows some fascinating possibilities for just-in-time data access.
- AES-GCM can overlay on top of an existing message-based protocol. It’s not hard to find 16 bytes for an authentication tag in a message header, then add an encryption or decryption pass just before or after network transmission.
Now, security experts know to be very cautious inventing a custom security protocol. And this article is too brief to cover the analysis, tradoffs, and assumptions that justified doing so. Use caution, but “difficult” does not mean “impossible”.
The encrypted vMotion performance data was excellent. The two most important metrics, cutover time and total migration time, barely changed (<10% increase in cutover time, still sub-second). What did change was resource overhead: across both source and destination, a normal vMotion of a large VM needs about 1.5 CPU cores per 10Gbps of network bandwidth to saturate the network, but an encrypted vMotion needs about 3.0 CPU cores per 10Gbps of network bandwidth. This capacity either has to be left idle, or has to be borrowed from running virtual machines during a vMotion. Nonetheless, this is an amazing result: modern CPUs encrypt at about 10Gbps, so the team hit the theoretical maximum – by finding a third resource to trade off.
What lessons can we take from this? First, security is not always an expense. Careful engineering can transfer the cost of security to some more tolerable metric (like resource overhead). Second, look carefully at what a technology provides for free – and whether “free” justifies hidden limitations. And finally, as the industry changes there are opportunities for innovation everywhere – even within “solved” problems.
The post Security in vMotion appeared first on Thrive-WiSE.