Nested Virtualization with KVM and AMD

After my previous attempt the other day to create a nested-guest(kvm on kvm) with Intel arch, I got hold of an AMD server machine with virt-extensions enabled and gave it a whirl. This went slightly smoother than the Intel attempt.

Some config info about the physical host, regular-guest and nested-guest. (All of them are Fedora-16; x86_64)

  • Physical Host (Host hypervisor/Bare metal)
    • 
      [root@phy-host-amd]# virsh nodeinfo
      CPU model:           x86_64
      CPU(s):              16
      CPU frequency:       2000 MHz
      CPU socket(s):       2
      Core(s) per socket:  8
      Thread(s) per core:  1
      NUMA cell(s):        1
      Memory size:         8173352 kB
      
  • Regualr Guest (Or Guest Hypervisor)
    • Config: 4GB Memory; 6 vcpus; 22GB Raw disk image w/ cache=’none’ enabled in the libvirt xml
  • Nested Guest
    • Config: 2GB Memory; 3 vcpus; 10G Raw disk image

Ensure nesting is enabled on the physical host

Let’s ensure kvm_amd kernel module is enabled with ‘nested’ virt.


[root@phy-host-amd ~]# modinfo kvm_amd | grep -i nested
parm:           nested:int
[root@phy-host-amd ~]# 

[root@phy-host-amd ~]# cat /sys/module/kvm_amd/parameters/nested
1
[root@phy-host-amd ~]# 

[root@phy-host-amd ~]# systool -m kvm_amd -v   | grep -i nested
    nested              = "1"
[root@phy-host-amd ~]# 

CAVEAT: To make life a little easier, I configured bridged networking on the physical host to ensure our regular-guest gets a bridged IP; and later, nested-guest gets a NATed IP. I’m noting it here because, the physical host initially had no bridging. The default libvirt bridge virbr0 has 192.168.122.0/24 IP space. So once we set up the regular-guest(or guest-hypervisor), we’ll end up having the same IP space. I tried to fix this prob. by creating another ‘persistent’ libvirt network interface and enabled autostart of it. [virsh net-add; virsh net-define; virsh net-autostart ]. But, it wasn’t elegant and messed up networks on reboot.

Set up the guest hypervisor
Create a minimal regular-guest using virt-install . The one I used is posted here

Now, add the cpu attribute to the regular-guest’s libvirt xml to expose AMD’s svm instructions, which comes with Opteron_G3 model .

Edit the xml using virsh:

# virsh edit regualr-guest 

(which will also define the xml)

Here is the attribute to be added to the guest hypervisor’s libvirt xml:

   <cpu>
      <arch>x86_64</arch>
      <model>Opteron_G3</model>
      <vendor>AMD</vendor>
      <topology sockets='2' cores='8' threads='1'/>
      <feature name='wdt'/>
      <feature name='skinit'/>
      <feature name='osvw'/>
      <feature name='3dnowprefetch'/>
      <feature name='cr8legacy'/>
      <feature name='extapic'/>
      <feature name='cmp_legacy'/>
      <feature name='3dnow'/>
      <feature name='3dnowext'/>
      <feature name='pdpe1gb'/>
      <feature name='fxsr_opt'/>
      <feature name='mmxext'/>
      <feature name='ht'/>
      <feature name='vme'/>
    </cpu>

And, restarted the regular-guest, so that it boots w/ the -cpuflag which the AMD virt extensions:


[root@phy-host-amd ~]# ps -ef | grep -i qemu-kvm
qemu     26677     1 14 10:39 ?        00:00:30 /usr/bin/qemu-kvm -S -M pc-0.14 -cpu phenom,+wdt,+skinit,+osvw,+3dnowprefetch,+misalignsse,+sse4a,+abm,+cr8legacy,+extapic,+cmp_legacy,+lahf_lm,+rdtscp,+pdpe1gb,+popcnt,+cx16,+ht,+vme -enable-kvm -m 4096 -smp 6,sockets=2,cores=8,threads=1 -name regular-guest -uuid 8f6a4478-496b-51d8-2de2-ff7fdb964af3 -nographic -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/regular-guest.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -drive file=/var/lib/libvirt/images/regular-guest.img,if=none,id=drive-virtio-disk0,format=raw,cache=none -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=24,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:5f:c6:5f,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -usb -device usb-tablet,id=input0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5

Now, let’s fetch the IP of the regular-guest using virt-cat


[root@phy-host-amd ~]# virsh list
 Id Name                 State
----------------------------------
  5 regular-guest        running
[root@phy-host-amd ~]# 
[root@phy-host-amd ~]# virt-cat regular-guest /var/log/messages | grep 'dhclient.*bound to'
Jan 17 10:13:06 dhcpyy-zz dhclient[732]: bound to ww.xx.yy.zz -- renewal in 32578 seconds.

(Note: ‘ww.xx.yy.zz’ above will be a bridged IP address)

Create the nested guest
Now. install virt-packages in the regular-guest. Also, let’s check if the /dev/kvm char device is exposed in the regular-guest ; and start the libvirtd service.


[root@regular-guest ~]# file /dev/kvm 
/dev/kvm: character special
[root@regular-guest ~]# systemctl status libvirtd.service 
libvirtd.service - LSB: daemon for libvirt virtualization API
          Loaded: loaded (/etc/rc.d/init.d/libvirtd)
          Active: active (running) since Tue, 17 Jan 2012 10:49:25 -0500; 5s ago
         Process: 1440 ExecStart=/etc/rc.d/init.d/libvirtd start (code=exited, status=0/SUCCESS)
        Main PID: 1448 (libvirtd)
          CGroup: name=systemd:/system/libvirtd.service
                  ├ 1448 libvirtd --daemon
                  └ 1501 /usr/sbin/dnsmasq --strict-order --bind-interfaces --pid-file=/var/run/libvirt/network/default.pid --conf-file= --exce...

Proceed with installing a minimal F16 nested-guest w/ virt-install. The script I used is here

Debugging note: Once the guest install is finished, fix the serial console access by disabling plymouth-service using this workaround. This will let us login via virsh serial console(to get kernel and boot messages) w/o any line breaks while entering credentials:

 # ln -s /dev/null /etc/systemd/system/plymouth-start.service

Get the (NATed) IP of the nested-guest. (Also, grepped for the qemu-kvm command-line of the nested-guest.)


[root@regular-guest ~]# virsh list
 Id Name                 State
----------------------------------
  2 nested-guest         running
[root@regular-guest ~]# ps -ef | grep qemu-kvm
qemu      2245     1  2 Jan17 ?        00:20:11 /usr/bin/qemu-kvm -S -M pc-0.14 -enable-kvm -m 2048 -smp 3,sockets=3,cores=1,threads=1 -name nested-guest -uuid 2aae2ab5-ddb6-2585-aa16-7fe97296f34b -nographic -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/nested-guest.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -drive file=/var/lib/libvirt/images/nested-guest.img,if=none,id=drive-virtio-disk0,format=raw -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=24,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:0e:4e:53,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -usb -device usb-tablet,id=input0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5

[root@regular-guest ~]# virt-cat nested-guest /var/log/messages | grep 'dhclient.*bound to'                                                            
Jan 17 11:08:30 localhost dhclient[721]: bound to 192.168.122.220 -- renewal in 1393 seconds.
[root@regular-guest ~]# 

SSh into the nested-guest, install virt-what package and run to see if we’re on a hypervisor


[root@localhost ~]# cat /etc/fedora-release 
Fedora release 16 (Verne)
[root@localhost ~]# ifconfig eth0 | grep inet
          inet addr:192.168.122.220  Bcast:192.168.122.255  Mask:255.255.255.0
          inet6 addr: fe80::5054:ff:fe0e:4e53/64 Scope:Link
[root@localhost ~]# 
[root@localhost ~]# virt-what 
kvm

Wooo!! so we’re on an OS which is inside an OS which is inside an OS.

Nested Virtualization with KVM Intel

Some context: In regular virtualization, your physical linux host is the hypervisor, and runs multiple operating systems. Nested Virtualization let’s you run a guest inside a regular guest(essentially a Guest hypervisor).For AMD there is nested-support available since a while, and some people reported success w/ nesting KVM guests. For Intel arch., there is support available recently, an year-ish, and some in progress work, so thought I’d give it a whirl when Adam Young started discussion about it in context of openstack project.

Some of the common use-cases for that are being discussed for nested-virtualization
- For instance, a cloud user gets a beefy, Regualar Guest(which she completely controls). Now, this user can turn regular guest into a hypervisor, and can cheerfully run/manage multiple guests for developing or testing w/o the hassle and intervention of the cloud provider.
- Possibility of having a many instances of virtualization setup (hypervisor and its guests) on one single Bare metal.
- Ability to debug and test hypervisor software

I have immediate access to a moderately beefy Intel hardware, and rest of the post is based on Intel’s CPU virt extensions. Before proceeding, let’s settle on some terminology for clarity:

  • Physical Host (Host hypervisor/Bare metal)
    • Config: Intel(R) Xeon(R) CPU(4 cores/socket); 10GB Memory; CPU Freq – 2GHz; Running latest Fedora-16(Minimal foot-print, @core only with Virt pkgs;x86_64; kernel-3.1.8-2.fc16.x86_64
  • Regualr Guest (Or Guest Hypervisor)
    • Config: 4GB Memory; 4vCPU; 20GB Raw disk image with cache =’none’ to have decent I/O; Minimal, @core F16; And same virt-packages as Physical Host; x86_64
  • Nested Guest (Guest installed inside the Regular Guest)
    • Config: 2GB Memory; 1vCPU; Minimal(@core only) F16; x86_64

Enabling Nesting on the Physical Host

Node Info of the Physical Host.

 
# virsh nodeinfo
CPU model:           x86_64
CPU(s):              4
CPU frequency:       1994 MHz
CPU socket(s):       1
Core(s) per socket:  4
Thread(s) per core:  1
NUMA cell(s):        1
Memory size:         10242864 kB

Let us first ensure kvm_intel kernel module has nesting enabled. By default, it’s disabled for Intel arch[ but enabled for AMD -- SVM (secure virtual machine) extensions arch.]

 
# modinfo kvm_intel | grep -i nested
parm:           nested:bool
# 

And, we need to pass this kvm-intel.nested=1 on kernel commandline while rebooting the host to enable nesting for the Intel KVM kernel module. Which can be verified after boot by doing:

 
# cat /sys/module/kvm_intel/parameters/nested 
Y
# systool -m kvm_intel -v   | grep -i nested
    nested              = "Y"
# 

Or alternatively, Adam Young identified that nesting can be enabled by adding this directive kvm_intel.nested=1 to the end of /etc/modprobe.d/dist.conf file and reboot the host so it persists.

Set up the Regular Guest(or Guest hypervisor)
Install a regular guest using virt-install or oz tool or any other preferred way. I made a quick script here. And ensure to have cache=’none’ in the disk attribute of the Guest Hypervisor’s xml file. (observation: Install via virt-install tool didn’t seem have this option picked by default.) Here is the ‘drive’ attribute libvirt xml snippet:

    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <source file='/var/lib/libvirt/images/regular-guest.img'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>

Now, let’s try to enable Intel VMX(Virtual Machine Extensions) in the regular guest’s CPU. We can do it by running the below on the Physical host(aka Host Hypervisor), and adding the ‘cpu’ attribute to the regular-guest’s libvirt xml file, and start the guest.

# virsh  capabilities | virsh cpu-baseline /dev/stdin 
<cpu match='exact'>
  <model>Penryn</model>
  <vendor>Intel</vendor>
  <feature policy='require' name='dca'/>
  <feature policy='require' name='xtpr'/>
  <feature policy='require' name='tm2'/>
  <feature policy='require' name='vmx'/>
  <feature policy='require' name='ds_cpl'/>
  <feature policy='require' name='monitor'/>
  <feature policy='require' name='pbe'/>
  <feature policy='require' name='tm'/>
  <feature policy='require' name='ht'/>
  <feature policy='require' name='ss'/>
  <feature policy='require' name='acpi'/>
  <feature policy='require' name='ds'/>
  <feature policy='require' name='vme'/>
</cpu>

The o/p of the above cmd has a variety of options. Since we need only vmx extensions, I tried the simple way by adding to the regular-guest’s libvirt xml(virsh edit ..) and started it.

<cpu match='exact'>
  <model>core2duo</model>
 <feature policy='require' name='vmx'/>
</cpu>

Thanks to Jiri Denemark for the above hint. Also note that, there is a very detailed and informative post from Dan P Berrange on host/guest CPU models in libvirt.

As we enabled vmx in the guest-hypervisor, let’s confirm that vmx is exposed in the emulated CPU by ensuring qemu-kvm is invoked with -cpu core2duo,+vmx :


[root@physical-host ~]# ps -ef | grep qemu-kvm
qemu     17102     1  4 22:29 ?        00:00:34 /usr/bin/qemu-kvm -S -M pc-0.14 
-cpu core2duo,+vmx -enable-kvm -m 3072
-smp 3,sockets=3,cores=1,threads=1 -name f16test1 
-uuid f6219dbd-f515-f3c8-a7e8-832b99a24b5d -nographic -nodefconfig 
-nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/f16test1.monitor,server,nowait 
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown
-drive file=/export/vmimgs/f16test1.img,if=none,id=drive-virtio-disk0,format=raw,cache=none
-device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
-netdev tap,fd=21,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:e6:cc:4e,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -usb -device usb-tablet,id=input0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5

Now, let’s attempt to create a nested guest

Here comes the more interesting part, the nested-guest config. will be 2G RAM; 1vcpu; 8GB virtual disk. And let’s invoke a virt-install cmdline with a minimal kickstart install:


[root@regular-guest ~]# virt-install --connect=qemu:///system \
    --network=bridge:virbr0 \
    --initrd-inject=/root/fed.ks \
   --extra-args=ks=file:/fed.ks console=tty0 console=ttyS0,115200 serial rd_NO_PLYMOUTH \
    --name=nested-guest --disk path=/var/lib/libvirt/images/nested-guest.img,size=6 \
    --ram 2048 \
    --vcpus=1 \
    --check-cpu \
    --hvm \
    --location=http://download.foo.bar.com/pub/fedora/linux/releases/16/Fedora/x86_64/os/
    --nographics

Starting install...
Retrieving file .treeinfo...                                                                                                 | 1.7 kB     00:00 ... 
Retrieving file vmlinuz...                                                                                                   | 7.9 MB     00:08 ... 
Retrieving file initrd.img...                               28% [==============                                   ] 647 kB/s |  38 MB     02:25 ETA 

virt-install proceeds fine(to a certain extent), doing all regular things like getting access to network, create devices, create file-systems, dep checks performed, and finally package install proceeds:


Welcome to Fedora for x86_64



     ┌─────────────────────┤ Package Installation ├──────────────────────┐
     │                                                                   │
     │                                                                   │
     │                                 24%                               │
     │                                                                   │
     │                   Packages completed: 52 of 390                   │
     │                                                                   │
     │ Installing glibc-common-2.14.90-14.x86_64 (112 MB)                │
     │ Common binaries and locale data for glibc                         │
     │                                                                   │
     │                                                                   │
     │                                                                   │
     └───────────────────────────────────────────────────────────────────┘

And now, it’s stuck like that for ever. Doesn’t budge, trying to install pkgs for eternity. Let’s try to see what’s the state of the guest in a seperate terminal


[root@regular-guest ~]# virsh list
 Id Name                 State
----------------------------------
  1 nested-guest         paused

[root@regular-guest ~]# 
[root@regular-guest ~]#  virsh domstate nested-guest --reason
paused (unknown)

[root@regular-guest ~]# 

So our nested-guest seems to be paused, And package install on the nested-guest’s serial console is still hung. I gave up at this point. Need to try if I can get any helpful info w/ virt-dmesg tool aor any other ways to debug this further.

Just to note, there is enough disk space and memory on the ‘regular-guest’, so that case is ruled out here. And, I tried to destroy the broken nested-guest, and attempted to create a fresh one(repeated twice). Still no dice.

So not much luck yet with Intel arch, I’d have to try on an AMD machine.

UPDATE(on Intel arch): After trying a couple of times, I was finally able to ssh to the nested guest, but, after a reboot, the nested-guest loses the IP rendering it inaccessible.(Info: the regular-guest has a bridged IP, and nested-guest has a NATed IP) . And I couldn’t login via serial-console, as it’s broken due to a regression(which has a workaround). Also, refer to comments below for further discussion on NATed networking caveats.

Little more disk I/O perf. improvement with ‘fallocate’ing a qcow2 disk

Recently I’ve started using ‘preallocation=metadata’ flag while creating qcow2 disk images to extract some decent I/O performance. Today, while discussing qcow2 disk image performance with Stefan Hajnoczi (thank you!) on irc, I found, using fallocate — which preallocates all the blocks to a file — on a qcow2 disk image would improve disk I/O performance a little more as alls the blocks are allocated to the file ahead of time. (Just to note – fallocate comes w/ the linux standard pkg ‘util-linux-ng’)

Let’s run a quick test to see the disk I/O performance improvement by preallocating all the space in a qcow2 disk.

Create the disk image with ‘preallocation=metadata’

 
$ qemu-img create -f qcow2 -o preallocation=metadata /export/vmimgs/f16-test1.qcow2 8G
Formatting '/export/vmimgs/f16-test1.qcow2', fmt=qcow2 size=8589934592 encryption=off cluster_size=65536 preallocation='metadata' 
 

Let’s check the size of the image in bytes


$ ls -l /export/vmimgs/f16-test1.qcow2
-rw-r--r--. 1 root root 8591507456 Dec  2 16:55 /export/vmimgs/f16-test1.qcow2

# Also, print the allocated file size in blocks
$ ls -lash /export/vmimgs/f16-test1.qcow2
1.4M -rw-r--r--. 1 root root 8.1G Dec  2 16:55 /export/vmimgs/f16-test1.qcow2
 

Run fallocate to preallocate space to the disk image:


$ fallocate -l 8591507456 /export/vmimgs/f16-test1.qcow2 
 

Now, re-run ‘ls’ to print the allocated file size in blocks. (Notice that all the disk size, 8G, is now allocated.)


$ ls -lash /export/vmimgs/f16-test1.qcow2
8.1G -rw-r--r--. 1 root root 8.1G Dec  2 16:55 /export/vmimgs/f16-test1.qcow2
$ 
 

Also, let’s run ‘qemu-img info’ to get the disk size, virtual size.


$ qemu-img info f16-test1.qcow2 
image: f16-test1.qcow2
file format: qcow2
virtual size: 8.0G (8589934592 bytes)
disk size: 8.0G
cluster_size: 65536
$ 
 

As a simple test, I used the above disk image to create an @core only Fedora-16 guest(on a Fedora-16 host) and clocked the timing — it took roughly 5 min 32 sec to finish. While, previously, w/o fallocateing a disk image, when I clocked the same f-16 timing, it took nearly 8 minutes. So, there is a decent improvement noticed here.

With this, Stefan noted, disk write speed inside the guest machine should also be improved, when blocks are written for the first time. And also, due to less disk fragmentation — as all the space was preallocated in one operation — there would be fewer disk seeks during large read operations.

Creating a Qcow2 virtual machine

Qcow2 disk image is an interesting format which supports features like internal and external snapshots, backing files, image compression, encryption. But also, it’s I/O performance is very slow compared to RAW format. Here are a couple of settings which can extract reasonable performance out of Qcow2 disk images.

Create a qcow2 disk image
First, let’s create a qcow2 disk image using ‘qemu-img’ tool

$ /usr/bin/qemu-img create -f qcow2 -o preallocation=metadata /export/vmimgs/glacier.qcow2 8G

NOTE: At this point in time, preallocation=metadata option is the best we can do to extract max. possible (near RAW) I/O performance out of QCOW2 format. (hint from Kevin Wolf – Qemu/Qcow2 developer )

From the below listing that 970M is the allocated or used size of the guest, 8.1G is the max size the image can ‘grow to’.


[root@moon tmp]# ls -lash /export/vmimgs/glacier.qcow2 
970M -rw-r--r--. 1 qemu qemu 8.1G Sep 24 23:45 /export/vmimgs/glacier.qcow2
[root@moon tmp]# 

Create the guest

# Create an unattended minimal guest install using a qcow2 disk image
virt-install --connect=qemu:///system \
    --network=bridge:br0 \
    --initrd-inject=/var/tmp/fed-minimal.ks \
    --extra-args="ks=file:/fed-minimal.ks console=tty0 console=ttyS0,115200" \
    --name=glacier \
    --disk path=/export/vmimgs/glacier.qcow2,format=qcow2 \
    --ram 2048 \
    --vcpus=2 \
    --check-cpu \
    --hvm \
    --location=http://download.fedora.redhat.com/pub/fedora/linux/releases/15/Fedora/x86_64/os/ \
    --nographics 

The above will create a minimal guest w/ a qcow2 disk image format. Content of the fed-minimal kickstart is here

Once, the guest is created, ensure to have cache=’none’ parameter in ‘disk’ element of the guest’s xml file (if not present, add it and redefine the xml. It looks like below). This is another aspect which can improve the disk I/O performance.


[root@moon ~]# grep cache /etc/libvirt/qemu/glacier.xml -A 4 -B 1
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2' cache='none'/>
      <source file='/export/vmimgs/glacier.qcow2'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>
[root@moon ~]# virsh define /etc/libvirt/qemu/glacier.xml
Domain glacier defined from /etc/libvirt/qemu/glacier.xml
[root@moon ~]# virsh start glacier
Domain glacier started

[root@moon ~]#

I’m still trying to wrap my head around the caching and preallocation mechanisms of the qcow2 format. Meanwhile, work on Qcow2 version-3 is in progress in upstream qemu.

Unattended guest install with a local kickstart

Up until recently, I was doing all my unattended installs to provision linux guests with a kickstart hosted over http. Now, with this neat flag ‘–initrd-inject‘ in virt-install , guests can be created using a local kickstart file. This is available in python-virtinst-0.500.4-1 and above)

Here is the invocation, it assumes network bridging is configured. Refer this to quickly configure bridging (if you haven’t already).

This creates a minimal guest w/ these specs: 20G disk image(RAW format); 2GB ram ; 2 Virtual CPUs.

#!/bin/bash
set -x

# Note: Replace this with your local Fedora tree if you have one.
tree= http://download.fedoraproject.org/pub/fedora/linux/releases/15/Everything/x86_64/os

virt-install --connect=qemu:///system \
    --network=bridge:br0 \
    --initrd-inject=/export/fed-minimal.ks \
    --extra-args="ks=file:/fed-minimal.ks \
      console=tty0 console=ttyS0,115200" \
    --name=f15testbox \
    --disk /var/lib/libvirt/images/f15testbox.img,size=20 \
    --ram 2048 \
    --vcpus=2 \
    --check-cpu \
    --accelerate \
    --hvm \
    --location=$tree \
    --nographics 

Also a serial console(ttyS0) is configured so that all the booting/package-install process(complete install should take around around 4-5 mins) could be seen from the shell.

Later on, for remote management using libvirt, you can connect to guests using:

  # virsh console f15testbox 

And the contents of the kickstart file ‘fed-minimal.ks’. This is the smallest possible install:

  
# Minimal Kickstart file
install
text
reboot
lang en_US.UTF-8
keyboard us
network --bootproto dhcp
#Choose a saner password here.
rootpw testpasswd
firewall --enabled --ssh
selinux --enforcing
timezone --utc America/New_York
#firstboot --disable
bootloader --location=mbr --append="console=tty0 console=ttyS0,115200 rd_NO_PLYMOUTH"
zerombr
clearpart --all --initlabel
autopart

#Just core packages
%packages
@core
%end

 

I placed all the above in a quick script

A note on bridging, network device names:
With the recent Fedora 15 feature Consistent Network Device Naming , network device names from here on would be renamed from ethX to something like emX . Refer the Fedora test day for more information and a script to determine if your system is impacted due to the Biosdevname change. (I had to do this for my DELL optiplex Xeon test box.)

Experiment with ‘Native Linux KVM Tool’

Over the weekend I was tinkering around with the recently announced version 2 of ‘Native Linux KVM tool’. My aim was to boot a minimal Linux (RAW) disk image and a QCOW2 based image. Read on for details:

A little bit of context
‘Native Linux KVM tool’ was first announced by Pekka Enberg on lkml/kvm-upstream lists. From the announce email:

“Right now it can boot a Linux image and provide you output via a serial
console, over the host terminal, i.e. you can use it to boot a guest
Linux image in a terminal or over ssh and log into the guest without
much guest or host side setup work needed.”

Essentially, the initial goal of the tool appears to be a featherlight-weight userspace alternative to QEMU, which can boot Linux guests. And this Native tool lives inside the kernel tree under /tools(this means, if/when this tool is merged in the mainline kernel tree, a linux distro will by default get minimal userspace tool to boot linux guests). However, QEMU does plenty more than booting a linux guest.(Quick googling will provide all the info.)

Getting to the matter
On Saturday, I started off by pulling Penberg’s kernel git tree[1]. And proceeded to configure it with some build time options required for ‘Native Linux KVM tool’ by following the Version 2 announcement. And I enabled each of the necessary options in the kernel configuration. Before I proceed, Sasha Levin(Native KVM tool dev.) suggested that I compile the guest kernel without any modules. Reason being, if I built it with modules, all of them needs to be explicitly loaded into the disk image before booting. If we have all of them built in, we just use the ‘bzImage’. So, I did a quick `sed -i ‘s/=m/=y/’ .config` to *include* everything and compiled the kernel with ‘make -j5′ (so that it builds with 5 parallel threads).

Build complete, bzImage generated. Without any delay I went ahead and launched the hypervisor to boot the minimal disk image:

  [root@moon kvm]#./kvm run -d /export/test-images/linux-0.2.img 
 # kvm run -k ../../arch/x86/boot/bzImage -m 448 -c 4 --name guest-9408
.
.
.
[   63.903734] Copyright (c) 2009 - 2010 Intel Corporation.
[   63.903753] ixgbe: Intel(R) 10 Gigabit PCI Express Network Driver - version
3.3.8-k2
[   63.903755] ixgbe: Copyright (c) 1999-2011 Intel Corporation.
[   63.903774] ixgbevf: Intel(R) 10 Gigabit PCI Exp[root@moon kvm]# 

No dice! As it can be noticed, it just exits abruptly w/o providing any useful error(And later I was pointed out that it didn’t go through the regular exit path either). So I provided my kernel .config and bzImage for debugging.

Meanwhile, Sasha provided me his kernel .config file to get a working ‘kvm’ tool. Rinse and repeat – re-compilation and booting the disk image.

Woot! this time it boots into the minimal linux disk image.

 [root@moon kvm]# ./kvm run -d /export/test-images/linux-0.2.img
 # kvm run -k ../../arch/x86/boot/bzImage -m 448 -c 4 --name guest-9408
.
.
.
[    1.386747] Bluetooth: Virtual HCI driver ver 1.3
[    1.387838] Bluetooth: HCI UART driver ver 2.2
[    1.388615] Bluetooth: HCI H4 protocol initialized
[    1.389481] Bluetooth: HCI BCSP protocol initialized
[    1.390366] Bluetooth: HCILL protocol initialized
sh-2.05b# mkdir /export 
sh-2.05b# pwd 
/export
sh-2.05b# echo test > foo.txt
sh-2.05b# cat foo.txt 
test
sh-2.05b# ls /
bin   dev  export  lost+found  proc  sbin  usr
boot  etc  lib	   mnt	       root  tmp   var
sh-2.05b# df -hT
Filesystem    Type    Size  Used Avail Use% Mounted on
rootfs      rootfs     20M   17M  1.7M  92% /
/dev/root     ext4     20M   17M  1.7M  92% /
devtmpfs  devtmpfs    150M     0  150M   0% /dev
sh-2.05b# 

Next morning, on Sunday, I tried to do a couple of regular virtual guest operations like pause, resume, list, stop. I first chose to ‘pause’ the guest. Poof! I’m thrown out of my ssh session to the *host* , where I’m doing all of this. I can’ re-ssh any more, the SSH session was killed. All I see is a connection refused. Confused, I checked with Penberg, he replied, it’s a nasty bug(which sends the pause signal to the wrong process), but fixed and pointed me to the git commit. Ok, that confirmed I didn’t screw up anything. I re-pulled and re-compiled kernel tree on my Lenovo X200 laptop . (Later in the day, I went and fixed the host test machine(this reminded me to tie this machine to a remote power management console).

With the up2date git tree (and post recompilation), I was able to do all the regular virtual guest operations fine.

Results
Test env.
OS: Fedora -15
Processor: 4 Intel(R) Xeon(R) CPU E5405 @ 2.00GHz
RAM: 8GB

I tested w/ both minimal RAW and a debian squeeze (de-compressed) QCOW2 disk images.

With minimal RAW image: I was able to run; pause; resume; list; stop the minimal RAW linux disk image. Also my (limited)test results, guest networking is a little fragile(understandably) at the moment.
Posted my results here.

With QCOW2 image: Currently a QCOW2 image boots in ‘read-only’ mode. To try it, below is the working syntax.


# ./kvm run -p "root=/dev/vda1" -d /export/test-images/debian_squeeze_i386_standard_decompressed.qcow2


For more verbose details, errors I encountered, conversion of compressed to de-compressed images, I posted my notes and results here.

NOTE: I’m yet to boot any of the Fedora/RHEL RAW disk images which I use in my daily work. Also, I haven’t tried out the ‘virtio ballon’ feature. Will write more on these when I get to it.

For a quick shot
If you don’t want to compile your own kernel and do all the stuff, to give a quick shot, all that needs to be done is:

Download the working Kernel x86 boot executable ‘bzImage’(I built with Sasha’s config); compiled ‘kvm’ tool binary ; and a minimal linux disk image(linux-0.2.img)
– ‘bzImage’ and ‘kvm’ tool binary are located on my fedora people page
– minimal linux disk image — # wget http://wiki.qemu.org/download/linux-0.2.img.bz2 && bunzip2 linux-0.2.img.bz2

And, run the hypervisor to boot into the minimal disk image:

 
# ./kvm run -k ~/tinker/native-linux-kvm/linux-kvm/arch/x86/boot/bzImage  -d /export/testimages-nlt/linux-0.2.img
 

I jotted down a README here.
If you’re interested to compile your own kernel from scratch w/ latest git, I uploaded a working kernel .config (from Sasha) on my fedora people page.

Thanks a lot to Sasha Levin and Pekka Enberg. They were very helpful answering all my questions and also walked me through some of the issues I was facing.

References

  1. git://github.com/penberg/linux-kvm.git
  2. http://kashyapc.fedorapeople.org/native-linux-kvm-tool/working-native-linux-kvm-tool/sashal-linux-config

Update:
With this commit 6533f7913743742fdd690eee0930fb7ba1bcbb1f, Pekka introduced an ‘init’ target . So if there are any errors( like /usr/bin/ld: cannot find -lc) while compiling the kvm binary on Fedora, ensure to have glibc-static package installed to get them resolved.