Nikola Tesla – Man Out of Time

Just finished(which I really wanted to read for a while) reading this fascinating biography Nikola Tesla – Man Out of Time by Margaret Cheney. A terrific insight into this enigmatic scientist, his spectacular inventions, his personality and eccentricities.

One of his really unique skills(which blew my mind) was his ability to conceptualize, develop, iterate and perfect his complex and intricate inventions(mostly induction motors, coils, turbines, and other electrical/mechanical equipment) entirely in his head, without any blue-prints on paper whatsoever, before going for the actual implementation. And, more often than not, they always worked just as he expected them to.

A quick quote from a chapter titled ‘Robots’ :
“Inventors of modern computer technology in the last half of twentieth century repeatedly have been surprised, when seeking patents, to encounter Tesla’s basic ones, already on file.”

Another interesting(but a little scary) theory he made(from a chapter called ‘To Mars’) was — that he could ‘split open the earth itself in the same way as a boy would split an apple’ by applying principles of mechanical resonance.

Among several things, the book outlines many of his great inventions in Electrical, Mechanical, Wireless and many other engineering fields. And, other interesting aspects like the war of currents(DC vs AC currents), radio invention(for which he got the patent posthumously), wireless transmission of electricity, illuminating entire oceans and its depths (so that catastrophes like ‘Titanic’ could be avoided) in methodical way and just enough detail without being extremely esoteric.

I thoroughly enjoyed reading it. Highly recommended. Specifically, for those pursuing(or interested in) any kind of engineering discipline.

FOSDEM 2012 (Feb – 4,5) Trip Report

Gentle warning: Very long post. (I tried to segregate w.r.t to talks. Maybe you can skip the ones that doesn’t interest you :) )

I was fortunate to attend the FOSDEM conference for the first time in the frozen city of Brussels, Belgium. For those unfamiliar w/ FOSDEM, the conference is held for 2 days(Saturday, Sunday), completely volunteer organized involving free and open source software (w/o any space for commercial talks). I believe this is the first FOSDEM where we had more than 25 talks from RH folks. I’ll try to outline the two days of event and the talks I managed to attend.

My day started off w/ attending FOSDEM welcome keynote. Some statistics from the welcome note — 4 keynotes ; 6 main tracks; 35 stands ; 25 rooms ; 7 main track talks ; 428 scheduled events ; 418 speakers ; 31 lightening talks ; 361 Devroom talks.

This kind of choice may overwhelm people thinking: “Oh, I want to attend this session, but I also want to attend that and the other other one, which may happen concurrently”. But I guess, people anyway realize the physical impossibility and stick to a couple of dev rooms or so.

== Day 1 ==

OpenStack News: Last year retrospective
—————————————
This was in ‘Virtualization and Cloud’ dev room. Thierry Carrez(Release Manager for the Openstack project[1]), gave an overview of how the project evolved over the past year, no. of contributors, components involved and what’s coming ahead. He briefly talked about the main projects of openstack:
- ‘Nova’(also called OpenStack Compute) — the central part of an IaaS system provides an interface to the virtualization software installed on the host via web
- ‘Swift’(OpenStack Object Storage) — a scalable storage system ;
- ‘Glance’(OpenStack Image Service)– which retrieves the disk images.
- ‘KeyStone’ (OpenStack Identity) — To provide unified authentication across projects.

It appears there has been an almost near exponential rise in the contributions since last year(given the no. of companies involved and many individual contributors). He also discussed about OpenStack ‘Horizon’ project which is the dashboard providing web interface to OpenStack services(noted above.)

Common Criteria Certification of Open Source Software
—————————————————–
I then walked into the ‘Hardware Security and Cryptography’ dev room, where Tomas Gustavson(PrimeKey, CTO) started discussing about Common Criteria and open source software[2]. He talked about the process and procedures involved. He also related how several CC documents are linked together. And then outlined the pains involved: time, money, technical-level, only a specific version is certified, to keep track of all minute details and documents and their linkings. However he concedes it is important as the certification assures that the certified product works as it is /documented/ that it shall work. He then moves on to the intricacies involved with Open Source and CC. He also mentioned Red Hat, IBM, PrimeKey as the ones who provide open source certified products. Concluding, however tedious the process maybe the end result is satisfying and provides confidence to governments, federal agencies, major banks and related customers and deploying such software.

I missed to attend Richard W Jones talk on ‘libguestfs’ as I had a colliding talk of mine during the same slot.

Overview of Dogtag Certificate System
————————————–
After the Common Criteria talk, I gave my brief talk about Dogtag Certificate System[3] to an audience of 50-60 people in the 100 capacity dev room. I started off w/ different subsystems involved, some configuration overview, and a couple of deployment scenarios possible. Then, talked a little about cloning of subsystems for high availability, different security mechanisms available, and some command-line tools. After that, I discussed upcoming plans about REST based design, more tighter integration of subsystems w/ freeIPA project and the refactoring work in progress. Then I briefly showed a small demo(the talk was for 25-30 mins, so I didn’t manage to do a proper demo in time) of pre-installed subsystems(CA, KRA, OCSP) and the web interface on a virtual machine on my laptop. I was a little nervous while presenting, however, I also got a few questions.

Unfortunately, I couldn’t get to meet Kai Engert(upstream/RH Mozilla-nss maintainer) who organized the Dev Room. He actually sent me an email to meet up on Saturday night(1st day of conf.), but I wasn’t able to check it in time(as I didn’t use the (prohibitively expensive) internet at the hotel). Also, he was also handling other talks in Mozilla dev rooms.

I then moved to the ‘Hypervisors’ track.

Ganeti:(A look inside the Virtualization Cluster Management system)
——————————————————————-
Guido Trotter(of Google) dicussed about their project ‘Ganeti’[4] to manage clusters of physical machines which run virtualization software using commodity hardware. It supports both XEN and KVM hypervisors. Live-migration appears to be one of its critical feature. Guido started with some terminology, components involved, configurations possible, and roles of virt nodes and some customizations that could be done. He also talked about storage management and replication.

I was wondering why there was a real pressing need for Google to start yet-another new management layer project for virtual machines(let it be clusters or something else). As there are already many existing management projects catering to several virtualization use-cases.

Virtualization with KVM: bottom to top, past to future
——————————————————
Paolo Bonzini(of Red Hat) gave a complete overview[5] of the entire virt. stack covering several use-cases relating Desktop, Server and Cloud Virtualization. Starting lower the stack with KVM hypervisor’s entry into linux kernel and its integration with QEMU project. From there, he moved up the stack discussing about Libvirt for management, it’s features, and several libvirt APIs available for other applications to use. And then, to desktop virtualization management software like virt-manager, and the more recent ‘Gnome-Boxes’(more on this below) and several other virt-tools for disk manipulation. Moving on, he discussed about large scale virtualization problems and available solutions(oVirt, OpenStack, Ganeti) and did some comparison of these technologies. He concluded with a roadmap for KVM, QEMU, Libvirt oVirt node, oVirt engine projects.

Linux Containers and OpenVZ
—————————
Kir Kolyshkin(OpenVZ project maintainer) started off by introducing the concept of Linux Containers which deals with Operating system-level virtualization which is different from whole-system(or full machine) virtualization like QEMU/KVM. Which means, with containers, there is only one real hardware(no virtual hardware to deal with) ; a single kernel and many user space instances. Container technology is primarily used by hosting providers for deploying web applications. Another alternative technology(which Red Hat supports and actively contributes) is LXC(Linux Containers). As there is no overhead of a hypervisor, higher density is possible with Linux Containers. Each container has it’s own files(chroot ; process tree ; n/w ; devices ; IPC objects). He discussed about some OpenVZ[6] features how it compares with LXC, and also discussed about dynamic resource allocation using ‘cgroups’ technology. And mentioned some of the tools and other new features/related-projects upcoming in OpenVZ.

- vzctl – A tool to control OpenVZ containers.
- VSwap: A new approach to memory management. Which requires only two parms to configure – RAM and Swap
- ploop: A reimplimentation of linux loop device. Which supports – ‘plain’ raw, qcow2 ; supports n/w storage, Snapshots and fast provisioning via stacked images.
- CRIU: (Chekcpoint/Restore(mostly) In User-space) — http://criu.org

But, LXC has been gaining more and more traction as it doesn’t require a ‘patched’ kernel(which OpenVZ needs) to work with containers. But OpenVZ appear to have more deployments since it’s been around a little longer.

Native Linux KVM Tool(NLKT)
—————————
Sasha Levin(NLKT developer) introduced NLKT project[7]. A feather light weight (in-kernel) user-space alternative to QEMU, written from scratch for managing KVM hypervisor based guests(linux only at the moment). This project sits inside the kernel tree under /tools directory. It was originally born out of a long(100 emails + thread discussion, initiated by Ingo Molnar) on upstream kvm list as an RFC about unifying kvm user-space(qemu) and kernel-space into a single project as it is a single experience to the end user. After lots of heated discussions, QEMU/KVM maintainers and contributors had their own different reservations and no consensus was reached. NLKT already works project is still in development phase, there are several active contributors. It supports very minimal legacy devices(for simplicity and maintenance’s sake) which are only required for booting. Also to note, it doesn’t support the plenty of architectures that QEMU supports. He also outlined about upcoming features.

NLKT is submitted for inclusion into mainline kernel (but not yet accepted). What this means, if it is merged, a Linux distro. will by default get a minimal user-space tool to boot linux guests.

Having said that, QEMU is light years ahead with thousands of man hours spent developing and testing, supports plenty of enterprise features, and a wide deployment base it already has.
(I experimented w/ NLKT a couple of times out of curiosity (during free time) to see how this works and learn a different perspective of KVM.)

OpenStack developers meeting and Distribution panel
—————————————————
I was still hanging around the ‘Virtualization and Cloud’ Devroom, so I joined this last talk of the day just to observe how things progress. Thierry Carrez and a couple of other Openstack contributors moderated this session attended by 40-50 folks which included people representing different distributions and upstream projects. The discussions mostly surrounded around concerns of distributions, governance model and further improvements. From my observation, I don’t think there was any concrete consensus about any of the topics. There were a few Red Hat engineers discussing about the work Red Hat is doing, while the moderator was more keen on hearing a clear idea of what is Red Hat’s stance on OpenStack, and other surrounding areas relating to budget for openstack conferences.

That ends Day1.

== Day 2 ==

USB redirection over network
—————————-
I came half-way into this talk by Hans de Goede(of Red Hat). His talk was primarily about USB redirection[8] as in using the usb devices(which are plugged into the physical machine) inside a virtual machine. I missed the part where he talked about the special case where the USB device being redirected is not on the physical machine but to a machine located elsewhere and how that device is accessed over the network inside a guest.

He also gave a small demo of USB redirection where he plugged a mouse into the physical machine and was able to use it inside the virtual machine.

I had also briefly attended a talk on ‘Tool kits and Wayland’ a discussion about next generation display manager providing much smoother user experiences presented by Rob Bradford(of Intel) in ‘CrossDesktop Devroom’.

GNOME Boxes, use other systems with ease
—————————————-
In the ‘CrossDesktop Devroom’, Zeeshan Ali and Marc-André Lureau(of Red Hat) talked about ‘Gnome Boxes’[9], a desktop virtualization software which is integrated into Gnome-3. ‘Boxes’ use Libvirt under the hood. While virt-manager is a separate application which needs to be invoked as a separate application. A super-quick demo was also provided by Zeeshan

For more info, refer to Daniel P Berrange’s post on this and future of virt-manager.[9.1]

Virtualization Management the oVirt way
—————————————
Itamar Heim(of Red Hat) presented a high level overview[10] of oVirt project, which targets large scale virtualization/Data Ceter management software leveraging many of existing virtualization technologies(KVM based). He started by discussing the goals of building a community around the virt. stack and a little bit about governance model. Then he went over the life cycle of virtual machine management using oVirt interface using screen-shots. And discussed several management features available for live migration, system scheduling, power/image management, monitoring, etc. He the showed a high level architecture(which shows IPA as a component). Then he briefly discussed about ‘Hooks’ which can modify a VM definition as desired, but, just before a VM start. Some example hooks he mentioned are:
- CPU Pinning
- Single Root I/O Virtualization (SR/IOV) — which gives the ability to provide performance benefit similar to assigning a physical PCI device(like a n/w port) to a guest.
- Smart Card
- Hugepages (related to memory)
- Numa (Non-uniform memory access)

He also outlined several upcoming features: live snapshots, live storage migration, hot plug, multiple storage domains, shared disks, iscisi disk, shared file system support,storage array integration, Gluster support, libguestfs integration…

oVirt – Engine Core
——————-
Omer Frenkel(of Red Hat) discussed about oVirt ‘Engine Core’[11] which is the central part of oVirt platform which provides administration interfaces. He talked about other responsibilities of Engine Core and several internal details. Then he discussed about Authentication, where user management is done via LDAP servers, and kerberos auth to LDAP servers. And mentioned about IPA/AD as it’s current support. He concluded with some administration detail and road map.

VDSM — the oVirt node management agent
(VDSM: Virtual Desktop and Server Management Daemon)
—————————————————
Federico Simoncelli(of Red Hat) discussed about VDSM[12], a high level API for managing the cluster nodes which was originally tailored for needs of oVirt. It is written in Python; multi-threaded and multi-processed. He outlined some responsibilities of VDSM, It is used to dynamically manage anything from a few VMs on a single host to 1000s of VMs on a cluster of 100s of hosts using multiple storage targets. He concluded discussing about Storage Architecture and Thin Provisioning.

He offered Red Hat swag for audience who asked questions.

Buiding app. sandboxes on top of LXC and KVM w/ libvirt
——————————————————-
Daniel P. Berrange(of Red Hat) gave an excellent talk[13] on building sandboxes on top of LXC and KVM using libvirt to an almost full crowd of 500. He started off with differentiating DAC and MAC access control mechanisms and then discussed the idea of ‘Application Sandboxes’ where the goal is to isolate any kind of regular applications, thus providing multiple defense layers. Before going further, he clarified ‘selinux sandbox’ from ‘libvirt sandbox’ which his talk was about. He then talked about start-up mechanisms for different libvirt drivers(KVM, LXC) and their performance overheads (of cpu execution, start-up/shutdown penalties, device access). Then he discussed about some real life use cases where sandboxing can be applied:
- Deploying multiple Apache Virtual hosts (and providing strong isolation) ;
- Audio transcoding of an obtained ‘ogg’ from an untrusted source and converting it into ‘raw’ in a sandboxed environment thus avoiding file-system and n/w access ;
- Running browser instances in a sandboxed environment(one for banking, one for general use, etc..)
- mock RPM build (chroot is installed using ‘rpm’ in a sandbox, where malicious %post/%pre scripts can escape the sandboxed env.)

He discussed it in a bit more detail with some examples of virt-sandbox command on his blog[13.1]

That’s it for talks.

After that, I headed to the Fedora stand and did some booth duty, and answered(politely) a couple of questions(hey, why isn’t Fedora nice to me, and when can we expect to see this bug fixed) and handed over some swag to folks, then we dismantled the booth and headed out for dinner into the chill.

Social
——
After dinner, myself, Tom Callaway(Fedora Engineering Manager), Jonathan Blandford(Gnome Desktop Manager), Gnome ‘Boxes’ team, and a couple of other community members went to watch SuperBowl(American football) at a place called “Fat Boy’s”(probably it could be named better). Though I don’t follow the game at all, I supported the NY Giants because a character I read in a book likes it. And I was diligently warned by Tom Callaway(a Patriots supporter) that I could, but I may get hurt :) . I left exactly at half-time and walked back to the hotel as it was already 02:00 AM and I couldn’t keep myself awake despite having 2 cups of strong tea. Later I was told NY Giants won.

Conclusion:
———–
This is the first FOSDEM I attended. I felt it was a great conference(minus the the cold wave) where diverse set of groups converge at one place.

I also had a chance to meet some Red Hatters, and several community members(though just very briefly given the tight schedule of the conference) : Daniel P Berrange, Richard W Jones, Paolo Bonzini, Tom Callaway, Lennart Poettering, Zeeshan Ali, Marc-André Lureau. And a couple of others working on Aeolus/DeltaCloud projects Michal Fojtik, Francesco Vollera, Marios( working on DeltaCloud, Aeolus projects), Sasha Levin, Pekka Enberg, Christopher Wickert, Jeorg Simon, Bert Desmet, Thorsten Leemhuis, Jonathan Blandford, Jeron Van Meun and many others.

I tried to review the post thrice over. Please forgive if there are any grammatical errors.

Some pictures: http://www.flickr.com/photos/kashyapchamarthy/

References:
———–

[1] http://openstack.org

[2] http://wwwpriv.primekey.se/~tomas/presentations/commoncriteria-opensource-5-FOSDEM.odp
[2.1] http://www.primekey.se/Community/The+CESeCore+project/

[3] http://kashyapc.fedorapeople.org/fosdem2012-dogtag-pki-demo/

[4] http://fosdem.org/2012/schedule/event/427/83_ganeti_internals.pdf

[5] http://fosdem.org/2012/schedule/event/444/82_fosdem12.pdf

[6] http://openvz.org/

[7] http://fosdem.org/2012/schedule/event/360/2_2011-forum-native-linux-kvm-tool.pdf

[8] http://fedoraproject.org/wiki/Features/UsbNetworkRedirection

[9] https://live.gnome.org/Design/Apps/Boxes
[9.1] http://berrange.com/posts/2011/11/22/gnome-3-desktop-virtualization-support-from-gnome-boxes-and-the-future-for-virt-manager/

[10] http://www.ovirt.org/w/images/b/b0/Fosdem2012-ovirt-clean.pdf

[11] http://www.ovirt.org/wiki/File:Ovirt-engine-core_fosdem_2012.pdf

[12] http://www.ovirt.org/wiki/Category:Vdsm

[13] http://people.redhat.com/berrange/fosdem-2012/libvirt-sandbox-fosdem-2012.pdf
[13.1] http://berrange.com/posts/2012/01/17/building-application-sandboxes-with-libvirt-lxc-kvm/

Short post: FOSDEM 2012

Just done w/ FOSDEM at frozen Brussels. Here is some slide info and demo info for my talk I gave in ‘Hardware and Crypto Dev Room’.

Mostly I spent time shuffling between the mostly packed ‘Cloud and Virt Dev room‘ , Hypervisors track, a couple of sessions at Cross-Desktop Dev Room. As always, more importantly, in the hallways, matching faces to IRC nicks, met a lot of people whom I’ve mostly worked with on email and IRC.

Meanwhile, I took a couple of days off personally and having a decent time in a little place here in Belgium. Will make a little more detailed post w/ the talks I had a chance to attend.

Nested Virtualization with KVM and AMD

After my previous attempt the other day to create a nested-guest(kvm on kvm) with Intel arch, I got hold of an AMD server machine with virt-extensions enabled and gave it a whirl. This went slightly smoother than the Intel attempt.

Some config info about the physical host, regular-guest and nested-guest. (All of them are Fedora-16; x86_64)

  • Physical Host (Host hypervisor/Bare metal)
    • 
      [root@phy-host-amd]# virsh nodeinfo
      CPU model:           x86_64
      CPU(s):              16
      CPU frequency:       2000 MHz
      CPU socket(s):       2
      Core(s) per socket:  8
      Thread(s) per core:  1
      NUMA cell(s):        1
      Memory size:         8173352 kB
      
  • Regualr Guest (Or Guest Hypervisor)
    • Config: 4GB Memory; 6 vcpus; 22GB Raw disk image w/ cache=’none’ enabled in the libvirt xml
  • Nested Guest
    • Config: 2GB Memory; 3 vcpus; 10G Raw disk image

Ensure nesting is enabled on the physical host

Let’s ensure kvm_amd kernel module is enabled with ‘nested’ virt.


[root@phy-host-amd ~]# modinfo kvm_amd | grep -i nested
parm:           nested:int
[root@phy-host-amd ~]# 

[root@phy-host-amd ~]# cat /sys/module/kvm_amd/parameters/nested
1
[root@phy-host-amd ~]# 

[root@phy-host-amd ~]# systool -m kvm_amd -v   | grep -i nested
    nested              = "1"
[root@phy-host-amd ~]#

CAVEAT: To make life a little easier, I configured bridged networking on the physical host to ensure our regular-guest gets a bridged IP; and later, nested-guest gets a NATed IP. I’m noting it here because, the physical host initially had no bridging. The default libvirt bridge virbr0 has 192.168.122.0/24 IP space. So once we set up the regular-guest(or guest-hypervisor), we’ll end up having the same IP space. I tried to fix this prob. by creating another ‘persistent’ libvirt network interface and enabled autostart of it. [virsh net-add; virsh net-define; virsh net-autostart ]. But, it wasn’t elegant and messed up networks on reboot.

Set up the guest hypervisor
Create a minimal regular-guest using virt-install . The one I used is posted here

Now, add the cpu attribute to the regular-guest’s libvirt xml to expose AMD’s svm instructions, which comes with Opteron_G3 model .

Edit the xml using virsh:

# virsh edit regualr-guest 

(which will also define the xml)

Here is the attribute to be added to the guest hypervisor’s libvirt xml:

   <cpu>
      <arch>x86_64</arch>
      <model>Opteron_G3</model>
      <vendor>AMD</vendor>
      <topology sockets='2' cores='8' threads='1'/>
      <feature name='wdt'/>
      <feature name='skinit'/>
      <feature name='osvw'/>
      <feature name='3dnowprefetch'/>
      <feature name='cr8legacy'/>
      <feature name='extapic'/>
      <feature name='cmp_legacy'/>
      <feature name='3dnow'/>
      <feature name='3dnowext'/>
      <feature name='pdpe1gb'/>
      <feature name='fxsr_opt'/>
      <feature name='mmxext'/>
      <feature name='ht'/>
      <feature name='vme'/>
    </cpu>

And, restarted the regular-guest, so that it boots w/ the -cpuflag which the AMD virt extensions:


[root@phy-host-amd ~]# ps -ef | grep -i qemu-kvm
qemu     26677     1 14 10:39 ?        00:00:30 /usr/bin/qemu-kvm -S -M pc-0.14 -cpu phenom,+wdt,+skinit,+osvw,+3dnowprefetch,+misalignsse,+sse4a,+abm,+cr8legacy,+extapic,+cmp_legacy,+lahf_lm,+rdtscp,+pdpe1gb,+popcnt,+cx16,+ht,+vme -enable-kvm -m 4096 -smp 6,sockets=2,cores=8,threads=1 -name regular-guest -uuid 8f6a4478-496b-51d8-2de2-ff7fdb964af3 -nographic -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/regular-guest.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -drive file=/var/lib/libvirt/images/regular-guest.img,if=none,id=drive-virtio-disk0,format=raw,cache=none -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=24,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:5f:c6:5f,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -usb -device usb-tablet,id=input0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5

Now, let’s fetch the IP of the regular-guest using virt-cat


[root@phy-host-amd ~]# virsh list
 Id Name                 State
----------------------------------
  5 regular-guest        running
[root@phy-host-amd ~]#
[root@phy-host-amd ~]# virt-cat regular-guest /var/log/messages | grep 'dhclient.*bound to'
Jan 17 10:13:06 dhcpyy-zz dhclient[732]: bound to ww.xx.yy.zz -- renewal in 32578 seconds.

(Note: ‘ww.xx.yy.zz’ above will be a bridged IP address)

Create the nested guest
Now. install virt-packages in the regular-guest. Also, let’s check if the /dev/kvm char device is exposed in the regular-guest ; and start the libvirtd service.


[root@regular-guest ~]# file /dev/kvm
/dev/kvm: character special
[root@regular-guest ~]# systemctl status libvirtd.service
libvirtd.service - LSB: daemon for libvirt virtualization API
          Loaded: loaded (/etc/rc.d/init.d/libvirtd)
          Active: active (running) since Tue, 17 Jan 2012 10:49:25 -0500; 5s ago
         Process: 1440 ExecStart=/etc/rc.d/init.d/libvirtd start (code=exited, status=0/SUCCESS)
        Main PID: 1448 (libvirtd)
          CGroup: name=systemd:/system/libvirtd.service
                  ├ 1448 libvirtd --daemon
                  └ 1501 /usr/sbin/dnsmasq --strict-order --bind-interfaces --pid-file=/var/run/libvirt/network/default.pid --conf-file= --exce...

Proceed with installing a minimal F16 nested-guest w/ virt-install. The script I used is here

Debugging note: Once the guest install is finished, fix the serial console access by disabling plymouth-service using this workaround. This will let us login via virsh serial console(to get kernel and boot messages) w/o any line breaks while entering credentials:

 # ln -s /dev/null /etc/systemd/system/plymouth-start.service

Get the (NATed) IP of the nested-guest. (Also, grepped for the qemu-kvm command-line of the nested-guest.)


[root@regular-guest ~]# virsh list
 Id Name                 State
----------------------------------
  2 nested-guest         running
[root@regular-guest ~]# ps -ef | grep qemu-kvm
qemu      2245     1  2 Jan17 ?        00:20:11 /usr/bin/qemu-kvm -S -M pc-0.14 -enable-kvm -m 2048 -smp 3,sockets=3,cores=1,threads=1 -name nested-guest -uuid 2aae2ab5-ddb6-2585-aa16-7fe97296f34b -nographic -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/nested-guest.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -drive file=/var/lib/libvirt/images/nested-guest.img,if=none,id=drive-virtio-disk0,format=raw -device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=24,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:0e:4e:53,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -usb -device usb-tablet,id=input0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5

[root@regular-guest ~]# virt-cat nested-guest /var/log/messages | grep 'dhclient.*bound to'
Jan 17 11:08:30 localhost dhclient[721]: bound to 192.168.122.220 -- renewal in 1393 seconds.
[root@regular-guest ~]#

SSh into the nested-guest, install virt-what package and run to see if we’re on a hypervisor


[root@localhost ~]# cat /etc/fedora-release
Fedora release 16 (Verne)
[root@localhost ~]# ifconfig eth0 | grep inet
          inet addr:192.168.122.220  Bcast:192.168.122.255  Mask:255.255.255.0
          inet6 addr: fe80::5054:ff:fe0e:4e53/64 Scope:Link
[root@localhost ~]#
[root@localhost ~]# virt-what
kvm

Wooo!! so we’re on an OS which is inside an OS which is inside an OS.

Nested Virtualization with KVM Intel

Some context: In regular virtualization, your physical linux host is the hypervisor, and runs multiple operating systems. Nested Virtualization let’s you run a guest inside a regular guest(essentially a Guest hypervisor).For AMD there is nested-support available since a while, and some people reported success w/ nesting KVM guests. For Intel arch., there is support available recently, an year-ish, and some in progress work, so thought I’d give it a whirl when Adam Young started discussion about it in context of openstack project.

Some of the common use-cases for that are being discussed for nested-virtualization
- For instance, a cloud user gets a beefy, Regualar Guest(which she completely controls). Now, this user can turn regular guest into a hypervisor, and can cheerfully run/manage multiple guests for developing or testing w/o the hassle and intervention of the cloud provider.
- Possibility of having a many instances of virtualization setup (hypervisor and its guests) on one single Bare metal.
- Ability to debug and test hypervisor software

I have immediate access to a moderately beefy Intel hardware, and rest of the post is based on Intel’s CPU virt extensions. Before proceeding, let’s settle on some terminology for clarity:

  • Physical Host (Host hypervisor/Bare metal)
    • Config: Intel(R) Xeon(R) CPU(4 cores/socket); 10GB Memory; CPU Freq – 2GHz; Running latest Fedora-16(Minimal foot-print, @core only with Virt pkgs;x86_64; kernel-3.1.8-2.fc16.x86_64
  • Regualr Guest (Or Guest Hypervisor)
    • Config: 4GB Memory; 4vCPU; 20GB Raw disk image with cache =’none’ to have decent I/O; Minimal, @core F16; And same virt-packages as Physical Host; x86_64
  • Nested Guest (Guest installed inside the Regular Guest)
    • Config: 2GB Memory; 1vCPU; Minimal(@core only) F16; x86_64

Enabling Nesting on the Physical Host

Node Info of the Physical Host.

 
# virsh nodeinfo
CPU model:           x86_64
CPU(s):              4
CPU frequency:       1994 MHz
CPU socket(s):       1
Core(s) per socket:  4
Thread(s) per core:  1
NUMA cell(s):        1
Memory size:         10242864 kB

Let us first ensure kvm_intel kernel module has nesting enabled. By default, it’s disabled for Intel arch[ but enabled for AMD -- SVM (secure virtual machine) extensions arch.]

 
# modinfo kvm_intel | grep -i nested
parm:           nested:bool
#

And, we need to pass this kvm-intel.nested=1 on kernel commandline while rebooting the host to enable nesting for the Intel KVM kernel module. Which can be verified after boot by doing:

 
# cat /sys/module/kvm_intel/parameters/nested
Y
# systool -m kvm_intel -v   | grep -i nested
    nested              = "Y"
#

Or alternatively, Adam Young identified that nesting can be enabled by adding this directive kvm_intel nested=1 to the end of /etc/modprobe.d/dist.conf file and reboot the host so it persists.

Set up the Regular Guest(or Guest hypervisor)
Install a regular guest using virt-install or oz tool or any other preferred way. I made a quick script here. And ensure to have cache=’none’ in the disk attribute of the Guest Hypervisor’s xml file. (observation: Install via virt-install tool didn’t seem have this option picked by default.) Here is the ‘drive’ attribute libvirt xml snippet:

    <disk type='file' device='disk'>
      <driver name='qemu' type='raw' cache='none'/>
      <source file='/var/lib/libvirt/images/regular-guest.img'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>

Now, let’s try to enable Intel VMX(Virtual Machine Extensions) in the regular guest’s CPU. We can do it by running the below on the Physical host(aka Host Hypervisor), and adding the ‘cpu’ attribute to the regular-guest’s libvirt xml file, and start the guest.

# virsh  capabilities | virsh cpu-baseline /dev/stdin
<cpu match='exact'>
  <model>Penryn</model>
  <vendor>Intel</vendor>
  <feature policy='require' name='dca'/>
  <feature policy='require' name='xtpr'/>
  <feature policy='require' name='tm2'/>
  <feature policy='require' name='vmx'/>
  <feature policy='require' name='ds_cpl'/>
  <feature policy='require' name='monitor'/>
  <feature policy='require' name='pbe'/>
  <feature policy='require' name='tm'/>
  <feature policy='require' name='ht'/>
  <feature policy='require' name='ss'/>
  <feature policy='require' name='acpi'/>
  <feature policy='require' name='ds'/>
  <feature policy='require' name='vme'/>
</cpu>

The o/p of the above cmd has a variety of options. Since we need only vmx extensions, I tried the simple way by adding to the regular-guest’s libvirt xml(virsh edit ..) and started it.

<cpu match='exact'>
  <model>core2duo</model>
 <feature policy='require' name='vmx'/>
</cpu>

Thanks to Jiri Denemark for the above hint. Also note that, there is a very detailed and informative post from Dan P Berrange on host/guest CPU models in libvirt.

As we enabled vmx in the guest-hypervisor, let’s confirm that vmx is exposed in the emulated CPU by ensuring qemu-kvm is invoked with -cpu core2duo,+vmx :


[root@physical-host ~]# ps -ef | grep qemu-kvm
qemu     17102     1  4 22:29 ?        00:00:34 /usr/bin/qemu-kvm -S -M pc-0.14
-cpu core2duo,+vmx -enable-kvm -m 3072
-smp 3,sockets=3,cores=1,threads=1 -name f16test1
-uuid f6219dbd-f515-f3c8-a7e8-832b99a24b5d -nographic -nodefconfig
-nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/f16test1.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown
-drive file=/export/vmimgs/f16test1.img,if=none,id=drive-virtio-disk0,format=raw,cache=none
-device virtio-blk-pci,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1
-netdev tap,fd=21,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:e6:cc:4e,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -usb -device usb-tablet,id=input0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5

Now, let’s attempt to create a nested guest

Here comes the more interesting part, the nested-guest config. will be 2G RAM; 1vcpu; 8GB virtual disk. And let’s invoke a virt-install cmdline with a minimal kickstart install:


[root@regular-guest ~]# virt-install --connect=qemu:///system \
    --network=bridge:virbr0 \
    --initrd-inject=/root/fed.ks \
   --extra-args=ks=file:/fed.ks console=tty0 console=ttyS0,115200 serial rd_NO_PLYMOUTH \
    --name=nested-guest --disk path=/var/lib/libvirt/images/nested-guest.img,size=6 \
    --ram 2048 \
    --vcpus=1 \
    --check-cpu \
    --hvm \
    --location=http://download.foo.bar.com/pub/fedora/linux/releases/16/Fedora/x86_64/os/
    --nographics

Starting install...
Retrieving file .treeinfo...                                                                                                 | 1.7 kB     00:00 ...
Retrieving file vmlinuz...                                                                                                   | 7.9 MB     00:08 ...
Retrieving file initrd.img...                               28% [==============                                   ] 647 kB/s |  38 MB     02:25 ETA

virt-install proceeds fine(to a certain extent), doing all regular things like getting access to network, create devices, create file-systems, dep checks performed, and finally package install proceeds:


Welcome to Fedora for x86_64

     ┌─────────────────────┤ Package Installation ├──────────────────────┐
     │                                                                   │
     │                                                                   │
     │                                 24%                               │
     │                                                                   │
     │                   Packages completed: 52 of 390                   │
     │                                                                   │
     │ Installing glibc-common-2.14.90-14.x86_64 (112 MB)                │
     │ Common binaries and locale data for glibc                         │
     │                                                                   │
     │                                                                   │
     │                                                                   │
     └───────────────────────────────────────────────────────────────────┘

And now, it’s stuck like that for ever. Doesn’t budge, trying to install pkgs for eternity. Let’s try to see what’s the state of the guest in a seperate terminal


[root@regular-guest ~]# virsh list
 Id Name                 State
----------------------------------
  1 nested-guest         paused

[root@regular-guest ~]#
[root@regular-guest ~]#  virsh domstate nested-guest --reason
paused (unknown)

[root@regular-guest ~]#

So our nested-guest seems to be paused, And package install on the nested-guest’s serial console is still hung. I gave up at this point. Need to try if I can get any helpful info w/ virt-dmesg tool aor any other ways to debug this further.

Just to note, there is enough disk space and memory on the ‘regular-guest’, so that case is ruled out here. And, I tried to destroy the broken nested-guest, and attempted to create a fresh one(repeated twice). Still no dice.

So not much luck yet with Intel arch, I’d have to try on an AMD machine.

UPDATE(on Intel arch): After trying a couple of times, I was finally able to ssh to the nested guest, but, after a reboot, the nested-guest loses the IP rendering it inaccessible.(Info: the regular-guest has a bridged IP, and nested-guest has a NATed IP) . And I couldn’t login via serial-console, as it’s broken due to a regression(which has a workaround). Also, refer to comments below for further discussion on NATed networking caveats.

Revisiting Native Linux KVM Tool

Just a quick revisit of native linux kvm tool(nlkt). There were quite a few improvements upstream. So, I git pulled the latest, built kernel; built the binary executable. The nlkt binary is now renamed to ‘lkvm’ (thanks Pekka, it’ll improve searchability a lot)

Some enhancements I noticed from my testing:
- 9pfs enhancements
- Writable support for qcow2 disk-images
- sandbox support — this seems to be mostly a wrapper around ‘run’ command

After building, I posted latest kvm tool binary lkvm, kernel bzImage, linux .config and init binaries over here . Also, a couple of simple test results with latest git.

To try out a slightly long way, clone the nlkt git tree, (also ensure to have the correct directives enabled in the linux config. I posted mine above) ; build the kernel and kvm tool.

Build:

 
# cd linux-kvm
# make -j5
# cd tools/kvm
# make
 

To give a quick try with the binaries I posted above, first let’s setup default rootfs by running the setup command. Note that we also need to have a guest directory with init and init_stage2 binaries. Where the init mounts the host file system as read-only, runs the init_stage2 to setup a tty console and call the shell executable /bin/sh

 
--------------------------------------------
[kashyap@tesla nlkt-jan11]$ #./lkvm setup default
--------------------------------------------
[kashyap@tesla nlkt-jan11]$ pwd
/var/tmp/nlkt-jan11
--------------------------------------------
[kashyap@tesla nlkt-jan11]$ tree
.
├── bzImage
├── guest
│   ├── init
│   └── init_stage2
└── lkvm

1 directory, 4 files
[kashyap@tesla nlkt-jan11]$
 

Once we boot into our default rootfs setup, let’s boot into the kernel

 
[kashyap@tesla nlkt-jan11]$ ./lkvm run -d default
  # lkvm run -k ./bzImage -m 448 -c 4 --name default
.
.
.
Starting '/bin/sh'...
sh-4.2#
 

We can also notice the host file system being mounted read-only in the guest:

 
--------
sh-4.2# pwd
/
--------
sh-4.2# ls
bin  etc   host  lib64	root  sys  usr	virt
dev  home  lib	 proc	sbin  tmp  var
--------
sh-4.2# ls host/ ; cd host
bin   dev  home  lib64	     media  opt   root	sbin  sys  usr
boot  etc  lib	 lost+found  mnt    proc  run	srv   tmp  var
--------
sh-4.2# touch foo
touch: cannot touch `foo': Read-only file system
sh-4.2#
--------
 

Now, let’s try the sandbox, which will run a command as part of the init and then exits gracefully . In this case, it’s a simple ls command.

 
--------
[kashyap@tesla nlkt-jan11]$ ./lkvm sandbox -k ./bzImage -- ls
  # lkvm run -k ./bzImage -m 448 -c 4 --name guest-9990
.
.
.
Mounting...
Starting '/bin/sh'...
bin  etc   host  lib64	root  sys  usr	virt
dev  home  lib	 proc	sbin  tmp  var
[    2.052463] Unregister pv shared memory for cpu 1
[    2.052546] Unregister pv shared memory for cpu 0
[    2.052578] Unregister pv shared memory for cpu 3
[    2.055887] Unregister pv shared memory for cpu 2
[    2.057093] Restarting system.
[    2.057407] machine restart

  # KVM session ended normally.
[kashyap@tesla nlkt-jan11]$
--------
 

NOTE: I just cleared some of the stdout for brevity.

UPDATE: Pekka Enberg reminded me in a comment below that I missed to note two more additional user-visible features — PPC64 architecture support ; Serial console emulation is much more faster. (I totally agree there.)