PCI passthrough with NVIDIA card as primary

A lot of writeups don't handle the intricacies related to NVIDIA GPUs where this GPU is the primary GPU and should be passed through to the guest. This writeup is supposed to collect the information I needed to get the setup working. I do not use virt-manager, just a simple script which contains the qemu command and some steps beforehand to get this to work.

Note: This setup is currently lacking hugepages.

The system

The BIOS of the x299-E provides no possibility to switch the primary GPU to the secondary PCIe slot on the motherboard which is intended for GPUs. I didn't want to swap the GPUs since that would mean - with this processor - that the 1080ti would only get 8 PCI lanes.

Can I even pass through PCI devices?

It makes no sense to continue setting this up, if that is not possible. To answer that, boot into your BIOS and enable VT-d (intel) or AMD-Vi (AMD). After that you need to enable IOMMU. To do that edit you kernel boot parameters to include intel_iommu=on if you have an intel cpu or amd_iommu=on

I for instance use the systemd-bootloader and my config (located at /boot/loader/entries/arch.conf) looks like this:

title Arch Linux
linux /vmlinuz-linux
initrd /intel-ucode.img
initrd /initramfs-linux.img
options intel_iommu=on iommu=on root=/dev/mapper/cryptroot rw usbcore.old_scheme_first=1 fbcon=map:1

After booting check dmesg if virtualization is enabled:

# # for intel:
# dmesg | grep "Directed I/O"
# # for amd
# dmesg | grep "AMD-Vi"

This should output something like DMAR: Intel(R) Virtualization Technology for Directed I/O or PCI-DMA: Intel(R) Virtualization Technology for Directed I/O for intel.

Or for amd:

AMD-Vi: Enabling IOMMU at 0000:00:00.2 cap 0x40
AMD-Vi: Lazy IO/TLB flushing enabled
AMD-Vi: Initialized for Passthrough Mode

Another interesting thing: I remapped all ttys to the AMD card with fbcon=map:1, this way I can stay on the HDMI input for normal operation (except when I want to do something with the BIOS) and only need to switch input source when starting the guest VM.

More documentation about the framebuffer console at git.kernel.org

Preparation

Install your second graphics card and configure Xorg to use this one. I used the nvidia configuration assistant and made some minimal changes to use the AMD card instead of the NVIDIA one.

20-amd.conf

Section "ServerLayout"
    Identifier     "Layout0"
    Screen      0  "Screen1" 0 0
    InputDevice    "Keyboard0" "CoreKeyboard"
    InputDevice    "Mouse0" "CorePointer"
    Option         "Xinerama" "0"
    Option         "BlankTime" "180"
    Option         "StandbyTime" "180"
    Option         "SuspendTime" "180"
    Option         "OffTime" "180"
EndSection

Section "Monitor"
    # HorizSync source: edid, VertRefresh source: edid
    Identifier     "Monitor0"
    VendorName     "Unknown"
    ModelName      "Acer XB271HU"
    HorizSync       34.0 - 222.0
    VertRefresh     30.0 - 165.0
    Option         "DPMS"
EndSection

# AMD gpu
Section "Device"
    Identifier     "Device1"
    Driver         "amdgpu"
    Option         "TearFree" "true"
    BusID          "PCI:23:0:0" # the PCI slot of the graphics card
EndSection

# AMD gpu
Section "Screen"
    Identifier     "Screen1"
    Device         "Device1"
    Monitor        "Monitor0"
    DefaultDepth    24
    Option         "Stereo" "0"
    Option         "metamodes" "2560x1440_60 +0+0" # connected via HDMI -> 60Hz
    Option         "SLI" "Off"
    Option         "MultiGPU" "Off"
    Option         "BaseMosaic" "off"
    SubSection     "Display"
        Depth       24
    EndSubSection
EndSection

Since my monitor only has one DP and one HDMI port I connected the AMD card via HDMI and the 1080ti via DP.

$ lspci | grep VGA
17:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Radeon RX 460/560D / Pro 450/455/460/555/555X/560/560X] (rev e5)
65:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)

Note the disparity between the bus ids (17 <-> 23)? lspci returns the bus ids in hex notation, while Xorg requires them to be in decimal. To convert the numbers we can run something like python3 -c 'print(int("17", base=16))'

Once this is done (and you confirmed that it works) I recommend to uninstall the nvidia driver you currently have installed - this just makes it simpler later on.

Install Required Software

Since I am not using virt-manager, I don't need to install it.

Setting up passthrough

vfio is a stub driver which captures your PCI devices before another driver can access them, it creates device mappings on the filesystem which are then used by qemu. A simple way to check if the capturing worked is by running ls /dev/vfio/ and checking that all IOMMU groups are present.

IOMMU groups

An IOMMU group contains hardware which should only passed over together to the VM, it is possible to only pass through specific items of a group, but I didn't try that. This means, that for every device we want to pass to the VM we need to check that it is either alone in an IOMMU group or shares this group only with devices we intend to pass through anyways.

Lets do it then:

  1. Note the PCI "ids" at the start of each line: the thing of the form <domain>:<bus>:<device>.<func>\

    $ lspci | grep VGA
    17:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Radeon RX 460/560D / Pro 450/455/460/555/555X/560/560X] (rev e5)
    65:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
    

    For me that would be "65:00.0" (domain is not shown by default).

    $ lspci | grep Samsung
    02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
    08:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
    

    In case you have 2 SSDs with the same controller, unplug the guest SSD and use the id which is missing after a second lspci run.

  2. Find the corresponding IOMMU groups: $ find /sys/kernel/iommu_groups | sort -V
    Confirm that only devices which should be passed through are in a group.

    /sys/kernel/iommu_groups/75/devices/0000:65:00.0
    /sys/kernel/iommu_groups/75/devices/0000:65:00.1
                    group id ^          ^ PCI addr ^
    
  3. Isolate the PCI devices:
    The archlinux wiki has generally a great article about this: wiki.archlinux.org

    These instructions are archlinux specifc, but they can be easily adapted to the process of other distros.

    You can either write a small script which is then included in the initramfs, or pass an option to vfio-pci with the vendor ids of the devices. Since there are enough resources online on how to do the second option and I used the first option, I will only describe that one.

    1. Write a script (dont forget to chmod +x it): /usr/local/bin/vfio-pci-passthrough\

      #!/bin/sh
      
      # put your PCI addresses here (including the domain)
      # this is my line
      # ssd, gpu, gpu (audio)
      DEVS="0000:02:00.0 0000:65:00.0 0000:65:00.1"
      
      if [ ! -z "$(ls -A /sys/class/iommu)" ]; then
          for DEV in $DEVS; do
              echo "vfio-pci" > /sys/bus/pci/devices/$DEV/driver_override
          done
      fi
      modprobe -i vfio-pci
      

      This script is taken directly from wiki.archlinux.org

    2. create a file: /etc/modprobe.d/vfio.conf\

      install vfio-pci /usr/local/bin/vfio-pci-override.sh
      
    3. add the modconf to the HOOKS array in /etc/mkinitcpi.conf

    4. add modules to the initramfs (MODULES array in /etc/mkinitcpio.conf):
      vfio_pci vfio vfio_iommu_type1 vfio_virqfd In case you also have your graphics driver in there: the graphics drive should come after these modules, just to be sure.

    5. And lastly regenerate you initramfs.

Reboot and check that everything still works. Then check if vfio-pci worked for every device: dmesg | less

For every device id you can run cat /sys/bus/pci/devices/$DEV/driver_override. If one device doesn't have vfio-pci set, then you can do the following:

#!/bin/sh

# example DEV: 0000:65:00.0
DEV=...
kernel_driver="$(lspci -nnk -s "$DEV" | grep 'in use' | sed -n 's/[^:]*: \(.*$\)/\1/p')"
echo "$DEV" > /sys/bus/pci/drivers/"$kernel_driver"/unbind
echo "$DEV" > /sys/bus/pci/driver/vfio-pci/bind

This has to be done after every reboot, creating a script is certainly a good idea (maybe even build that into the initramfs and execute it after module loading).

Nvidia as primary card

For this to work you have to have a second graphics card and use that as the primary card. Then boot into the system and grab the rom for the card (this is then passed to virt-man).

  1. get the address of the card: lspci
  2. unbind the card: echo <address> > /sys/bus/pci/drivers/vfio-pci/unbind
  3. dump the rom: echo 1 > /sys/bus/pci/devices/<address>/rom
  4. read the rom (and save it somewhere): cat /sys/bus/pci/devices/<address>/rom > gpu_bios.rom
  5. rebind the card: echo <address> > /sys/bus/pci/drivers/vfio-pci/bind

This rom will then be passed to the virtual machine.

Note: Nvidia actually now allows using it's GPUs inside a VM and doesn't show the infamous error 43 with recent display drivers, which makes this step no longer necessary.

Stitching things together

By now all prerequisites are met to write our script which will start the VM.

Here is mine, I didn't bother to add cleanup code as of now.

#!/usr/bin/env bash

set -x

vmname="win10vm"
gpu_id="0000:65:00.0"
audio_id="0000:65:00.1"
ssd_id="0000:02:00.0"
bridge_name="br-qemu-win-vm"
tap_name="tap-qemu-win-vm"
dhcp_subnet=172.254.0.1/16
dhcp_range=172.254.0.2,172.254.255.254
iface_forward="wlp4s0"

# graphical sudo
if [ -n "$SUDO" ]; then
    SUDO="$SUDO"
elif [ -t 1 ]; then
    SUDO=sudo
else
    export SUDO_ASKPASS=/usr/lib/ssh/ssh-askpass
    SUDO='sudo --askpass'
fi
if ! which $SUDO 2>&1 >/dev/null; then
    SUDO=sudo
fi

if ps -A | grep -q $vmname; then
	echo "$vmname is already running." &
	exit 1
else
    efi_vars="/tmp/$vmname-efi-vars.fs"
    cp /usr/share/ovmf/x64/OVMF_VARS.fd "$efi_vars"

    # rebind ssd to vfio-pci driver
    # this is needed since the nvme driver claims this device
    # before the module load vfio-pci can claim it.
    echo $ssd_id | sudo tee '/sys/bus/pci/drivers/nvme/unbind'
    echo $ssd_id | sudo tee '/sys/bus/pci/drivers/vfio-pci/bind'

    # unbind the efifb driver from the primary gpu
    echo efi-framebuffer.0 | sudo tee \
        '/sys/bus/platform/devices/efi-framebuffer.0/driver/unbind'

    # create bridge
    $SUDO ip link add name $bridge_name type bridge
    $SUDO ip addr add "$dhcp_subnet" dev $bridge_name
    $SUDO ip link set dev $bridge_name up
    $SUDO ip tuntap add $tap_name mode tap
    $SUDO ip link set $tap_name up
    $SUDO ip link set dev $tap_name master $bridge_name
    $SUDO dnsmasq --interface=$bridge_name --bind-interface --dhcp-range=$dhcp_range
    # instead just specify dns server for adapter in windows

    # docker and our nft rules interfere
    ## stop docker and reload nft base rules
    $SUDO systemctl stop docker
    $SUDO nft -f /etc/nftables.conf

    # for dhcp and stuff
    $SUDO nft add rule ip filter INPUT udp dport 67 accept
    $SUDO nft add rule ip filter INPUT tcp dport 67 accept
    $SUDO nft add rule ip filter INPUT udp dport 53 accept
    $SUDO nft add rule ip filter INPUT tcp dport 53 accept
    # barrier and scream
    $SUDO nft add rule ip filter INPUT udp dport 4010 accept
    $SUDO nft add rule ip filter INPUT tcp dport 4010 accept
    $SUDO nft add rule ip filter INPUT udp dport 24800 accept
    $SUDO nft add rule ip filter INPUT tcp dport 24800 accept
    # forward to interfaces
    $SUDO nft add rule ip filter FORWARD iifname "$bridge_name" counter packets 0 bytes 0 accept
    $SUDO nft add rule ip filter FORWARD oifname "$bridge_name" counter packets 0 bytes 0 accept
    for iface in $iface_forward; do
        $SUDO nft add rule ip nat POSTROUTING oifname "$iface" counter masquerade
    done

    scream -i $bridge_name &
    if ! pgrep barrier; then
        barrier &
    fi

    $SUDO nice --adjustment=-20 qemu-system-x86_64 \
        -name $vmname,process=$vmname \
        -machine type=q35,accel=kvm \
        -cpu host,kvm=off,hv-vendor-id=null,hv_time,hv_relaxed,hv_vapic,hv_spinlocks=0x1fff \
        -smp 10,sockets=1,cores=5,threads=2 \
        -m 12G \
        -mem-prealloc \
        -rtc clock=host,base=localtime \
        -serial none \
        -parallel none \
        -vga none -nographic \
        -netdev tap,id=net0,br=$bridge_name,ifname=$tap_name,script=no,downscript=no \
        -device e1000,netdev=net0 \
        -audiodev pa,id=snd0,server=unix:/run/user/$(id -u)/pulse/native \
        -device intel-hda -device hda-duplex \
        -drive if=pflash,format=raw,readonly,file=/usr/share/ovmf/x64/OVMF_CODE.fd \
        -drive if=pflash,format=raw,file="$efi_vars" \
        -device vfio-pci,host=$gpu_id,multifunction=on,id=gpu,romfile=/opt/vm/1080ti_asus.rom \
        -device vfio-pci,host=$audio_id,id=audio \
        -device vfio-pci,host=$ssd_id,id=sdd \
        -drive file=/dev/disk/by-id/path-to-my-hdd \
        -boot order=dc \
        -drive file=/opt/vm/virtio-win-0.1.185.iso,media=cdrom \
        -drive file=/opt/vm/Win10_1809Oct_EnglishInternational_x64.iso,media=cdrom
        # -mem-path /dev/hugepages \
	exit $?
fi

The script performs the following actions in order:

  1. Define variables for the network, which operates in bridge mode, the pci ids for the graphics card and the vm name
  2. Define sudo to use a graphical frontend if no terminal is detected
  3. Create the bridge network
  4. Add firewall rules for the bridge and services which enable audio and mouse control
  5. Start scream and barrier. Scream is for audio passthrough to host, Barrier is a fork of Synergy and used to pass the mouse over to the VM.
  6. Start the virtual machine
  7. Define the processor emulation mode and allocate 12GiB of RAM
  8. Disable the qemu inbuilt graphics adapter
  9. Define the network for the VM.
  10. Connect various pci devices
  11. Specify drives

I won't explain everythin in detail, since other people before me have already done a better job than I will at explaining them.

The following lines are the "juicy" ones:

Things that this script is still lacking: