PCI passthrough with NVIDIA card as primary
A lot of writeups don't handle the intricacies related to NVIDIA GPUs where this GPU is the primary GPU and should be passed through to the guest. This writeup is supposed to collect the information I needed to get the setup working. I do not use virt-manager, just a simple script which contains the qemu command and some steps beforehand to get this to work.
Note: This setup is currently lacking hugepages.
The system
The BIOS of the x299-E provides no possibility to switch the primary GPU to the secondary PCIe slot on the motherboard which is intended for GPUs. I didn't want to swap the GPUs since that would mean - with this processor - that the 1080ti would only get 8 PCI lanes.
- Motherboard: ASUS ROG STRIX X299-E
- CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz
- Primary GPU (guest gpu): NVIDIA GeForce GTX 1080 Ti
- Secondary GPU (host gpu): ROG STRIX Radeon RX560
- SSD (guest): Samsung EVO 970 500GB
Can I even pass through PCI devices?
It makes no sense to continue setting this up, if that is not possible.
To answer that, boot into your BIOS and enable VT-d (intel) or AMD-Vi (AMD).
After that you need to enable IOMMU. To do that edit you kernel boot
parameters to include intel_iommu=on
if you have an intel cpu or
amd_iommu=on
I for instance use the systemd-bootloader and my config (located at
/boot/loader/entries/arch.conf
) looks like this:
title Arch Linux
linux /vmlinuz-linux
initrd /intel-ucode.img
initrd /initramfs-linux.img
options intel_iommu=on iommu=on root=/dev/mapper/cryptroot rw usbcore.old_scheme_first=1 fbcon=map:1
After booting check dmesg
if virtualization is enabled:
# # for intel:
# dmesg | grep "Directed I/O"
# # for amd
# dmesg | grep "AMD-Vi"
This should output something like
DMAR: Intel(R) Virtualization Technology for Directed I/O
or
PCI-DMA: Intel(R) Virtualization Technology for Directed I/O
for intel.
Or for amd:
AMD-Vi: Enabling IOMMU at 0000:00:00.2 cap 0x40
AMD-Vi: Lazy IO/TLB flushing enabled
AMD-Vi: Initialized for Passthrough Mode
Another interesting thing: I remapped all ttys to the AMD card with fbcon=map:1
, this
way I can stay on the HDMI input for normal operation (except when I want to do something
with the BIOS) and only need to switch input source when starting the guest VM.
More documentation about the framebuffer console at git.kernel.org
Preparation
Install your second graphics card and configure Xorg to use this one. I used the nvidia configuration assistant and made some minimal changes to use the AMD card instead of the NVIDIA one.
20-amd.conf
Section "ServerLayout"
Identifier "Layout0"
Screen 0 "Screen1" 0 0
InputDevice "Keyboard0" "CoreKeyboard"
InputDevice "Mouse0" "CorePointer"
Option "Xinerama" "0"
Option "BlankTime" "180"
Option "StandbyTime" "180"
Option "SuspendTime" "180"
Option "OffTime" "180"
EndSection
Section "Monitor"
# HorizSync source: edid, VertRefresh source: edid
Identifier "Monitor0"
VendorName "Unknown"
ModelName "Acer XB271HU"
HorizSync 34.0 - 222.0
VertRefresh 30.0 - 165.0
Option "DPMS"
EndSection
# AMD gpu
Section "Device"
Identifier "Device1"
Driver "amdgpu"
Option "TearFree" "true"
BusID "PCI:23:0:0" # the PCI slot of the graphics card
EndSection
# AMD gpu
Section "Screen"
Identifier "Screen1"
Device "Device1"
Monitor "Monitor0"
DefaultDepth 24
Option "Stereo" "0"
Option "metamodes" "2560x1440_60 +0+0" # connected via HDMI -> 60Hz
Option "SLI" "Off"
Option "MultiGPU" "Off"
Option "BaseMosaic" "off"
SubSection "Display"
Depth 24
EndSubSection
EndSection
Since my monitor only has one DP and one HDMI port I connected the AMD card via HDMI and the 1080ti via DP.
$ lspci | grep VGA
17:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Radeon RX 460/560D / Pro 450/455/460/555/555X/560/560X] (rev e5)
65:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
Note the disparity between the bus ids (17 <-> 23
)? lspci
returns
the bus ids in hex notation, while Xorg requires them to be in decimal.
To convert the numbers we can run something like python3 -c 'print(int("17", base=16))'
Once this is done (and you confirmed that it works) I recommend to uninstall the nvidia driver you currently have installed - this just makes it simpler later on.
Install Required Software
Since I am not using virt-manager, I don't need to install it.
- Install qemu and ovmf:
pacman -Syu qemu edk2-ovmf
Setting up passthrough
vfio
is a stub driver which captures your PCI devices before another driver
can access them, it creates device mappings on the filesystem which are then
used by qemu. A simple way to check if the capturing worked is by running
ls /dev/vfio/
and checking that all IOMMU groups are present.
IOMMU groups
An IOMMU group contains hardware which should only passed over together to the VM, it is possible to only pass through specific items of a group, but I didn't try that. This means, that for every device we want to pass to the VM we need to check that it is either alone in an IOMMU group or shares this group only with devices we intend to pass through anyways.
Lets do it then:
-
Note the PCI "ids" at the start of each line: the thing of the form
<domain>:<bus>:<device>.<func>
\$ lspci | grep VGA 17:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Baffin [Radeon RX 460/560D / Pro 450/455/460/555/555X/560/560X] (rev e5) 65:00.0 VGA compatible controller: NVIDIA Corporation GP102 [GeForce GTX 1080 Ti] (rev a1)
For me that would be "65:00.0" (domain is not shown by default).
$ lspci | grep Samsung 02:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983 08:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM981/PM981/PM983
In case you have 2 SSDs with the same controller, unplug the guest SSD and use the id which is missing after a second lspci run.
-
Find the corresponding IOMMU groups:
$ find /sys/kernel/iommu_groups | sort -V
Confirm that only devices which should be passed through are in a group./sys/kernel/iommu_groups/75/devices/0000:65:00.0 /sys/kernel/iommu_groups/75/devices/0000:65:00.1 group id ^ ^ PCI addr ^
-
Isolate the PCI devices:
The archlinux wiki has generally a great article about this: wiki.archlinux.orgThese instructions are archlinux specifc, but they can be easily adapted to the process of other distros.
You can either write a small script which is then included in the initramfs, or pass an option to vfio-pci with the vendor ids of the devices. Since there are enough resources online on how to do the second option and I used the first option, I will only describe that one.
-
Write a script (dont forget to
chmod +x
it):/usr/local/bin/vfio-pci-passthrough
\#!/bin/sh # put your PCI addresses here (including the domain) # this is my line # ssd, gpu, gpu (audio) DEVS="0000:02:00.0 0000:65:00.0 0000:65:00.1" if [ ! -z "$(ls -A /sys/class/iommu)" ]; then for DEV in $DEVS; do echo "vfio-pci" > /sys/bus/pci/devices/$DEV/driver_override done fi modprobe -i vfio-pci
This script is taken directly from wiki.archlinux.org
-
create a file:
/etc/modprobe.d/vfio.conf
\install vfio-pci /usr/local/bin/vfio-pci-override.sh
-
add the
modconf
to the HOOKS array in/etc/mkinitcpi.conf
-
add modules to the initramfs (MODULES array in
/etc/mkinitcpio.conf
):
vfio_pci vfio vfio_iommu_type1 vfio_virqfd
In case you also have your graphics driver in there: the graphics drive should come after these modules, just to be sure. -
And lastly regenerate you initramfs.
-
Reboot and check that everything still works.
Then check if vfio-pci
worked for every device: dmesg | less
For every device id you can run cat /sys/bus/pci/devices/$DEV/driver_override
.
If one device doesn't have vfio-pci
set, then you can do the following:
#!/bin/sh
# example DEV: 0000:65:00.0
DEV=...
kernel_driver="$(lspci -nnk -s "$DEV" | grep 'in use' | sed -n 's/[^:]*: \(.*$\)/\1/p')"
echo "$DEV" > /sys/bus/pci/drivers/"$kernel_driver"/unbind
echo "$DEV" > /sys/bus/pci/driver/vfio-pci/bind
This has to be done after every reboot, creating a script is certainly a good idea (maybe even build that into the initramfs and execute it after module loading).
Nvidia as primary card
For this to work you have to have a second graphics card and use that as the primary card. Then boot into the system and grab the rom for the card (this is then passed to virt-man).
- get the address of the card:
lspci
- unbind the card:
echo <address> > /sys/bus/pci/drivers/vfio-pci/unbind
- dump the rom:
echo 1 > /sys/bus/pci/devices/<address>/rom
- read the rom (and save it somewhere):
cat /sys/bus/pci/devices/<address>/rom > gpu_bios.rom
- rebind the card:
echo <address> > /sys/bus/pci/drivers/vfio-pci/bind
This rom will then be passed to the virtual machine.
Note: Nvidia actually now allows using it's GPUs inside a VM and doesn't show the infamous
error 43
with recent display drivers, which makes this step no longer necessary.
Stitching things together
By now all prerequisites are met to write our script which will start the VM.
Here is mine, I didn't bother to add cleanup code as of now.
#!/usr/bin/env bash
set -x
vmname="win10vm"
gpu_id="0000:65:00.0"
audio_id="0000:65:00.1"
ssd_id="0000:02:00.0"
bridge_name="br-qemu-win-vm"
tap_name="tap-qemu-win-vm"
dhcp_subnet=172.254.0.1/16
dhcp_range=172.254.0.2,172.254.255.254
iface_forward="wlp4s0"
# graphical sudo
if [ -n "$SUDO" ]; then
SUDO="$SUDO"
elif [ -t 1 ]; then
SUDO=sudo
else
export SUDO_ASKPASS=/usr/lib/ssh/ssh-askpass
SUDO='sudo --askpass'
fi
if ! which $SUDO 2>&1 >/dev/null; then
SUDO=sudo
fi
if ps -A | grep -q $vmname; then
echo "$vmname is already running." &
exit 1
else
efi_vars="/tmp/$vmname-efi-vars.fs"
cp /usr/share/ovmf/x64/OVMF_VARS.fd "$efi_vars"
# rebind ssd to vfio-pci driver
# this is needed since the nvme driver claims this device
# before the module load vfio-pci can claim it.
echo $ssd_id | sudo tee '/sys/bus/pci/drivers/nvme/unbind'
echo $ssd_id | sudo tee '/sys/bus/pci/drivers/vfio-pci/bind'
# unbind the efifb driver from the primary gpu
echo efi-framebuffer.0 | sudo tee \
'/sys/bus/platform/devices/efi-framebuffer.0/driver/unbind'
# create bridge
$SUDO ip link add name $bridge_name type bridge
$SUDO ip addr add "$dhcp_subnet" dev $bridge_name
$SUDO ip link set dev $bridge_name up
$SUDO ip tuntap add $tap_name mode tap
$SUDO ip link set $tap_name up
$SUDO ip link set dev $tap_name master $bridge_name
$SUDO dnsmasq --interface=$bridge_name --bind-interface --dhcp-range=$dhcp_range
# instead just specify dns server for adapter in windows
# docker and our nft rules interfere
## stop docker and reload nft base rules
$SUDO systemctl stop docker
$SUDO nft -f /etc/nftables.conf
# for dhcp and stuff
$SUDO nft add rule ip filter INPUT udp dport 67 accept
$SUDO nft add rule ip filter INPUT tcp dport 67 accept
$SUDO nft add rule ip filter INPUT udp dport 53 accept
$SUDO nft add rule ip filter INPUT tcp dport 53 accept
# barrier and scream
$SUDO nft add rule ip filter INPUT udp dport 4010 accept
$SUDO nft add rule ip filter INPUT tcp dport 4010 accept
$SUDO nft add rule ip filter INPUT udp dport 24800 accept
$SUDO nft add rule ip filter INPUT tcp dport 24800 accept
# forward to interfaces
$SUDO nft add rule ip filter FORWARD iifname "$bridge_name" counter packets 0 bytes 0 accept
$SUDO nft add rule ip filter FORWARD oifname "$bridge_name" counter packets 0 bytes 0 accept
for iface in $iface_forward; do
$SUDO nft add rule ip nat POSTROUTING oifname "$iface" counter masquerade
done
scream -i $bridge_name &
if ! pgrep barrier; then
barrier &
fi
$SUDO nice --adjustment=-20 qemu-system-x86_64 \
-name $vmname,process=$vmname \
-machine type=q35,accel=kvm \
-cpu host,kvm=off,hv-vendor-id=null,hv_time,hv_relaxed,hv_vapic,hv_spinlocks=0x1fff \
-smp 10,sockets=1,cores=5,threads=2 \
-m 12G \
-mem-prealloc \
-rtc clock=host,base=localtime \
-serial none \
-parallel none \
-vga none -nographic \
-netdev tap,id=net0,br=$bridge_name,ifname=$tap_name,script=no,downscript=no \
-device e1000,netdev=net0 \
-audiodev pa,id=snd0,server=unix:/run/user/$(id -u)/pulse/native \
-device intel-hda -device hda-duplex \
-drive if=pflash,format=raw,readonly,file=/usr/share/ovmf/x64/OVMF_CODE.fd \
-drive if=pflash,format=raw,file="$efi_vars" \
-device vfio-pci,host=$gpu_id,multifunction=on,id=gpu,romfile=/opt/vm/1080ti_asus.rom \
-device vfio-pci,host=$audio_id,id=audio \
-device vfio-pci,host=$ssd_id,id=sdd \
-drive file=/dev/disk/by-id/path-to-my-hdd \
-boot order=dc \
-drive file=/opt/vm/virtio-win-0.1.185.iso,media=cdrom \
-drive file=/opt/vm/Win10_1809Oct_EnglishInternational_x64.iso,media=cdrom
# -mem-path /dev/hugepages \
exit $?
fi
The script performs the following actions in order:
- Define variables for the network, which operates in bridge mode, the pci ids for the graphics card and the vm name
- Define sudo to use a graphical frontend if no terminal is detected
- Create the bridge network
- Add firewall rules for the bridge and services which enable audio and mouse control
- Start scream and barrier. Scream is for audio passthrough to host, Barrier is a fork of Synergy and used to pass the mouse over to the VM.
- Start the virtual machine
- Define the processor emulation mode and allocate 12GiB of RAM
- Disable the qemu inbuilt graphics adapter
- Define the network for the VM.
- Connect various pci devices
- Specify drives
I won't explain everythin in detail, since other people before me have already done a better job than I will at explaining them.
The following lines are the "juicy" ones:
-vga none -nographic
: This disables the qemu inbuilt vga adapter, you will want to uncomment this line to setup barrier.-device vfio-pci,host=$gpu_id,multifunction=on,id=gpu,romfile=/opt/vm/1080ti_asus.rom
: Pass through the graphics part of the GPU, specify the rom file previously read from it in step Nvidia as primary card.-device vfio-pci,host=$audio_id,id=audio
: Pass the audio part of the GPU
Things that this script is still lacking:
- Removing the added firewall rules
- Stopping the started background processes.
- Tearing down the bridge network