Scope

In this post, I intend to help you understand how a Linux system typically boots so that you can modify it and come up with your own solutions to the problem of network booting. Huge shout-out to Apalrd’s Adventures for their awesome Youtube videos and blog posts, which helped tremendously with this process. I followed their post on setting up a PXE-boot thin client for a significant chunk of this adventure. This guide helped me through setting up a TFTP server, configuring my DNS, building iPXE for chainloading, and getting something working quickly to build the motivation to keep going. The moment I saw a bash prompt on my mini PC that I literally ripped the SSD out of, I was hooked. This post is meant as a companion to theirs, and goes more into how it works than how to build it. I may circle back and do a how-to later, but I really just wanted to document some of the things that his guide didn’t really explain, or that were hard to find on the internet even when you do know what to search, and aggregate it into a long-form article that looks at the whole picture in detail.

Since they didn’t really go into how to do this with a RHEL-based system, I would like to create a Youtube video with the specifics of how to do this yourself. However, this post is not a how-to. It’s an explanation of the pieces and how they fit together, designed to give you a deeper understanding of the holistic process and equip you to figure out the rest. I’ve been researching and experimenting for two weeks now and still don’t have a full solution, and I already deeply understood the boot process going into it, so I want to aggregate all the information and document some of the pieces that were missing from the internet.

Update: I no longer use this approach to storage, as it ended up being too problematic. iSCSI doesn’t really like its network going down, even for a moment, and the target driver in the kernel kept causing random lock-ups. If I were to approach this again, I would go with something built to be a little more fault-tolerant like Ceph RBD. However, this was still a fun experiment and I’m leaving this post for educational purposes.

Background and story

This was a bit of an adventure.

I use Rocky Linux 9 to power the servers in my homelab. It’s a nice, stable OS based on CentOS and Red Hat Enterprise Linux. This has the upside of being similar to what I use at work, and allowing a decent amount of customizability without going full-arch, but has the downside of being based on Red Hat (you can Google the recent controversy; I don’t want any libel claims), and much of their more specific documentation is stuck behind a paywall. I still love Arch for my desktop machines (I use Arch, btw), but it doesn’t offer the stability that I want in a server that I share with my family, so I stuck with Enterprise Linux.

Anyway, shameful justification of OS choices aside, I am trying to build an unusual system for a homelab. I don’t have the money to invest in a true high-availability setup for my purposes (nor do I really need it), but I want to dabble a little bit and increase the security of my lab, so I have split my server into two machines: compute and storage. This will allow me to create copy-on-write snapshots of my server so I can have “soft backups” to restore from in the event my system is compromised.

In iteration 1, the storage server handled all long-term, non-recoverable storage like VM images, documents, security footage, etc and served them to the compute server over a 10GbE connection via NFS & iSCSI, and to end users via samba, with SELinux properly set up because I don’t actually trust its security.

Now it’s time for iteration 2. I bought an AMD Epyc CPU and a motherboard to go with it, and I want to run it without any storage attached whatsoever. Naturally, I immediately turned to PXE, having worked with it before on consumer hardware during my high school’s IT internship. We used software that managed it for us though, so I didn’t fully understand how it worked, and started researching.

From there, I started looking through the iPXE documentation to figure out how his config worked and how to write my own. Alpine Linux is a great choice for PXE boot, because it runs completely in RAM. Unfortunately, I wanted to boot Rocky Linux, which is a bit more difficult. To understand why, I’ll go through how a Linux-based OS typically boots.

The (usual) Linux boot process

When a Linux system boots, the system kind of stumbles along, loading loaders to load loaders. Here is the general process for a UEFI system:

  1. The system boots and loads the UEFI firmware that’s built into the motherboard
  2. The UEFI firmware starts scanning attached storage devices for an EFI system partition containing executables it can run.
  3. The UEFI firmware, having found GRUB on your boot disk (the most common Linux bootloader), loads it into memory and executes it.
  4. GRUB then scans the devices on your system to locate a valid grub.cfg file and reads it to determine where it can find OS kernels, cmdline values, and other boot files.
  5. GRUB locates the linux kernel (often starts with vmlinuz, which is a self-extracting gzipped kernel image) and the initrd and loads them into memory; the kernel first, then the initrd (or initramfs or init ramdisk), and executes the kernel’s code, passing it the cmdline from grub.cfg.
  6. The kernel starts, sets up a userspace environment, and extracts the contents of the initrd into the virtual filesystem, then executes the file at /init or /sbin/init, or another path specified using the init.rd= cmdline argument
  7. The init binary occupies PID 1, and is responsible for starting the rest of the system.
  8. The init process begins loading anything it needs to mount the root filesystem (ie. the rest of the OS). This can include device drivers, software interfaces such as virtual RAID controllers, networking, etc, and mounts the root filesystem on a subdirectory within the virtual filesystem.
  9. The init process then calls pivot_root in order to “zoom in” to the newly-mounted root filesystem, and calls execv on the init program contained within it to finish the boot process.
  10. The rest of the system’s services are started, and eventually you get either a console, or a graphical interface.

Why is this relevant?

It’s important to understand how a traditional Linux system boots in order to modify it and eliminate the dependency on a hard disk, booting entirely over the network.

iPXE is a bootloader that can be served over PXE and takes the place of GRUB. However, it’s not a drop-in replacement. It doesn’t read GRUB config files, and because we’re booting over the network, there is no disk to read a config, kernel image, or initrd from. It needs to know where to get this information. Because I’m chainloading iPXE (booting iPXE over PXE), I need to embed a script so it doesn’t enter an infinite loop. See the iPXE documentation on chainloading. My embed script sets up dhcp and tells ipxe to grab more instructions from a script hosted on the same server via HTTP.

How will the new boot process work?

First, we need to figure out where to get our filesystem from. There are a few options for this:

  • iSCSI
    • This emulates a physically-attached drive, but wraps I/O commands in IP packets and sends them to a server over the network. This allows the system to emulate the device and make it appear exactly as if there were a physical disk attached to the system.
  • NBD
    • NBD is short for Network Block Device. This is like iSCSI, but simpler to implement in software. However, it isn’t as versatile, and we’re not writing our own drivers, so the simplicity doesn’t do anything for us.
  • NFS
    • NFS is short for Network FileSystem. It is a standard way of sharing files on UNIX-based systems, allowing for more advanced filesystem features than something like SMB.
  • SMB
    • This is what windows file sharing is built on. However, this can be extremely problematic as a root Linux filesystem for a variety of reasons, and should pretty much always be avoided.

I chose to go with iSCSI, because it offers good performance and is generally less problematic since it looks to the rest of the system just like a physically-attached drive. With this in mind, the new boot process should look like this:

  1. The system boots and loads the UEFI firmware that’s built into the motherboard
  2. The UEFI firmware finds no storage in the machine and loads the PXE ROM from the system’s network card, which requests an IP address, TFTP server IP, and efi filename from the network via DHCP.
  3. The PXE ROM connects to the TFTP server, downloads the specified file, and executes it. This file is a copy of iPXE with the embedded script.
  4. iPXE grabs the boot script from the HTTP server and executes it. This script tells it where to find the kernel and initrd on the web server and boots it.
  5. Systemd takes over from the initrd, sets up networking, logs into the iSCSI server, mounts the virtual disk, and pivots into it.
  6. The rest of the system boots like normal.

Problem #1: Keeping the kernel accessible

Typically, the kernel is distributed with your system’s package manager and stored in /boot. We need it to be stored on a web server that exists somewhere other than the current machine so we can download it over the network, and it has to be stored such that both the OS and the HTTP server can access it.

I chose to solve this problem using NFS. By creating a directory in the HTTP directory of the storage server and exporting it via NFS, the OS can mount it and both the HTTP boot server and the OS can access the /boot directory, exposing new kernels via an external HTTP URL.

Perfect!

Problem #2: Unpredictable kernel names

This was a massive problem. In many linux distros, kernel filenames vary by version (and sometimes hardware IDs). Most distros counter this by regenerating the GRUB config file for every update, but we don’t use GRUB. How do we get iPXE to find kernel images and ramdisks?

Enter: BootLoaderSpec

Luckily, a group of people already got together to solve the fragmented boot loader configuration problem, and came up with a specification called BLS, or BootLoaderSpec. This defines a bunch of files under /boot/loader/entries that contain variables for each installed kernel/initrd combination. Now we have a way to obtain the information we need, but how do we get iPXE to use this information?

This is where we start to get into the weeds of Enterprise Linux. Their documentation on the boot process kind of sucks. The Linux boot process is fairly well-known and you can find plenty of resources on it, but the problem is that you can’t stand on that understanding to build something useful unless you want to roll your own updates by hand, which is extremely tedious and time-consuming in this day and age where software updates occur every week. Enterprise Linux has poorly-documented but well-functioning tooling, wrappers, and generators that scaffold this boot process for you, so you can edit that config file, but it’ll be gone in a week, overwritten the next time any related software updates.

I wanted something that could generate an iPXE script containing the location of the new kernel whenever I updated. I couldn’t find what I needed, so I made my own. I want to keep the story easy to follow though, so I’ll document the missing details of RPM triggers, kernel-install, dracut, and nm-initrd-generator down below.

Ok, we have an iPXE script. Now what?

Now that we understand how the system is going to boot, it’s time to install the OS. I’d like to follow up on this with more detail once I find a better system, but I essentially installed Rocky Linux like normal, but when setting up storage, I told it to connect to my iSCSI server and use that disk as the main install destination for the system. Anaconda then complained that it wasn’t a suitable boot drive, and I’d have to agree. Unfortunately, I’m not entirely sure how to use an NFS share as the boot partition in Anaconda yet, so I plugged in a flash drive and told it to use that for /boot. The system installed like normal, and I copied all the files from the flash drive onto the HTTP server and bootstrapped it with a hand-written iPXE script mimicking the grub.cfg file with HTTP URLs instead of file paths.

Now I can boot into the system via iPXE! The system comes online, downloads iPXE, grabs the hand-rolled config, finds the kernel image and initrd, then it reconfigures the network, logs into the iSCSI server, mounts the drive, and boots into the OS! Victory!!!

What about updates?

I kind of glossed over updates previously. There’s still a few more steps to build a reliable system that functions this way. You now need to hook into all the generators and automatically reconfigure iPXE when the system updates. In order to do this, I split the configs up.

  • /boot/main.ipxe
    • This script contains a list of options for debugging and troubleshooting, or whatever else you want. iPXE goes here by default, but I set up Nginx to redirect to different scripts if it recognizes the IP (using static leases in the DHCP server).
  • /boot/compute.ipxe
    • This defines the new OS as the destination boot target. It defines some variables to assist the auto-generated config in looking for files and adds any cmdline options that could vary from system to system, then calls /boot/compute/main.ipxe
  • /boot/compute/main.ipxe
    • This is the iPXE script that gets generated by ipxe-reconfigure. /boot/compute in the HTTP server is the same folder as /boot in the OS. This selects a kernel and starts the system.

With this setup, the ipxe-reconfigure package only has to worry about details local to the OS itself, and doesn’t concern itself with the specifics of your PXE boot environment such as the location of the HTTP server, where on the HTTP server the /boot filesystem is hosted, etc.

Now, you can install ipxe-reconfigure, and it will generate the /boot/main.ipxe (/boot/compute/main.ipxe on the HTTP server) file every time the kernel updates.

kernel-install: Hooking into kernel updates

In enterprise linux, the kernel-install utility is how initrd and bootloader configurations get updated. By default, it updates grub directly. Since we won’t be using grub, you’ll need to reconfigure this a little bit. First, you’ll need to configure kernel-install to create BLS files.

Add the following line to /usr/lib/kernel/install.conf

layout=bls

Next, you’ll need to configure kernel-install to invoke ipxe-reconfigure as the last stage of the process. Add the following to /usr/lib/kernel/install.d/99-ipxe-reconfigure.install

#!/usr/bin/bash

case "$COMMAND" in
    add|remove)
        ipxe-reconfigure
        ;;
    *)
        ;;
esac

This tells kernel-install to run ipxe-reconfigure whenever a kernel is added or removed.

Why do we need an initial ramdisk?

Usually, the Linux kernel is configured to know very little up-front. This keeps binary sizes small and prevents loading significant amounts of unnecessary drivers. This means it usually doesn’t know how to assemble a system, especially a highly customized one with a lot of moving parts.

The job of the initial ramdisk in linux is to bootstrap the kernel and prepare the system for the final stage of the boot process. This includes loading device drivers like nvme, setting up early-boot networking, software storage emulation layers like LVM, mdraid, and iSCSI, running fsck to fix errors on your storage devices before they’re mounted, and mounting the various storage devices into the vfs (virtual filesystem).

Finally, once everything is set up, the initramfs calls pivot_root to “zoom in” to the new filesystem, and executes /sbin/init from the new root. From there, the system can continue to boot as normal.

Dracut

Dracut is the initial ramdisk generator used in enterprise linux. Building an initramfs is somewhat difficult, and the more technologies you involve in your boot process, the harder it gets. Dracut is a utility that breaks down the initrd into “modules” that get assembled together with a miniature systemd to build the initrd. Adding modules to dracut is out of scope for this article, but you can read about it on the arch wiki here. For our purposes, we only need to know how to configure the NetworkManager module and iSCSI, which you can find here: https://man7.org/linux/man-pages/man7/dracut.cmdline.7.html

You can configure the cmdline for your system in /etc/kernel/cmdline, but I would recommend setting the cmdline variable in the iPXE config on your web server at /boot/compute.ipxe, like so (change the options to suit your needs):

set cmdline bridge=bridge-san:eno1np0 ip=bridge-san:dhcp

Conclusion

I would like to re-state that this is not a how-to. That will come later. The main goal of this post was to give you an idea of how a system like this could be built, and how all the pieces fit together. Once I finish setting up my system and feel I have a good enough understanding of it, I will follow up with a youtube video and subsequent blog post explaining each step in detail.

I hope you learned something from this article, and feel better equipped to figure out the rest of the details for your own system.

If you have any questions or areas you’d like to see expanded upon, send me an email at mariobuddy@gmail.com.