Kernel Panic Analysis in 2026: Reading the Core Dump to Find the Real Cause

~12 min read

77 views

0 likes

A kernel panic is a fatal system error from which the Linux kernel cannot safely recover. When this happens, the operating system stops all execution to prevent data corruption. The only way to find the real cause of a kernel panic is to read the core dump generated at the exact moment of the crash. Guessing based on system logs or monitoring graphs is a waste of time because standard logging daemons stop functioning the moment the kernel halts.

In our experience managing production servers, administrators often assume a random reboot was caused by a hardware failure. They replace RAM modules, migrate virtual machines, or rebuild entire stacks without proof. A core dump provides the exact instruction pointer, the process running in the CPU, and the contents of system memory at the millisecond the failure occurred. Reading this dump turns an unpredictable outage into a precise software or hardware bug that you can fix.

The Mechanics of a Kernel Panic and Kdump

When the Linux kernel detects an unrecoverable internal error, it calls the panic() function. This function prints an error message to the console, flushes disk buffers if the storage subsystem is still considered stable, and halts the CPU. The system logs located in /var/log/syslog or /var/log/messages will not contain the panic details because the syslog daemon relies on the kernel to write to the filesystem. Once the kernel panics, it cannot trust its own filesystem drivers.

To capture the state of the machine, Linux uses a mechanism called kexec. This allows a crashed kernel to boot a secondary capture kernel directly from memory without passing through the hardware BIOS or UEFI sequence. This secondary kernel runs in a reserved, isolated block of RAM. Its only job is to mount the filesystem, copy the memory state of the crashed kernel into a file called vmcore, and then trigger a normal reboot.

This entire process relies on the kdump service. If kdump is not configured and running before the panic occurs, the memory state is lost forever upon reboot. On our KVM dedicated virtual servers, such as the vm.Ready plan, you have complete control over the kernel and can allocate the necessary memory for kdump. On shared containerized environments like LXC, you cannot trigger or debug kernel panics because the container shares the host kernel. The host protects itself from tenant-induced kernel crashes.

Configuring Kdump on Production Servers

Before you can analyze a crash, you must configure the server to capture it. The first step is reserving memory for the crash kernel in the bootloader configuration. You must edit the GRUB configuration file and append the crashkernel parameter to the command line arguments.

Open /etc/default/grub in your preferred text editor. Locate the line starting with GRUB_CMDLINE_LINUX_DEFAULT. You need to add crashkernel=auto for RHEL-based systems or specify an exact memory allocation for Debian and Ubuntu systems. A safe default for modern servers with over 4GB of RAM is allocating 256MB.

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash crashkernel=256M"

After modifying the file, update the bootloader. On Debian or Ubuntu, execute update-grub. On RHEL, AlmaLinux, or CentOS, run grub2-mkconfig -o /boot/grub2/grub.cfg. You must reboot the server for the kernel to reserve this memory block.

Once the server is back online, install the kdump tools. On Debian-based distributions, install the kdump-tools package. On RHEL-based systems, install kexec-tools. Enable and start the service using systemctl enable kdump followed by systemctl start kdump. You can verify the memory reservation by reading the kernel command line from /proc/cmdline or by checking the service status.

To ensure the system actually captures a dump during a panic, you should test it. You can trigger a manual kernel panic using the magic SysRq key sequence. Ensure you do not do this on a server actively serving production traffic. Executing the following command as root will immediately crash the server and force a dump.

echo c > /proc/sysrq-trigger

The server will drop its SSH connections, reboot, and boot back into the primary kernel. Once online, navigate to /var/crash/. You will find a directory timestamped with the exact date and time of the crash containing the vmcore file and the dmesg output.

Preparing the Crash Analysis Environment

A vmcore file is a raw binary copy of the system RAM. You cannot read it with a text editor. You need the crash utility and the exact debugging symbols for the kernel version that panicked. The crash utility translates raw memory addresses into human-readable function names, variables, and stack traces.

First, verify the exact kernel version that generated the dump. This is often different from the currently running kernel if an automatic update occurred during the reboot. Read the dmesg.txt file generated alongside the vmcore to find the kernel release string. Next, install the crash package using your distribution package manager.

You must download the debuginfo packages corresponding to the crashed kernel. On Ubuntu, you must add the debug symbol repository (ddebs) to your APT sources and install linux-image-$(uname -r)-dbgsym. On AlmaLinux or RHEL, you enable the debug repository and install kernel-debuginfo. These packages are large, often exceeding several gigabytes, because they contain the unstripped vmlinux binary mapping every line of kernel source code to its exact memory offset.

If you are managing infrastructure on an unmanaged provider, maintaining these debugging packages consumes valuable disk space. For administrators who prefer to focus on application deployment rather than kernel debugging, a managed infrastructure approach shifts this diagnostic burden. In our datacenters, sysadmins handle the KVM hypervisor maintenance and low-level diagnostic work directly.

Executing the Crash Utility

With the debug symbols installed and the vmcore file ready, you launch the crash utility. You must provide the unstripped vmlinux binary and the vmcore file as arguments. The path to the vmlinux file varies by distribution, but it is typically located in /usr/lib/debug/lib/modules/ or /usr/lib/debug/boot/.

crash /usr/lib/debug/lib/modules/5.15.0-100-generic/vmlinux /var/crash/202602251050/vmcore

The utility loads the symbols, maps the memory space, and drops you into an interactive prompt. The initial output displays critical system context. You will see the exact kernel panic message, the uptime of the server before the crash, the system load average, the total number of tasks running, and the specific process ID that was executing on the CPU when the halt occurred.

Pay close attention to the PANIC line in this initial output. Typical messages include "Oops: 0000 [#1] SMP" for unhandled memory faults or "Kernel panic - not syncing: out of memory" for fatal resource exhaustion. The COMMAND line tells you the name of the executable running at the time of the crash. While the process listed in COMMAND is not always the root cause, it indicates what the CPU was doing when the kernel state became invalid.

Reading the Stack Trace

The most important command inside the crash utility is bt, which stands for backtrace. This command prints the call stack of the active task. The stack trace shows the exact sequence of C functions the kernel was executing right up to the panic. You read a stack trace from the bottom to the top. The bottom entries show the initial system call from user space, and the top entries show the fatal kernel function.

crash> bt
PID: 14592  TASK: ffff888123456000  CPU: 2   COMMAND: "php-fpm"
 #0 [ffffc90001234b58] machine_kexec at ffffffff8105c3b1
 #1 [ffffc90001234bb0] __crash_kexec at ffffffff8110b252
 #2 [ffffc90001234c78] panic at ffffffff8166d123
 #3 [ffffc90001234cf8] oops_end at ffffffff81031f0d
 #4 [ffffc90001234d18] no_context at ffffffff8106f2d4
 #5 [ffffc90001234d70] __bad_area_nosemaphore at ffffffff8106f521
 #6 [ffffc90001234db8] bad_area_nosemaphore at ffffffff8106f574
 #7 [ffffc90001234dc8] __do_page_fault at ffffffff8106fbe8
 #8 [ffffc90001234e28] do_page_fault at ffffffff8106fdf2
 #9 [ffffc90001234e58] page_fault at ffffffff818018d5
    [exception RIP: ixgbe_clean_rx_irq+142]
#10 [ffffc90001234f08] ixgbe_poll at ffffffffa015b678 [ixgbe]
#11 [ffffc90001234f40] net_rx_action at ffffffff815617a1
#12 [ffffc90001234fb8] __do_softirq at ffffffff810a08e3

In this output, you can follow the execution path. At step 12, the kernel was handling a software interrupt for network traffic. At step 10, it called ixgbe_poll, which is the driver for an Intel 10 Gigabit network card. At step 9, a page fault occurred inside ixgbe_clean_rx_irq. This means the network driver attempted to access a memory address that did not exist or was protected. The kernel caught this invalid access via __do_page_fault, determined it was unrecoverable, and triggered the panic sequence starting at step 2.

The line marked exception RIP is the Instruction Pointer. It pinpoints the exact function where the violation occurred. In our datacenters in Salt Lake City and Beauharnois, we use enterprise hardware to avoid obscure driver bugs. However, when deploying custom network modules on unmanaged servers, encountering null pointer dereferences in network drivers is a frequent operational issue. Upgrading the specific driver or the kernel usually resolves this category of failure.

Identifying Hardware vs Software Faults

A core dump can differentiate between a software bug and failing hardware. Inside the crash utility, run the log command. This outputs the kernel ring buffer leading up to the crash. You are looking for Machine Check Exceptions (MCE). An MCE is a hardware error reported directly by the CPU to the operating system.

If the log command reveals messages containing "Hardware Error" or "Machine check events logged," the kernel panic was a secondary symptom of a physical hardware failure. You might see errors relating to correctable or uncorrectable ECC memory faults. We provision our infrastructure with Xeon processors and ECC RAM specifically to detect and correct single-bit errors. If a multi-bit memory error occurs, the hardware forces a kernel panic intentionally to stop the operating system from writing corrupted data to disk.

When the analysis points to hardware, no amount of software configuration will fix the issue. The physical node requires component replacement. If you are reading this dump from a bare-metal server in your own rack, you must schedule maintenance and swap the RAM or CPU. If you are operating on a managed virtual machine, the host provider should have already caught the MCE via out-of-band management interfaces and live-migrated your instance.

Investigating Resource Exhaustion

The Out of Memory (OOM) killer is designed to terminate processes to free up RAM. Normally, the OOM killer does not cause a kernel panic. It sacrifices a user-space process like PHP-FPM or MySQL to keep the system running. However, administrators sometimes modify sysctl settings incorrectly, forcing the kernel to panic instead of invoking the OOM killer.

You can verify this inside the crash utility by examining the kernel variables. Run the command p sysctl_panic_on_oom. If the output returns a value of 1, the system is explicitly configured to crash upon running out of memory. You will also see "Out of memory: Compacting and killing" followed by a panic call in the log buffer.

We see this scenario frequently when clients migrate heavy WordPress sites to entry-level virtual servers. On a 2GB VPS running WooCommerce, you will exhaust PHP workers and RAM quickly under load. Instead of crashing the server entirely, you should configure the system to manage its memory correctly. Proper sysctl kernel parameter tuning ensures the system handles resource exhaustion predictably. To fix this specific issue, edit /etc/sysctl.conf, set vm.panic_on_oom = 0, and run sysctl -p.

Checking Tainted Modules

The Linux kernel tracks whether proprietary or unsupported third-party modules are loaded into memory. This is called a "tainted" kernel. A panic in a tainted kernel is significantly harder to debug because open-source developers cannot review the proprietary code. Inside the crash utility, use the sys command to check the taint status.

crash> sys
      KERNEL: /usr/lib/debug/lib/modules/5.15.0-100-generic/vmlinux
    DUMPFILE: /var/crash/202602251050/vmcore
        CPUS: 4
        DATE: Wed Feb 25 10:50:00 2026
      UPTIME: 45 days, 04:12:30
LOAD AVERAGE: 2.15, 1.95, 1.88
       TASKS: 215
    NODENAME: web01.internal
     RELEASE: 5.15.0-100-generic
     VERSION: #100-Ubuntu SMP Wed Feb 10 14:15:20 UTC 2026
     MACHINE: x86_64
      MEMORY: 8 GB
       PANIC: "Oops: 0000 [#1] SMP NOPTI"
       TAINT: P (PROPRIETARY_MODULE)

The TAINT line indicates the kernel is operating with restrictions. A 'P' indicates a proprietary module is loaded. Common examples include closed-source GPU drivers, proprietary RAID controller modules, or third-party security software. To identify which modules are active, run the mod command in the crash console. Look at the rightmost column for the taint flags.

If the stack trace leads directly into a tainted module, the kernel is likely fine, but the third-party driver has a critical flaw. You must remove the module, update it from the vendor, or replace the hardware requiring the proprietary driver. This is a common failure point for legacy hardware arrays and one reason modern hypervisor deployments standardize on open source drivers and NVMe storage configurations.

Evaluating Memory State with Kmem

If the panic message relates to memory corruption or exhaustion, you need to understand how RAM was allocated at the time of the crash. The kmem -i command provides a snapshot of memory usage matching the exact instant the CPU halted. It displays total memory, free memory, swap usage, and slab allocations.

Review the Slab column. The slab allocator manages kernel objects like inodes, directory entries, and network buffers. If the slab cache is consuming the majority of your RAM, you have a memory leak inside a kernel driver. If user-space memory (indicated by page cache and anonymous pages) is fully exhausted but the slab cache is normal, an application layer process caused the memory pressure. Finding a slab leak requires running kmem -S to inspect the specific kernel structures consuming the memory, which will point directly to the failing subsystem.

Operational Realities and Decision Points

Reading a core dump requires a deep understanding of C programming, hardware architecture, and kernel internals. When a production server crashes, the priority is restoring service. Kdump handles the automated preservation of the evidence so you can reboot immediately and investigate the vmcore file offline without extending the downtime window.

If your daily workflow involves compiling debug symbols and tracing hexadecimal memory addresses across stack frames, you are doing the job of a low-level systems engineer. For technical teams managing applications, dealing with kernel faults shifts focus away from writing code and deploying services. By running applications on managed infrastructure, the hosting provider maintains the hypervisor stability, tests the drivers, and mitigates hardware faults automatically. You rely on stable virtualization rather than debugging bare-metal kernel crashes.

The difference between guessing at random reboots and proving the root cause lies in your diagnostic process. Configure kdump. Ensure the memory is allocated. When the server halts, use the crash utility to trace the instruction pointer. Read the stack trace to locate the failing function, check the hardware logs for MCEs, and review the slab allocations. A kernel panic is a logical failure, and the core dump provides the exact logic that failed.

Source & Attribution

This article is based on original data belonging to serverspan.com blog. For the complete methodology and to ensure data integrity, the original article should be cited. The canonical source is available at: Kernel Panic Analysis in 2026: Reading the Core Dump to Find the Real Cause.