For reproducible scaling experiments, it is good practice to explicitly set
OMP_DYNAMIC=false, especially when you want to guarantee the number of
threads spawned for each parallel region. In this short article, I explain why
this is important and how one can sift through the GCC codebase (140k+ files
and more than 5 million LOC) plus the OpenMP standard to find this information.
Controlling the exact thread count is important for reproducibility. If the runtime is allowed to dynamically adjust the number of threads, the same program may exhibit different performance characteristics across compilers, systems, or execution environments. Explicitly disabling dynamic thread adjustment helps ensure that scaling measurements and performance comparisons are consistent and interpretable.
Interestingly, this is not strictly necessary when using GCC, because GCC’s
OpenMP implementation (gcc/libgomp) already defaults to OMP_DYNAMIC=false
(well, sort of… more on that later). However, relying on the default behavior
is not portable. Per section 6.3:
OMP_DYNAMIC of the
OpenMP standard:
The behavior of the program is implementation defined if the value of OMP_DYNAMIC is neither true nor false.
Moreover, per section 2.5: Internal Control
Variables
and section 2.5.2: Internal Control Variable
Initialization,
the variable dyn_var determines whether dynamic adjustment of the number of
threads is enabled, and its value is implementation defined. This means
compiler developers are free to choose the default behavior of their OpenMP
implementation. In principle, this means that a compiler implementer may choose
to enable dynamic adjustment of the number of threads by default.
Note that the internal control variable dyn_var simply reads from
OMP_DYNAMIC if its defined. However, explicitly disabling it ensures more
predictable and reproducible behavior across systems and compilers.
How do we know, though, that GCC defaults to OMP_DYNAMIC=false? Well, GCC
does not actually “set” OMP_DYNAMIC, so that’s why I wrote “sort of” earlier.
Really, GCC defines dyn_var in a struct of configuration variables that can
be updated by environment variables. If the environment variable doesn’t
exist, then the default value of the configuration variable is used. If
you want to verify GCC’s behavior yourself, you can trace it directly through
the gcc/libgomp source code:
1
2
3
4
5
6
7
8
9
$ git clone --depth 1 https://github.com/gcc-mirror/gcc.git
$ cd gcc
$ cd libgomp
$ rg dyn_var
parallel.c
73: if (icv->dyn_var)
env.c
78: .dyn_var = false
From ripgrep, we can see where dyn_var gets used in
parallel.c#L73.
We can also see from
env.c#L78,
the default value of dyn_var, and thus know its value for certain without
having to guess or treat GCC as a black box.
1
2
3
4
5
6
7
8
9
10
11
12
13
// @file env.c
const struct gomp_default_icv gomp_default_icv_values = {
.nthreads_var = 1,
.thread_limit_var = UINT_MAX,
.run_sched_var = GFS_DYNAMIC,
.run_sched_chunk_size = 1,
.default_device_var = INT_MIN,
.max_active_levels_var = 1,
.bind_var = omp_proc_bind_false,
.nteams_var = 0,
.teams_thread_limit_var = 0,
.dyn_var = false // <----------------- DEFAULT VALUE!!!!
};
Overall, OMP_DYNAMIC is a small but important example of a broader issue in
HPC: default behavior is not guaranteed to be portable, even when it is
consistent within a specific compiler like GCC. For reproducible performance
experiments, especially scaling studies, explicitly setting OpenMP
configuration variables removes ambiguity and ensures that results are
comparable across compilers, systems, and environments. As shown in GCC’s
libgomp implementation, even seemingly “fixed” defaults are ultimately
implementation details and relying on them implicitly can introduce hidden
variability into performance measurements. When in doubt: always set OpenMP
environment variables explicitly for reproducible performance work.
I recently ran into a frustrating problem: I wanted to print a PDF directly from a USB stick, but the printer kept showing “memory device is not working properly.” The drive had previously been flashed with Debian, so I assumed deleting the old files and copying my PDF would be enough. This was, of course, not the solution. In this short article, I cover what I did to fix the problem.
When I plugged the drive into my Linux laptop and called fdisk -l, the output
showed a Linux partition still lingering on the drive:
1
2
Disk /dev/sda: 7.46 GiB Device Boot Start End Sectors Size Id Type
/dev/sda2 1064960 15646719 14581760 7G 83 Linux
Even after copying my PDF, the printer refused to read it. It turns out
printers are picky and cannot read Linux filesystems like ext4. They expect a
FAT32 (or sometimes exFAT) partition starting at the beginning of the drive.
This makes sense since FAT32 is one of the oldest and widely supported file
systems (see Kingston Tech: Understanding file systems and drive
formatting)
Changing permissions (chmod +777) on Linux didn’t help either. In retrospect,
it makes sense that this didn’t work since the printer cannot understand Linux
file permissions.
However, the fix was straightforward:
On Linux, the steps are simple using fdisk and mkfs.vfat:
1
2
3
4
sudo umount /dev/sda*
# then use: o > n > p > 1 > Enter > Enter > t > b > W95 FAT32 > w
sudo fdisk /dev/sda
sudo mkfs.vfat -F 32 /dev/sda1
After that, I copied my PDF to the root of the USB drive, plugged it into the printer, and it worked immediately.
This little episode was a great reminder: USB drives that were once used for Linux installations often leave behind partitions that devices like printers cannot read. It taught me to always make sure your filesystem matches the device’s expectations.
]]>Recently my Ubuntu installation ended up in a broken state after a reboot:
external displays stopped working, nvidia-smi failed inside the desktop
environment (DE), and the system would hang while stopping gdm during
shutdown. Interestingly, the NVIDIA driver still worked in recovery mode, which
suggested the kernel modules themselves were not completely broken. I decided
to reset both the NVIDIA stack and the DE entirely since I
have had issues in the past with GPU weirdness.
One thing that helped diagnose root issues was disabling the graphical splash screen during boot. By default Ubuntu hides most boot messages behind the splash screen, which makes it hard to see where the system is hanging.
To see the real boot output, while in the Ubuntu DE I edited the GRUB configuration:
1
sudo nano /etc/default/grub
In the line:
1
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
I removed quiet splash so it became:
1
GRUB_CMDLINE_LINUX_DEFAULT=""
Then I updated the GRUB:
1
sudo update-grub
Now Ubuntu displayed the full boot log instead of the splash screen. This made
it obvious that the system was hanging while rebooting from the DE due to
gdm, which pointed toward a joint problem with the DE and GPU stack rather
than the kernel itself.
At this point, I hard reset, then powered back on. I then pressed ESC in
order to reveal the GRUB boot menu. I then entered recovery mode and fixed the
issue as follows:
First, I updated the system to ensure all packages and kernel components were consistent:
1
apt update && apt upgrade
Next, I completely removed all NVIDIA and CUDA packages:
1
apt --purge remove "*nvidia*" "*cuda*"
To make sure no stale desktop configuration remained, I also removed the GNOME desktop and the display manager:
1
apt --purge remove ubuntu-desktop gdm3
Then I cleaned up any remaining dependencies:
1
apt autoremove
After that, I reinstalled the desktop environment:
1
apt install ubuntu-desktop gdm3
Finally, I let Ubuntu reinstall the recommended GPU drivers automatically:
1
ubuntu-drivers install
After rebooting, the system came up normally again. NVIDIA worked inside the
desktop environment, external displays were detected, and the gdm shutdown
hang was gone. While perhaps not the most elegant solution, completely
resetting both the GPU drivers and the desktop environment turned out to be the
fastest path to a clean state.
Update (2026-03-10)
Interestingly, this solution worked only temporarily. I did not dig into
system logs; however, it turns out that by default ubuntu-drivers will try
and install the most recent stable driver available. In this, on my Ubuntu 24.04
system with v6.17 kernel, the automatically installed driver was
nvidia-driver-590. Oddly, this is actually not the driver version that is
recommended if you list the available devices for which drivers are available
with
1
ubuntu-drivers devices
The output was
1
driver : nvidia-driver-580-open - distro non-free recommended
So, in order to properly fix my problem, I did the following:
1
2
3
4
5
6
7
8
9
10
11
12
# 1. Purge all NVIDIA leftovers
sudo apt purge '*nvidia*' '*cuda*'
sudo apt autoremove
# 2. Update package lists
sudo apt update
# 3. Install NVIDIA driver 580
sudo apt install nvidia-driver-580 nvidia-settings nvidia-prime
# 4. Reboot
sudo reboot
I ended up doing this for v6.14 kernel, which I set to the default kernel
loaded during boot by modifying /etc/default/grub with
1
GRUB_DEFAULT="Advanced options for Ubuntu>Ubuntu, with Linux 6.14.0-37-generic"
and then sudo update-grub.
I have since had no issues with my drivers or display manager :))
]]>Often I have to perform substitution of variables or code blocks with something
new as part of refactoring efforts. Since I am using nvim, I wanted a way to
do this without having to close nvim, but also taking advantage of highly
optimized regex search utilities like ripgrep. In this short how-to, I
provide the commands one can use to do this without ever having to close nvim.
Recently, I ran into a case where a target string I wanted to replace across a
codebase contained lots of special characters. Special characters required
painful manual escaping if I used sed.
Instead, I wanted to:
ripgrep to intelligently find matching files.%s with \V (“very nomagic”)
to avoid regex escaping headaches.The target workflow looked like this:
1
:argdo %s,\V<C-r>0,newstring,g
Where:
\V disables magic pattern interpretation.argdo applies the substitution to all files in the argument list.The question was: how do I cleanly pipe ripgrep results into :args?
Neovim allows things like:
1
2
:args **/*.py
:args +!ls src/*.py
But I wanted to use ripgrep because it’s tailored specifically for fast
pattern matching across files in a git repository.
As an example, I was trying to replace the below string across many files:
1
"{environ['HOME']}/.cartopy_backgrounds/BlueMarble_3600x1800.png"
After some experimentation, I landed on a minimal solution that:
nvim configuration changesnvimStep 1: Start Neovim
1
$ nvim
Step 2: Use ripgrep to find matching files
Write matching filenames to a temporary file:
1
:!rg -l -F "\"{environ['HOME']}/.cartopy_backgrounds/BlueMarble_3600x1800.png\"" . --glob "*.py" > /tmp/out.txt
Explanation:
-l -> list matching files only-F -> fixed string search (no regex interpretation)--glob "*.py" -> restrict to Python files\"Step 3: Populate the argument list
1
:args `cat /tmp/out.txt`
Now all matching files are in the argument list.
Step 4: Perform the replacement
Assuming the search string is in the unnamed register (e.g. you yanked it with
y):
1
:argdo %s,\V<C-r>0,"new_text",g | update
This:
\V to avoid regex escapingupdate)This solution:
find, grep, or any other toolIt’s simple, explicit, and flexible.
There are likely more elegant solutions, but this approach is portable, minimal, and easy to remember.
Sometimes the most boring solution is the best one.
]]>If you choose to host a self-managed (i.e. private) GitLab instance on on-premises infrastructure, you do not get access to GitLab’s shared (public) runners for CI/CD. Unlike GitLab.com or GitLab Dedicated, a private GitLab instance puts you fully in control, but that control comes with responsibility: you have to bring your own runners. In this article, I document my experience setting up a GitLab runner on a Raspberry Pi 3B+, why I chose this route, and how I did it. Spoiler: while the setup itself is fairly straightforward, there are some Raspberry Pi–specific hardware quirks that are worth handling correctly if you care about long-term stability and recoverability.
Before diving into hardware, it’s worth briefly setting the stage.
Continuous Integration (CI) and Continuous Deployment/Delivery (CD) are practices that automate the process of:
The goal is to catch issues early, reduce manual steps, and make software delivery repeatable and reliable.
Static analysis is of particular interest since it is essentially the automated scanning of code in order to find and report vulnerabilities and errors. For example, consider the following C code snippet:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
// @file main.c
// @brief Illustrates usefulness of static code analysis.
#include <stdlib.h>
void foo(int x) {
int buf[10];
buf[x] = 0;
}
int main() {
int *p;
free(p);
foo(10);
return 0;
}
Take a moment to scan the above program and see what might be problematic about
it. Once you’re ready, you could go ahead and compile it with your favorite
compiler and have the compiler emit warnings. For example, we can use gcc
with strict warnings set (see man gcc):
1
2
3
4
5
6
7
8
9
$ gcc -Wpedantic -Wextra -Wall main.c
main.c: In function ‘foo’:
main.c:3:7: warning: variable ‘buf’ set but not used [-Wunused-but-set-variable]
3 | int buf[10];
| ^~~
main.c: In function ‘main’:
main.c:9:3: warning: ‘p’ is used uninitialized [-Wuninitialized]
9 | free(p);
| ^~~~~~~
Great! With just a few flags (note, only the -Wall flag is actually relevant
here, but to be very strict I included the -Wpedantic and -Wextra), we have
got some quite clear and very useful warnings! Line 3 tells us we have an
unused variable named buf while line 7 tells us we have an uninitialized
variable p that we are calling free on. Calling free on an uninitialized
pointer is undefined behavior (see man 3 free), so we can already infer that
this very simple program will crash. Note that the unused variable is not a
particularly harmful aspect of the program since it will get optimized out by
the compiler anyways. See section Verifying Compiler Optimizations in
Assembly of this article for
details. However, as a good practice you should not declare variables you
don’t use, and gcc warnings can alert you to such artifacts in your code.
While we can easily use gcc to emit warnings, this requires that we compile
our program. We can halt gcc after the compilation proper stage so that the
assembler doesn’t convert the assembly to object code; however, as you scale
the number of files up, you end up wasting computational resources to emit
assembly code when that is not actually a necessary stage to detect many
potential errors in the code.
Instead, we can use a tool like cppcheck that, per its design
docs,
performs only a modified subset of the phases that a compiler would normally
perform such as the preprocessing, tokenizing, and abstract syntax tree
construction. A major difference is cppcheck will not assemble nor link code.
Moreover, cppcheck can catch things that a compiler would not. In general,
this is the approach of static code analysis: don’t compile or run your code,
but use parsing techniques to catch errors or produce warnings instead. For
example, if we call cppcheck on the same program,
1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ cppcheck main.c
Checking main.c ...
main.c:6:6: error: Array 'buf[10]' accessed at index 10, which is out of bounds. [arrayIndexOutOfBounds]
buf[x] = 0;
^
main.c:12:7: note: Calling function 'foo', 1st argument '10' value is 10
foo(10);
^
main.c:6:6: note: Array index out of bounds
buf[x] = 0;
^
main.c:11:8: error: Uninitialized variable: p [uninitvar]
free(p);
^
we get the crucial error that our array access is out of bounds! We did not see
this error when using gcc. You can also see that the uninitialized variable
is caught as well, so in fact there is overlap in what a compiler can catch and
what a static code analyzer can catch.
You can hopefully understand now why static code analysis in particular is so valuable and why you would want to include it in a CI/CD pipeline.
GitLab is a web-based DevOps platform that combines (non-exhaustively):
into a single application. One of GitLab’s strengths is how tightly integrated
CI/CD is with the project (aka repository) itself. That is, pipelines are
defined declaratively via a .gitlab-ci.yml file in a a project’s root
directory. The feature of GitLab that executes CI/CD pipelines is known as
GitLab runners (analogous to GitHub actions).
GitLab runners are daemons (i.e., background processes, see man daemon) that
run on either public (aka shared) servers or are configured to run on on-prem
infrastructure. The runner execution
flow is highly
intuitive, and involves essentially two steps:
Only step (1) requires active administration since step (2) is handled by the runner itself.
Per the GitLab runner documentation, shared runners are not available to self-managed instances. The following table summarizes the different GitLab tiers:
| Tier | Shared Runner Available | On-prem | Cloud |
|---|---|---|---|
| GitLab.com | yes | no | yes |
| Self-managed | no | yes | no |
| GitLab Dedicated | yes | no | yes |
What does this mean? It means that for a self-managed GitLab instance, the administrators maintain full control of their data and infrastructure. So, if you want to use CI/CD, you need to setup any GitLab runners yourself.
GitLab runners can be setup on virtually any hardware with networking capabilities; however, since I was primarily interested in a lightweight runner, I chose to deploy a Raspberry Pi 3B+.
The Raspberry Pi 3B+ checked several boxes for me:
The last point, that is full control, is particularly important because I am not an administrator of any of the servers we have at work. By getting my own hardware, I can experiment with the GitLab runner setup without bothering IT.
For CI workloads like static analysis, linting, and lightweight builds, the Pi is perfectly sufficient. Where it could struggle is with multiple developers submitting CI jobs since I have disabled concurrent jobs by default. However, it can certainly tolerate use by less than five developers, which is the current state of the users of the self-managed GitLab instance. More than this would require a runner on more powerful hardware, but that is a next step in the following circumstances:
I knew that I would be heavily resource constrained, so I opted for a Linux operating system tailored for minimal resource consumption on Raspberry Pis. DietPi fits exactly this requirement and was a breeze to install. Since DietPi is Debian-based, it was also extremely comfortable to administer. But most importantly, DietPi consumed about half of the RAM that the headless Raspberry Pi OS does. Obviously, I want to dedicate as much RAM as possible to the GitLab runner and its operations, so minimizing extraneous RAM utilization was a high priority.
To initially setup the Pi, I flashed DietPi OS onto a microSD, plugged that into the Pi, connected an HDMI cable to a monitor, connected a keyboard, and plugged in power supply. I then went through the setup stage to get language settings and networking working. Straightforward. Boom, done.
However, as an aside, I will voluntarily and shamefully admit that while selecting my regulatory domain for WiFi (see Wiki: WLAN Channels and Linux wireless regulatory domains), I accidentally selected GE. I live in Germany. This was a silly oversight since GE corresponds to Georgia, and when I was scanning for WiFi connections on my local network nothing was coming up. After switching to DE, of course my problem was solved. In short: know basic geography and nomenclature and avoid spending half an hour troubleshooting network connectivity issues.
We’ve got the Pi initially setup, so I now can focus on bolstering recoverability. SD cards fail. Filesystems get corrupted. Eventually, you will brick something, so might as well put measures in place to facilitate easy recovery.
The Raspberry Pi supports USB boot via a one-time programmable (OTP) bit that enables booting from USB mass storage devices. According to recent documentation, this should already be enabled on newer Pi models.
However, when I checked:
1
$ vcgencmd get_config program_usb_boot_mode
on the Pi, I got
1
program_usb_boot_mode=0
This means USB boot was, in fact, not enabled.
To fix this, I had to explicitly update /boot/firmware/config.txt to enable USB
boot, then reboot once to permanently set the OTP bit. This update is very
simple: just append the following to the end of /boot/firmware/config.txt
file:
1
program_usb_boot_mode=1
This is a one-way operation, but one that significantly improves recovery options if the system becomes unbootable. Why? Because now I can drop any Linux distribution on a USB stick, then plug it into the Pi to triage any issues without having to pull out the Pi microSD to make changes to it with another computer.
MicroSD cards have limited write endurance (see here). While it is perhaps a bit overengineering, I wanted to move the root filesystem, where lots of write operations would occur, to a more durable SSD and leave the largely read-only bootloader on the microSD.
Moving the root filesystem was a multistep process. Ultimately, the goal was two-fold:
/boot/firmware/cmdline.txt to point the root
file system to the SSD./etc/fstab on the SSD to reflect the new filesystem information.First, I plugged the microSD and SSD into a Linux machine. I then needed to create mount points for the filesystems on those storage devices. By listing block devices,
1
2
3
4
5
6
7
8
9
10
11
12
$ lsblk
NAME MAJ:MIN RM SIZE R0 TYPE MOUNTPOINTS
sda 8:0 0 476.9G 0 disk
|_sda1 8:1 0 512M 0 part
|_sda2 8:1 0 2.3G 0 part
mmcblk0 179:0 0 58.3G 0 disk
|_mmcblk0p1 179:1 0 128M 0 part
|_mmcblk0p2 179:2 0 58.2G 0 part
nvme0n1 259:0 0 476.9G 0 disk
|_nvme0n1p1 259:1 0 260M 0 part /boot/efi
|_nvme0n1p2 259:2 0 358.4G 0 part /
|_...
I could look to see the disk and partition sizes. The nvme0n1 disk is the
SSD on my laptop, the sda disk is the SSD I will use for the Pi, and the
mmcblk0 is the microSD. I then created mount points for the SSD and microSDs:
1
2
3
4
5
6
$ mkdir -p /media/pi/ssd
$ mkdir -p /media/pi/msd/boot
$ mkdir -p /media/pi/msd/root
$ sudo mount /dev/sda2 /media/pi/ssd/
$ sudo mount /dev/mmcblk0p1 /media/pi/msd/boot
$ sudo mount /dev/mmcblk0p2 /media/pi/msd/root
I could then copy the contents of the microSD root directory to its new location on the SSD:
1
$ cp -a /media/pi/msd/root/* /media/pi/ssd/
Now I need to set the kernel boot parameters where the root filesystem is. I need the partition unique ID of the root partition on the SSD, so I listed this with
1
$ lsblk -o +PARTUUID
and then set the parameter in /boot/firmware/cmdline.txt on the microSD boot
partition I’ve mounted like
1
root=PARTUUID=<PARTUUID of the SSD root partition>
I updated also /etc/fstab on the SSD so that the root filesystem mounts from
the SSD and not the microSD. Lastly, I booted the Pi to verify everything
works.
If you looked closely when I listed the block devices, you may have noticed
that while the disk size of the SSD is 476.9G, the size of the root partition
is only 2.3G. The reason there are two partitions in the first place is because
at one point I tried just booting the Pi directly with the SSD, so this required
that I flash DietPi onto the SSD. Flashing the SSD created a boot and root
partition that were obviously significantly smaller than the total disk space
available. I could have used a tool like gparted, however, DietPi already
comes with a simple tool to expand the partition. So after booting up the Pi,
I simply called
1
$ dietpi-drive_manager
and voila, expanding the root partition to maximize usage of the SSD is done.
At this point, the system boots normally and the heavy I/O happens on the SSD. Exactly what we want for CI workloads.
With the system stable, the next step was validating whether the Pi is actually useful as a CI runner.
Installing the GitLab runner was surprisingly easy once the Pi was stable. You
can follow the GitLab runner installation
docs, but in short all I needed to do
was curl the setup script and then execute it. I chose to use the Docker
executor for my GitLab
runner, meaning CI jobs are run in Docker containers managed by the runner. To
that end, I built a custom Docker image that comes preloaded with several
static analysis tools:
This image is used directly by GitLab CI jobs running on the Pi. If you’re interested, the image is available at Docker Hub: jfdev001/iap-gitlab-ci. Naturally, the target architecture is ARM since that’s the Pi’s architecture; however, one could easily adapt the Dockerfile for x86-64 architecture. See section Static Code Analysis Dockerfile for the full Dockerfile at the time of this writing.
It’s also worth noting that during the GitLab runner setup on the Pi, I provided some advanced configuration to enforce only a subset of valid Docker images that could be used, a RAM limit for Docker jobs, and a limit on the total memory of jobs when SWAP space is also used. These restrictions were necessary to keep the intention of the runner for lightweight jobs codified and not just socially contracted.
Naturally, I wanted to test how my new GitLab runner could handle static
analysis of a representative codebase. To that end, I submitted a job to
perform static analysis of the ICON weather/climate
model codebase using fortitude. For
context, in the src directory of this codebase, there are roughly 500k lines
of code spread over 900 files. I forked the ICON codebase to our self-managed
GitLab instance, and provided a job file like the one below:
1
2
3
4
5
6
7
8
9
10
11
12
# @file .gitlab-ci.yml
# @brief Static analysis of ICON's 500k+ lines of code 900 file Fortran codebase
icon-src-static-analysis:
stage: test
image: jfdev001/iap-gitlab-ci:static-analysis-base
tags:
- lightweight
script: |
# Only emit warning of syntax errors detected, omits specific details
# NOTE: || true included to prevent job from automatically crashing
find codebases/icon/ -iname "*.f90" -type f\
-exec fortitude check {} + > /dev/null || true
Fortitude completes a full analysis of the codebase in about 10 seconds. For a low-power, fanless ARM board, that’s more than acceptable, and perfectly adequate for pre-merge CI checks.
Running a GitLab runner on a Raspberry Pi 3B+ turned out to be both practical and educational. In summary, I:
The result is a small, quiet, low-power CI runner that does exactly what it needs to do, and does it reliably. If you’re running a self-managed GitLab instance and want full control over your CI infrastructure without breaking the bank, a Raspberry Pi is a surprisingly solid option, as long as you respect its hardware quirks.
Extra details for anything that I felt didn’t belong in the main article are included below.
The following section is a deliberate detour into compiler behavior. While tangential to the CI setup itself, it helps clarify what truly matters when interpreting static analysis and compiler warnings.
In the example program in section The Importance of Static Code
Analysis, you can verify pretty
easily that the unused variable buf is harmless by comparing the assembly
generated for main.c when optimization flags are turned off and on. In the
following paragraphs, I walk you through assembly code, which is the lowest
level code that is semantically meaningful, for main.c.
With gcc, by default the optimization flags are turned off, and we can
explicitly halt the compiler after the compilation proper stage to emit
assembly code, thus preventing the assembler stage (which converts the assembly
code to object files that are not readable with text editors). If we call
1
$ gcc -O0 -S -fverbose-asm -o main_debug.s main.c
and look at the foo section of main_debug.s, we see
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
; @file main_debug.s
; @brief Snippet of foo section of debug mode gcc assembly output.
; @note For highlighting reasons, I use ';' to denote comments. However,
; this is GNU assembly, so '#' would be the syntactically correct choice.
foo:
.LFB6:
.cfi_startproc
endbr64
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
subq $64, %rsp
movl %edi, -52(%rbp) ; x, x
; main.c:4: void foo(int x) {
movq %fs:40, %rax ; MEM[(<address-space-1> long unsigned int *)40B], tmp84
movq %rax, -8(%rbp) ; tmp84, D.2541
xorl %eax, %eax ; tmp84
; main.c:6: buf[x] = 0;
movl -52(%rbp), %eax ; x, tmp83
cltq
movl $0, -48(%rbp,%rax,4) ;, buf[x_2(D)]
; main.c:7: }
nop
movq -8(%rbp), %rax ; D.2541, tmp85
subq %fs:40, %rax ; MEM[(<address-space-1> long unsigned int *)40B], tmp85
je .L2
call __stack_chk_fail@PLT
.L2:
leave
.cfi_def_cfa 7, 8
ret
.cfi_endproc
If you have never seen assembly before, the syntax above can be pretty
overwhelming. Fortunately, with the -fverbose-asm flag, gcc annotated the
important parts so that we can more easily decipher what’s going on. Even if
you were unable to read assembly, you could infer from line 20 that somewhere
before line 20 the memory for the buf array gets allocated. Why? Because
line 20 is a comment indicating that the proceding assembly corresponds to
buf[x] = 0. Therefore, at some point before line 20 in the assembly code,
there must be some allocation of memory for the array.
Okay, but if we don’t want to play an inference game, how do we know how much
memory for buf actually get allocated? From main.c we see that buf is
an array where each element is of type int and the array can store 10
elements. While the exact number of bytes for C data types might vary from one
hardware architecture to the next, the minimum size for int is 4 bytes.
Now we know that buf requires 40 bytes of memory.
If we know how much memory is needed for buf, where in the assembly can
we look to confirm that memory is allocated? To understand that, you must
understand that when a program executes, every function call requires the
operating system to allocate memory for the functions local variables,
arguments, and return value. The region of memory for a particular function
that gets allocated is called a stack frame. The stack frame is part of the
program (aka process) stack itself which has a total size that is much
smaller than the total amount of memory that can be allocated on the heap (see
RLIMIT_STACK and RLIMIT_DATA in man getrlimit). Lines 9 and 12 tell us
that we are allocating a stack frame for the the foo function, but the key
giveaway that we are allocating memory for buf on the stack is in line 14
with subq $64, %rsp. The $64 says that we need 64 bytes of memory for this
stack frame!
So there you have it, we allocate 64 bytes which is more… more than actually
needed for the array. Why are we allocating more memory than we need? Well, we
know also that memory must be allocated for the local variable x so that is
another 4 bytes. Great, we need at least 44 bytes of memory for our stackframe,
but yet we allocate 64 bytes. It turns out that by default we allocate also 8
bytes for a stack
canary
as a way of securing our code (see -fstack-protector in man gcc, and note
that in Ubuntu 14.10 -fstack-protector-strong is enabled by default). Now we
need at least 52 bytes of memory for our stack frame! Yet we still are
allocating 64 bytes…
It turns out that this is the result of the System V Application Binary
Interface (ABI), which mandates that for
x86-64 architectures the stack is 16-byte aligned, so therefore additional
bytes are used to pad the stack frame until the memory allocated for the stack
frame is a multiple of 16. 64 is a multiple of 16, therefore, while we only
need 52 bytes for the foo function call, we must allocate 64 bytes to
comply with the ABI. The ABI ensures many things, but with respect to
alignment, it guarantees that a program can run without modification on
multiple operating systems and that function calls occur in a standardized way.
That was a huge digression in order to say “hey, we know that when
optimizations are turned off, memory for buf in the stack frame is
allocated.” But what about when we turn on optimizations? Oh no… we’re
going to have to look at more scary assembly to see how the compiler treats
unused variables, which is the warning we saw that gcc emitted way back in
section The Importance of Static Code
Analysis. If we call,
1
$ gcc -O3 -S -fverbose-asm -o main_opt_verbose.s main.c
and look at the foo section of main_opt.s, we see
1
2
3
4
5
6
7
8
9
; @file main_opt.s
; @brief Snippet of foo section of optimized gcc assembly output.
foo:
.LFB16:
.cfi_startproc
endbr64
# main.c:7: }
ret
.cfi_endproc
This assembly is much easier to interpret than the unoptimized assembly.
You can clearly see that no pushq %rbp operation occurs and no allocation
of memory for the stack frame with a subq $LITERAL, %rsp operation occurs.
The compiler intelligently optimized the stack frame allocation out because
no work occurs in the function call.
All of this just to say that the warning emitted in section The Importance of
Static Code Analysis regarding the
unused variable buf can actually be safely ignored since it has no effect on
an optimized object file anyway.
Here is the Dockerfile used to build the static analysis oriented image for the Docker executor on the GitLab runner.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
# @file Dockerfile
# @brief Static analysis base image for AARCH64 (e.g., Raspberry Pi 3B+)
FROM python:3.12-slim
# Install system dependencies and static analysis tools
ARG SHELLCHECK_ARCH=aarch64
ARG SHELLCHECK_VERSION=0.11.0 # https://github.com/koalaman/shellcheck/releases/tag/v0.11.0
ARG CHECKMAKE_ARCH=arm64
ARG CHECKMAKE_VERSION=0.3.2 # https://github.com/checkmake/checkmake/releases/tag/v0.3.2
ARG RUFF_VERSION=0.14.14 # https://github.com/astral-sh/ruff/releases/tag/0.14.14
ARG FORTITUDE_VERSION=0.7.5 # https://github.com/PlasmaFAIR/fortitude/releases/tag/v0.7.5
ARG CMAKELANG_VERSION=0.6.13 # https://github.com/cheshirekow/cmake_format/releases/tag/v0.6.13
RUN \
# install system dependencies
apt-get update && apt-get install -y --no-install-recommends \
ca-certificates \
curl \
tar \
unzip && \
apt-get purge -y --auto-remove -o APT::AutoRemove::RecommendsImportant=false && \
rm -rf /var/lib/apt/lists/* && \
# install shellcheck
curl -L https://github.com/koalaman/shellcheck/releases/download/v${SHELLCHECK_VERSION}/shellcheck-v${SHELLCHECK_VERSION}.linux.${SHELLCHECK_ARCH}.tar.gz -o shellcheck.tar.gz && \
tar -xf shellcheck.tar.gz && \
rm shellcheck.tar.gz && \
mv shellcheck-v${SHELLCHECK_VERSION}/shellcheck /usr/bin/ && \
rm -rf shellcheck-v${SHELLCHECK_VERSION} && \
# install checkmake
curl -L https://github.com/checkmake/checkmake/releases/download/v${CHECKMAKE_VERSION}/checkmake-v${CHECKMAKE_VERSION}.linux.${CHECKMAKE_ARCH} -o /usr/local/bin/checkmake && \
chmod +x /usr/local/bin/checkmake && \
# install python based tools
pip install --no-cache-dir \
ruff==${RUFF_VERSION} \
fortitude-lint==${FORTITUDE_VERSION} \
cmakelang==${CMAKELANG_VERSION}
CMD ["bash"]
I am constantly using ssh to remotely login to different compute clusters
(e.g., DKRZ’s Levante, ECMWF’s Atos, local clusters at work, etc.). I find it
much more convenient to use a terminal-based text editor like
neovim coupled with the standard terminal
multiplexer tmux in order to switch easily
between several programs in one terminal. However, I quickly encountered an
issue where my neovim color scheme was not displayed when using it within tmux.
Fortunately, there’s an easy fix to this issue that I cover in this article. I
first show the problem and the expected behavior, diagnose the error, and then
correct it accordingly. I document the solution here and tie to use with
legacy systems.
When launching neovim with tokyo-moon colorscheme in a regular terminal (I’m
using Lubuntu 22.04 LTS with default terminal qterminal), you can clearly
see the expected color scheme:
After launching tmux and inspecting the same file, however, I was only seeing the default neovim color scheme:
Luckily, neovim makes diagnosing problems very easy. Launching neovim and
calling :checkhealth revealed the following:
1
2
3
4
5
6
7
8
9
10
tmux ~
- ✅ OK escape-time: 0
- ✅ OK focus-events: on
- $TERM: screen-256color
- ⚠️ WARNING True color support could not be detected. |'termguicolors'| won't work properly.
- ADVICE:
- Add the following to your tmux configuration file, replacing XXX by the value of $TERM outside of tmux:
set-option -a terminal-features 'XXX:RGB'
- For older tmux versions use this instead:
set-option -a terminal-overrides 'XXX:Tc'
Look at that! We are given an exact warning regarding colors in the terminal.
More importantly, we are given advice on how to fix it. I then updated
~/.tmux.conf accordingly:
1
echo set-option -a terminal-featuers \'xterm-256color:RGB' >> ~/.tmux.conf
The above solution worked locally, but when I ssh’ed into Levante, I found I
had the same issue still. When working with compute clusters, it’s not uncommon
to encounter very old versions of software. This is precisely the case for the
version of tmux on DKRZ’s Levante. That version is 2.7 and is from 2013.
Needless to say, all the latest features of tmux aren’t supported there.
Fortunately, :checkhealth also provides the solution in such cases! On
Levante, the following did the trick:
1
set-option -a terminal-overrides \'xterm-256color:Tc\' >> ~/.tmux.conf
You may also be thinking: “Why didn’t you just install a more recent version of tmux on Levante and use that?” Well, I did, in fact, try this. For whatever reason, the tmux server would, however, periodically crash. Rather than trying to figure out the cause, I lazily downgraded back to the original tmux version that I knew was stable. Sometimes it’s better to just use what works. If fundamentally I am using tmux as a productivity tool, and I did not want to invest time to figure out the root of the periodic crashes of the latest version.
Maybe in a future article I will document the debugging journey, but for now, I will conclude :)
]]>I often download and read scientific journal articles on my Kobo eReader.
Unfortunately, the metadata in the PDFs for such journal articles may lack title
information. This means that the searchable title that appears for the article
once loaded onto the eReader is something very ugly like an uninterpretable
string of digits. An easy way to fix this is using the open-source tool pdftk.
In this short article, I show how to use pdftk for this purpose.
A look at the man pages for pdftk via man pdftk tells us everything we
need to know:
A bit of a deeper look at the man pages reveals that we need to dump the PDF meta data, modify this in a text editor, and then update the original PDF with the new metadata.
As an example, take Tsunami Propagation from a Finite Source (Carrier 2005). Once you’ve downloaded it, you can inspect the metadata contents with:
1
pdftk ~/Downloads/cmes.2005.010.113-2.pdf dump_data_utf8 output ~/Downloads/cmes.2005.utf
Opening ~/Downloads/cmes.2005.utf, you’ll see a number of fields, one of which
looks like the following:
1
2
3
InfoBegin
InfoKey: Title
InfoValue: main.dvi
If you change the InfoValue here in ~/Downloads/cmes.2005.utf to your
desired name, e.g.,
1
InfoValue: Carrier 2005: Tsunami Propagation from a Finite Source
and then call
1
pdftk ~/Downloads/cmes.2005.010.113-2.pdf update_info_utf8 ~/Downloads/cmes.2005.utf output ~/Downloads/cmes_updated.pdf
then your PDF now has the correct metadata and is ready for reading on an eReader!
Here is a permalink to a script that automates this update process for you. Here is a link to the same script on the main branch (in-case I ever update it). Here is the code pasted below for the script in case you don’t want to check out my github :)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
#!/usr/bin/bash
trap 'echo "Error at ${LINENO}: ${BASH_COMMAND}" 2>&1' ERR
usage=$(cat << EOF
usage: $0 [-h] [-i] PDF NEW_TITLE
Update a PDF metadata with a NEW_TITLE and then output the updated PDF to a
default path '\${PDF}_updated'. Pass -i to update the PDF inplace.
positional arguments:
PDF Path to PDF to update.
NEW_TITLE The new metadata title for the PDF.
optional arguments:
-h Print a help message and exit.
-i Update the PDF inplace, thus removing any intermediate outputs.
EOF
)
inplace=0
while getopts "hi" flag
do
case "${flag}" in
h) echo "${usage}"
exit 0
;;
i) inplace=1;;
*) echo "ERROR: unrecognized flags"
echo "try $0 -h"
exit 1
;;
esac
done
# shift past options to get positionals
shift $(($OPTIND - 1))
n_positional_args=2
PDF=$1
NEW_TITLE=$2
# Script exits immediately on error and errors on uninitialized variables
# NOTE: when using grep, grep exits 1 on failure to pattern match and therefore
# '|| true' is needed when saving grep output to use in test expressions
set -eu
# Check if all positional arguments were provided
if [ $# -lt $n_positional_args ]
then
echo "ERROR: Missing positional arguments!"
echo "Try '$0 -h' for more information."
exit 1
fi
# Verify positionals
if [[ ! -f "${PDF}" ]]
then
echo "ERROR: PDF does not exist"
echo "Got ${PDF}"
exit 1
fi
# update title metadata
cur_metadata=$(pdftk "${PDF}" dump_data_utf8)
has_title_metadata=$(grep -n "InfoKey: Title" <<< "${cur_metadata}" || true)
if [[ -n "${has_title_metadata}" ]]
then
# remove existing metadata
lineno_infokey_title=$(grep --only-matching --perl-regexp \
"\d*(?=:)" <<< "${has_title_metadata}")
lineno_infobegin=$((lineno_infokey_title - 1))
lineno_infovalue=$((lineno_infokey_title + 1))
cur_metadata="$(printf '%s\n' "${cur_metadata}" | \
sed "${lineno_infobegin}d;${lineno_infokey_title}d;${lineno_infovalue}d")"
fi
# Define the new metadata
title_metadata=$(cat << EOF
InfoBegin
InfoKey: Title
InfoValue: ${NEW_TITLE}
EOF
)
title_metadata="${title_metadata}"$'\n'
new_metadata="${title_metadata}${cur_metadata}"
output_pdf_path="$(dirname $(realpath ${PDF}))/updated_$(basename ${PDF})"
tmp_new_metdata="/tmp/new_metadata"
echo "${new_metadata}" > "${tmp_new_metdata}"
# update the pdf
pdftk "${PDF}" update_info_utf8 "${tmp_new_metdata}" output "${output_pdf_path}"
if [[ ${inplace} -eq 1 ]]
then
rm "${PDF}"
mv "${output_pdf_path}" "${PDF}"
output_pdf_path="${PDF}"
fi
echo "output written to ${output_pdf_path}"
rm "${tmp_new_metdata}"
In this article, some extra mathematical details related to the solution of the steady-state reaction-diffusion equation using PETSc are discussed. First, the simple nonlinear governing equation of interest is shown. Then, Newton’s method is presented at the partial differential equation (PDE) level for generality rather than being presented at the algebraic level. Subsequently, the spatial discretization via the finite difference method is shown for completeness. Finally, a commented PETSc implementation of the discretized reaction-diffusion equation is shown to concretely illustrate how the mathematical notation maps to code.
The general form of the one dimensional, time evolving heat equation can be modelled with diffusion (i.e., \(\frac{\partial^2 u}{\partial x^2}\) in 1D or \(\nabla^2 u\) in N-dimensions), reaction \(R(u)\), and source \(f(\cdot)\) processes for temperature (or substances more generally) with concentration \(u(x, t)\) and is given by
\[\frac{\partial u}{\partial t} = \frac{\partial^2 u}{\partial x^2} + R(u) + f. \\ \tag{1}\]In this post, we focus on the time independent (aka steady-state) form of equation (1)
\[0 = \frac{\partial^2 u}{\partial x^2} + R(u) + f. \\ \tag{2}\]We define also \(R(u) = -\rho \sqrt u\) and \(f(x) = 0\) as well as the Dirichlet boundary conditions \(u(0) = \alpha\) and \(u(1) = \beta\) for our domain \(\Omega = [0, 1] \).
Clearly, the square root function in \(R(u)\) introduces a nonlinear term into equation (2). Since identifying whether an equation has nonlinear terms naturally determines whether we can use a linear or nonlinear solver, it is useful to recall the definition of a linear operator \(L\) which is said to be linear if, for every pair of functions \(f\) and \(g\) and a scalar \(\theta\)
\[L(f + g) = L(f) + L(g),\]and
\[L(\theta f) = \theta L(f).\]In this case, the square root term is obviously nonlinear. Given this nonlinearity, a nonlinear solver must be used to solve equation (2).
We present here Newton’s method for solving nonlinear equations at the PDE level to make it as generally applicable as possible. Newton’s method is often presented at the algebraic level since its presentation is paired with the discretization (e.g., via finite difference methods, finite element methods, etc.) method of the PDEs. Since there are a large number of techniques to discretize PDEs, it is more general to formulate Newton’s method in the context of the strong (“original”) form of the PDEs. You can then select and apply the appropriate steps for your desired discretization (e.g., for the finite element method, this involves deriving a weak formulation and substituting the finite element solution using specified basis functions) followed by an application of your solver to the discretized PDEs.
Now, Newton’s method is for solving general nonlinear equations of the form
\[F(u) = 0. \\ \tag{3}\]Substituting equation (2) into equation (3) and then expanding terms
\[F(u) = \frac{\partial^2 u}{\partial x^2} - \rho \sqrt u = 0, \\ \tag{4}\]we now have our governing equation in the desired general form. A cornerstone of computational science is converting problems into forms that we know how to solve effectively. We know and have numerous methods to solve linear systems of equations. Newton’s method iteratively approximates solutions to \(u\) in equation (4) by linearizing (i.e., converting a nonlinear problem to a linear one) around an iterate, solving the new linear system for a step in the direction of the solution, and then adding that step to the current iterate. That step in the direction of the solution is called a “Newton step” or perturbation. Mathematically, this solution process can be written as
\[\begin{aligned} F(u^k + \delta u) &= F(u^k) + F'(u^k)(\delta u), && \text{(5.1)}\\ F'(u^k)(\delta u) &= -F(u^k), && \text{(5.2)} \\ u^{k+1} &= u^{k} + \delta u. && \text{(5.3)} \end{aligned}\]The “linearization” is equation (5.1): it is a truncated Taylor series that is a linear function of \(\delta u\) that approximates \(F\) near \(u^k\) at iteration \(k\), therefore we have replaced our nonlinear function with a linear one. We seek the zero of this function, that is \(F(u^k + \delta u) = 0\) and rearrange to get equation (5.2). The term \(F’(u)(\delta u)\) may look a little suspicious, but consider for a moment what \(F\) actually is. If \(F\) were a scalar function, naturally \(F’\) is the ordinary derivative. Similarly, if \(F\) were a vector function, \(F’\) would be the Jacobian written as \(J_F(u^k)\). However, \(F\) is neither a scalar nor a vector function, but rather it is an operator: a map of one function space to another function space. That is, \(F\) maps the function space of \(u\) to another function space. Thus, we must define the derivative of an operator. This is given by the Gateaux derivative
\[dF(u; \delta u) = \lim_{\epsilon \rightarrow 0} \frac{F(u + \epsilon \delta u) - F(u)}{\epsilon} = \left . \frac{d}{d\epsilon} F(u + \epsilon \delta u) \right\vert_{\epsilon = 0}.\]It turns out that by definition, \(F’\) in \(dF(u; \delta u) = F’(u)(\delta u)\) is just the continuous linear operator represented by the Jacobian matrix (see Rall 1971). Note, there are technically some subtleties here for the mathematically rigorous reader—I am no formal mathematician myself, of course. The Gateaux derivative, by definition, is not necessarily linear or continuous, but if we put the linear and continuous constraints on the Gateaux derivative, then we define the Frechet derivative. In the context of PDEs for which one is likely to perform numerical simulation, it is reasonable to assume the linearity and continuity requirement. Experts will still call the derivative for operators the Gateaux derivative (see Bangerth: Lecture 31.55) even when the term Frechet derivative might be more appropriate. I am only mentioning this distinction since you might encounter the Gateaux or Frechet derivative in your own study of numerical PDEs. Thus, to solve nonlinear PDE problems via Newton’s method, we must do one of the following:
(a) Provide the Jacobian explicitly after deriving and discretizing the Gateaux derivative.
(b) Allow a library (or write it yourself if you so desire) to compute the Jacobian by relying on automatic differentiation (e.g., NonlinearSolver.jl: Solvers) and/or sparsity detection (e.g., NonlinearSolve.jl: Declaring a Sparse Jacobian with Automatic Sparsity Detection).
(c) Allow a library to symbolically compute the Jacobian (e.g., Mathematica: Unconstrained Optimization – Methods for Solving Nonlinear Equations, FENICS: Solving the Nonlinear Variational Problem Directly, etc.).
It is worth noting that there are Jacobian-free methods (see Knoll 2004), but these come with their own trade-offs.
To be as explicit as possible, we now compute the Gateaux derivative of \(F\)
\[\begin{aligned} \left . \frac{d}{d\epsilon} F(u + \epsilon \delta u) \right\vert_{\epsilon = 0} &= \left . \frac{d}{d\epsilon}\left[ \frac{\partial^2}{\partial x^2}(u + \epsilon \delta u) - \rho \sqrt{(u + \epsilon \delta u)} \right] \right\vert_{\epsilon = 0} \\ &= \left . \frac{d}{d \epsilon}\left[\frac{\partial^2}{\partial x^2}u\right] + \frac{d}{d \epsilon}\left[\frac{\partial^2}{\partial x^2}\epsilon \delta u \right] - \frac{d}{d \epsilon} \left[ \rho \sqrt{(u + \epsilon \delta u)} \right] \right\vert_{\epsilon = 0} \\ &= \left . \frac{\partial^2}{\partial x^2} \delta u - \frac{d}{d \epsilon} \left[ \rho \sqrt{(u + \epsilon \delta u)} \right] \right\vert_{\epsilon = 0} \\ &= \left . \frac{\partial^2}{\partial x^2} \delta u - \frac{\rho}{2}(u + \epsilon \delta u)^{-\frac{1}{2}} \delta u \right\vert_{\epsilon = 0} \\ &= \frac{\partial^2}{\partial x^2} \delta u - \frac{\rho}{2 \sqrt u} \delta u. \end{aligned} \\ \tag{6}\]Once we have the Gateaux derivative, equation (5.3) is simply an update of the solution space in the direction “pointing” toward the zero of our nonlinear function.
We now have the continuous forms of the reaction-diffusion equation suitable for solution via Newton’s method. To provide implementations, these continuous forms have to be discretized as shown in the next sections. We therefore define a structured, 1D grid with spacing \(h\) between grid points and indices \(i \in [0…5] \) where the full domain is \(x \in [0, 1]\). Recall also that \(u(0) = \alpha\) and \(u(1) = \beta\). Note that the grid spacing can be calculated by \(h = \frac{1}{m_x - 1}\) where \(m_x\) is the number of grid points. Naturally, we subtract 1 as there are \(m_x - 1\) “spacings” between the \(1^{st}\) and \(m_x^{th}\) grid point. The grid is depicted and annotated below for reference.
h
-------
| |
V V
u: α ? ? ? ? β
▆-----▆-----▆-----▆-----▆-----▆
x: 0 0.2 0.4 0.6 0.8 1
i: 0 1 2 3 4 5
Here, we discretize equation (4). The focus of this article is not on the derivation of the finite difference method, so, if you’re unfamiliar with it, a clear explanation and derivation can be found at CFD University’s: The Finite Difference Method. Ignore the bits in that article related to the Navier-Stokes equations, of course. In any case, the discrete form of the second derivative operator using a centered finite difference is given by
\[\begin{aligned} F(u) \approx F(u_i) = F_i &= \frac{u_{i-1} - 2 u_i + u_{i+1}}{h^2} + R(u_i) \\ &= \frac{u_{i-1} - 2 u_i + u_{i+1}}{h^2} - \rho \sqrt{u_i}, \end{aligned} \tag{7}\]where \(u_i\) the solution at grid point \(i\). That’s it!
Since equation (6) defines a linear operator on \(\delta u\), we only care about discretizing the coefficients of \(\delta u\). That is, we want the discrete form of the continuous linear operator
\[F'(u) = \frac{\partial^2}{\partial x^2} + \frac{dR}{du} = \frac{\partial}{\partial x^2} - \frac{\rho}{2 \sqrt u}, \tag{8}\]since \(F’(u)\) acts on \(\delta u\) as represented by \(F’(u)(\delta u)\) in equation (5.2). If we apply a centered finite difference scheme to discretize the second derivative operator on \(\delta u\), we have
\[\frac{\partial^2}{\partial x^2} \delta u \approx \frac{\delta u_{i-1} - 2 \delta u_i + \delta u_{i+1}}{h^2}. \\ \tag{9}\]For the derivative of the reaction function given by \(\frac{dR}{du}\), \(u\) in \(\frac{\rho}{2 \sqrt u}\) corresponds simply to \(u_i^{k}\) since this is the discrete solution at the grid point \(i\) at Newton step \(k\).
If you’re paying attention, you’ll notice that the discrete form of the second derivative operator—a linear operator—in equation (9) acting on \(\delta u\) is the same as the discrete form of the second derivative operator when it acts on \(u\) in equation (7). This may seem like a rather silly or obvious thing to note; however, it’s surprisingly important for the efficient solution of nonlinear equations since the discrete form of the linear operator is completely independent of the values of \(u_i^k\)—which is changing at every iteration as defined by equation (5.3). This means that a performance-conscious programmer can not only initialize the sparsity pattern of the Jacobian but can also cache those values of the discrete linear operator in the Jacobian—meaning you do not have to assign the coefficients of the discrete linear operator to the Jacobian more than once. This doesn’t apply to the PETSc implementation of the present article, but it’s information that’s certainly worth knowing. You can find an example of this sort of caching in a solving the Navier-Stokes equations with Ferrite.jl example if you like.
With the discrete operations described above, we now frame the action of the Jacobian on the perturbation \(\delta u\) in the context of a Newton iteration \(k\). The Jacobian can therefore be written—emphasizing the coefficients of the Jacobian by wrapping them in brackets—in the form
\[\begin{aligned} F'(u^k)(\delta u) &= J_F(u^k) \delta u \\ &= \left[ \frac{1}{h^2} \right] \delta u_{i-1} + \left[ \frac{-2}{h^2} \right] \delta u_i + \left[ \frac{1}{h^2} \right] \delta u_{i+1} + \left[\frac{-\rho}{2 \sqrt{u^k_i}}\right]\delta u_i \\ &= \left[ \frac{1}{h^2} \right] \delta u_{i-1} + \left[ \frac{-2}{h^2} - \frac{\rho}{2 \sqrt{u^k_i}} \right] \delta u_i + \left[ \frac{1}{h^2} \right] \delta u_{i+1}. \end{aligned} \\ \tag{10}\]With the components of the Jacobian explicitly specified, we can now propose functions for implementing the solution of the reaction-diffusion equation using PETSc.
PETSc provides and supports a suite of (non)linear solvers, preconditioners,
data types for linear algebra, massive scalability through automatic support for
distributed and shared memory parallelism, and much more.
The user need only provide a comparatively simple—at least for the
problem we consider in this article—set of functions that specify their
particular problem. Per figure (1), the user needs to implement
FormFunctionLocal—which is simply \(F(u)\)—as well as
FormJacobianLocal—which is simply \(J_F(u^k)\). First the implementation
will be shown, then relevant parts of the code will be mapped to their
mathematical equivalent.
In this section we implement the necessary PETSc functions for the solution of the reaction-diffusion equation. These implementations correspond directly to the maths introduced in the present article.
The PETSc function below for \(F(u)\) is adapted from p4pdes/reaction.c:92-115.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
// Compute F(u) for reaction-diffusion equation
// Reference: Equation (7)
PetscErrorCode FormFunctionLocal(DMDALocalInfo *info, PetscReal *u,
PetscReal *FF, AppCtx *user) {
PetscInt i;
PetscReal h = 1.0 / (info->mx-1), x, R;
// iterate through grid indices
for (i=info->xs; i<info->xs+info->xm; i++) {
// point on left boundary
if (i == 0) {
FF[i] = u[i] - user->alpha;
// point on right boundary
} else if (i == info->mx-1) {
FF[i] = u[i] - user->beta;
// interior
} else {
// stencil includes left boundary
if (i == 1) {
FF[i] = user->alpha - 2.0 * u[i] + u[i+1];
// stencil includes right boundary
} else if (i == info->mx-2) {
FF[i] = u[i-1] - 2.0 * u[i] + user->beta;
// stencil is purely in interior
} else {
FF[i] = u[i-1] - 2.0 * u[i] + u[i+1];
}
R = -user->rho * PetscSqrtReal(u[i]);
FF[i] = FF[i] / (h*h) + R;
}
}
return 0;
}
In line 6, the grid spacing h is computed as expected. In line 8, we iterate
through the locally owned part of a distributed vector, hence the indices
start from xs and go until (but excluding) the local index start
plus the number of points info->xm owned by the process.
There are, of course, two special cases that occur while iterating over the grid points: the left boundary (i.e., \(u(0) \equiv u_0 = \alpha\)) and the right boundary (i.e., \(u(1) \equiv u_{m_x - 1} = \beta\)) conditions. From equation (4) and equation (7), we know that \(F_i = 0\), and if we recognize the fact that the boundary conditions demand that \(u_0 = \alpha \), then we can infer that \(F_0 = u_0 - \alpha = 0\), which is exactly the residual form we need. Line 11 follows from this reasoning. The same logic applies to line 14 but for the right boundary condition.
Lines 16 to 29 handle the interior points. Line 19 is the discrete second
derivative for the second—index i=1—grid point in the 1D grid
where on the left boundary \(u_{i-1} = u_{0} = \alpha\). Line 22 is analagous
but for the right boundary where \(u_{i+1} = u_{m_x - 1} = \beta \). Line
27 computes the reaction function using its definition. Finally,
line 28 divides the numerator of equation (7) that was computed in one of the
branches of lines 18 to 26 by the square of the grid size accordingly to complete
the computation of the discrete second derivative of \(u\), then the reaction
function evaluated in line 27 is added also per equation (7).
Next, the PETSc function for the Jacobian is adapted from p4pdes/reaction.c:117-114.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// Compute J_F(u^k) for reaction-diffusion equation
// Reference: Equation (10)
PetscErrorCode FormJacobianLocal(DMDALocalInfo *info, PetscReal *u,
Mat J, Mat P, AppCtx *user) {
PetscInt i, col[3];
PetscReal h = 1.0 / (info->mx-1), dRdu, v[3];
for (i=info->xs; i<info->xs+info->xm; i++) {
// boundary conditions
if ((i == 0) || (i == info->mx-1)) {
v[0] = 1.0;
PetscCall(MatSetValues(P,1,&i,1,&i,v,INSERT_VALUES));
// interior
} else {
col[0] = i;
dRdu = -user->rho / (2.0 * PetscSqrtReal(u[i]));
v[0] = -2.0 / (h*h) + dRdu;
col[1] = i-1;
v[1] = (i > 1) ? 1.0 / (h*h) : 0.0;
col[2] = i+1;
v[2] = (i < info->mx-2) ? 1.0 / (h*h) : 0.0;
PetscCall(MatSetValues(P,1,&i,3,col,v,INSERT_VALUES));
}
}
PetscCall(MatAssemblyBegin(P,MAT_FINAL_ASSEMBLY));
PetscCall(MatAssemblyEnd(P,MAT_FINAL_ASSEMBLY));
if (J != P) {
PetscCall(MatAssemblyBegin(J,MAT_FINAL_ASSEMBLY));
PetscCall(MatAssemblyEnd(J,MAT_FINAL_ASSEMBLY));
}
return 0;
}
Line 6 and 7 are essentially the same as in FormFunctionLocal.
Once again, we handle the boundary conditions in lines 9 to 13. But why assign a value of
1.0 to the first element of a 3-element vector v and then use v to set only
a single element of the matrix P?
As in FormFunctionLocal, at the boundaries we have
and
\[F_{m_x - 1} = u_{m_x - 1} - \beta,\]in the discrete form. Though since we’re concerned with using the continuous notation as this is the form you are most likely to encounter, we have of course \(u(0) = \alpha\), which implies that \(F(u(0)) = \alpha \). Given that \(F(u) = 0\) from equation (4), it follows that
\[F(u(0)) = u(0) - \alpha = 0.\]In contrast to equation (4), there is obviously no second derivative operator or reaction operator acting on \(u\). What this means is that for \(u(0)\) and \(u(1)\), we can simply compute the derivative with respect to the unknown at the given boundary points such that
\[F'(u(0)) = \frac{\partial F(u(0))}{\partial u(0)} = \frac{\partial}{\partial u(0)} u(0) - \frac{\partial}{\partial u(0)} \alpha = 1,\]and similarly
\[F'(u(1)) = \frac{\partial F(u(1))}{\partial u(1)} = \frac{\partial }{\partial u(1)} u(1) - \frac{\partial }{\partial u(1)} \beta = 1.\]Returning to the discrete form, if \(F(u) \approx F(u_i) \) from equation (7), it follows
that \(F’(u) \approx F’(u_i)\). Therefore, when i = 0, we have
\(F’(u_0) = 1\) and when i = mx-1, we also have \(F’(u_{m_x - 1}) = 1\).
This means we set only a single element of P corresponding to the indices i,
that is P[i][i] = 1.0 since no further elements need to be set to satisfy the
discretization of the boundary conditions. Note that &i in line 11 is a 1-element array
and is used since MatSetValues expects an expression that can be treated as
an array since one usually assigns the values corresponding to arrays of row
and column indices. Since we are only assigning one value, we need only a
1-element array. See PETSc: MatSetValues
and note that idxm[] and idxn[] in the function parameters automatically decays to a pointer
by the C language implementation.
This concludes the discussion of enforcing boundary conditions in the Jacobian.
In lines 13 to 25 we handle the interior points. Lines 14 to 16 define the
coefficient of \(\delta u_i\) in equation (10). Note that in the
FormJacobianLocal function, u corresponds to the vector of values of the
solution u at Newton iteration k. Therefore, u[i] is equivalent to
\(u_i^k\). Lines 18 and 19 define the coefficient of \(\delta u_{i-1}\)
using equation (10) only if i does not lie on the boundary, that is enforce
the PDE on the interior points and eliminate any coupling with boundary
points. Lines 21 and 22 are the same principle but for the coefficient
\(\delta u_{i+1}\).
Lastly, lines 29 to 32 sets the Jacobian J to P
(see SNES: Jacobian Evaluation), meaning J is the matrix from which the preconditioner \(M\) is built. While
the Jacobian is usually the same as the matrix from which the preconditioner
is built, in principle you could set a different matrix P that may have
more desirable properties (e.g., better conditioned) than J. In practice,
P is used in conjunction with the specified preconditioner (e.g., Jacobi,
ILU, etc.) \(M\) that is used to solve a left
(default for Krylov solvers in PETSc)
preconditioned system of equations arising from equations (5.2) and (10) such
that
To make the preconditioner discussion concrete, if we were to tell PETSc to use a Jacobi preconditioner, then \(M = \text{diag}(J) = \text{diag}(P)\).
With this, we conclude the discussion of the relationship between the mathematics in the first part of the article and the adapted implementation for \(F(u)\) and \(J_F(u^k)\).
The adapted implementation in the previous section is based on maths that we derived in the present article with the intention of making the derivation more clear as well as generally applicable to nonlinear PDE problems. However, the basis of the implementation is from Bueler 2021, and for completeness we shortly explain how the original implementation maps to slightly different rearrangements of the maths we covered throughout the rest of the article.
In Bueler 2021, the governing equation is formulated as
\[F(u) = -\frac{\partial^2 u}{\partial x^2} + \rho \sqrt u,\]and the corresponding implemented discretization by finite differences is
\[F(u) \approx F(u_i) = F_i = -u_{i+1} + 2 u_i - u_{i-1} - h^2 (\rho \sqrt u + f),\]though \(f = 0\) in this problem, so that term may be ignored. In the original code, there is also a flag for including the the derivative of the reaction function in the Jacobian. If the derivative of the reaction function is excluded from the Jacobian, this simplifies the diagonal of the Jacobian, and therefore the resulting matrix \(K \approx J\) is an approximation for the Jacobian with similar spectral characteristics. For simplicity, we omit this flag in the adapted implementation.
Other than these small changes, the adapted implementation differs very little from the original implementation.
In this article, we bridge the gap between continuous nonlinear PDE theory and practical implementation. We cover Newton’s method, Gateaux derivatives, and an application of the finite difference discretization, while providing concrete PETSc code to solve the nonlinear steady-state reaction-diffusion problem. With mathematical and implementation details explicitly treated, hopefully the reader can now more confidently reason about low level numerical codes that they read and write. Future articles will show similar step-by-step mathematical and implementation details for systems of PDEs as well time dependent PDEs.
[1] : Bueler, E. Chapter 4: Nonlinear equations by Newton’s Method in PETSc for Partial Differential Equations: Numerical Solutions in C and Python. SIAM 2021.
[2] : Logg, A. et. al. Chapter 1.2: Nonlinear problems in Automated Solution of Differential Equations by the Finite Element Method – The FEniCS Book. Springer 2012
[3] : Heath, M. Chapter 5: Nonlinear equations in Scientific Computing: An Introductory Survey, 2ed. SIAM 2018.
[4] : Bangerth, W. MATH 676 Lecture 31.55 from Colorado State University: Nonlinear problems, part 2 – Newton’s method for PDEs.
[5] : Dokken, J. The FEniCS Tutorial: Custom Newton Solver.
]]>