Losing Our Minds Over Memory
Well hello there! Today we have yet another interesting Linux bug investigation story. This time, however, we’re going outside the kernel: Oh yes, this was an issue with the distribution itself.
Why’s the RAM Gone?
We produce a 32-bit ARM board running Linux, and we had some clients who wanted to run apps on it or something (who does that ?!). The Linux distribution we ship with the board (or “board support package”, or “BSP”, or strrev("PSB")
) was recently updated, and a client noticed that they could not allocate as much memory as used to in a previous version:
Billy: We need to allocate up to 1.8G of memory for our application, and now we can only allocate ~1.1G before getting an -ENOMEM
!
Me: That sucks, let me try it:
//memory_test.c
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <unistd.h>
int main(int argc, char* argv[]) {
const size_t sz = 1024 * 1024 * atoi(argv[1]);
void* mem = malloc(sz);
if (!mem) {
printf("Error: %d\n", errno);
} else {
printf("Ok!\n");
}
while(1)
usleep(1000000); // Suspend so we can investigate
return 0;
}
# ./memory_test 1100
Ok!
# ./memory_test 1400
Error: 12
Me: Yup, we got a problem
Let’s Begin
So, let’s start by spitting some facts about our Linux Distribution
- It’s a 32-bit kernel, so every process should see 4Gb of available address space;
- It’s a 2G/2G split, so every process has 2Gb of useable space, the rest is for the kernel;
- We use Yocto to build our distribution.
If you’re new to the world of embedded Linux, Yocto is like the Build-a-Bear of Linux distributions if the bear was a…Linux distribution? I dunno, we don’t have Build-a-Bear in Québec. The idea is that you can define and build a completely custom Linux distribution using a hierarchy of “recipes”. It’s a really nifty project that works really well out of the box and has a ton of flexibility, provided you’re willing to commit a bit of time to learning how to use it.
So, where did it all (OK, just the one thing) go wrong?
Tracer Round
Let’s see what’s actually happening when we call malloc()
- we can use strace
to log all system calls:
# strace ./memory_test.c
We get quite a bit of output, but at the end we see something like this:
mmap2(NULL, 1572868096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
What’s interesting is that it’s a call to mmap2 that’s failing, which is a system call generally used to map files into the memory space of a program. In this case it’s not a file, but rather glibc
bypassing the heap and directly allocating memory using mmap
and MAP_ANONYMOUS
due to it being larger than a specific threshold. So, glibc is asking the Linux kernel to give us a big chunk of 1572868096 bytes from anywhere in virtual memory, and the Kernel is replying “nein”.
Now let’s try it with the old version:
mmap2(NULL, 1572868096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x19b000
Ok wow, worked just like that. I decided to try the old glibc
on the new BSP version by statically linking the test program (-static
passed to gcc
), and that worked. Then I tried passing -static
to link with the new glibc
, and it seemed to work just as well! Wack.
I decided to comb through the strace
output for any other differences, and came across this at the beginning:
Before:
openat(AT_FDCWD, "/home/root/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0(\0\1\0\0\0\34l\1\0004\0\0\0"..., 512) = 512
mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x76f06000
mmap2(NULL, 1291652, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x76d9c000
mprotect(0x76ec2000, 65536, PROT_NONE) = 0
mmap2(0x76ed2000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x126000) = 0x76ed2000
mmap2(0x76ed5000, 9604, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x76ed5000
After:
openat(AT_FDCWD, "/lib/libc.so.6", O_RDONLY|O_LARGEFILE|O_CLOEXEC) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0(\0\1\0\0\0\231\277\264G4\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=932112, ...}) = 0
mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x76fbe000
mmap2(0x47b30000, 998260, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x47b30000
mprotect(0x47c0f000, 65536, PROT_NONE) = 0
mmap2(0x47c1f000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xdf000) = 0x47c1f000
Do you see it? Before, there were two mmap2
calls mapping to fixed addresses at 0x76ed2000
and 0x76ed5000
, but now those two calls seem to be mapping to 0x47b30000
and 0x47c1f000
instead. And sure enough, when I looked in the test application’s virtual memory map using pmap $(pgrep memory_test)
there were some allocations at and around addresses 0x47b30000
in the new BSP version.
This is an issue because we only have 2Gb of available userspace memory, which corresponds to addresses [0x0;0x80000000[. If we use an address in the middle of this range for something else, there’s no way we can fit a 1.8Gb+ memory allocation in a single contiguous block. So why did it suddenly decide to change the load address between Linux distributions?
Load
I decided to follow the problem and see where it takes me: First of all, who is responsible for issuing that mmap2
call to load glibc
in the first place? This is the responsibility of the dynamic loader ld-linux
, which is built as part of glibc. As luck would have it, there aren’t actually that many mmap
calls in the loader, and by comparing the flags I was able to track down the first mmap2
call to this bit of code in elf/dl-map-segments.h
(v2.31):
ElfW(Addr) mappref
= (ELF_PREFERRED_ADDRESS (loader, maplength,
c->mapstart & GLRO(dl_use_load_bias))
- MAP_BASE_ADDR (l));
/* Remember which part of the address space this object uses. */
l->l_map_start = (ElfW(Addr)) __mmap ((void *) mappref, maplength,
c->prot,
MAP_COPY|MAP_FILE,
fd, c->mapoff);
First, let’s see if this code has changed since the previous distribution’s glibc version, which was 2.27: Nope. Alright, so it looks like it loads from fixed address mappref
, which is deduced using the ELF_PREFERRED_ADDRESS
.
ELF_PREFERRED_ADDRESS
is defined by default in dl-load.h
to just take the value of the third argument (c->mapstart
), and is only overridden on PowerPC platforms. Also, in our case at least MAP_BASE_ADDR
is 0
and we’re not using load biasing. All that to say, the load address is coming directly from the value in c->mapstart
.
As far as I can tell, c->mapstart
is set in dl-load.c
by page-aligning ph->vaddr
, where ph
is the program header being read out of the library file in ELF format and vaddr
is the virtual address the header tells it to load at. Let’s take a look at those program headers:
# readelf -l /lib/libc.so.6
Elf file type is DYN (Shared object file)
Entry point 0x4105c22c
There are 10 program headers, starting at offset 52
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
EXIDX 0x12c508 0x4116c508 0x4116c508 0x015c0 0x015c0 R 0x4
PHDR 0x000034 0x41040034 0x41040034 0x00140 0x00140 R 0x4
INTERP 0x12bd18 0x4116bd18 0x4116bd18 0x00019 0x00019 R 0x4
[Requesting program interpreter: /lib/ld-linux-armhf.so.3]
LOAD 0x000000 0x41040000 0x41040000 0x12dacc 0x12dacc R E 0x10000
LOAD 0x12e010 0x4117e010 0x4117e010 0x026d8 0x04b64 RW 0x10000
DYNAMIC 0x12f1c4 0x4117f1c4 0x4117f1c4 0x000f8 0x000f8 RW 0x4
NOTE 0x000174 0x41040174 0x41040174 0x00044 0x00044 R 0x4
TLS 0x12e010 0x4117e010 0x4117e010 0x00008 0x00054 R 0x4
GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0x10
GNU_RELRO 0x12e010 0x4117e010 0x4117e010 0x01470 0x01470 R 0x1
Well there we go: The VirtAddr
for all the loaded libraries is hard-coded to that weird address in the ELF file. Now, how about the old version?
# readelf -l /lib/libc.so.6
Elf file type is DYN (Shared object file)
Entry point 0x1ab19
There are 10 program headers, starting at offset 52
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
EXIDX 0x0dc540 0x000dc540 0x000dc540 0x01628 0x01628 R 0x4
PHDR 0x000034 0x00000034 0x00000034 0x00140 0x00140 R 0x4
INTERP 0x0dbcb8 0x000dbcb8 0x000dbcb8 0x00019 0x00019 R 0x4
[Requesting program interpreter: /lib/ld-linux-armhf.so.3]
LOAD 0x000000 0x00000000 0x00000000 0xddb6c 0xddb6c R E 0x10000
LOAD 0x0ddb90 0x000edb90 0x000edb90 0x026d8 0x04b64 RW 0x10000
DYNAMIC 0x0ded44 0x000eed44 0x000eed44 0x000f8 0x000f8 RW 0x4
NOTE 0x000174 0x00000174 0x00000174 0x00044 0x00044 R 0x4
TLS 0x0ddb90 0x000edb90 0x000edb90 0x00008 0x00054 R 0x4
GNU_STACK 0x000000 0x00000000 0x00000000 0x00000 0x00000 RW 0x10
GNU_RELRO 0x0ddb90 0x000edb90 0x000edb90 0x01470 0x01470 R 0x1
Here not only are the addresses not at some weird offset, but one of them is even 0x00000000
- which would mean it’s left up to the loader at runtime to determine the address. That would explain why it ends up at a sane address that doesn’t bifurcate the virtual memory in twain.
Reload
I decided to extract the libc package generated by Yocto, and sure enough the VirtAddr
column was similar to the latter case above, but in the generated root filesystem (rootfs) it had that weird offset. So the Yocto build process must be doing some shenanigans that are modifying the system libraries between when they’re being packaged and when the rootfs is being generated. I figured there can’t be very many things that are capable of modifying ELF
files like that, so I dug around Google and the Yocto source a bit and came across this:
This is a prelink program. Prelinking is the process of pre-computing the load addresses and link tables generated by the dynamic linker as compared to doing this at runtime. Doing this ahead of time results in performance improvements when the application is launched.[…]
Basically, Yocto runs the https://linux.die.net/man/8/prelink utility to precalculate where to load system libraries beforehand instead of doing it dynamically at runtime. In theory, this should reduce application startup time, since ld-linux
won’t have to figure out where to put everything on-the-fly. It is controlled by adding image-prelink
to USER_CLASSES
in local.conf
. But two questions still remained:
- We had
image-prelink
in the previous version, too. Why was it OK then? - Why is it specifically loading to an address near 0x40000000?
The fun thing about open source is that we can just take a gander at Prelink’s code!
It has different implementations for each architecture. In our case, we’re building for ARM so it’s arch-arm.c
. We see near the bottom that the base address is hard-coded: .mmap_base = 0x41000000
. Voilà, one part of the mystery solved. So why did the previous version not modify the libraries. Well, can we see the output of the prelink
command when Yocto runs it? Yes (rhetorical question, ha!): It runs in the do_image
task of the recipe that defines the image. And sure enough, the log.do_image
has a whole bunch of lines similar to this:
/home/osboxes/bsp/build/tmp/work/oct3032-octasic-linux-gnueabi/octasic-image/1.0-r0/recipe-sysroot-native/usr/sbin/prelink: /sbin/ip.iproute2: Using /lib/ld-linux-armhf.so.3, not /lib/ld-linux.so.3 as dynamic linker
I dug through the sources of prelink
a bit, and this log appears when the value passed as the --dynamic-linker
argument doesn’t match what’s specified in the library:
dl = dynamic_linker ?: dso->arch->dynamic_linker;
if (strcmp (dl, data->d_buf) != 0
&& (dynamic_linker != NULL || dso->arch->dynamic_linker_alt == NULL
|| strcmp (dso->arch->dynamic_linker_alt, data->d_buf) != 0))
{
error (0, 0, "%s: Using %s, not %s as dynamic linker", dso->filename,
(char *) data->d_buf, dl);
goto error_out;
}
It seems like in this case, we’re trying to pass ld-linux.so.3
to it, which it doesn’t care for since it’s expecting ld-linux-armhf.so.3
. Then, by following the recipes, we see that Yocto gets the dynamic-linker
argument from linuxloader.bbclass
.
Specifically, we can see that the loader for ARM changed from the previous version (Rocko):
arm )
dynamic_loader="${base_libdir}/ld-linux.so.3"
;;
To the new version (Dunfell):
elif targetarch == "arm":
dynamic_loader = "${base_libdir}/ld-linux${@['-armhf', ''][d.getVar('TARGET_FPU') == 'soft']}.so.3"
And there you have it: In the old version it was selecting the wrong loader and failing silently (if a program screams into the error log but no one reads it, does it make a sound?), whereas in the new version the error was “fixed”, but it was causing the load address to be pre-assigned a funky value.
Explode
Luckily the fix for this is very simple: Don’t prelink. Seriously, even some of the Yocto maintainers don’t seem to be very enthused about it. I did my own tests and confirmed: Prelinking confers no tangible difference in boot-up time, at least on our board.
Adieu, prelink
! Now we can allocate massive gobs of memory with impunity again, hurrah!