Eric's Place
Welcome to the DEEP WEB. No... wait... still the regular web.

The Network Is Slow: An Adventure in the Linux Kernel

Welcome! Today I will tell a story of how I spent almost a week and a half discovering I had to change one (fairly obvious, in retrospect) value in the Linux Kernel KConfig to fix a weird problem. This was truly a case of the journey being more interesting than the destination.

We Have a Situation Here

I was tasked with maintaining a Linux distribution for a custom development board, which is based around a quad-core ARMv7 CPU. One fine day, our best customer, Billy, came to us with a problem:

Billy: The UDP performance on this board sucks! I send it packets at 1Gbps, and half of them are lost!

Me: Well that’s no good, Billy. But maybe your application simply can’t handle that much bandwidth? UDP makes no guarantees about packets arriving, after all.

Billy: But here’s the weird part: TCP works just fine! Well over 900Mbps. We’re using iperf, by the way.

Me: Well that is strange indeed my good sir, I shall investigate!

Go Deeper

The best way to solve a problem is to at least experience it yourself, so I decided to try to reproduce the issue quickly with iperf. However, I got roughly the same performance with both TCP and UDP, and both over 900Mbps. Alright, well maybe we need to mess with it a little to get the issue: I tried running some programs, playing with some peripherals, etc. I reran the iperf test, and sure enough the network performance dropped to both 500Mbps for both TCP and UDP (with UDP reporting ~50% packet loss). What’s the opposite of “Eureka!”…“Shiteka!”?

Now I tried to figure out what set off the great bandwidth apocalypse and this is where it got really bizaare: If I let iperf run for a while after a fresh boot, bandwidth dropped by ~50% at one point and never recovered. Ditto if I just let the board sit there for a while and then ran iperf later. It seemed as if for whatever reason, after about 3 minutes of uptime the board would only be able to support 50% of the incoming bandwidth it did initially. This gave me an idea, though:

Me: Hey William, how did you run your tests?

Billy: What do you mean? We ran the TCP test for a while, then ran the UDP test for a few hours. Also, my name is Billy.

Me: “Billy” is short for “William”

Billy: My parents named me “Billium”, but please don’t call be that

So the TCP performance was likely fine due to the fact that they ran it right after a fresh boot. But that doesn’t even come close to answering the question of why the bandwidth drops about 3 minutes in.

I figured maybe I needed to see what was going on at a deeper level. Using the perf tool, we can get an idea of what the system is spending its time doing. Simply do:

  • perf record -a -g to start a recording of all cores with call-graph support
  • ctrl+c to end the recording after a couple of seconds

If you’re a true hacker, you can run perf report -tui -g to display an in-terminal browser of the gathered data. But I’m a sucker for nice graphical interfaces, and this one made by Firefox is really nice.

I gathered two reports: One while network performance was fine, and one later while it was suffering from the ~50% reduction. I expected one report to show a different distribution of where CPU time was spent versus another, highlighting the culprit.

So I looked.

And looked.

And looked.

And guess what? They were almost identical. I re-gathered the data a few times and saw pretty much the same general distribution of CPU time in both cases. You gotta be kidding me.

And Deeper

I decided to painstakingly measure the actual time of some of the functions being called during network reception, figuring maybe there was an issue with memory or something. For this, I used ftrace in function-graph mode to measure the reported time for a function call. I also tried simply using the function trace mode and comparing the timestamps. No matter how much I probed, pretty much every function experienced the same behavior: After a certain point in time they all just got…slower.

On a lark, I tried running a simple test of CPU performance before and after, and it was roughly the same. So the system is suddenly getting slower after a certain point, but only for the network. Am I going crazy?

I decided it was time to ask for help, and a coworker had a suggestion:

Coworker: Sometimes we adjust the CPU affinity of the network stack for realtime applications, maybe try that?

Me: Seems a bit overkill for iperf, but why not.

With this suggestion, the performance became more balanced: The packet loss was higher initially, but it didn’t suffer as much after being left for a while. So much like a single Advil fighting against the effects of a massive head cold, we’ve barely alleviated the symptoms without treating the underlying illness. But maybe we’re onto something: I used htop to view the threads spawned by iperf, and noticed one thread in particular was consuming the lion’s share of CPU time. I tried forcing that thread’s affinity to a single core, and something magical happened.

The problem went away.

Like, gone…perfect network performance! Eureka! Well…Almosteka? Why do we need to pin one thread to a specific core in order for the performance to improve? Why does it drop after a certain time period in the first place?

Well, it does give us a great lead: Obviously the issue has to do with tasks being migrated between cores in the Linux scheduler. We can once again use ftrace here, except this time instead of tracing function calls, we can use trace events. These are little data structures that represent interesting events within the system, giving us the time they occurred along with some other debugging information. You can list the available events using the trace-cmd list utility. sched:sched_migrate_task looks like it could be helpful.

And Deeper Still

I ran trace-cmd record around when the drop in performance seemed to occur. Sure enough, the number of sched_migrate_task events spiked rapidly right after performance began to degrade. These surplus events all looked something like this, too:

ksoftirqd-2 [002] 5520480.045750: sched_migrate_task: comm=iperf pid=351528 prio=120 orig_cpu=1 dest_cpu=0
ksoftirqd-0 [000] 5520480.045704: sched_migrate_task: comm=iperf pid=351528 prio=120 orig_cpu=0 dest_cpu=1

To understand what’s going on, we need to understand when tasks are migrated between cores in Linux’s Completely Fair Scheduler. Basically, there are three different ways to trigger a migration:

  1. Periodically, when the Kernel runs trigger_load_balance();
  2. When a CPU goes idle;
  3. When a task “wakes up”.

Using the -T argument to trace-cmd, I was able to get a stack trace for each migration event. It seemed like the issue was a combination of 1. and 3.: The periodic load balance was migrating the iperf task, and then it was almost immediately re-migrating itself to its original CPU when it woke up. The [000], [002] above tells us which core is performing the migration. One time it was from the task being woken up (core 0), and the other was an unrelated core (2) doing the re-balancing.

This explains our performance issue fairly well: Obviously bouncing a task between cores itself takes time, but this also likely leads to L1 cache invalidations which will slow things down even further. It also explains why pinning the busy iperf task solved the issue, as it effectively forbade the system from performing these migrations. But I still had a few burning questions: First of all, why was it only happening after a few minutes? And more importantly: This seems like a pretty glaring issue; There are lots of multi-core ARM devices running Linux all over the world, and no-one has noticed a simple network performance issue before? No, it’s most likely our fault, and I feel like we’re close now.

I thought maybe for whatever reason the re-balancing task was getting into a fight with the re-balance on wake-up. I found a patch on the internet where someone had added a “cool down” parameter to tasks that were recently rebalanced, though it was never merged into the mainline Kernel. I excitedly gave it a try and it actually solved the issue…until I tried without the realtime patch set we had added. Now I got the same “migration bounce”, but this time between two “wake-ups” of the iperf task.

Like, Mariana Trench Deep

Time to go a bit deeper into the select_task_rq_fair function: Why exactly is it deciding to migrate the task and then migrate it again? Normally I’d use KGDB to single-step through a function, but that doesn’t work too well with the scheduler. So I added my own trace events using the TRACE_EVENT macro to show the flow of the function’s logic, as well as some local variables. Here’s what I learned:

  • The bogus migrations happen when want_affine is set to 1;
  • want_affine gets set to 1 when the jiffies rolls over from a very high number like 4294967295 to 0;

Wait, what’s a “jiffy”?

Jiffies are simply a basic unit of time employed all around the Kernel. The period of a jiffy is configurable, but in our case it’s 10ms (100Hz). The jiffies counter is also 32-bits, which means the largest jiffy is 4294967295 at which point it wraps around to 0. There are several helper macros in the Linux Kernel to cleanly handle this wrap around issue such as time_after, though, so it isn’t usually a problem if developers remember to use them. But here’s the really interesting thing in include/linux/jiffies.h:

/*
 * Have the 32 bit jiffies value wrap 5 minutes after boot
 * so jiffies wrap bugs show up earlier.
 */
#define INITIAL_JIFFIES ((unsigned long)(unsigned int) (-300*HZ))

Wait…is that…? Yup. There’s our mystery of why the problem starts after a few minutes solved: It’s when the jiffies wrap around to 0. But why would this be having an effect? Let’s look at that want_affine variable again…

want_affine is controlled by the wake_wide() function (and a CPU mask). What does that function do?

static int wake_wide(struct task_struct *p)
{
	unsigned int master = current->wakee_flips;
	unsigned int slave = p->wakee_flips;
	int factor = __this_cpu_read(sd_llc_size);

	if (master < slave)
		swap(master, slave);
	if (slave < factor || master < slave * factor)
		return 0;
	return 1;
}

Pretty short! Apparently it’s actually a heuristic created by someone at Facebook who discovered that in massive systems with lots of processors and memory modules, it’s faster if a task that is responsible for orchestrating many “worker” tasks schedules them on any idle cpu rather than a busy one that shares a cache with the “director” task. The flips measures how many times a task woke up another task recently in order to detect this director/worker relationship, and it is decayed in the record_wakee function:

if (time_after(jiffies, current->wakee_flip_decay_ts + HZ)) {
		current->wakee_flips >>= 1;
		current->wakee_flip_decay_ts = jiffies;
	}

Now wakee_flip_decay_ts is initialized to 0, but remember that our jiffies variable is initialized to a value very close to the wrap-around point. How does time_after work?

#define time_after(a,b)		\
	(typecheck(unsigned long, a) && \
	 typecheck(unsigned long, b) && \
	 ((long)((b) - (a)) < 0))

This is designed to handle wrap-around such that a very small number (e.g: 100) will be considered “after” a very large one (e.g: 4294967295). Therefore, this algorithm will only start working correctly once the jiffies wrap-around to 0 and pass current->wakee_flip_decay_ts + HZ (100). This explains why wake_wide was constantly returning 0 prior to the wrap around point, and I’m convinced that this is indeed a bonafide bug (current->wakee_flip_decay_ts should be initialized to INITIAL_JIFFIES, not 0). But it almost seems like the bug should be helping us at this point: We wanted the pre-jiffy-wrap behavior, but the correct one is what happens post-jiffy-wrap. Oh wake_wide(), you vex us so.

As usual, I did a lot of Googling Research to maybe give me some insight into this issue, and one thing I saw mentioned was the CONFIG_SCHED_MC KConfig option. It has the following description:

Multi-core scheduler support improves the CPU scheduler’s decision making when dealing with multi-core CPU chips at a cost of slightly increased overhead in some places. If unsure say N here.

And sure enough, our Kernel config had it set to ‘N’…someone was unsure :) I tried enabling it, and glory be, the issue vanished - the spurious task migrations were gone.

I Ain’t Afraid of no Magma

Now we could stop there, tell Billy all about our solution and happily move the JIRA ticket to the “Helsinki” column (I wish it was called that). But inquiring minds need to know: What does this mysterious KConfig option do, exactly?

CONFIG_SCHED_MC does most of its magic in topology.c/.h. Amongst other things, it activates the SD_SHARE_PKG_RESOURCES #define. This adjusts some parameters of the scheduler, but most importantly: It tells the scheduler that all CPUs share a last level cache (see sd_llc_size above). So before it was 1, and now it’s 4. If you look at the wake_wide function above, factor will now be 4. Interestingly enough, that’s still not enough to prevent it from triggering spurious task migrations (the # flips is in the thousands), but it helps us later: Let’s look at what happens when it tries to use this information further down in select_idle_sibling().

/*
* If the previous CPU is cache affine and idle, don't be stupid:
*/
if (prev != target && cpus_share_cache(prev, target) &&
	    (available_idle_cpu(prev) || sched_idle_cpu(prev)))
		return prev;

In our case, the previous CPU does go idle when iperf is migrated, so now cpus_share_cache will return 1, and it will decide to just leave the task where it is. Voilà! No more network performance issues, no modifying the Linux scheduler, and no manually setting the task affinity of iperf.

Ok, Let’s Come Up for Air

To recap, there are some interesting lessons I learned here:

  • Any “happens after 3-ish minutes” issues should immediately raise a mental red flag regarding the jiffies rollover;
  • The effects of mis-configuring CONFIG_SCHED_MC are fairly unobvious;
  • Seriously, triple-check the KConfig;
  • ftrace is a really great for figuring out what’s going on in Kernel code when you can’t afford to pause the whole kernel;
  • Facebook’s shenanigans know no bounds.

I eagerly told Billy our solution. However, this was an urgent customer issue, which means he’ll probably get back to me by Christmas.