Blog‎ > ‎

How new-lines affect Linux performance

The Linux kernel strives to be fast and efficient. As it is written mostly in C, it has great control on how the generated machine code looks. Nevertheless, as the kernel code is compiled into machine code, the compiler optimizes the generated code to improve its performance. The kernel code, however, employs uncommon coding techniques, which can fail code optimizations. In this blog-post, I would share my experience in analyzing poor code inlining of the kernel code.


New lines in inline assembly

As I was working on the kernel code, I encountered a strange phenomenon: minor changes I performed to the Linux source code, caused small but noticeable performance degradation. As I expected my change to actually improve performance, I decided to disassemble the functions which I changed. To my surprise, I realized that my change caused functions that were previously inlined, not to be inlined anymore.

I decided to further probe this issue and to check whether it affects other parts of the kernel. Arguably, it is rather hard to say whether a function should be inlined, so I was looking for some sort of indication of bad inlining decisions. In C functions that are declared with the inline keyword are not bound to be inlined by the compiler, which considers the inline keyword merely as a hint. One way to find indications for questionable inlining decisions is to search for functions that the programmer thought should be inlined, and declared using the inline keyword, which were eventually not inlined. Decisions not to inline these functions, however, are not necessarily wrong 

As I was slightly lazy to write a scripts that does so, I decided to look whether functions that are were hinted not to be inlined and are used in multiple translation units (different objects) are not inlined. In such cases, the Linux kernel, which is current not compiled using link time optimizations (LTO), would include multiple instances of the same (uninlined) function. This is usually wasteful, and in such cases it is usually better not to set the function as inline, and instead to export the function, so only a single instance of the function would be included in the binary. 

nm --print-size ./vmlinux | grep ' t ' | cut -d' ' -f2- | sort | uniq -c | grep -v '^      1' | sort -n -r | head -n 5

Instances Size           Function Name
     36     0000000000000019 t copy_overflow
      8     000000000000012f t jhash
      8     000000000000000d t arch_local_save_flags
      7     0000000000000017 t dst_output
      6     000000000000004e t put_page

jhash() is a big function (303 bytes) so it is reasonable for it is not to be inlined. dst_output() address is used as a function pointer, which causes it not to be inlined. Yet the other functions seem to be great candidates for inlining, and it is not clear why they are not inlined. Let's look at the source code of copy_overflow()which has many instances in the binary:

static inline void copy_overflow(int size, unsigned long count)
{
        WARN(1, "Buffer overflow detected (%d < %lu)!\n", size, count);
}

Will the disassembly tell us anything?

0xffffffff819315e0 <+0>: push   %rbp
0xffffffff819315e1 <+1>: mov    %rsi,%rdx
0xffffffff819315e4 <+4>: mov    %edi,%esi
0xffffffff819315e6 <+6>: mov    $0xffffffff820bc4b8,%rdi
0xffffffff819315ed <+13>: mov    %rsp,%rbp
0xffffffff819315f0 <+16>: callq  0xffffffff81089b70 <__warn_printk>
0xffffffff819315f5 <+21>: ud2    
0xffffffff819315f7 <+23>: pop    %rbp
0xffffffff819315f8 <+24>: retq   

Apparently not. Notice that out of the 9 assembly instructions that are shown above, 6 deal with the function entry and exit - for example, updating the frame pointer - and only the 3 bolded ones are really needed.

To understand the problem, we must dig deeper and look at the warning mechanism in Linux. In x86, the mechanism shares the same infrastructure with the bug reporting mechanism. When a bug or a warning are triggered, the kernel prints the filename and the line number in the source-code that triggered the bug, which can then used to analyze the bug root-cause. A naive implementation, however, would cause the code-cache to be polluted with the this information as well as the function call to the function that prints the error message, consequently causing performance degradation.

Linux therefore uses a different scheme by setting an exception triggering instruction (ud2) and saving the warning information in bug table that is set in a different section in the executable. Once a warning is triggered using the WARN() macro, an exception occurs triggered and the exception handler looks for the warning information - the source-code filename and line - in the table.

Inline assembly is used to save this information in _BUG_FLAGS(). Here is its code after some simplifications to ease readability:

asm volatile("1: ud2\n"
             ".pushsection __bug_table,\"aw\"\n"
             "2: .long 1b - 2b\n" /* bug_entry::bug_addr */
             "   .long %c0 - 2b\n" /* bug_entry::file */
             "   .word %c1\n"      /* bug_entry::line */
             " .word %c2\n"   /* bug_entry::flags */
             " .org 2b+%c3\n"
             ".popsection"
                 : : "i" (__FILE__), "i" (__LINE__),
                     "i" (flags),
                     "i" (sizeof(struct bug_entry)));

Ignoring the assembly shenanigans that this code uses, we can see that in practice it generates a single ud2 instruction. However, the compiler considers this code to be "big" and consequently oftentimes does not inline functions that use WARN() or similar functions.

The reason turns to be the newline characters (marked as '\n' above). The compiler, GCC, is unaware to the code size that will be generated by the inline assembly. It therefore tries to estimate its size based on newline characters and statement separators (';' on x86). In GCC, we can see the code that performs this estimation in the estimate_num_insns() function.

Note that this pattern, of saving data using inline assembly, is not limited to bugs and warning. The kernel uses for many additional purposes: exception tables, that gracefully handle an exception that is triggered inside the kernel; alternative instructions table, that tailors the kernel on boot-time to the specific CPU architecture extensions that are supported; annotations that are used for stack metadata validation and so on.

Before we get to solving this problem, a question needs to be raised: is the current behavior flawed at all? Eventually, the size of the kernel will increase if functions that use WARN(), for example, will be inlined. This increase in size can cause the kernel image to bigger, and since the Linux kernel cannot be paged out, will also increase memory consumption. However, the main reason that the compiler strives to avoid inflation of the code size is to avoid pressure on the instruction cache, whose impact may offset inlining benefits. Moreover, the heuristics of other compiler optimizations (e.g., loop optimizations) depend on the size of the code.

Solving the problem is not trivial. Ideally, GCC would have used an integrated assembler, similarly to LLVM, which would give better estimation of the generated code size of inline assembly. Experimentally, LLVM seems to make the right inlining decisions and is not affected by new-lines or data that is set in other sections of the executable. GCC, however, uses the GNU assembler after the code is compiled, which prevents it from getting a correct estimation of the code size.

Alternatively, the problem could have been solved by overriding GCC's code size estimation through a directive or a built-in function. However, looking at GCC code does not reveal a direct or indirect way to achieve this goal.

One may think that using the always_inline function attribute to force the compiler to force functions to be inlined would solve the problem. There are several drawbacks of such a solution. First, it is hard to make and maintain this annotations. Second, this solution does not address other code optimizations the rely on code-size estimation. Third, the kernel uses various configurations and supports multiple CPU architectures, which may require a certain function to be inlined in some setups and not inlined in other. Finally, and most importantly, using always_inline can just push the problem upwards, as we will later see.  

Therefore, a more systematic solution is needed. The solution comes in the form of assembly macros that are set to hold the long assembly code, and use a single line inside the inline assembly that calls the macro. This solution does not only improve the generated machine code, but makes the assembly code more readable, as it prevents various quirks that are required in inline assembly. Moreover, in certain cases this change allows to consolidate the currently separate implementations that are used in C and assembly, which eases code maintenance.

After addressing these issues, we see copy_overflow() and other functions disappear from the commonly non-inlined inline functions list.

 Instances Size           Function Name
      9     000000000000012f t jhash
      8     0000000000000011 t kzalloc
      7     0000000000000017 t dst_output
      5     000000000000002f t acpi_os_allocate_zeroed
      5     0000000000000029 t acpi_os_allocate

However, we got some new ones.  

Constant computations and inlining

As shown, kzalloc() is not always inlined, although its code is very simple. 

static inline void *kzalloc(size_t size, gfp_t flags)
{
        return kmalloc(size, flags | __GFP_ZERO);
}

The assembly, again does not provide any answers as to why it is not inlined.

0xffffffff817929e0 <+0>: push   %rbp
0xffffffff817929e1 <+1>: mov    $0x14080c0,%esi
0xffffffff817929e6 <+6>: mov    %rsp,%rbp
0xffffffff817929e9 <+9>: callq  0xffffffff8125d590 <__kmalloc>
0xffffffff817929ee <+14>: pop    %rbp
0xffffffff817929ef <+15>: retq   

The answer to our question lies in kmalloc(), which is called by kzalloc() and is considered to have many instructions by GCC heuristics. kmalloc() is inlined since it is marked with the always_inline attribute, but its estimated instruction count is then attributed to the calling function, kzalloc() in this case. This result exemplifies why the use of the always_inline attribute is not a sufficient solution for code inlining problem.

Still, it is not clear why GCC estimates that kmalloc() would be compiled into many instructions. As shown, it is compiled into a single call to __kmalloc()To answer this question, we need to follow kmalloc() code, which eventually uses the ilog2() macro to compute the log2 of an integer, in order to compute the page allocation order.

Here is a and shortened version of ilog2():

#define ilog2(n)                                \
(                                               \
        __builtin_constant_p(n) ? (             \
        /* Optimized version for constants */ \
                (n) < 2 ? 0 :                   \
                (n) & (1ULL << 63) ? 63 :       \
                (n) & (1ULL << 62) ? 62 :       \
                ...
                (n) & (1ULL <<  3) ?  3 :       \
                (n) & (1ULL <<  2) ?  2 :       \
                1 ) :                           \
        /* Another version for non-constants */ \
        (sizeof(n) <= 4) ?                      \
        __ilog2_u32(n) :                        \
        __ilog2_u64(n)                          \
}

As shown, the macro first uses the built-in function __builtin_constant_p() to determine whether n is known to be a constant during compilation time. If n is known to be constant, a long series of conditions is evaluated to compute the result during compilation time, which allows further optimizations. Otherwise, if n is not known to be constant, a short code is emitted to compute during runtime the result. Yet, regardless of whether n is constant or not, all of the conditions in the ilog2() macro are evaluated during compilation time and do not translate into any machine code instructions.

However, although the generated code is efficient, it causes GCC, again, to estimate the number of instructions that ilog2() takes incorrectly. Apparently, the number of instructions is estimated before inlining decisions take place, and in this stage the compiler usually still does not know whether n is constant. Later, after inlining decisions are performed, GCC cannot update the instruction count estimation accordingly.

This inlining problem is not as common as the previous one, yet it is not rare. Bit operations (e.g., test_bit()) and bitmaps commonly use __builtin_constant_p() in the described manner. As a result, functions that use these facilities, for example cpumask_weight(), are not inlined.

A possible solution for this problem is to use the built-in __builtin_choose_expr() to test __builtin_constant_p() instead of using C if-conditions and conditional operators (?:) :

#define ilog2(n) \
(                                                \
        __builtin_choose_expr(__builtin_constant_p(n), \
                ((n) < 2 ? 0 : \
                (n) & (1ULL << 63) ? 63 : \
                (n) & (1ULL << 62) ? 62 : \
                ...
                (n) & (1ULL <<  3) ?  3 :        \
                (n) & (1ULL <<  2) ?  2 :        \
                1 )),                            \
        (sizeof(n) <= 4) ?                      \
        __ilog2_u32(n) :                        \
        __ilog2_u64(n)                          \
}


This built-in is evaluated earlier in the compilation process, before inlining decisions are being made. Yet, there is a catch: as this built-in is evaluated earlier, GCC is only able to determine that an argument is constant for constant expressions, which can cause less efficient code to be generated. For instance, if a constant was given as a function argument, GCC will not be able to determine it is constant. In the following case, for example, the non-constant version will be used:

int bar(int n)
{
return ilog2(n)
}

int foo(int n)
{
return bar(n);
}

v = foo(bar(5)); /* will use the non-constant version */


Despite this limitation, using this approach can be suitable in certain cases, for example for test_bit() in which the bit number argument
is usually a fixed value or a non-constant value, and rarely a value which is provided by a calling function. Compiling using LLVM reveals, again, that LLVM inlining decisions are not negatively affected by the use of __builtin_constant_p().


Function attributes 

Finally, there are certain function attributes that affect inlining decision. Using annotations to set an optimization levels for a certain function can prevent the compiler from inlining the function or functions that are called by the function. The Linux kernel rarely uses such attributes, but one of its uses is in the KVM function vmx_vcpu_run() which is a hot function that launches or resumes the virtual machine. The use of the optimization attribute in this function is actually just to prevent cloning of the function. Its side-effect is, however, that all the functions it uses are not inlined, including, for example the function to_vmx():

   0x0000000000000150 <+0>: push   %rbp
   0x0000000000000151 <+1>: mov    %rdi,%rax
   0x0000000000000154 <+4>: mov    %rsp,%rbp
   0x0000000000000157 <+7>: pop    %rbp
   0x0000000000000158 <+8>: retq   

Yes. This function just returns as an output the same argument it got as an input. Not inlining vmx_vcpu_run() induces significant overhead, which can be as high as 10% for a VM-exit.

The cold attribute informs the compiler that a function is unlikely to be executed, and the compiler, among other things, optimizes these functions for size rather than speed, which can result in very non-aggressive inlining decisions. All the __init and __exit functions, which are used during the kernel and modules (de)initializations are marked as cold

Ideally, GCC would have provided us a directive or a builtin, that would directly or indirectly let us override the generated code-size estimation. Alternatively, the compiler could have used an integrated assembler to realize the generated
Yet, there does not appear to be 

/* Estimate number of instructions that will be created by expanding STMT.
   WEIGHTS contains weights attributed to various constructs.  */

int
estimate_num_insns (gimple *stmt, eni_weights *weights)
{
...
    case GIMPLE_ASM:
      {
        int count = asm_str_count (gimple_asm_string (as_a <gasm *> (stmt)));
        /* 1000 means infinity. This avoids overflows later
           with very long asm statements.  */
        if (count > 1000)
          count = 1000;
        return count;
      }
      ..
}



And indeed asm_str_count() as its name indicates estimates the number of generated assembly instructions based on the number of new-lines and line separators (semicolons on x86 assembly). As a result, adding new-lines which are not part of instructions "confuses" the compiler into considering the assembly block to be bigger.

Actually, the binary might get bigger  

In fact, looking on GCC we can see that the this evaluation is used for other
code optimizations, such as loop optimizations


/* Return the number of machine instructions likely to be generated for the
   inline-asm template. */
asm_str_count (const char *templ)

  for (; *templ; templ++)
    if (IS_ASM_LOGICAL_LINE_SEPARATOR (*templ, templ)
        || *templ == '\n')
      count++;


Loop optimizations, PHI node optimizations

https://gcc.gnu.org/onlinedocs/gccint/Tree-SSA-passes.html
https://gcc.gnu.org/onlinedocs/gccint/RTL-passes.html


 Instances Size    Function Name
Instances Size           Function Name
          9     000000000000012f t jhash
          8 0000000000000011 t kzalloc
          7 0000000000000017 t dst_output
      5 000000000000002f t acpi_os_allocate_zeroed
      5 0000000000000029 t acpi_os_allocate
      5 0000000000000028 t umask_show
      5 0000000000000018 t __TRACE_SYSTEM_ZONE_NORMAL
      5 0000000000000018 t __TRACE_SYSTEM_ZONE_MOVABLE
      5 0000000000000018 t __TRACE_SYSTEM_ZONE_DMA32
      5 0000000000000018 t __TRACE_SYSTEM_ZONE_DMA

  static inline void *kzalloc(size_t size, gfp_t flags)
  {
          return kmalloc(size, flags | __GFP_ZERO);
  }

   0xffffffff81a0bb00 <+0>: push   %rbp
   0xffffffff81a0bb01 <+1>: mov    $0x14080c0,%esi
   0xffffffff81a0bb06 <+6>: mov    %rsp,%rbp
   0xffffffff81a0bb09 <+9>: callq  0xffffffff8125cf50 <__kmalloc>
   0xffffffff81a0bb0e <+14>: pop    %rbp
   0xffffffff81a0bb0f <+15>: retq   


Again, this decision of the compiler is strange, to say the least. Out of the 6 assembly instructions of kzalloc(), 4 are only needed to deal with the function entry and return flows. Without counting bytes, it is clear that calling kzalloc() increases the code size, which therefore makes this inlining decision wrong in two aspects: it bloats the code-size and increases execution time.


Aftermath


[1] https://gcc.gnu.org/onlinedocs/gcc-4.0.1/gcc/Extended-Asm.html

Thanks to Linus Torvalds, Hans Peter Anvin, Masahiro Yamada, Peter Zijistra and others for their assistance in the analysis and in solving this problem.

Comments