My blog has moved!
Read it in the new location here.
It turns out that the overhead of dynamic linking on Linux amd64 is 2 CPU
cycles per cross-module call. I usually take forever to get to the point in my
writing, so I thought I would change this for once :-)
In MySQL, there has been a historical tendency to favour static linking, in
part because to avoid the overhead (in execution efficiency) associated with
dynamic linking. However, on modern systems there are also very serious
drawbacks when using static linking.
The particular issue that inspired this article is that I was working
on
MWL#74,
building a proper shared libmysqld.so library for the MariaDB
embedded server. The lack of a proper libmysqld.so in MySQL and
MariaDB has caused no end of grief for
packaging
Amarok for the various Linux
distributions. My patch increases the amount of dynamic linking (in a default
build), so I did a quick test to get an idea of the overhead of this.
ELF dynamic linking
The overhead comes from the way dynamic linking works in ELF-based systems
like Linux (and many other POSIX-like operating systems). Code in shared
libraries must be compiled to be position-independent, achieved with
the -fPIC compiler switch. This allows the loader to
simply mmap() the image of a shared library into the process
address space at whatever free memory space is available, and the code can run
without any need for the loader to do any kind of relocations of the code. For
a much more detailed explanation see for
example
this
article.
When generating position-independent code for a function call into another
shared object, the compiler cannot generate a simple
absolute call instruction, as the destination address is not
known until run-time. Instead, the call goes via an indirect jump is
generated, fetching the destination address from a table called the PLT, short
for Procedure Linkage Table. For example:
callq 0x400680 )
...
: jmpq *0x200582(%rip)
The indirect jump resolves at runtime into the address of the real function to
be called, so that is the overhead of the call when using dynamic linking: one
indirect jump instruction.
Micro-benchmarking
To measure this one-instruction overhead in terms of execution time, I used
the following code:
for (i= 0; i < count; i++)
v= mylib_myfunc(v);
The function mylib_myfunc() is placed in a library, with the
following code:
int mylib_myfunc(int v) {return v+1;}
I tested this with both static and dynamic linking on a Core 2 Duo 2.4 GHz
machine running Linux amd64. Here are the results from running the loop for
1,000,000,000 (one billion) operations:
total time (sec.)CPU cycles/iteration
Static linking2.546
Dynamic linking3.388
So that is the two CPU cycles of overhead per call that I referred to at the
start of this post.
Incidentally, if you try stepping through the call with a debugger, you will
see a much larger overhead for the very first call. Do not be fooled by this,
this is just because the loader fills in the PLT lazily, computing the correct
address of the destination only on the first time the call is made (so
addresses of functions that are never called by a process need never be
calculated). See above-referenced article for more details.
(Note that this is for 64-bit amd64. For 32-bit x86, the mechanism is similar,
but the actual overhead may be somewhat larger, since that architecture lacks
program-counter-relative addressing and so must reserve one
register %ebx (out of its already quite limited register bank)
for this purpose. I did not measure the 32-bit case, I think it is of little
interest nowadays for high-performance MySQL or MariaDB deployments (and the
overhead of function calls on x86 32-bit is significantly higher anyway,
dynamic linking or not, due to the need to push and pop all arguments to/from
the stack)).
Conclusion
Two cycles per call is, in my opinion, a very modest overhead. It is hard to
imagine high-performance code where this will have a real-life noticeable
effect. Modern systems rely heavily on dynamic linking, and static linking is
nowadays causing much more problems that it solves. And I think it is also
time to put the efficiency argument for static linking to rest.