Monday, November 4, 2013

Improving build times of large Qt apps

My colleagues and I spent time recently improving build times of a largish Qt app (Mendeley) and its associated test suite. I'm sharing some notes here in case anyone else finds them useful. Most of the steps here fall under one of a few basic ideas:
  • Measure first
  • Do more in parallel
  • Work around the inefficiencies of C++ compilation
  • Use faster tools
  • Do less disk I/O
All of these steps can improve build times on all platforms, but those that reduced the amount of I/O during builds were especially effective on Windows.

Measure first

When we started out, I expected that running the tests would be consuming most of our CI system's cycle time. In the end it turned out that the largest bottleneck was actually just building the code on Windows, which was taking 3x as long as Linux (30 mins for a fresh build vs 10 on Linux). The unit tests did take longer to run on Windows by a factor of 2 (20mins total vs 10 on Linux).

Use those cores!

One of simplest things to address is usually taking advantage of multiple cores on your system. The '-j' argument to make sets the number of parallel jobs. The optimal number will depend on a number of factors. Setting the value to the number of cores is a reasonable starting point, but check what happens with different values.

When running unit tests, use the option in the driver to run multiple tests in parallel. ctest supports a '-j' argument for this as well. An important thing to remember before enabling this is that your tests need to be set up so that they can't interfere with one another. This means not trying to use the same resources (files, settings keys, I/O ports, web service accounts etc.) at the same time. Some tests might be easier to isolate than others in which case you can split your test suite into subsets and only run some of the subsets in parallel. ctest has a facility for assigning labels to tests using.


CTest then has a set of command-line arguments that can be used to run only tests with labels matching a certain pattern, or exclude tests with labels matching a certain pattern. This can then be used to run only a subset of tests which are known not to interfere with one another concurrently.

Working around C++ compilation inefficiency

When the compiler encounters an #include statement, it effectively copies and pastes the content into the current source file. The resulting output that the compiler has to lex, parse and understand the semantics of ends up being tens of thousands of lines long in the case of a typical source file in a Qt app. The more you use code-heavy headers such as the C++ standard library or Boost, the worse this gets. This is incredibly inefficient and means that much of your build time can be spent re-parsing the same source code over and over. This is compounded by the complexity of parsing C++ in the first place.

Consider this very simple list view app.  There are only 15 lines of actual code in the example but the preprocessed output, which can be produced by passing the -E flag to gcc, is just under 43,000 lines of actual code (as determined by sloccount) or just under 60,000 lines when C++11 mode is enabled (using the '-std=c++0x' flag).

In a language with a proper package/module system (eg. C#, Go or many other languages), processing an import only involves reading some metadata from the already-compiled module rather than re-parsing everything. A proper module system for C++ is in the works but is still some way off. In the meantime, there are hacks workarounds available which can help considerably.

Precompiled headers

MSVC, GCC and Clang all have good support for precompiled headers. The use of precompiled headers is even more important now since the preprocessed output of many of the #includes from the C++ standard library grows considerably in size when C++11 is enabled. Note that under MSVC on Windows, C++11 mode is always enabled.

With the small example above, creating a precompiled header which includes just the QStringList header reduces compile times for the main .cpp file on my system from ~1.1s to ~0.7s (about 35%). This sounds modest but adds up by the time you have a project with hundreds of source files. Even in a small project with just a few dozen source files I think it is worthwhile.

The steps to enable precompiled headers will depend on the build system you are using. With qmake, this is relatively simple. CMake lacks a simple built-in command for this but there are samples online that we used as a basis.

A downside of precompiled headers is that you are effectively automatically #including an extra header with every file that you build, so a file may compile in a build with precompiled headers but fail to build in one without if the file is missing necessary #includes that are supplied by the precompiled header when enabled. If you're running a CI system is therefore useful to have at least one regular build that is not using precompiled headers.

Unity builds

A unity build involves creating a single source file which #include's all the source files from a particular module or the whole project and compiling that at once. The main caveat with this approach is that variables and functions declared within an implementation .cpp file may now clash with declarations from other source files - since they are now being compiled together as a single source file instead of separately.

More efficient build tools

Part of the reason for a gradual creep in built times as a project grows is due to scaling issues with build tools. The amount of time taken for a do-nothing build (ie. running 'make' when everything is up to date) grows noticeably with cmake + make as the total number of targets to build increases. Fortunately for us, engineers on Google Chrome ran into this problem harder and long before we did so they have produced some helpful replacements for the standard tools:
  • The Ninja build system is designed to be faster, especially for incremental builds where little changed. Recent versions of CMake have built-in support for generating Ninja build files (use 'cmake -G Ninja' to generate Ninja build files). The difference in build speed for incremental builds where little changed is decent on Mac and Linux but very noticeable on Windows compared to nmake. Prior to Ninja, Qt developers also created jom as a faster alternative to make.
  • On Linux, the Gold linker is faster than the traditional ld linker and can often be used as a drop-in replacement.

Reducing total disk I/O

Disk I/O is very slow, reducing the total amount of I/O (especially random I/O) required during a build can improve overall build times substantially. Anecdotally, this is especially true on Windows where reducing the total amount of I/O performed during a clean build had the largest impact in terms of achieving parity between build + test times on Windows and build times on Linux and Mac.

Use faster hardware

It always feels a little dirty to solve software inefficiency by throwing faster hardware at the problem but if you can afford it, it can be a quick win.
  • Adding more memory will reduce the likelihood of the build system swapping.
  • A good SSD drive will speed up disk I/O, especially for operations which do a lot of random I/O.
  • If you have a lot of memory spare you can create a RAMDisk and do the build on that.
I haven't compared the impact of an SSD vs. a standard IDE drive myself, this advice comes mostly from Chromium developers build notes.

Reducing debug info size

In debug builds, a large proportion of the total size of data read/written from disk is typically debug information. When doing local development, this information is usually useful. When generating builds on a continuous integration system that will purely be used for automated tests, this is less so.
  • All compilers (MSVC, gcc, clang) have switches to control the amount of debug info that is generated. With gcc/clang these are controlled by the -gXYZ switches.

Generating fewer binaries for tests

For every binary that is generated as part of a project, there are a number of overheads:
  • Each binary will add a number of additional targets to the build system
  • Each binary requires a linking step - which can be memory and I/O intensive.
  • Each binary generated requires reading/writing additional data to disk. The cost of this depends on how large the generated binary is and how many files need to be processed to assemble the final binary.
In our case, we are using the QTestLib framework for unit tests, which by default encourages the creation of one test class per original class. Each test class is then compiled into a separate binary with a QTEST_MAIN($TEST_CLASS_NAME) macro providing the entry point for the test app. This works fine for smaller apps. When a project grows larger however and you have hundreds of test classes, the overhead of linking all of those binaries can add noticeably to the total build time.

We changed the test builds to produce one test binary per source directory instead of one per test class. This was done by replacing the QTEST_MAIN() macro with a substitute which instead declares a '$TESTCLASS_main()' function and registered it in a global map from test class to init function on startup. All of the test classes are then compiled and linked together with a small stub library which declares the 'int main()' function that reads the name of the test to run from the command-line and calls the corresponding '$TESTCLASS_main()' function, forwarding the other command-line arguments to it. This allows multiple Qt test cases to be linked into a single binary which improves build times in several ways:
  • The number of linking operations during builds was considerably reduced.
  • The total amount of binary data generated on disk was reduced as code that was previously statically linked into the test binary for each test class is now only linked into a single test binary for each group of tests.
  • The total number of make steps and targets for the whole project was reduced.
On Windows this change shaved 30% off our total build time and the impact on build times of adding a new test case is now greatly reduced.

Generating smaller binaries

Another way to reduce the size of compiled binaries is to build each module of the app into a shared rather than static library. This is sometimes referred to as a 'component build'. When there are many executables being generated from the same source code this reduces the amount of work for the linker and the amount of IO by only generating the shared code and associated debug info once when building the shared library/DLL, instead of linking it separately into each binary.

Note that by doing this you are deferring some of the linking work from build time to runtime and consequently startup will slow down as the number of dynamically loaded libraries increases.

Further reading

I hope these notes are useful - please let me know if you have other recommendations in the comments. In the meantime, here are a few notes for existing projects which I found useful background reading:

  • Notes on accelerating Chromium builds on Windows, Linux and Mac - this doesn't involve Qt but the advice is still quite relevant.
  • Notes on improving Firefox's build system.
  • An explanation of how a language designed with build performance in mind differs from C++

No comments: