Thursday, November 21, 2013

Understanding the QWidget layout flow

When layouts in a UI are not behaving as expected or performance is poor, it can be helpful to have a mental model of the layout process in order to know where to start debugging.  For web browsers there are some good resources which provide a description of the process at different levels. The layout documentation for Qt describes the various layout facilities that are available but I haven't found a detailed description of the flow, so this is my attempt to explain what happens when a layout is triggered that ultimately ends up with the widgets being resized and repositioned appropriately.

  1. A widget's contents are modified in some way that require a layout update. Such changes can include:
    • Changes to the content of the widget (eg. the text in a label, content margins being altered)
    • Changes to the sizePolicy() of the widget
    • Changes to the layout() of the widget, such as new child widgets being added or removed
  2. The widget calls QWidget::updateGeometry() which then performs several steps to trigger a layout:
    1. It invalidates any cached size information for the QWidgetItem associated with the widget in the parent layout.
    2. It recursively climbs up the widget tree (first to the parent widget, then the grandparent and so on), invalidating that widget's layout. The process stops when we reach a widget that is a top level window or doesn't have its own layout - we'll call this widget the top-level widget, though it might not actually be a window.
    3. If the top-level widget is not yet visible, then the process stops and layout is deferred until the widget is due to be shown.
    4. If the top-level widget is shown, a LayoutRequest event is posted asynchronously to the top-level widget, so a layout will be performed on the next pass through the event loop.
    5. If multiple layout requests are posted to the same top-level widget during a pass through the event loop, they will get compressed into a single layout request. This is similar to the way that multiple QWidget::update() requests are compressed into a single paint event.
  3. The top-level widget receives the LayoutRequest event on the next pass through the event loop. This can then be handled in one of two ways:
    1. If the widget has a layout, the layout will intercept the LayoutRequest event using an event filter and handle it by calling QLayout::activate()
    2. If the widget does not have a layout, it may handle the LayoutRequest event itself and manually set the geometry of its children.
  4. When the layout is activated, it first sets the fixed, minimum and/or maximum size constraints of the widget depending on QLayout::sizeConstraint(), using the values calculated by QLayout::minimumSize(), maximumSize() and sizeHint(). These functions will recursively proceed down the layout tree to determine the constraints for each item and produce a final size constraint for the whole layout.  This may or may not alter the current size of the widget.
  5. The layout is then asked to resize its contents to fit the current size of the widget using QLayout::setGeometry(widget->size()). The specific implementation of the layout - whether it is a box layout, grid layout or something else then lays out its child items to fit this new size.
  6. For each item in the layout, the QLayoutItem::setGeometry() implementation will typically ask the item for various size parameters (minimum size, maximum size, size hint, height for width) and then decide upon a final size and position for the item. It will then invoke QLayoutItem::setGeometry() to update the position and size of the widget.
  7. If the layout item is itself a layout or a widget, steps 5-6 proceed recursively down the tree, updating all of the items whose constraints have been modified.
A layout update is an expensive operation, so there are a number of steps taken to avoid unnecessary re-layouts:
  • Multiple layout update requests submitted in a single pass through the event loop are coalesced into a single update
  • Layout updates for widgets that are not visible and layouts that are not enabled are deferred until the widget is shown or the layout is re-enabled
  • The QLayoutItem::setGeometry() implementations will typically check whether the current and new geometry differ or whether they have been invalidated in some way before performing an update. This prunes parts of the widget tree from the layout process which have not been altered.
  • The QWidgetItem associated with a widget in a layout caches information which is expensive to calculate, such as sizeHint(). This cached data is then returned until the widget invalidates it using QWidget::updateGeometry()

Given this flow, there are a few things to bear in mind to avoid unexpected behaviour:
  • Qt provides multiple ways to set constraints such as fixed and minimum sizes.
    • Using QWidget::setFixedSize(), setMinimumSize() or setMaximumSize(). This is simple and available whether you control the widget or not.
    • Implementing the sizeHint() and minimumSizeHint() functions and using QWidget::setSizePolicy() to determine how these hints are handled by the layouts. If you control the widget, it is almost always preferable to use sizePolicy() together with the layout hints.
  • The layout management documentation suggests that handling LayoutRequest events in QWidget::event() is an alternative to implementing a custom layout. A potential problem with this is that LayoutRequest events are delivered asynchronously on the next pass through the event loop. If your widget is likely to update its own geometry in response to the LayoutRequest event then this can trigger layout flicker where several passes through the event loop occur before the layout process is fully finished. Each of the intermediate stages will flicker on screen briefly, as the event loop may process a paint event on each pass as well as the layout update, which looks poor. So if you need a custom layout, subclassing QLayout/QLayoutItem is the recommended approach unless you're sure that your widget will always be used as a top-level widget.

Monday, November 4, 2013

Improving build times of large Qt apps

My colleagues and I spent time recently improving build times of a largish Qt app (Mendeley) and its associated test suite. I'm sharing some notes here in case anyone else finds them useful. Most of the steps here fall under one of a few basic ideas:
  • Measure first
  • Do more in parallel
  • Work around the inefficiencies of C++ compilation
  • Use faster tools
  • Do less disk I/O
All of these steps can improve build times on all platforms, but those that reduced the amount of I/O during builds were especially effective on Windows.

Measure first


When we started out, I expected that running the tests would be consuming most of our CI system's cycle time. In the end it turned out that the largest bottleneck was actually just building the code on Windows, which was taking 3x as long as Linux (30 mins for a fresh build vs 10 on Linux). The unit tests did take longer to run on Windows by a factor of 2 (20mins total vs 10 on Linux).

Use those cores!


One of simplest things to address is usually taking advantage of multiple cores on your system. The '-j' argument to make sets the number of parallel jobs. The optimal number will depend on a number of factors. Setting the value to the number of cores is a reasonable starting point, but check what happens with different values.

When running unit tests, use the option in the driver to run multiple tests in parallel. ctest supports a '-j' argument for this as well. An important thing to remember before enabling this is that your tests need to be set up so that they can't interfere with one another. This means not trying to use the same resources (files, settings keys, I/O ports, web service accounts etc.) at the same time. Some tests might be easier to isolate than others in which case you can split your test suite into subsets and only run some of the subsets in parallel. ctest has a facility for assigning labels to tests using.

set_tests_properties( $TEST_TARGET PROPERTIES LABELS $LABELS)

CTest then has a set of command-line arguments that can be used to run only tests with labels matching a certain pattern, or exclude tests with labels matching a certain pattern. This can then be used to run only a subset of tests which are known not to interfere with one another concurrently.

Working around C++ compilation inefficiency


When the compiler encounters an #include statement, it effectively copies and pastes the content into the current source file. The resulting output that the compiler has to lex, parse and understand the semantics of ends up being tens of thousands of lines long in the case of a typical source file in a Qt app. The more you use code-heavy headers such as the C++ standard library or Boost, the worse this gets. This is incredibly inefficient and means that much of your build time can be spent re-parsing the same source code over and over. This is compounded by the complexity of parsing C++ in the first place.

Consider this very simple list view app.  There are only 15 lines of actual code in the example but the preprocessed output, which can be produced by passing the -E flag to gcc, is just under 43,000 lines of actual code (as determined by sloccount) or just under 60,000 lines when C++11 mode is enabled (using the '-std=c++0x' flag).

In a language with a proper package/module system (eg. C#, Go or many other languages), processing an import only involves reading some metadata from the already-compiled module rather than re-parsing everything. A proper module system for C++ is in the works but is still some way off. In the meantime, there are hacks workarounds available which can help considerably.

Precompiled headers


MSVC, GCC and Clang all have good support for precompiled headers. The use of precompiled headers is even more important now since the preprocessed output of many of the #includes from the C++ standard library grows considerably in size when C++11 is enabled. Note that under MSVC on Windows, C++11 mode is always enabled.

With the small example above, creating a precompiled header which includes just the QStringList header reduces compile times for the main .cpp file on my system from ~1.1s to ~0.7s (about 35%). This sounds modest but adds up by the time you have a project with hundreds of source files. Even in a small project with just a few dozen source files I think it is worthwhile.

The steps to enable precompiled headers will depend on the build system you are using. With qmake, this is relatively simple. CMake lacks a simple built-in command for this but there are samples online that we used as a basis.

A downside of precompiled headers is that you are effectively automatically #including an extra header with every file that you build, so a file may compile in a build with precompiled headers but fail to build in one without if the file is missing necessary #includes that are supplied by the precompiled header when enabled. If you're running a CI system is therefore useful to have at least one regular build that is not using precompiled headers.

Unity builds


A unity build involves creating a single source file which #include's all the source files from a particular module or the whole project and compiling that at once. The main caveat with this approach is that variables and functions declared within an implementation .cpp file may now clash with declarations from other source files - since they are now being compiled together as a single source file instead of separately.

More efficient build tools


Part of the reason for a gradual creep in built times as a project grows is due to scaling issues with build tools. The amount of time taken for a do-nothing build (ie. running 'make' when everything is up to date) grows noticeably with cmake + make as the total number of targets to build increases. Fortunately for us, engineers on Google Chrome ran into this problem harder and long before we did so they have produced some helpful replacements for the standard tools:
  • The Ninja build system is designed to be faster, especially for incremental builds where little changed. Recent versions of CMake have built-in support for generating Ninja build files (use 'cmake -G Ninja' to generate Ninja build files). The difference in build speed for incremental builds where little changed is decent on Mac and Linux but very noticeable on Windows compared to nmake. Prior to Ninja, Qt developers also created jom as a faster alternative to make.
  • On Linux, the Gold linker is faster than the traditional ld linker and can often be used as a drop-in replacement.

Reducing total disk I/O


Disk I/O is very slow, reducing the total amount of I/O (especially random I/O) required during a build can improve overall build times substantially. Anecdotally, this is especially true on Windows where reducing the total amount of I/O performed during a clean build had the largest impact in terms of achieving parity between build + test times on Windows and build times on Linux and Mac.

Use faster hardware


It always feels a little dirty to solve software inefficiency by throwing faster hardware at the problem but if you can afford it, it can be a quick win.
  • Adding more memory will reduce the likelihood of the build system swapping.
  • A good SSD drive will speed up disk I/O, especially for operations which do a lot of random I/O.
  • If you have a lot of memory spare you can create a RAMDisk and do the build on that.
I haven't compared the impact of an SSD vs. a standard IDE drive myself, this advice comes mostly from Chromium developers build notes.

Reducing debug info size


In debug builds, a large proportion of the total size of data read/written from disk is typically debug information. When doing local development, this information is usually useful. When generating builds on a continuous integration system that will purely be used for automated tests, this is less so.
  • All compilers (MSVC, gcc, clang) have switches to control the amount of debug info that is generated. With gcc/clang these are controlled by the -gXYZ switches.

Generating fewer binaries for tests


For every binary that is generated as part of a project, there are a number of overheads:
  • Each binary will add a number of additional targets to the build system
  • Each binary requires a linking step - which can be memory and I/O intensive.
  • Each binary generated requires reading/writing additional data to disk. The cost of this depends on how large the generated binary is and how many files need to be processed to assemble the final binary.
In our case, we are using the QTestLib framework for unit tests, which by default encourages the creation of one test class per original class. Each test class is then compiled into a separate binary with a QTEST_MAIN($TEST_CLASS_NAME) macro providing the entry point for the test app. This works fine for smaller apps. When a project grows larger however and you have hundreds of test classes, the overhead of linking all of those binaries can add noticeably to the total build time.

We changed the test builds to produce one test binary per source directory instead of one per test class. This was done by replacing the QTEST_MAIN() macro with a substitute which instead declares a '$TESTCLASS_main()' function and registered it in a global map from test class to init function on startup. All of the test classes are then compiled and linked together with a small stub library which declares the 'int main()' function that reads the name of the test to run from the command-line and calls the corresponding '$TESTCLASS_main()' function, forwarding the other command-line arguments to it. This allows multiple Qt test cases to be linked into a single binary which improves build times in several ways:
  • The number of linking operations during builds was considerably reduced.
  • The total amount of binary data generated on disk was reduced as code that was previously statically linked into the test binary for each test class is now only linked into a single test binary for each group of tests.
  • The total number of make steps and targets for the whole project was reduced.
On Windows this change shaved 30% off our total build time and the impact on build times of adding a new test case is now greatly reduced.

Generating smaller binaries


Another way to reduce the size of compiled binaries is to build each module of the app into a shared rather than static library. This is sometimes referred to as a 'component build'. When there are many executables being generated from the same source code this reduces the amount of work for the linker and the amount of IO by only generating the shared code and associated debug info once when building the shared library/DLL, instead of linking it separately into each binary.

Note that by doing this you are deferring some of the linking work from build time to runtime and consequently startup will slow down as the number of dynamically loaded libraries increases.

Further reading

I hope these notes are useful - please let me know if you have other recommendations in the comments. In the meantime, here are a few notes for existing projects which I found useful background reading:

  • Notes on accelerating Chromium builds on Windows, Linux and Mac - this doesn't involve Qt but the advice is still quite relevant.
  • Notes on improving Firefox's build system.
  • An explanation of how a language designed with build performance in mind differs from C++