This is about #16960 and related tickets.


In 2016 we gave our main server some more RAM, as a temporary solution to cope with our workload, and as a way to learn about how to scale it. See hardware for automated tests take2 for our reasoning, lots of benchmark results, and conclusions.

It's working relatively well so far, but we may need to upgrade again soonish, and improvements are always welcome in the contributor UX area:

  • Our release process suffers from a number of delays that are in the critical path and could be shortened if our Jenkins workers were faster: running the test suite after importing Tor Browser, building the final images to check for reproducibility, building IUKs.

  • When Jenkins has built an ISO from one of our main branches or from a branch that Needs Validation, since October 2017 we rebuild in in a slightly different build environment to ensure it can be rebuilt reproducibly. This increased substantially the number of ISO image we build which sometimes creates congestion in our CI pipeline (see below for details).

  • On our current setup, large numbers of automated test cases are brittle and had to be disabled on Jenkins (@fragile), which quite a few problematic consequences such as: it decreases the value our CI system brings to our development process, it is demotivating for test suite developers, it decreases the confidence developers have in our test suite, and it forces developers to run the full test suite elsewhere when they really want to validate a branch. Interestingly, we don't see this much brittleness anywhere else, even on a replica of our Jenkins setup that uses nested virtualization too.

  • As we add more automated tests, and re-enable tests previously flagged as fragile, a full test run takes longer and longer. We're now up to 300-350 minutes per run (depending on how many concurrent jobs are running) without fragile tests. We can't make it faster by adding RAM anymore nor by adding CPUs to ISO testers. But faster CPU cores would fix that. With a fast (Intel E-2134) CPU and a poor Internet connection, the same test suite (without fragile tests) only takes about twice less time.

  • Building our website takes a long while (11-15 minutes on our ISO builders on lizard, i.e. 20% of the entire ISO build time, which is 56-70 minutes on lizard depending on how many concurrent jobs are running), which makes ISO builds take longer than they could. This will get worse as new languages are added to our website. This is a single-threaded task, so adding more CPU cores or RAM would not help: only faster CPU cores would fix that. For example, with a fast (Intel E-2134) CPU and a poor Internet connection, the ISO build only takes:

    • 25 minutes (including 5 minutes for building the website) on bare metal
    • 25 minutes for 1 concurrent build in a VM, i.e. using nested virtualization
    • 33 minutes (including ~5 minutes for building the website) for 2 concurrent builds in VMs, in the worst case situation (builds started exactly at the same time ⇒ they both need all their vCPUs at the same time)
  • Waiting time in queue for ISO build and test jobs is acceptable most of the time, but too high during peak load periods:

    • between 2017-06-17 and 2017-12-17:

      • 4% of the test jobs had to wait for more than 1 hour.
      • 1% of the test jobs had to wait for more than 2 hours.
      • 2% of the ISO build jobs had to wait more than 1 hour.
    • between 2018-05-01 and 2018-11-30:

      • We've run 3342 ISO test jobs; median duration: 195 minutes.
      • 7% of the test jobs had to wait for more than 1 hour.
      • 3% of the test jobs had to wait for more than 2 hours.
      • We've run 3355 ISO successful build jobs; median duration: 60 minutes.
      • 7.2% of the ISO build jobs had to wait more than 15 minutes.
      • 2% of the ISO build jobs had to wait more than 1 hour.
      • We've run 3355 reproducibly_build_* jobs; median duration: 70 minutes.
      • 10% of the reproducibly_build_* jobs had to wait more than 15 minutes.
      • 3.2% of the reproducibly_build_* jobs had to wait more than 1 hour.

    Not that many jobs have to wait a long time, but this congestion happens precisely when we need results from our CI infra ASAP, be it because there's intense ongoing development or because we're reviewing and merging lots of branches close to a code freeze, so these delays hurt our development and release process.

  • Our current main server was purchased at the end of 2014. The hardware will hopefully last quite a few more years, but since it's now 5 years old, we've budgeted its replacement. If our CI run on another machine, replacing this main server would cost much less and the hardware would be chosen very differently.

  • The Tails community keeps needing new services; some of them need to be hosted on hardware we control for security/privacy reasons (which is not the case for our CI system):

    • Added already:
      • self-host our website
      • Schleuder
      • Weblate is a serious CPU, memory, and I/O consumer; besides, part of its job is to rebuild the website, which is affected by the single-threaded task performance limitations described above.
      • survey platform
    • Under consideration:
      • Matomo will require huge amounts of resources and put quite some load on the system where we run it; it needs to be hosted on hardware we control.
      • Request tracker for help desk): unknown resources requirements; needs to be hosted on hardware we control.
    • WIP and will need more resources once they reach production status or are used more often:
  • We're allowing more and more people to use our CI infrastructure. This may increase its resources needs eventually. From a development perspective, ideally every merge request would go through our CI, which probably requires a stronger separation between the CI workload and more security-sensitive tasks that we run.


Bare metal server dedicated to CI

Hard to tell whether this would fix our test suite fragility problems, hard to specify what hardware we need. If we get it wrong, likely we have to wait another 5 years before we try again ⇒ we need to rent essentially the exact hardware we're looking at so we can benchmark it before buying.


  • No initial development nor skills to learn: we can run our test suite in exactly the same way as we currently do.
  • Can provide hardware redundancy in case lizard suddenly dies.
  • We control the hardware and have a good relationship with a friendly collocation.


  • High initial money investment… unless we get it sponsored or get a big discount from a vendor.
  • On-going cost for hosting a second server.

Extra options:

  • If we want to drop nested virtualization to get more performance, then we have non-negligible development costs and hard sysadmin problems to solve (#9486):
    • We currently reboot isotesters between test suite runs ⇒ if we go this way we need to learn how to clean up after various kinds of test suite failure.
    • Our test suite currently assumes only one instance is running on a given system ⇒ if we go this way we have to remove this limitation.


  • For simplicity's sake, the following assumes we have Jenkins workers that are each able to run all the kinds of jobs we have, which would be a good idea anyway.
  • CPU: assuming 2 cores (4 hyperthreads) per Jenkins worker, for 8 workers, we need 16 cores at ≥ 3.5 GHz base frequency, which is roughly equivalent to 4 × the Intel NUC mentioned below. Our options are:
    • 4 × quad-core CPU (On 2020-02-04, 1 result with a not totally crazy price: Xeon Gold 5222 → 4×105 W = 420 W)
    • 2 × octo-core CPUs (On 2020-02-04, 1 result: Xeon Gold 6244 → 2 × 150 W = 300 W)
    • 1 × 16-core CPU (On 2020-02-04, no result.)
    • Higher-density systems, with 2+ servers in a chassis e.g. Supermicro Twin solutions, might allow using cheaper CPUs that don't support multi-processor setups.
  • RAM:
    • 232 GB = 29 GB × 8 Jenkins workers
    • Jenkins VM + host system + a few accessory VMs
      = round to 256 GB; 192 GB should work with super fast storage (at least 4 × NVMe × 2 for RAID-1) if that's cheaper
  • storage:
    • 480 GB = 60 GB × 8 Jenkins workers
    • 600 GB for the Jenkins artifacts store
    • 70 GB for APT cacher (make it cache ISO history too)
    • Jenkins VM + host system + a few accessory VMs
      = round to 1.5 TB × 2 (RAID-1)

Cost estimate: $14-16k, depending on the chosen memory and storage setup (based on or

Bare metal gaming setup in a server housing dedicated to CI

CPU's targeted at the gaming sector are likely to provide the processing power we need, for a way lower price than regular server hardware. In particular, the new generation AMD Threadripper CPU's seem to be a good match. This would involve placing a gaming motherboard in a server casing.


  • No initial development nor skills to learn: we can run our test suite in exactly the same way as we currently do.
  • Can provide hardware redundancy in case lizard suddenly dies.
  • We control the hardware and have a good relationship with a friendly collocation.
  • Relatively cheap


  • On-going cost for hosting a second server.
  • No IPMI
  • Will require some research into cooling

Extra options:

  • We could get a KVM over IP device to compensate for the lack of IPMI
  • Or we could wait for a server board with TRX40 socket and IPMI support to enter the market


  • The Threadripper 3960X has 24 cores at a 3.8Ghz base frequency. This easily meets our required 16 cores at ≥ 3.5 GHz base frequency.
  • A board like the Gigabyte TRX40 Aorus Xtreme can provide 256GB of DDR4 ECC memory and dual PCI-e 4.0 x8 1.6TB for storage.

Cost estimate: roughly 1k for the motherboard, 1.5k for the CPU, 2k for the memory, 1.5k for the storage, 0.5k for housing and extras ~= e6.5k for the most fancy version with super fast storage. Cutting down on the storage and motherboard could bring this down to roughly e5k.

Custom-built cluster of consumer-grade hardware dedicated to CI, aka. the hacker option

For example, we could stuff 4 × Intel NUC or similar together in a custom case, with whatever cooling, PoE and network boot system this high-density cluster would need. Each of these nodes should be able to run 2 Jenkins workers.

Pros and cons


  • Potentially scalable: if there's room left we can add more nodes in the future.
  • As fast, or actually even faster, as server-grade hardware.
  • We have one such node to play with and benchmark results already.


  • Lots of initial research and development: casing, cooling, hosting, power over Ethernet, network boot, remote administration
  • High initial money investment
  • Hosting this is a hard sell for collocations. For example, the folks who'll do hardware debugging and parts replacement onsite may frown upon the extremely non-standard setup.
  • On-going cost for hosting this cluster.
  • Needs some development work to fully benefit from the performance improvements.


  • Intel: in the ninth generation NUC9i7 (availability: 2020Q1), NUC9V7QNX is the one that supports vPro
  • Other vendors have started selling UCFF boards/kits with fast CPUs.
  • Similar smallish form factors would be worth investigating, e.g. there are plenty of Mini-ITX options on the market that could give us the high density we need.
    • Supermicro X11SCL-IF, $200
      • 300€
      • 2 × 16GB RAM Unbuffered ECC UDIMM, DDR4-2666MHz (e.g. CT16G4WFD8266) $400
      • CPU heatsink SNK-P0049A4 $50
      • mini-ITX case with power supply (e.g. Antec ISK310-150) 95€
      • SSD M.2 PCI-E 3.0 x4 250GB 90€
    • ASRock mini-ITX motherboards with IPMI that support suitable CPUs: E3C226D2I, E3C236D4I-44E85, EPC612D4I
    • Supermicro X10SDV series are affordable but support only too slow CPUs
    • nicer Supermicro options, e.g. X11SSV-M4F, are too expensive (1000€ for the motherboard only)
  • Supermicro SuperServer E300-9D is very tempting (IPMI) but more expensive than a NUC.

Cost estimate: 7k€ for hardware (including shipping to the USA and import taxes) and dozens of hours of custom work.

Remote management

  • AMT (vPro) can be a pain as it shares the Ethernet interface with the OS. IPMI would be a big plus as it integrates well into the colo's existing setup.

Benchmarking results


  • Twice faster than lizard with 1 Jenkins executor on the node, that's able to run one build or test job at a time, without nested virtualization.

  • With 2 Jenkins executors on the node, each in its own VM that's able to run one build or test job at a time, with nested virtualization, and qcow2 disks (probably slower than LVs as used on lizard):

    • light load, i.e. only one concurrent build/test: each build + test takes 43% less time than on lizard
    • heavy load, i.e. two concurrent builds/tests: each build + test takes 44% less time than on lizard

For detailed numbers, see above on this page: look for "E-2134".

So if we had, say, 4 such boxes in a case, each with 2 Jenkins workers, in 24h they would build, reproduce and test ~55 branches. While during the same period, the 9 VMs on lizard build, reproduce and test ~40 branches. Even adding only 2 such boxes would increase the maximum throughput of our CI by 69% and immensely lower latency during heavy load times.

Required development work

The above benchmarks assume perfect load distribution, which requires Jenkins slaves that can run both build and test jobs. It works fine in local limited testing but this is not what we have set up on lizard at the moment: there's a risk that a failed test job leaves the system in a bad shape and breaks the following build job as we don't reboot after build jobs. We might need to either make test jobs more robust on failure (#17216), or to start rebooting the Jenkins slaves after build jobs as well.

To utilize more fully such nodes, we may need to raise the number of vCPUs allocated to TailsToaster. For example, assuming 8 hyperthreads per box, and 2 VMs per box, with our current settings (2 vCPUs allocated to TailsToaster and Cucumber eats another vCPU core):

  • when running 1 single test suite job, up to 3 hyperthreads out of 8 are used;

  • when running 2 concurrent test suite jobs, up to 2×3=6 hyperthreads out of 8 are used.

Raising this number may make the test suite run faster at least when there's only 1 test suite job running. It might even make it faster, in most scenarios, when 2 concurrent test suite jobs are running: they generally won't be using that much CPU power at the same time. OTOH it could introduce some test fragility.

We should first measure if it's worth the effort.

Run builds and/or tests in the cloud

Some of the following numbers are outdated in the sense that they don't take into account that we'll build and test USB images as well soon, which impacts mostly the amount of data that need to be 1. transferred between the Jenkins master and workers; 2. stored on the Jenkins master.

EC2 does not support running KVM inside their VMs yet. Both Azure and Google Cloud support it. OpenStack supports it too as long as the cloud is run on KVM (e.g. OVH Public Cloud, OSU Open Source Lab). ProfitBricks would likely work too as their cloud is based on KVM.

There's a Jenkins plugin for every major cloud provider that allows starting instances on demand when needed, up to pre-defined limits, and shuts them down after a configurable idle time.

After building an ISO, we copy artifacts from the ISO builder to the Jenkins master (and thus to nightly.t.b.o), and then from the Jenkins master to another ISO builder (if the branch Needs Validation or one of our main branches) and one ISO tester (in an case) that run downstream jobs. These copies are blocking operations in our feedback loop. So:

  • If the network connection between pieces of our CI system was too slow, the performance benefits of building and testing faster may vanish.

    Assuming a 1.2 GB ISO, 3.5 minutes should be enough for a copy (based on benchmarking a download of a Debian ISO image from lizard) ⇒ 2 or 3 × 3.5 = 8 or 10.5 minutes for an ISO build; compared to 1.5 minutes to the Jenkins master + 20 seconds to the 2nd ISO builder + 15 seconds to the ISO tester =~ 2 minutes on lizard currently. In the worst case (Jenkins master on our infrastructure, Jenkins workers in the cloud) it adds 5 or 8.5 minutes to the feedback loop, which is certainly not negligible but is not a deal breaker either.

  • If data transfers between pieces of our CI system costed money we would need estimate how much these copies would cost. On OVH Public Cloud, data transfers to/from the Internet are included in the price of the instance so let's ignore this.

One way to avoid this problem entirely is to run our Jenkins master and nightly.t.b.o in the cloud as well.


  • Scalable as much as we can (afford), both to react to varying workloads on the short term (some days we build and test tons of ISO images, some days a lot fewer), and to adjust to changing needs on the long term.
  • No initial money investment.
  • No hardware failures we have to deal with.
  • We can try various instances types until we find the right one, as opposed to bare metal that requires careful planning and somewhat-informed guesses (mistakes in this area can only be fixed years later: for example, choosing low-voltage CPUs, that are suboptimal for our workload).
  • Frees lots of resources on our current virtualization host, that can be reused for other purposes. And if we don't need these resources, then our next bare metal server can be much cheaper, both in terms of initial investment and on-going costs (it will suck less power).


  • We need to learn how to manage systems in the cloud, how to deal with billing, and how to control these systems from Jenkins.

  • On-going cost: renting resources costs money every month.

    Very rough estimate, assuming we run all ISO builds and tests on dynamic OVH C2-15 instances (4 vCores at 3.1 GHz, 15 GB RAM, 100 GB SSD), assuming they perform exactly like my local Jenkins (4 i7-6770HQ vCores at 2.60GHz, 15 GB RAM), and assuming that no VAT applies:

    • builds & tests: (30 minutes/build * 450 builds/month + 105 minutes/test * 350 tests/month) / 60 * 0.173€ = 145€/month
    • second build for reproducibility (#13436): 30 minutes / 60 * 250 builds/month * 0.173€ = 22€/month
    • total = 167€/month

    Now, to be more accurate:

    • Likely these instances will be faster than my local Jenkins, thanks to higher CPU clock rate, which should lower the actual costs; but only actual testing will give us more precise numbers.

    • Running a well chosen number of static instances would probably lower these costs thanks to the discount when paying per month. Also, booting a dynamic instance and configuring it takes some additional time, which costs money and decreases performance.

      We need to evaluate how many static instances (kept running at all times and paid per-month) we run and how many dynamic instances (spawned on demand and pay per-hour) we allow. E.g. on OVH public cloud, a dynamic C2-15 instance costs more than a static one once it runs more than 50% of the time. Thanks to the Cluster Statistics Jenkins plugin, once we run this in the cloud we'll have the data we need to optimize this; it should be easy to script it so we can update these settings from time to time.

    • We need to add the cost of hosting our Jenkins master and nightly.t.b.o in the same cloud, or the cost of transferring build artifacts between that cloud and lizard.

      Our Jenkins master is currently allocated 2.6 GB of RAM and 2 vcpus. An OVH S1-8 (13€/month) or B2-7 (22€/month) static instance should be enough. We estimated that 300 GB of storage would be enough at least until the end of 2018. Our metrics say that the storage volume that hosts Jenkins artifacts often makes good use of 1000-2000 IOPS, so an OVH "High Speed Volume" (0.08€/month/GB) would be better suited even though in practice only a small part of these 300 GB needs to be that fast, and possibly a slower "Classic Volume" might perform well enough. So:

      • worst case: 22 + 300×0.08 = 46€/month
      • best case: 13 + 300×0.04 = 25€/month
    • In theory we could keep running some of our builds on our own infra instead of in the cloud: one option is that the cloud would only be used during peak load times for builds (but always used for tests in order to fix our test suite brittleness problems, hopefully). But if we do that, we don't improve ISO build performance in most cases, and the build artifacts copy problems surface, which costs performance and some development time to optimize things a bit:

      • If we run the Jenkins master on our own infra: only artifacts of ISO builds run in the cloud during peak load times need to be downloaded to lizard; we could force the 2nd ISO build to run locally so we avoid having to upload these artifacts to the cloud, or we could optimize the 2nd ISO build job to retrieve the 1st ISO only when the 2 ISOs differ and we need to run diffoscope on them (in which case we also need to download the 2nd ISO to archive it on the Jenkins master). But all ISO build artifacts must be uploaded to the nodes that run the test suite in the cloud.

      • If we run the Jenkins master in the cloud: most of the time we need to upload there the ISOs built on lizard; then we need to download them again for the 2nd build on lizard as well (unless we do something clever to keep them around for the 2nd build, or force the 2nd build to run in the cloud too); but at this point they're already available out there in the cloud for the test suite downstream job. During peak times, the difference is that ISOs built in the cloud are already there for everything else that follows (assuming we force the 2nd build to run in the cloud too).

  • We need to trust a third-party somewhat.

  • To make the whole thing more flexible and easier to manage, it would be good to have the same nodes able to run both builds and tests. Not sure what it would take and what the consequences would be.

We could request a grant from the cloud provider to experiment with this approach. See Arturo's report about how OONI took advantage of the AWS grant program.

Dismissed options

Replace lizard

Dismissed: our CI workload has too specific needs that are better served by dedicated hardware; trying to host everything on a single box leads to crazy hardware specs that are hard to match.


  • No initial development nor skills to learn: we can run our test suite in exactly the same way as we currently do.
  • On-going cost increases only slightly (we probably won't get low-voltage CPUs this time).
  • We can sell the current hardware while it's still current, and get some of our bucks back.


  • High initial money investment.
  • Hard to tell whether this would fix our test suite fragility problems, and we'll only know after we've spent lots of money.
  • Hard to specify what hardware we need. If we get it wrong, likely we have to wait another 5 years before we try again.

The plan

  1. Check how the sysadmins team feels about the cloud option: are there any blockers, for example wrt. ethics, security, privacy, anything else? → we're no big fans of using other people's computers but if that's the best option we can do it

  2. Keep gathering data about our needs while going through the next steps:

    • upcoming services [intrigeri] (last updated: 2020-02-04)
  3. Describe our needs for each option:

    • 2nd bare metal server-grade machine:
      • hardware specs [intrigeri] [DONE]
      • hosting needs (high power consumption) [DONE]
    • the gamer option:
      • hardware specs [groente] [DONE]
      • hosting needs (high power consumption) [DONE]
      • KVM over IP?
    • the hacker option: hosting [CANCELED: we won't do the hacker option]
  4. Ask potential hardware/VMs donors (e.g. ProfitBricks) if they would happily satisfy our needs for free or with a big discount. If they can do that, then let's do it. Otherwise, keep reading. [WIP: intrigeri]

  5. Benchmark how our workload would be handled by the options we have no data about yet:

    • Rent a bare metal server for a short time, run CI jobs in a realistic way, measure. [CANCELED: we have data about Intel E-2134 which is good enough.]
    • Find a friend who's really into gaming and has a threadripper setup to run some CI jobs in a realistic way, measure.
    • Rent cloud VMs for a short time, run CI jobs in a realistic way, measure. [CANCELED: we won't do cloud.]
  6. Decide what we do and look into the details e.g.:

    • bare metal options: where to host it?
    • cloud: refresh list of suitable providers, e.g. check if more providers offer nested KVM, if the OSU Open Source Lab offering is ready, and ask friendly potential cloud providers such as universities, HPC cluster admins, etc. [CANCELED: we won't do cloud]