On our front page we make claim that “huddersfield.click provides low-cost high-performance” and in this blog post we’ll go into detail about how we actually achieve that.
The following benchmarking was carried out using the ApacheBench tool running against one of our hosted WordPress sites in early March 2021. The benchmark was run multiple times during the course of a day to simulate 1,000 or 5,000 home page requests at a variety of different concurrency loads (mimicking simultaneous users), and the average time per request calculated by dividing the ApacheBench “Time taken for tests” result by the number of requests. The benchmarking was run from a separate server in the same data centre, so effectively represents the maximum achievable performance of our current WordPress configuration.1
Before we delve into the huddersfield.click set up, let’s get a baseline of what can be achieved with a bog-standard default installation of WordPress…
WordPress out of the box
Graph 1 shows the benchmarks for a standard LAMP (Linux, Apache, MySQL & PHP) setup with a fresh WordPress installation on a 1GB Linode server running Ubuntu. No extra WordPress plugins were installed and Apache was configured to use mod_php.
This is the same data as above, but shown with a logarithmic Y scale:
With a light load of 10 concurrent requests, it takes ApacheBench 48.25 seconds to request the home page 1,000 times, which means the server can deliver the page 20.72 times per second. That level of performance was maintained as the number of concurrent requests was increased up to 80.
However, beyond 80, the server begins to struggle — as Apache spawns extra processes to handle the connections, the server’s memory runs out and it began to increasingly rely on swap space (temporary memory storage on the server’s hand drive). We’re wandering into OOM territory and things go rapidly downhill.
By 100 concurrent requests, the 512MB swap space is completely full and the server is really struggling. Anyone trying to access the website is going to be waiting between 10 to 20 seconds for the page to load. Pushing on to 110 concurrent requests, page loading time is anything up to 30 seconds. Beyond that point, the server is no longer able to function and the benchmark test fails (around 1/4 of the requests time out or display an error message from the server) and pages are taking up to 90 seconds to load. With no spare memory and a maxed-out CPU, there’s a risk of data loss and processes crashing.
Let’s start to look out how things are set up on huddersfield.click…
Rather than running everything on a single server, we split services out onto separate servers which allows us to optimise each for maximum performance. The front-end web server handles incoming requests and either fulfils it or passes the request back to the WordPress server.
For this first graph, we’ve disabled all of the WordPress optimisations that we usually have in place and every incoming request for the site’s home page is passed back to the WordPress server.
At first glance, this doesn’t look good and the response times are slower than the “out of the box” WordPress installation — each page request takes about 10ms longer, so we can only manage to serve the home page around 16.7 times a second. Why is that?
- All our websites use HTTPS (SSL) by default whereas the “out of the box” setup used HTTP — the SSL/TLS connection and handshake sequence adds extra time to each request.
- Requests are passing through two separate firewalls — a web application firewall (ModSecurity) and an endpoint firewall (Wordfence) — and this adds a small delay as each request is analysed.
However, as we’ll see when we take a look at graph 4, there’s a flip side. Unlike the “out of the box” setup, we’re able to easily scale well beyond 80 concurrent requests. We’ll explain why that is shortly.
Let’s start to re-enable those optimisations that we switched off for the previous benchmark. First up, the WordPress Cache is switched back on — this stores a copy of every requested page and uses it to satisfy subsequent requests.2
We can see an immediate improvement with the request time plateauing at around 16.9ms for 50 concurrent requests, which equates to being able to deliver the home page around 59 times a second.
Whilst that’s good, it’s not amazing. Each request is still being passed back to the WordPress server which then has to check if the page is cached, retrieve it, and pass it back to the user’s web browser.
join the queue
As mentioned above, we start to see a benefit from separating out the services onto separate servers as we increase the number of concurrent requests. Graph 4 shows response times for no optimisations and also for the WordPress Cache enabled. As graphs go, two nearly horizontal lines is not particularly exciting to look at! However, note that we’re now pushing the benchmarking all the way up to 500 concurrent requests.
What’s happening here is that the front-end web server is queuing the incoming requests. Unlike the “out of the box” setup which starts to fail above 80 concurrent requests, the front-end just queues them up and leaves the WordPress server (which is also a 1GB Linode) free to respond as quickly as possible without getting overloaded and OOM-ing. The number of concurrent connections doesn’t matter as much now, as we can manage a consistent level of performance.
“I feel the need… the need for speed!”
We’ve got one more trick up our sleeve, and that’s to hook the front-end web server directly into the WordPress Cache. This means that we’re cutting out everything that adds latency to the request (i.e. the two firewalls and WordPress) and serving up static content straight from the cache. If the requested page is in the cache, then there’s no need to pass the request back to the WordPress server.3
The first graph shows the average request time (ms) and the second shows what that equates to in terms of web pages delivered per second. Note how we’ve now jumped from delivering the home page 59 times per second with the WordPress Cache enabled (at 50 concurrent requests) to over 5,000 times per second — that’s a performance increase of over 8,600%.
Compared to the “out of the box” setup, we’ve now achieved a performance increase of over 24,650%. Obviously there’s an increase in cost as we need more servers, but that increase is just 800% for the hardware used for the benchmark.
Unlike graph 4 which quickly plateaued, we do start to see a decrease in performance as the number of concurrent requests increases. Why is that? Well, all the heavy-lifting is now being done by the front-end web server — more concurrent requests equals more work to do. However, even at 1,000 concurrent requests, we’re still able to provide a sub-millisecond average response time of just under 0.6ms.4
So, with a constant load of 50 concurrent requests we could deliver the home page 443,577,369 times a day. With a constant load of 1,000 concurrent requests, it’d be 144,322,319 times a day. To put that into real-world context, the front-end web server typically handles around 150,000 requests per day which is between 0.03% to 0.1% of it’s potential capacity (depending on the number of concurrent requests). In other words, we’ve got plenty of room to grow and to handle unexpected surges in traffic.
Meanwhile, with the back-end WordPress server sat idling (since static content is being pulled straight from the cache) it’s free to handle requests for dynamic content such as site searches, e-commerce functionality and page editing. In fact, we try to ensure the daily average CPU load on the WordPress servers remains under 2% so that they too can cope with any sudden surges in traffic.
- For all benchmarks, the “-k” flag was used to enable keepalive mode.
- This is a “warm cache” that is automatically refreshed on a daily basis. The cache is also refreshed when pages are edited, comments are posted, etc.
- Most WordPress site pages are suitable for caching but specific pages can be easily excluded. Whenever a page is edited, the cached version is refreshed to the latest content is always being delivered to site visitors.
- Although the benchmarks suggest the servers can easily handle over 1,000 concurrent requests, the Linux server we used for running ApacheBench had a ulimit setting of 1,024 which mean that it couldn’t open more than that number of concurrent connections.