Tuning web server for more connections

TsundereChen | Blog

2020-07-27

sre

apache, linux, nginx

Tuning web server for more connections

Recently, I was assigned a task, to figure out how to tune web server, so it can handle more connections at once. I know how to install web server, and I know how to set it up, but I actually haven’t thought about how to tune web server for more performance. This is a really good chance to get to know your web server better, so let’s dig in!

Disclaimer:

When you’re trying any parameter to your service/application, it’s important for you to know what that parameter means. Any tuning is useless unless you know how that parameter changed your service/application. Do this on your own risk, and only change one parameter at a time. Check if a change makes your service/application unstable, if it did, rollback, proceed to next change if everything works fine.

The task I was assigned is to improve performance for Amazon EC2 instance, running Amazon Linux 2. But any Linux distro should works fine, too.

The results in this post came from Ubuntu Server 20.04.

Kernel Options

Unless specified, these options are going to be written in /etc/sysctl.conf

Virtual Memory

Every program in the system have it’s own virtual memory, and it’s operating system’s job to manage those memory.

We have four parameters to study here.

vm.swappiness
vm.dirty_ratio
vm.dirty_background_ratio
vm.vfs_cache_pressure

Let’s take a look at each of these parameters

vm.swappiness

This represents the percentage of the free memory before activating swap. The lower the value, the less swapping is used and more memory pages are kept in physical memory.

The default value of vm.swappiness is usually 60, and using swap may make your program less responsive, so if you need your program to be responsive at any time, it’s better to lower the value of this parameter.

For MariaDB, it’s recommended to set vm.swappiness value to 1. For web server, setting this value to 10 should work.

vm.dirty_ratio

For vm.dirty_ratio, we need to talk about what dirty means first.

dirty memory means the memory data that is in the physical memory, and needed to be written into the disk. Because hard drive’s write speed is usually slower than RAM’s write speed, and there are lots of write tasks in the system, so it’s understandable that system would first write data into RAM temporarilly, then, when there’s too many data in memory or system is idle, it would write those data into hard drive, and release those resources.

When your program creates too many dirty memory, it may make other programs unable to get memory, thus result in lower system performance, so it’s a good idea to write dirty memory into drive before it consume too many RAM.

vm.dirty_background_ratio

The difference between vm.dirty_ratio and vm.dirty_background_ratio is, vm.dirty_ratio will run writeback in foreground, and vm.dirty_background_ratio will run in background.

So the value of vm.dirty_background_ratio should always be lower than vm.dirty_ratio.

An example of these values is

1 2	vm.dirty_ratio = 75 vm.dirty_background_ratio = 5

You need to consider your machine’s spec for these values. Whether you got lots of RAM or you got RAID card, they will make lots of difference when applying these values.

vm.vfs_cache_pressure

This percentage value controls the tendency of the kernel to reclaim the memory being used for caching of directory and inode objects.

If your program is going to open lots of file, or accessing lots of directory, it’s a good idea to set this value lower, set this value higher otherwise.

The default value of this option is 100, set this value higher means more aggressive, set this value lower means unaggressive.

Network

BBR

BBR is a TCP congestion control algorithm developed by Google, this will not speed up your network.

For more detailed information about BBR, please refer TCP BBR congestion control comes to GCP - your Internet just got faster.

In short, BBR can bring higher throughput and lower latency to your system, it uses recent measurements of the network’s delivery rate and round-trip time to build an model, that can be used to control how fast it sends data and the maximum amount of data it’s willing to allow in the network at any time.

BBR’s code is merged into Linux kernel since version 4.9, recent Linux distro should have tcp_bbr kernel module.

Use command below to enable BBR in your system.

# modprobe tcp_bbr
# echo "tcp_bbr" >> /etc/modules-load.d/modules.conf
# echo "net.core.default_qdisc=fq" >> /etc/sysctl.conf
# echo "net.ipv4.tcp_congestion_control=bbr" >> /etc/sysctl.conf
# sysctl -p

Connection, socket buffer and other tunables

We are going to talk about these parameters

net.core.somaxconn
net.core.rmem
net.core.wmem
net.core.optmem_max
net.ipv4.tcp_fastopen
net.ipv4.tcp_tw_reuse
net.ipv4.tcp_max_syn_backlog
net.ipv4.tcp_window_scaling

net.core.somaxconn

This option will set the maximum value of total connections to kernel. This effect your web server’s performance a lot.

The default value of this option is 128, and this value has been raised in kernel 5.4 to 4096.

net.core.rmem

If you type sysctl net.core.tcp_rmem, you should see three numbers, like this

1 2	tsundere@ubuntu-server:~$ sysctl net.ipv4.tcp_rmem net.ipv4.tcp_rmem = 4096 131072 6291456

The first number is the minimum TCP buffer size, the second number is current TCP buffer size, the last number is maximum TCP buffer size.

Why do you need a bigger buffer? There are two kinds of senario, one is you got really fast network environment, and second one is you’re communicating over high-latency WAN. Both senarios can benefit from a larger TCP buffer size.

First, let’s refresh our memory with TCP three-way handshake.

TCP_three_way_handshake

In order to ensure that client got the packet that server sent, the server really can’t send any new packet to the client, thus less efficient network. By increasing buffer size, you can send larger packet at once, thus increasing efficiency in your system.

For this option, you need to calculate the value for your system yourself. The value should be your network speed in bytes times your round-trip delay time in seconds. For example, for a 1Gbps network with 4ms delay, the maximum value of net.core.rmem should be 500kbytes, or 512000. Setting maximum buffer too big will just result in network congestion.

net.core.wmem

Refer to net.core.rmem

net.core.optmem_max

This option affects the memory allocated to cmsg list maintained by the kernel that contains “extra” packet information like SCM_RIGHTS or IP_TTL.

Increasing this option allows the kernel to allocate more memory as needed for more control messages that need to be sent for each socket connected (including IPC sockets/pipes)

The content above is copied from In Linux, how do I determine optimal value of optmem_max?, nice answer.

net.ipv4.tcp_fastopen

Yeah…we need to talk about TCP three-way handshake, again.

For a normal TCP three-way handshake, the data can only be sent after the client send “ACK” to server, then server can send data back to client. This is not efficient, is there any way that we can send data earlier?

TCP Fast Open solved this issue. When the client created a connection to the server for the first time, the server would issue a cookie, and send to client when server send “SYN-ACK” back to client, and now the client have the cookie.

Now, when the client create another connection to the server, it would also send the cookie to the server, the server can identify the client using that cookie, and it would know that this client has created connection before, so now I can send data directly to the client. So instead of sending “SYN-ACK”, the server would send “SYN-ACK + Data”, so the client won’t need to send “ACK” again.

The options for net.ipv4.tcp_fastopen are

0, disabled
1, only enabled on outgoing connections
2, only available on listening sockets
3, enables on outgoing and listening

net.ipv4.tcp_tw_reuse

This will enable reuse of TIME-WAIT sockets for new connections when it’s safe from protocol viewpoint.

Basically, this means the server can use the socket that was already created, have been used and now is idle. By reusing the connection, the system will not need to create a new socket, resulting in faster connection creation time.

The value of this option is

0, disable
1, global enable
2, enable for loopback traffic only

net.ipv4.tcp_max_syn_backlog

This option will tell kernel how many half-open connections can keep in connection queue. If you have lots of clients connecting to server, and server can’t handle connections fast enough, put those connections to queue instead of refusing connection is a good idea.

net.ipv4.tcp_window_scaling

This option allows kernel to change TCP window size. If this option is enabled, program can increase the size of their socket buffers and the window scaling option will be employed.

This option has two available value

0, disabled
1, enabled

File Descriptor

This will affect the maximum number of files your system/user/program can open at once.

Kernel Option

fs.file-max

This option will control the maximum amount of files that your system can open at the same time.

Bigger is usually better.

/etc/security/limits.conf

This file is used by pam_limits, and it can limit how many resources a user can get.

There are two kinds of limit, soft and hard. You can think soft as default value, and hard as upper limit.

If I want to change how many files user nginx can open at the same time, I need to write these lines into /etc/security/limits.conf

1 2	nginx soft nofile 1048576 nginx hard nofile 1048576

Systemd Limit

Systemd will also limit how many files a service can open at the same time.

To change the maximum number of files a service can open, add LimitNOFILE=<AMOUNT> option into [Service] block in the unit file.

RAM & swap

If you have enough RAM, but your system is still using swap, other than changing the value of vm.swappiness, you can also use zram to create swap.

Zram

Zram basically is swap in RAM. It can create swap using RAM, and it also support compression, creating a much faster and efficient swap.

Commands below are example for creating zram

# modprobe zram
# echo lz4 > /sys/block/zram0/comp_algorithm
# echo 2G > /sys/block/zram0/disksize
# mkswap --label zram0 /dev/zram0
# swapon --priority 100 /dev/zram0

Web Server

For my task, I was asked to tune Apache Web Server and Nginx.

And these are options I found helpful when tuning for performance.

Apache

options

ServerLimit

This option will set the the maximum number of child process that Apache can have.

MaxRequestWorkers

This option will set the maximum number of thread a child process can handle at the same time.

npm_event_module

The npm_event_module was created at time around Apache 2.0, this module is designed to allow more requests to be serrved simultaneously by passing off some processing work to the listening threads, freeing up the worker threads to serve new requests.

There are some options for this module.

StartServers

Number of child processes created at startup.

MinSpareThreads

Minimum number of idle threads available to handle request spikes.

MaxSpareThreads

Maximum number of idle threads available to handle request spikes.

MaxRequestWorkers

Maximum number of connections that will be processed simultaneously.

MaxConnectionsPerChild

Limit on the number of connections that an individual child server will handle during its life.

Nginx

http block

worker_rlimit_nofile

This option will set the maximum amount of files a nginx worker can open. Don’t set this any larger than system limit.

worker_connections

The maximum amount of connections a worker can handle at once.

gzip

This option can enable gzip compression for HTTP request.

open_file_cache

This option can enable caching for file descriptors, very helpful when serving static files.

server block

listen

fastopen

This option will tell nginx to use TCP Fast Open for conection, remember to enable kernel support for TCP Fast Open.

backlog

This option will limit the maximum length for the queue of pending connections. You can set this option the same value as net.core.somaxconn

reuseport

This option will tell nginx to create individual listening socket for each worker process. This will allow kernel distributing incoming connection between worker processes.

Results

We have seen lots of tunables, and now it’s time to see the numbers!

Testing VMs are using Ubuntu Server 20.04
Both server and client are equipped with 2 vCPU, 2GB RAM
Test was done using hey

Before

Let’s start by testing our web servers with 1500 clients, and creating 15,000 requests.

Nginx

$ hey -c 1500 -n 15000 http://192.168.133.142:8080

Summary:
  Total:        11.0877 secs
  Slowest:      8.4616 secs
  Fastest:      0.0020 secs
  Average:      1.0375 secs
  Requests/sec: 1352.8485


Response time histogram:
  0.002 [1]     |
  0.848 [10803] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  1.694 [1780]  |■■■■■■■
  2.540 [344]   |■
  3.386 [295]   |■
  4.232 [371]   |■
  5.078 [333]   |■
  5.924 [333]   |■
  6.770 [187]   |■
  7.616 [58]    |
  8.462 [2]     |


Latency distribution:
  10% in 0.1969 secs
  25% in 0.3957 secs
  50% in 0.6161 secs
  75% in 0.8539 secs
  90% in 2.7558 secs
  95% in 4.6799 secs
  99% in 6.3083 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0223 secs, 0.0020 secs, 8.4616 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:    0.0055 secs, 0.0000 secs, 0.5786 secs
  resp wait:    0.1933 secs, 0.0007 secs, 2.5633 secs
  resp read:    0.1772 secs, 0.0001 secs, 1.0626 secs

Status code distribution:
  [200] 14507 responses

Error distribution:
  [490] Get http://192.168.133.142:8080: EOF
  [2]   Get http://192.168.133.142:8080: http: server closed idle connection
  [1]   Get http://192.168.133.142:8080: read tcp 192.168.133.121:35106->192.168.133.142:8080: read: connection reset by peer

We got some errors with Nginx, hope there isn’t going to be any error after tuning.

Apache

Using the same testing params as Nginx.

tsundere@stress-test:~$ hey -c 1500 -n 15000 http://192.168.133.142:8081

Summary:
  Total:        14.3835 secs
  Slowest:      14.1035 secs
  Fastest:      0.0016 secs
  Average:      0.8385 secs
  Requests/sec: 1042.8583


Response time histogram:
  0.002 [1]     |
  1.412 [13407] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  2.822 [862]   |■■■
  4.232 [130]   |
  5.642 [208]   |■
  7.053 [1]     |
  8.463 [349]   |■
  9.873 [17]    |
  11.283 [1]    |
  12.693 [0]    |
  14.104 [24]   |


Latency distribution:
  10% in 0.0194 secs
  25% in 0.1176 secs
  50% in 0.4984 secs
  75% in 0.9432 secs
  90% in 1.4469 secs
  95% in 2.7219 secs
  99% in 7.8421 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0177 secs, 0.0016 secs, 14.1035 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:    0.0061 secs, 0.0000 secs, 0.2483 secs
  resp wait:    0.7235 secs, 0.0011 secs, 13.4384 secs
  resp read:    0.0090 secs, 0.0001 secs, 0.2952 secs

Status code distribution:
  [200] 15000 responses

Although Apache takes more time, but it completes all the requests, good job!

After

I changed the params to 9000 clients, and total requests of 90000. 6x the amount of before, let’s see the result.

Nginx

$ hey -c 9000 -n 90000 http://192.168.133.142:8080

Summary:
  Total:        68.8300 secs
  Slowest:      28.5029 secs
  Fastest:      0.0011 secs
  Average:      6.2177 secs
  Requests/sec: 1307.5701


Response time histogram:
  0.001 [1]     |
  2.851 [22864] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  5.701 [21704] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  8.552 [22498] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  11.402 [9454] |■■■■■■■■■■■■■■■■■
  14.252 [8190] |■■■■■■■■■■■■■■
  17.102 [3050] |■■■■■
  19.952 [2224] |■■■■
  22.803 [10]   |
  25.653 [4]    |
  28.503 [1]    |


Latency distribution:
  10% in 0.3849 secs
  25% in 2.8160 secs
  50% in 5.7732 secs
  75% in 8.6122 secs
  90% in 12.0723 secs
  95% in 15.2423 secs
  99% in 17.8537 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.3823 secs, 0.0011 secs, 28.5029 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:    0.0258 secs, 0.0000 secs, 10.2770 secs
  resp wait:    0.8093 secs, 0.0006 secs, 11.5860 secs
  resp read:    2.8846 secs, 0.0001 secs, 16.7973 secs

Status code distribution:
  [200] 90000 responses

Apache

While trying with 9000 clients and 90000 requests, Apache failed lots of requests. So I decreased the clients to 6000 and 60000 requests.

$ hey -c 6000 -n 60000 http://192.168.133.142:8081

Summary:
  Total:        76.8264 secs
  Slowest:      28.5878 secs
  Fastest:      0.0014 secs
  Average:      6.8449 secs
  Requests/sec: 780.9814


Response time histogram:
  0.001 [1]     |
  2.860 [11824] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  5.719 [14237] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  8.577 [15231] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  11.436 [9615] |■■■■■■■■■■■■■■■■■■■■■■■■■
  14.295 [4498] |■■■■■■■■■■■■
  17.153 [2396] |■■■■■■
  20.012 [1061] |■■■
  22.871 [177]  |
  25.729 [123]  |
  28.588 [6]    |


Latency distribution:
  10% in 1.9116 secs
  25% in 3.5167 secs
  50% in 6.1875 secs
  75% in 9.4653 secs
  90% in 13.1850 secs
  95% in 14.9849 secs
  99% in 18.6452 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.5925 secs, 0.0014 secs, 28.5878 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:    0.1029 secs, 0.0000 secs, 3.4054 secs
  resp wait:    1.4830 secs, 0.0012 secs, 18.8476 secs
  resp read:    2.4209 secs, 0.0001 secs, 11.0374 secs

Status code distribution:
  [200] 59169 responses

Error distribution:
  [825] Get http://192.168.133.142:8081: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  [6]   Get http://192.168.133.142:8081: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Conclusion

I think there’s still more options/parameters to tune, and log from web servers should be helpful when we want to increase the connections to the server. But after some simple tweaks, the web server can perform 4x to 6x better than before, I think it’s worth the tuning and studying.