I have been working as a professional developer since the year 2007. I have solved many issues during my long life. I totally think that every issue is different from the others, so you need to use your brain to figure out what is happening. But there are recurring problems spreading into multiple companies, some of them easy to spot and solve, other more complicated, nasty, cursed ones are part of the experience of a person.
When I arrived in UK I worked on a component that was receiving lot of messages and it was generating bespoken dashboards for reporting. That was the first time I got this problem: we setup a small server that was pinging the different services and it was displaying the response time. Under high traffic we saw in this monitoring tool that some services started to fail in replying, and others had very high response times.
We started investigating and in the logs of the failing services there were no traces of error. But the clients of those service were almost always failing. For understanding that particular error we had to take a step back from our code and think about the full stack. The services were running in a server based on Netty. Netty is a NIO client server framework. It is famous for its high performance and for the extensions and resource management. If you need more infos you can visit the website of the product.
Netty handles requests in a kind of simple way: there could be several NIO channels, all of them handling requests with dedicated threads. But there are also dispatcher threads that simply accept the incoming connections and pass them to an handling thread. Those dispatcher threads, the event loop group threads, are usually in limited number, the default is twice the number of processors (to cope with the Hyper-Threading Technology that shows 2 logical thread for each physical processor). A nice discussion on this topic can be found here.
Now we know that accepting a connection is something that, although being fast, can be done by a limited number of threads. Play/Akka use Netty, as said, Spring Boot uses Tomcat or Jetty. The concept is the same: although the different servers have different thread models (as you can see from Spring configuration, Jetty has acceptors and selectors – acceptors are threads accepting the connections, while Tomcat has workers that are threads doing the work, maximum set to 200 as default, and a maximum queue length for incoming connection requests when all possible request processing threads are in use), they all accept connections and then delegate the work to dedicated threads.
What happens when the traffic grows too much? Acceptors/EvenLoopGroup threads or whatever it is, start to create threads but there may be an increase of connection requests queued. Not coping with the requests, no worker is created, so the execution on the side of your application may look normal (meaning that the requests that are accepted are computed normally) but on the side of the client it may look as a disaster: long time for connecting, or even no connection possible.
In a modern microservice system this can lead to another consequence: being the server unable to accept any requests, it is unable to accept also health checks, and the consequence is that the server may restart unexpectedly when it is computing normal work.
And what is the broken pipe error??? That is another possible consequence. A broken pipe is happening when the server tries to write to a connection that has already been closed. It is easy to simulate: you can create a piece of code that connects to a server, send a request and close the connection without waiting for a response. Now let’s imagine that the client connects to a server that is extremely slow: after several seconds the connection is still pending and the client decides to timeout the request (and maybe to retry because it is setup to use hystrix). When the server will be able to start a worker for that closed connection, it will raise a broken pipe exception.
On the client side, instead, if the client is using hystrix and it is opening several multiple connections to a blocked service, it may incur in a SEMAPHORE error because hystrix cannot create a thread for that outgoing request (the error is something like “could not acquire a semaphore for execution and no fallback available“).
That’s all for the problem of unexpected traffic peak. Only solution: to have more instances of the service available, so that there can be more threads accepting/handling requests.