Exploring a weird HTTP issues
Posted on 2022-05-04 in Programmation
Today I'd like to explain how I tried to solve a weird HTTP issues that I found at work. I hope you will find the method and the trials I used the useful/interesting if/when you will encounter something similar.
The issue was this: I needed to dynamically generate a ZIP file with a list of files that couldn't be known before generation. For small ZIP files, everything went fine. For bigger ones, the download would stop before the end and I would only get a corrupted ZIP file. This was of course only happening in production: locally, the download could be long but it would always complete. The stop would always happen when a request was taking at least 30s to complete. Since the project is configured have timeouts at 30s at multiple level to avoid too long requests, that wasn't specific enough to avoid some digging.
The website is written in Python using the Django web framework. In production it's deployed in kubernetes and it's run with gunicorn a well known and used Python application server. In front of gunicorn, there is a nginx reverse proxy, mostly to buffer file upload requests (see this article for more information on that).
Now that you have this context, let's dive in.
gunicorn?
I didn't start my exploration methodically. I had an intuition it could be linked to gunicorn: the Django application would take too much time to generate the ZIP and gunicorn would time out. By default [1], a request times out after 30s. This couldn't happen locally since I use Django's built in server which doesn't have a timeout. And in production, I could see in the logs that gunicorn was restarting its worker process.
So, I timed the ZIP creation process. In some cases, it could get close to 30s and even be longer than this. This was mostly due to the fact we were trying to compress the files. Since they were PDF and images, we didn't really need to compress them (and it was causing issues anyway). So, I changed how we create the ZIP file to skip the compression entirely and create an uncompressed ZIP file. In worst cases, the ZIP file creation now only took around 5s.
At this stage, I thought I solved the problem. However, after some more testing in pre-production, I found out that sometimes, the request would still abort and result in a corrupted ZIP file. I had to dig some more!
More gunicorn issues?
Since gunicorn is serving the file, I was thinking that maybe it was blocked until the file was completely downloaded. It's an issue I already encountered with file uploads and that I solved with a nginx reverse proxy. Maybe something similar was happening, although I was expecting nginx to buffer the response in both directions.
To test this, I configured a nginx instance locally and added the limit_rate 1M; [2] to the server block to slow down downloads so I could witness the problem. I then launched the site with gunicorn with only one worker and triggered a download. I tried to make requests to gunicorn both directly and through nginx. gunicorn responded as expected while the download continued. I then switched gunicorn off, just to be sure. The download continued as expected. So, I could rule out gunicorn. nginx was correctly buffering the response.
Let's go beyond 30s
So far, so good. However, locally, the download took less than 30s. So, I tweaked the limit_rate parameter again to be sure to hit this 30s mark. The download failed and stopped at 30s.
For reasons that are beyond the scope of this article, when the user triggers the download through the interface as I was, we actually download the file in JS and then on something like this URL.createObjectURL(new Blob([response.body])) to actually save the file. It's not something I'd recommend generally since it adds complexity and the browser can handle it just fine on its own. But, it's how it is.
So, I tried to query the API directly without relying on any JS layer. It worked. It looks like a timeout issue in the frontend code.
Once again, I had similar issues with file uploads a while back: the frontend also had timeouts configured. As it turned out, these timeouts were also an issue here. The project uses SuperAgent for its HTTP request, so I just had to configure it so this request could have a longer timeout.
I tested again, and it worked! I thought I solved it. It turned out there was another issue I couldn't spot locally.
To the server!
At this stage, I once again though the problem was solved. So I deployed my solution in pre-production and tested. And it failed. Again.
Our pre-production environnement is a bit peculiar, and we have two nginx reverse proxies in front of the app (we have only one in production). I though it could be an interaction between these two.
So I tweaked the configuration of both nginx. I increased all relevant timeout values: proxy_connect_timeout 600;, proxy_send_timeout 600;, proxy_read_timeout 600; and send_timeout 600;. I even disabled proxy buffering just in case with proxy_buffering off;.
It had no effect. At this stage, I resorted to try each step in the test environnement separately. To do this, I disabled all HTTPS redirections and disabled permissions on my backend view so I could access the URL directly with curl [3] .
Our test environnement is in kubernetes, but the same method applies to any environnement. I connected to the pod where gunicorn and its nginx side car were running and tried this:
- Talking directly to gunicorn: curl -v --output gunicorn.zip localhost:8000/api/download-url [4]. It worked perfectly and the ZIP file was valid (I unzipped it).
- Talking to the nginx sidecar: curl -v --output proxy.zip localhost:80/api/download-url/. Again, it worked.
I then connected to the pod of the second reverse proxy and ran:
curl -v --output proxy.zip https://test.example.com/api/download-url/
And it failed.
What I am missing?
To dig deeper, still thinking the issue was in the second proxy, I increased its log level with error_log stderr debug;. I got this error:
epoll_wait() reported that client prematurely closed connection, so upstream connection is closed too while reading upstream, client: <IP>, server: <SERVER_NAME>, request: "GET /api/download-url/ HTTP/1.1", upstream: "http://<IP>:80/api/download-all/", host: "<HOST>"
That's a weird error. curl (what I thought as the client) didn't closed the connection. What could be happening?
I googled a bit but didn't find anything useful. I tried again thinking at each step what the request actually did and traversed. And then, I got it: by using https://test.example.com I was not talking directly to the second proxy. I was going through the load balancer first. It's obvious once you know it, but I had to think about it.
So I tried again with curl -v --output proxy.zip -X 'Host: test.example.com' localhost/api/download-all/ to skip the load balancer and really talk to the proxy directly. And it worked!
So the problem came from the load balancer after all! And the message reported by nginx is coherent with this: the load balancer is its client and it's closing the connection. If it were not, I'd be out of ideas this time and would need to find an alternative solution to my problem.
I searched the documentation and found out that there is a timeout for these requests (it's not hit for file uploads for some reasons or I'd have know about it sooner) and that we can configure it. The documentation for GKE is here. I applied the suggested configuration and… TADA! Problem solved for good. At least until the download isn't longer than the supplied timeout, which is unlikely for our use cases.
Conclusions
It took more time that I originally anticipated. The fact that many problems were tied together made the resolution harder:
- The ZIP creation was too slow. So gunicorn could time out on some occasions [5].
- The frontend had a timeout, so the request couldn't also complete because of that.
- The request was so long that we also hit a timeout at infrastructure level. This one could only be found in a test environnement close enough of production. To make things worse, I forgot about the load balancer and thought I was talking to a component directly when I actually wasn't.
Luckily for me, I encountered a similar problem with file upload. This time, a rapid Google search indicated me to use a reverse proxy to solve the issue. So, as always, the more experience who have, the more likely you are to have an idea or a lead to where the problem is.
In the end, I had to use to the good old method of testing each component after another to find out what is actually causing the issue. It's long a a bit tedious, but it works.
While I was doing this, I couldn't help but think of other solutions. It's a good thing I didn't stop, because most of them wouldn't have worked and all of them would have needlessly increased the complexity of the project. The solutions I thought of were:
- Create the ZIP file with gunicorn and use a nginx feature named X-Accel-Redirect to transfer quickly the request to nginx. It would have been useless since nginx was already buffering the request correctly.
- Uploading the file to bucket storage and redirect the user to it. It would have worked but I would also have to clean the ZIP files after a while.
- Switch to an extra service altogether if the ZIP creation had to be long. It could have helped, but I would still have the long download issue.
I hope you will find this useful. If you have a question or a remark, please leave a comment below.
[1] | And we use this default value. |
[2] | You may wander why I used this instead of the browser tools. I started with the throttle tool of the browser, but by default it was way too slow. I managed to configure it, but on some occasions the limit wasn't enforced correctly. And later, I needed to test with curl anyway. So in the end, it saved me quite a lot of time. |
[3] | It goes without saying than this environnement doesn't contain any sensitive information. I used random public test files to test my ZIP download. |
[4] | It's a placeholder URL as you probably guessed. |
[5] | It could also seems unresponsive to kubernetes probes so kubernetes would restart it. I didn't mention this sooner to make reading easier. Since our requests are now fast enough, tweaking the probes isn't necessary, so I didn't do it. |