Writing this here, since it caused some issues at the Day Job, and took a little bit of work to debug. Hopefully this will save some of you some time, and will jog my memory should something like this happen again.
Anyway, last week, one of my team wanted to deploy a new release of our Access Management System (ARIA), which involved deploying a bunch of containers. The release procedure worked fine, however after the deploy, the main web app container was entirely unable to talk to our API layer.
My team did a little bit of debug, but it was at this point that it got escalated to me. I rolled the live environment back, and began debugging the problem.
On the face of it, it seemed that networking, or at least name resolution, was no longer working from within the container. A curl call from the command line produced:
curl: (6) getaddrinfo() thread failed to start
However, a connection to an IP address would work. So, I began looking at networking / name resolution. The next step was to see what the name servers were doing… however, nslookup gave me:
isc_thread_create(): fatal error: pthread_create(): Operation not permitted
Interesting… so something was blocking creating new threads within the container. Likely the security model that docker was running… not sure why this would change, but I confirmed this by redeploying with SECCOMP turned off:
security_opts:
- "apparmor:unconfined"
- "seccomp:unconfined"
Confirmed, networking was working.
Not sure what’s changed, but it would appear that somewhere down the line the base Apache Linux image has updated, and is now using a different system call for starting threads. Likely a new version of GLIBC has been rolled into the container somewhere.
The final fix was to update the various containers to make sure they were all running on the newer base image, and then to redeploy docker on our estate so that it was running the latest version.
Boom.
Everything back to normal, hope this saves you some time!