I have a large Docker swarm (old style docker swarm API in a container). There is plenty of capacity (multi-TB of RAM, etc)
When jobs (multibranch pipeline job in this case) allocate a docker node (by labels), one of these things happens:
- Node is allocated immediately
- Node is not allocated and jenkins logs indicate why (eg: swarm is full as per my configuration for maximums in the Jenkins configuration)
- Node is allocated with a significant delay (minutes). Logs do not indicate why, there is no Docker Plugin log activity until the node is allocated.
- Node is allocated with a ridiculous delay (I just had one take 77 minutes). Logs do not indicate any activity from the Docker plugin until it is allocated. Other jobs have gotten containers allocated since (and those events are in the logs). An interesting thing I noticed is that the job sometimes gets its container only once a later build of this job requests one (they run in parallel), and then the later build waits (forever?).
How can I troubleshoot this behavior, especially #4?
Because it is intermittent, I can't be sure, but it seems as if it has gotten worse after the Docker Plugin 1.0.x to 1.1.x upgrade (possibly also Jenkins 2.92>2.93 upgrade)
In fact, I have two Jenkins instances, one upgraded to plugin 1.1.1 and the other on 1.1, and the one running 1.1 is currently not exhibiting these issues (but it's also under less load)