DiskSpaceMonitorDescriptor (used to check free space on the temp and workspace partitions) inherits (with a few intermediate classes) from AbstractNodeMonitorDescriptor, whose default scheduling interval is one hour. That means if an agent runs out of space it could take an hour before Jenkins detects the problem and takes the node offline. (There are some other code paths – such as onConnect – that can trigger an update, but I believe one hour remains the worst case.)
I've tripped into this multiple times where a job fulls up an agent, a subsequent job fails, yet the agent is still marked as online.
I believe one hour is not a reasonable modern value for such a quick check, but I am unsure how to proceed:
- Change AbstractNodeMonitorDescriptor to a "more reasonable" default?
- Make this a configurable value?
- Make it a configurable value per check?
- A fancy dynamic scheduler with backoff?
My own inclination is that one minute would be a reasonable and unsurprising default.