We are currently in reactive mode for most of the possible infra issues. We discover them late, or too late.
Examples of things we need to monitor:
- the trigger of that JIRA: right now it seems the Evergreen job on trusted.ci.jenkins.io is broken again, since the last merge is from yesterday, but the last published Docker image on https://hub.docker.com/r/jenkins/evergreen/tags/ or https://hub.docker.com/r/jenkinsciinfra/evergreen-backend/ is 3 days old...
- There should be an alert when the time difference between the last commit on https://github.com/jenkins-infra/evergreen and the last published image is more than, say, 6 hours
- https://evergreen.jenkins.io/ should be monitored to be running (and actually working as expected)
- Incrementals not broken (it has been regularly broken, throwing HTTP-500, but designed to be ignored on purpose, but it means that when it starts failing for whatever reason, we miss it until we really need an incremental version :)
When a critical component/behavior is broken, we should be notified as soon as possible to have more time to manage it serenely. (Instead of discovering it later when we really need something to work, like during a demo or whatever ).
Note: I suppose part or all of this is to be done using https://github.com/jenkins-infra/jenkins-infra-monitoring