I did some investigation into this, and here's what's going on:
- The Reload Configuration operation from the UI attempts to load up all the jobs and runs it can find.
- It loads runs from disk based on the presence of a build.xml in the directory for a particular build number (cf. RunMap.load()).
- Runs that are in progress do not have a build.xml written to the directory in the course of normal operations (problem #1).
Attempt 1: write a piece of code that, at the start of the reload operation (Jenkins.reload()), looks at all running Executors and forces them to marshal their state to disk (creating a build.xml). This would then allow Jenkins to notice those jobs running when it restarts.
Then I discovered that the unmarshal procedure for build.xml -> Run assumes that any build.xml it sees is for a run that is in State.COMPLETED. The State is not itself persisted as there's no getState() for xstream to call.
Attempt 2: Make the State of a run persist appropriately so that it can be recovered when Jenkins reloads the configuration.
This seems to work OK at least in limited tests, and I intend to put up a pull request to let people see the changes I'm proposing in the code. I do wonder, though, what the justification for doing it this way in the first place was; it seems likely that you would not in all cases want Jenkins to totally "trust" the state on disk when starting up, for example if there were a large time or configuration delta between the stop and start. I do, however, think that the specific case of mashing the "Reload Configuration" link should be able to assume that what was running before is still running. Things that could possibly go wrong now would mostly be in the arena of jobs that Jenkins now thinks are running but actually aren't anymore. In a reload case, you can probably expect that to happen minimally (if at all).
Another idea I had was based on the fact that the executors clearly are reporting back the fact that they are running a particular job, even if Jenkins doesn't believe that run actually exists due to this bug. It might be possible/better to, instead of persisting everything to disk on the Jenkins master side, have it query the slaves for running jobs when it comes back up and use the information it gets from them to reconstruct its own idea of what is currently running. I don't know how people feel about potential for abuse there, given that it would require the master to "trust" the slaves to tell it what they were working on when it restarted. A combination of the two approaches might be best (trust, but verify).