When my pods are killed by OOM, the nodes aren't removed, this pollutes the interface and causes the job stay running but zombie.
If I click to abort the job it prints "Are you sure you want to abort null?"
This message come from executors.jelly when executor.currentExecutable.fullDisplayName is null.
On proceed it deletes the node, as expected.
In the logs I found these entries:
I think it's related to Reaper class, when DELETED event is received (here) which calls Node#removeNode.] There I found this comment "If the node instance is not in the list of nodes, then this will be a no-op, even if there is another instance with the same".
I think by some reason the instance passed by Reaper is different from Node, which causes it to be ignored.
The OfflineCause for the node is "Node is being removed"