Aha, after running through incrementally adding all various options and found that this behavior kicks in when I restrict the job to a certain node type.
We have two node types , I'll call them NodeA and NodeB. The specs are the exact same, they're AWS EC2 nodes (c4.8xlarge) on EBS.
If I put in "Restrict where this job runs" on NodeA it works. If I put NodeB it fails. In our test environment, we only have one of each so I launched another node (NodeC) to see if it was something specific to our QA NodeB.
Guess what? It's only NodeB, NodeC is totally fine. I'm sad.
Should I leave this ticket open to further diagnose the problem with some guidance?