It Works in the Demo. Will It Work in Production? Evaluating and Debugging AI Agents