Treat agents like systems: measure success, errors, and review cost.
A useful metric is whether the system completes a defined task with clear acceptance criteria — not whether a response sounds confident.
You can measure invalid arguments, tool failures, retries, and recovery behavior. These show up in logs and traces.
If outputs require heavy human rewriting, “autonomy” shifts work instead of reducing it.