For three weeks, my morning briefing was perfect. A small AI agent I had built scanned everything that moved across the company overnight and handed me a tidy summary at 8:30 sharp. It never crashed. It never missed a day. And in week three, I realized I had stopped reading it. It had never once told me something I did not already know.

That is a stranger kind of failure than the ones we are trained to catch. The agent did not break. It just never helped. And because the output was always fluent and always plausible, it took weeks to notice the difference between working and useful.

By now we have built more than a dozen of these AI agents for our own work: briefings, meeting prep, inbox triage, status trackers. The pattern that emerged is sharper than “AI is good at some things and bad at others.” Homemade agents are excellent at errands and quietly weak at judgment. The line between those two kinds of work turned out to be the distinction that actually matters.

The retrieval was never the problem

Most of the advice on building agents frames the question as a capability question: what can the model do? That framing sent me down the wrong path for months. The better question is not what the agent can do. It is what kind of work you are handing it.

Take a concrete case. We wanted better 1:1s. The idea was simple. Before each weekly meeting, an agent searches every connected tool for anything about that person: their recent work, their messages, their tickets, their documents. Then the same material feeds the quarterly feedback conversation later on. On paper this should raise coaching quality a lot, because you walk in with the full picture instead of a vague memory.

The retrieval part worked. Pulling scattered inputs about one person into one place is exactly the kind of well-scoped task these systems handle well, which is also where Anthropic’s own Building Effective Agents guidance says they are most reliable. So we pushed the idea further, into the coaching itself.

Where strong AI agents still did not convince us

We ran the same idea on two strong setups. One was Codex with GPT-5.5. The other was Claude in Cowork mode with Opus 4.8. Both gathered the context well. Neither produced coaching input that convinced us.

The gathering itself was impressive. Both AI agents surfaced messages and documents we had forgotten, and laid them out in minutes. So the disappointment sat one layer up, in the conclusions they drew from all that context.

The output read fine. It was structured, confident, and generic. It said what any competent observer could say about almost anyone: communicates well, could prioritize more, engages the team. Yet it rarely named the one thing that actually mattered for that person in that quarter.

We had seen the same split before. On a fixed schedule, with no specific question, the agent drifted toward the generic mean. But fed a concrete situation, with the real tension spelled out, it got noticeably sharper. So the problem statement was doing the work, not the model. An agent with nothing specific to react to fills the gap with the average of everything it has seen, and then writes it up with confidence.

Coaching is judgment, and judgment is bounded

There is a reason the data alone did not help. Coaching is not a retrieval task. We can only coach inside three boundaries: what we actually see of a person, where that person wants to go, and where the company needs them. A model has none of the three. It can read the messages, but it does not know the career conversation from six months ago, or the strategic need forming for next year. Two of those three boundaries are barely written down anywhere. They live in past conversations, in context we carry around, and in plans we have not even shared with the team yet.

So “the model is not good enough yet” is probably the wrong reading. Two other explanations fit better, and both point back at us rather than at the tool.

The first is that we never transferred our own judgment into the prompt. The agent did not lack intelligence. It lacked context that still lives in our heads and was never written down. If we cannot put into words what good coaching looks like for this specific person, we cannot expect an agent to find it for us.

The second is harder to admit. We may not be able to judge the output well. For an errand, the result is right or it is not, and you see it instantly. For coaching input, you would need to already know the answer to tell whether the agent found it. That is the trap behind automation bias, which is strongest exactly when a system looks competent and you are short on time. A polished briefing is a very comfortable place for it to hide.

What we kept, and what we stopped expecting

We kept the gathering. The agent that pulls context about a person or a topic before a meeting is genuinely useful, and we still run it every week. It saves real preparation time. It also raises the floor, because nothing important quietly gets forgotten.

But we stopped expecting the agent to do the coaching. The read on what matters most for this person right now stays with the coach. In practice that means we use the agent to prepare, not to conclude. So we put the effort into writing down the situation and the question, rather than into automating the answer.

If you want to test the same split this week, draw a line through your own wishlist of tasks you want AI agents to take. For each one, ask two questions. Can you state what done looks like? And can you check the output in a few seconds? Two yes answers mean it is an errand, so automate it, and schedule it if you like. Any no means judgment is involved, so keep it as something you run by hand, with a real question in front of it.

The idea is still worth pursuing. Better context before a hard conversation is a real gain, and the gathering already delivers it. AI agents will keep getting better at that part, and that alone is worth the setup. Whether an agent can ever carry the judgment underneath coaching is a more open question. For now that part stays subjective, and it stays with us. We are still working out how much of it can be written down at all.


Leave a Reply

Your email address will not be published. Required fields are marked *