You're Measuring Your AI Wrong

CSAT scores and sentiment analysis tell you how the conversation felt. They don’t tell you whether anything happened. Those are different questions, and only one of them matters.

If you deploy an AI voice or chat system and you measure it primarily by customer satisfaction scores, you are measuring the wrong thing. I want to be direct about this because the implications are significant and the mistake is nearly universal.

Here is what CSAT and sentiment analysis measure: the quality of the conversation experience. How natural the exchange felt. Whether the customer felt heard. Whether the interaction matched what the customer expected a helpful interaction to feel like. These are real signals. They tell you something true.

But they do not tell you whether the customer’s underlying need was resolved. They do not tell you whether the artifact the customer needed — the complaint record, the refund request, the lost-item report, the support ticket, the booking confirmation — actually exists in your system after the call ended.

A customer can be highly satisfied by a conversation that produced nothing. This is not a theoretical failure mode. It happens constantly, in every AI-assisted customer operation that measures success by conversation quality alone.

The customer says “thank you, that was very helpful.” The CSAT score registers a win. The conversation is logged as resolved. And the complaint is not in the system. The refund was not submitted. The item was not logged. The thing the customer needed to happen did not happen.

You will not see this in your dashboard. You will see a resolution rate and a satisfaction score. The gap between what those metrics show and what actually occurred is where your AI system is failing you.

Let me describe the metric you should be measuring instead.

For every category of interaction your AI system handles, there is a corresponding artifact — a discrete, concrete output that should exist in your systems after the interaction ends. Here are some examples:

A lost-item inquiry should produce a lost-item report: caller name, contact information, item description, approximate date and location of loss, follow-up preference. If that record does not exist in your system at the end of every lost-item call, your AI is not doing its job regardless of how the call was rated.

A billing concern should produce one of: a charge-status clarification document, a refund request routed to your payment system, or a verified resolution record. If none of those artifacts exist after a billing call, the call was not resolved — it was managed.

A complaint should produce a structured complaint record with the nature of the concern, the relevant details, the customer’s stated expectation, a priority flag, and an assigned follow-up path. If that record does not exist, the complaint was heard but not handled.

A support request should produce a support ticket with the problem statement, the troubleshooting steps taken during the call, the resolution or escalation status, and the next action. If the ticket is not there, the support was conversation, not service.

In each case, the artifact is specifiable in advance. You know what it should look like. You know what fields it must contain. You can check whether it exists. You can check whether it’s complete.

This check — does the artifact exist, and is it complete — is the measurement your AI system needs.

The reason organizations default to CSAT instead of artifact completion is that CSAT is easier to collect and harder to argue with. Customers tell you whether they were satisfied. The score is there. It is a number.

Artifact completion requires you to know what artifacts your system should be producing, to build the systems to receive them, and to check them after every interaction. This is more work. It also requires admitting, before you start, that the conversation is not the point — the artifact is — and that your AI system will be evaluated on output, not experience.

Most organizations are not ready to say that out loud, because it implies accountability for something specific. Did the record get created? Yes or no. Did the ticket get filed? Yes or no. Did the refund request get submitted? Yes or no.

This accountability is uncomfortable because it cannot be finessed. A 4.2 out of 5 on customer satisfaction is an abstraction. A complaint record that does not exist is a fact.

Let me tell you what happens when you switch to artifact-first measurement.

First, you discover failures you didn’t know you had. In any AI system that has been measured primarily by conversation quality, a significant fraction of interactions that registered as successful produced no artifact. When you start checking for the artifact, those failures become visible. This is uncomfortable. It is also the only way to fix them.

Second, you get real design feedback. When you know that your AI system is producing incomplete lost-item reports — missing the caller’s contact information in 23% of cases, say — you can fix that. You can look at the conversations where the contact information was missing and understand what happened. The caller didn’t provide it. The AI didn’t ask. The form field wasn’t required. Each of these is a fixable design problem. CSAT scores do not give you this kind of feedback. They tell you the conversation felt good or bad. Artifact analysis tells you specifically what is missing from the artifact and why.

Third, you align your AI system with the actual purpose of the interaction. The purpose of a lost-item call is not a positive customer experience. The purpose is to produce a usable lost-item record or verify a found-item status. The positive customer experience is a means to that end — it helps gather the necessary information, it builds goodwill, it reflects well on the organization. But it is not the end. The artifact is the end. When your measurement system reflects this, your AI system design will too.

There are two legitimate objections to artifact-first measurement, and both are worth addressing.

The first: some interactions don’t produce a discrete artifact. A customer who calls to ask what time the restaurant closes does not need a complaint record. The answer is the artifact. This is true. Not every call category has a structured artifact type. But most do, and for those that do, the artifact is the correct measure.

The second: measuring artifact completion doesn’t tell you whether the customer felt good about the interaction. This is also true. An AI system that produces perfect artifacts in a way that makes customers angry or confused is not a success. Both things matter. The argument is not that conversation quality is irrelevant — it is that artifact completion should be primary, and conversation quality should be secondary. Right now, most organizations have these reversed.

The practical step is simple to describe, if not always simple to execute.

Before you deploy or evaluate any AI customer-facing system, sit down and answer this question for every interaction category the system handles: what artifact should this interaction produce, what does that artifact look like when complete, and how will we check whether it exists?

Then build the checking mechanism. Audit samples. Look at the artifact completion rate alongside the CSAT rate. When these diverge — when satisfaction is high and artifact completion is low — you have identified the place where your AI system is producing the feeling of service without its substance.

Fix that place. Not by making the conversation feel better. By making the artifact appear.

The artifact is the proof. Everything else is how the conversation felt on the way there.

You’re Measuring Your AI Wrong

Like this:

Leave a ReplyCancel reply

Share this:

Like this:

Leave a ReplyCancel reply

Discover more from John Rector