San Francisco, California
Imagine a customer typing a question to your AI agent in all caps, repeating it, and then asking for a human representative. Most click-tracking software would count this as three interactions and mark the session as ‘active.’ But with the Summer ’26 update, Salesforce sees it differently: as a sign that automation has let someone down.
This difference between tracking activity and measuring quality lies at the heart of Salesforce Agent Analytics. It denotes a real shift from how companies have usually judged AI performance.
Why Click-Counting Never Told the Full Story
For years, the metrics used to judge AI customer service agents were basic. Did the user click? Did the chat stay open? Did they avoid filing a ticket? These signs were seen as proof of success. If a session ended without a support case, it was called a deflection and counted as a win. No one checked if the customer was actually satisfied or just worn out.
As AI agents became more independent, the problem with this logic became clear. Salesforce’s research team found that in over 2,500 conversations studied for its ICLR 2026 submission, 93% were labeled as successful by standard metrics, even when agents had stopped helping and just repeated what users said without solving anything. This failure is called ‘echoing.’ The metric that missed it is simply ‘inadequate.’
This is the problem Salesforce Agent Analytics now directly addresses.
Summer ’26: The Architecture of Honest Measurement
The Summer ’26 release, which goes live between June 13 and June 15, 2026, brings in Refined Agent Analytics. This is a unified dashboard that integrates Service Agent and Employee Agent data into one view, with over 40 metrics covering Quality, Health, Effectiveness, and Usage. While this is a solid upgrade, the bigger change is the introduction of Custom Scorers, now in Beta.
Custom Scorers don’t just count clicks. They actually read the conversations.
With LLM session evaluation, these scorers look at the entire conversation and grade it based on what matters to a business: Sentiment, Tone of Voice, Product Interest, Escalation Trigger, and Courtesy. For example, a company selling enterprise software might see ‘Product Interest’ as when a user starts comparing features with a competitor. A healthcare portal might define ‘Escalation Trigger’ as the exact words that come before an angry callback. Now, both of these can be set as scoring criteria.
The practical impact is clear. While legacy analytics might mark a closed chat window as a resolved case, a Custom Scorer can detect when a conversation shifts from neutral to hostile over several messages and ends with the user leaving. That’s not a deflection; it’s a failure. Now, Salesforce Agent Analytics can call it what it is.
The Deflection Metric Gets a Conscience
For a long time, deflection metrics have been a vanity stat in AI customer service. High deflection rates made executive dashboards look good, even if customers were just giving up instead of getting answers. OpenTable’s use of Agent force showed a better way. Their team created a live deflection score that updates during each talk, starting at neutral and rising in response to real signals. For example, typing in all caps raises the score, and asking for a human raises it more. The agent uses this live score to decide in real time whether to keep trying, open a case, or escalate.
This approach is fundamentally different from just counting closed windows. It treats frustration as real data, not just something missing. With Summer ’26, this idea is now built into the platform itself, so any company can use it, not just those with custom solutions.
Qualitative Scoring at Machine Speed
The deeper shift here is a methodology. Qualitative automated customer service agent scoring the practice of using a language model to judge the performance of another language model was considered a scholarly exercise as recently as 2024. The concern was obvious: what keeps the evaluating model from having the same blind spots as the model being evaluated? The answer Salesforce has landed on is human-defined rubrics. An enterprise writes the scoring criteria. The LLM applies that criterion at scale to every session, not just a sampled subset.
This is what makes qualitative automated customer service agent scoring practically viable for businesses with thousands of daily agent interactions. A human QA team might only review about 2% of the sessions. A Custom Scorer checks all of them, catching the same escalation triggers and tone signals that a trained reviewer would notice, and does so before the customer can complain.
Developers set up these scorers using the Metadata API and store their definitions in source control under the aiAgentScorerDefinitions folder. They turn them on from the Scorer Hub. The whole process is designed to make LLM session evaluation a repeatable, auditable engineering practice, not just an occasional review.
What Executives Should Actually Be Watching
With Custom Scorers now part of Salesforce Agent Analytics, CX leaders need to shift the conversation with their teams. Instead of asking, ‘What is our deflection rate?’ the better question is, ‘Of the sessions we deflected, how many ended with a sentiment score that shows real resolution?’
Over time, this difference will separate companies that build trust in their AI agents from those that simply reduce ticket volumes while harming customer relationships. Deflection of metrics without context have always shown problems only after the fact. LLM session evaluation lets you see problems before they get worse.
Salesforce has now built a quality-control function directly into its platform’s measurement tools. Companies that use it well will not only know when their agents fail, but also how they failed, and they’ll have the tools to fix problems before the next customer interaction.













