- Client:
European telecom operator
- Service:
AI Quality & Evaluation at Scale
- Category:
AI Quality & Evaluation, Enterprise AI, Contact Center, LLM Calibration
- Date:
December 10, 2024
From Under 1% to 100% QA Coverage: AI Quality Assurance at Scale
A European telecom operator was scoring less than 1% of its contact-center interactions by hand. Our team replaced that fragile sample with a calibration framework that now evaluates 100% of interactions automatically at 82% AI accuracy against a human benchmark, with a 72% automation score and a 50% improvement in process efficiency across 51 accounts and 9,400 users, reached within four months of launch.
Challenge & Solution
Manual QA could only ever touch a sliver of conversations. Under 1% sampling left blind spots, slow agent feedback, and quality standards that drifted from one account to the next. The mandate was unambiguous: evaluate every interaction, without scaling QA headcount in proportion, and do it in a way QA managers trust enough to act on. The hard question underneath was whether AI-generated evaluations could reliably match human judgment at volume.
Our team built a multi-modal AI pipeline (audio, speech-to-text, and LLMs) wrapped in a calibration framework designed for the people who own quality, not the people who write code. The cycle is deliberate: analysts manually evaluate a distributed set of interactions to establish a statistical benchmark; the AI is scored against that benchmark question by question; keywords and descriptions sharpen interpretation over a few iterations; the optimized configuration then scores every interaction in production. The decisive design choice was letting non-technical QA managers tune AI accuracy themselves, with no engineering support. Question design became the real lever: separating automatable from non-automatable questions upfront, and normalizing scores to required, automatable questions, removed false penalties and made the numbers ones managers believed.
Final Result
The system moved QA from a near-blind manual process to full, calibrated coverage:
- Coverage scaled from under 1% to 100% of interactions
- 82% AI accuracy against a human benchmark
- 72% automation score
- 50% improvement in process efficiency
- Deployed across 51 accounts and 9,400 users
The platform is architected to scale to 9M+ daily interactions and 20K+ users, a deliberate design target the framework is built to absorb. This is the work our AI Quality & Evaluation pillar leads: turning subjective, low-coverage review into trustworthy, fully automated evaluation that the business can act on.