How to Scale Expert Data Labeling Without a Sweatshop
Ahmed Rashad, CEO at Perle.ai
The standard playbook for scaling data annotation is a sweatshop. Hire as many low-cost workers as possible. Push throughput. Compete on per-unit price. The model has produced decades of bad training data and a reputation for the entire category as commodity work.
Ahmed Rashad — CEO of Perle.ai, building expert-driven labeling pipelines for medical, legal, dental, and embodied AI — refuses to run that playbook. Not for ethical reasons alone, though those exist. For operational reasons. The sweatshop model produces worse output, higher TCO for customers, and a flywheel that compounds in the wrong direction. Treating annotators as experts produces better data, better margins, and a flywheel that compounds in the right direction.
Here’s how he runs it.
The Wrong Flywheel
The cheap-labor model looks profitable on the surface. Lower wages mean lower costs, which means lower quoted prices, which means more deals. But Ahmed has watched the math play out: “I’ve seen people do it over and over again. They think they’re saving money or they’re reducing costs. You’re not actually. Even that, you’re actually increasing your cost quite significantly by doing that.”
The hidden costs: high turnover, lower output quality, more QA burden, more touch-up work, more customer churn when the data fails in production. The flywheel runs backwards — cheap labor produces bad data, bad data produces customer dissatisfaction, customer dissatisfaction produces price pressure, price pressure produces cheaper labor.
Once you’re on that flywheel, getting off is hard. The customers expect the cheap price. The team expects the cheap wages. The reputation expects the cheap quality. The whole organization is calibrated to the wrong number.
The Four Pillars
Ahmed’s framework for the alternative is straightforward. There’s nothing exotic about it — the rarity is in the operational discipline to actually run it.
1. Find the right people. Domain experts (clinicians, lawyers, linguists, engineers) for the verticals where context and judgment matter. Generalists for tasks where they don’t. The matching is non-trivial: which annotator handles edge cases well, which prefers high-volume routine work, which has the dialect or the specialty knowledge for a specific dataset.
2. Give them the right tooling. Pipelines that combine AI automation with human validation. Software that tracks who labeled what, how confident they were, and where disagreements happened. The tooling does the work AI is good at (transcription, translation, entity extraction) so the humans can focus on the judgment work AI isn’t good at.
3. Give them the right incentives. Pay them enough that good people stay. Reward quality over throughput. Give annotators a stake in the work — many of Ahmed’s customers want the variability and judgment humans bring, and the annotators producing that variability deserve to be valued for it, not penalized for not being machines.
4. Give them the right training. Domain experts arrive with judgment but not with the specific labeling system. Continuous training on the customer’s specific use case, the edge cases that have come up, the patterns the team is learning. Training isn’t a one-time onboarding — it’s an ongoing investment.
“You do those four things right, and basically it works out,” Ahmed says. “Now, it’s easier said than done.”
Why the Cheap Model Compounds Wrong
The specific failure mode of the sweatshop approach: it can’t handle edge cases.
Most labeling errors don’t come from the routine 90% of the data. They come from the 10% that’s ambiguous, contextual, or unusual. A medical conversation where penicillin could be a current medication, an allergy, or a past prescription. A legal document where a specific clause could be standard or unusual. An image where the object boundaries are unclear because of lighting or occlusion.
These cases require judgment. Cheap workers don’t have the time, training, or incentive to apply judgment — they’re paid by throughput, so the optimal move is to label fast and move on. The result: the routine 90% gets labeled correctly and the critical 10% gets labeled randomly. The model trains on the random labels for the cases it most needs to handle correctly, and fails in production exactly where it shouldn’t fail.
Ahmed’s operating insight on this: encourage dissent. “It’s actually good to encourage some dissent, right? Because a lot of times, specifically on the edge cases, it’s usually the odd person out who’s right.” His system intentionally identifies which annotators perform best on which kinds of edge cases and routes hard problems to those people. Consensus on edge cases isn’t the goal — accuracy is, and accuracy on hard cases often comes from the person who disagrees with everyone else.
The Standardization Effect
The unexpected outcome of running the four-pillar model: costs drop dramatically over time as the work standardizes.
Ahmed’s pattern: when Perle takes on a new project, the early phase is high-touch, expensive, iterating quickly to get the quality right. As the workflow matures and the team learns the customer’s edge cases, the work standardizes. Once standardized, the cost can drop to a tenth of the initial price. “It’s not uncommon for us to see the cost drop after we’ve run a project for six months to a tenth of the initial cost.”
This is the right flywheel: high quality enables standardization, standardization enables self-serve operation, self-serve operation enables lower costs, lower costs enable more customers, more customers fund better training and tooling, which enables higher quality. The whole system compounds in the customer’s favor over time.
The sweatshop model can’t reach this state. The work never standardizes because the quality is never high enough to trust. The QA burden scales with volume rather than dropping with maturity. Cost stays flat or grows, and the relationship eventually collapses when the customer realizes the TCO didn’t make sense.
What This Looks Like to Customers
For customers evaluating data labeling vendors, the operational signals to look for:
- Annotator continuity — Are the same people working on your project for months, or is there constant turnover?
- Edge case routing — Does the vendor have a system for identifying and escalating ambiguous cases, or do they push everything through the same workflow?
- Iteration cadence — Does the vendor revisit and revise labels through multiple passes, or are labels treated as final on first production?
- Self-serve trajectory — Is the vendor working toward standardizing your pipeline so you can run it without their intervention, or are they keeping you dependent?
Vendors running the four-pillar model invest in continuity, build edge case routing into their systems, run iterative loops, and progressively transfer self-serve capability. Vendors running the sweatshop model don’t. The difference shows up in the data and, eventually, in the model’s production performance.
FAQ
Why is paying data labelers more sometimes cheaper for customers?
Paying labelers more produces higher quality output, which reduces customer QA burden, touch-up work, and production failures. Ahmed Rashad’s pattern: cheap-labor vendors require customers to hire QA staff, manage rework, and absorb production failures — driving total cost above the higher-priced expert vendor. Better-paid annotators produce data that can go directly into the model.
What is the four-pillar framework for managing data annotators?
Right people (domain experts matched to the work), right tooling (AI-human pipelines), right incentives (compensation that rewards quality over throughput), right training (continuous, not one-time). Ahmed Rashad describes this as the framework for treating annotation as craft rather than commodity. The framework produces higher quality, lower turnover, and a workflow that standardizes over time into lower-cost operations.
Why does the sweatshop model produce bad training data?
Sweatshop models pay by throughput, so the optimal worker behavior is fast labeling without judgment. Edge cases — the 10% of ambiguous, contextual data that matters most for production performance — get labeled randomly. The model trains on noise for the cases it needs to handle correctly, then fails in production. Cheap labor compounds into expensive model failures.
How does expert annotation handle edge cases differently?
Expert systems intentionally identify which annotators perform best on which kinds of edge cases and route hard problems to those people. Ahmed Rashad’s principle: encourage dissent, because consensus on edge cases isn’t the goal — accuracy is, and the odd person out is often right. Edge case routing is an explicit operational layer, not an afterthought.
Why does data labeling cost drop after standardization?
When quality is high enough early in a project, the workflow can be standardized into self-serve operations the customer can run with minimal vendor intervention. Standardization eliminates per-document overhead. Ahmed Rashad reports costs commonly dropping to a tenth of initial cost after six months of standardization. The sweatshop model can’t reach this state because quality never standardizes.
How do you find the right data labelers for specialized domains?
Domain matching matters most for high-stakes verticals — clinicians for medical, lawyers for legal, native speakers for multilingual work. Within domain, individual matching matters too: some annotators handle routine work well, others excel at edge cases. The operational discipline is tracking who performs best on which kinds of problems and routing accordingly. Most vendors don’t do this.
What’s the role of AI automation in expert-driven labeling?
AI handles the work it’s good at: transcription, translation, PII extraction, initial entity tagging, surface-level classification. Human experts validate the AI output, correct errors, handle edge cases, and contribute domain judgment AI can’t replicate. The pipeline isn’t AI versus human — it’s AI doing scale work and humans doing judgment work, integrated through tooling that tracks both.
How do you measure data labeling quality beyond accuracy?
Accuracy matters, but so does coverage of edge cases, consistency across annotators, judgment quality on ambiguous examples, and whether the labels capture the variability domain experts bring. Ahmed Rashad notes that customers in subjective domains specifically want annotator variability — different valid interpretations of the same example — because the model needs to learn that the answer depends on context.
What kinds of work require expert annotation rather than crowdsourced annotation?
High-stakes verticals where context and judgment dominate: medical (clinical interpretation), dental (lesion detection), legal (contract analysis, document processing), multilingual (dialect and code-switching), embodied AI (physical interaction modeling), structural reasoning. Crowdsourced annotation works for routine tasks like image classification or simple transcription where context is minimal.
How long does it take to onboard a domain expert annotator?
Initial onboarding takes days to weeks depending on the domain. But ongoing training is continuous — domain experts arrive with judgment but need to learn the customer’s specific edge cases, terminology, and labeling system. Ahmed Rashad’s pattern: training is not one-time, it’s an ongoing investment that keeps the team calibrated as the customer’s data evolves.
Full episode coming soon
This conversation with Ahmed Rashad is on its way. Check out other episodes in the meantime.
Visit the ChannelMore from Ahmed Rashad
Founder Archetype
Read Ahmed Rashad's archetype profile
The Sage · Classical: Hephaestus · Tests & Allies
Related Insights
Why Monitoring Agents Demand Custom Models: The For-Loop Cost Problem
Devi Parikh, Co-CEO at Yutori
How to Scale Web Agents Without Drowning in Context
Devi Parikh, Co-CEO at Yutori
Building a Knowledge-Based Network: How Experts Scale Without Replacing Themselves
Dara Ladjevardian, CEO & Co-Founder at Delphi