CRM Data Cleanup: How AI Finds the Duplicates You Missed

In February last year, two of our reps called the same prospect on the same day. Anya reached out at 10:15 AM. Tomás called at 2:30 PM. Same person, same company, same phone number. The prospect — a VP of Product at a mid-market fintech — was polite about it the first time. By the second call, she was annoyed. "I literally spoke to someone from your team four hours ago. Do you people not talk to each other?"

We did talk to each other. We just didn't know we were working the same account. The prospect existed as two separate records in Attio: one created by Anya from a webinar attendee list ("Katherine Walsh, VP Product, Meridian Financial"), and one created by Tomás from a LinkedIn import ("Kate Walsh, VP of Product, Meridian"). Different first names. Slightly different titles. Same person.

That duplicate cost us the deal. Katherine — Kate — told Anya the next day that she was going with another vendor. "The double-call thing made me question how organized your team is. If your sales process is this messy, what's the product going to be like?" A $41K deal lost to a data quality problem.

That was the week I stopped treating CRM cleanup as a quarterly nice-to-have and started treating it as a revenue-critical operation.

The Scope of the Problem

I asked Marcus, our sales manager, to run an audit. He spent two days going through our Attio instance. The findings were sobering.

Duplicate contacts: 347 probable duplicates out of 4,200 total contact records. That's an 8.3% duplication rate. Some were obvious (same email, different name spellings). Most were subtle — same person, different email addresses; same company, different entity names; contacts who'd changed jobs and existed once at their old company and once at their new one.

Stale records: 890 contacts hadn't been updated in over six months. Of those, Marcus manually spot-checked 50 and found that 34 (68%) had at least one outdated field — wrong title, wrong company, disconnected phone number, bounced email address. Extrapolating, roughly 600 of our contact records had meaningfully inaccurate data.

Orphaned records: 215 contacts not linked to any company record. Floating in the CRM with a name and maybe an email, no organizational context, no deal association. Ghosts in the machine.

Missing data: Average field completion across all records was 41%. Most records had a name, an email, and a company. Maybe a title. Phone number populated on 38% of records. Industry on 29%. Company size on 22%. Source attribution on 44%.

Marcus summarized it: "About a quarter of our CRM is either wrong, duplicated, or so incomplete it's useless." He spent two full days reaching this conclusion. Two days of a sales manager's time — roughly $1,400 in loaded cost — and he'd only done a sampling-based audit. Fixing the problems would take weeks.

Why Manual Cleanup Fails

We tried the manual approach first. I assigned Priya, our SDR, to spend Friday afternoons on CRM cleanup. She was efficient. She could review and merge about 30 duplicate records per hour, verify and update about 20 stale records per hour. At that pace, addressing the 347 duplicates would take about 12 hours. The 600 stale records, another 30 hours. The orphans and missing data, more still.

Priya did this for three weeks. She fixed 180 records. Then she got pulled into a prospecting push, and the cleanup stopped. Meanwhile, the team was creating new records every day — about 15-20 new contacts per week — and some percentage of those were introducing new duplicates, new incomplete records, new orphans.

CRM decay isn't a problem you solve once. It's a problem that regenerates faster than manual effort can address it. Elena compared it to weeding a garden: "You can spend all Saturday pulling weeds, and by Wednesday they're back." She's right. The decay rate exceeds the manual cleanup rate. You can never catch up.

What AI Catches That Humans Don't

We deployed a CRM data cleanup agent that runs continuous scans on our Attio instance. The difference between this and manual cleanup isn't just speed. It's pattern recognition.

Manual cleanup catches obvious duplicates: same email address, same phone number, exact name matches. These are easy. Any deduplication tool can find them.

The AI catches fuzzy duplicates — the ones that are actually dangerous because they're hard for humans to spot. Here are real examples from our first scan:

"Katherine Walsh" and "Kate Walsh" at the same company. Different first names, same last name, same company. A simple string match wouldn't flag these. The AI recognized that "Kate" is a common diminutive of "Katherine" and that the records shared a company match. Flagged as probable duplicate. Correct.

"James Chen, Director of Engineering, Pinnacle Tech" and "James Chen, VP of Engineering, Vertex Solutions." Same name, different companies, different titles. Not a duplicate? Actually, it was — James had changed jobs. His LinkedIn confirmed it. The AI flagged it because the older record's email address was bouncing and the newer record was created recently. Probable job change.

"Acme Corp" and "Acme Corporation" and "ACME Corp." Three company records for the same entity, each with different contacts attached. Between them, 14 contacts split across three records instead of unified under one. None of the deals on these records had complete context because the history was fragmented.

The AI also caught something no human would have found manually. Two contacts with different names, different emails, and different companies — but the same phone number. Turns out one record was the prospect's personal cell phone (from a trade show badge scan) and the other was from a web form where the same person used their work email. Same human, zero overlapping identifiers except the phone number. Manual review would never have connected them.

First scan: the agent identified 412 probable duplicates (versus Marcus's 347 from his two-day audit), 734 records with stale data, 289 orphaned records, and 1,100+ records with significant data gaps. It completed the scan in about 40 minutes. Marcus's audit, covering less ground, took two days.

The Merge Problem

Finding duplicates is the easy part. Merging them correctly is where things get complicated.

When you have two records for the same person, which one is the "master"? Which fields do you keep? If Record A has the person's old title and Record B has their current title, you obviously keep B's title. But what if Record A has notes from three calls and Record B has notes from one call? You need all the notes. What if Record A is linked to Deal #1 and Record B is linked to Deal #2, and those deals are actually the same deal, also duplicated?

Duplicate records metastasize. A duplicate contact creates duplicate deal records which create duplicate activity logs which create incorrect pipeline metrics. Cleaning contacts without cleaning the downstream data makes things worse — you end up with a "merged" contact record that's linked to a confusing tangle of partially-duplicate deal records.

The cleanup agent handles merge logic with rules we configured: prefer the most recently updated record as master, preserve all notes and activities from both records, prefer the email address that isn't bouncing, prefer the title from the more recent record, and consolidate all deals (flagging potential deal-level duplicates for human review).

Kenji's experience illustrates why automated merge logic matters. He had a prospect — let's call him David Park — who existed in three records. One from a webinar list, one from Kenji's manual entry, one from an inbound form submission. Between the three records, David had two different email addresses (personal and work), two phone numbers (mobile and office), and three different titles (the webinar list was outdated). The agent merged them into a single record with both email addresses, both phone numbers, the most current title, and the combined interaction history from all three sources.

Before the merge, Kenji had a partial picture — he only knew about the record he'd created. After the merge, he could see that David had attended a webinar three months before Kenji reached out, had filled out an inbound form two weeks ago (which went to the SDR team, not to Kenji), and had a richer engagement history than anyone realized. Kenji used the inbound form submission as a conversation opener in his next call: "I saw you checked out our webinar back in October and recently requested more info — what prompted you to revisit?" David was impressed. The deal progressed.

Stale Data: The Silent Revenue Killer

Duplicates are visible problems. Stale data is invisible. And invisible problems are worse because nobody knows they exist until something breaks.

Here's what stale CRM data actually costs. Anya emailed a proposal to a prospect's work email. The email bounced — the prospect had left the company two months ago. The CRM still showed them as the active contact. Anya lost four days discovering this, finding the new contact, rebuilding the relationship, and resending the proposal. The deal closed, but it closed two weeks late, pushing it from Q3 to Q4 and messing up the forecast.

Tomás called a phone number listed on a contact record. It rang through to someone who said, "Mark hasn't worked here in a year." Tomás spent 20 minutes tracking down Mark's new number. Multiply by hundreds of records and thousands of calls.

The cleanup agent addresses staleness through continuous monitoring. It flags records where: email addresses show bounce indicators, phone numbers show disconnection patterns, titles haven't been verified in over 90 days, company data conflicts with recent public information (funding rounds, acquisitions, layoffs), and contacts haven't had any activity — no emails, no calls, no meetings — in over six months.

The six-month inactivity flag caught 340 records in our first scan. Marcus reviewed a sample and decided to mark 60% of them as "dormant" — not deleted, but moved to a separate view so they don't clutter active prospecting. The remaining 40% were contacts on active deals that had simply been neglected. Several of those led to re-engagement efforts that revived stalled deals. One — a $28K opportunity that Priya had let go quiet — came back to life and closed within six weeks of re-engagement.

The Ongoing Discipline

CRM cleanup isn't a project. It's a program. The agent runs weekly scans. Every Monday morning, Marcus gets a data quality report: new duplicates detected, records flagged for staleness, field completion trends, and a "data health score" for the overall CRM.

The first month, the health score was 52 out of 100. After six months of continuous AI-powered cleanup, supplemented by human review of flagged issues, the score is 81. Field completion went from 41% to 67%. Duplicate rate went from 8.3% to 1.4%. Stale data rate dropped from roughly 15% to 4%.

The impact on daily operations is tangible. Email bounce rates dropped from 6.2% to 1.8%. Call connection rates improved by 14% because phone numbers are more current. Forecast accuracy improved because deal data is more complete and more likely to reflect reality.

Elena said something that captures the change well: "I used to dread opening a CRM record because I knew half the information would be wrong. Now I actually trust what I see. That sounds small, but it changes how you work. You make decisions based on the CRM instead of in spite of it."

The Human Review Layer

The agent doesn't auto-merge everything. High-confidence duplicates — same email, matching names — get merged automatically. Lower-confidence matches — fuzzy name matches, same-company-but-different-email situations — get queued for human review. We assign about 30 minutes per week to reviewing the queue. Most weeks there are 10-15 items to review. A rep looks at the evidence, confirms or rejects the match, and moves on.

This hybrid approach is intentional. Fully automated cleanup is risky because edge cases exist. Two people at the same company with similar names who are actually different people. Parent and subsidiary companies that look like duplicates but are separate buying entities. Former employees who still show up in databases under their old company.

Tomás rejected a merge suggestion last month: "Jordan Kim, Marketing Director, Apex Group" and "Jordan Kim, Marketing Manager, Apex Group." Different titles, same company. The agent guessed it was the same person with an outdated title on one record. Tomás knew it was actually two different people — one in their US office, one in their London office. Both named Jordan Kim. Both in marketing. Same company. Different humans. Without the human review step, the agent would have merged them and we'd have lost a contact.

What Clean Data Makes Possible

Clean CRM data is a prerequisite for everything else we want to do with AI. Enrichment agents work better when they're not enriching duplicate records. Meeting prep agents produce better briefs when the underlying data is accurate. Pipeline analytics are meaningful only when the data feeding them reflects reality.

The cleanup agent was the least glamorous thing we've implemented. Nobody gets excited about deduplication. Nobody tweets about field completion rates. But it's the foundation that makes the exciting stuff work.

Marcus told me recently that the $41K deal we lost to Katherine Walsh's duplicate record bothered him for months. "We lost a deal because our data was messy. Not because of our product, not because of our pricing, not because of the competition. Because two records existed for the same person and we looked incompetent." He paused. "That won't happen again."

It hasn't.

Try These Agents

CRM Data Cleanup -- Continuous duplicate detection, stale record flagging, and data quality monitoring
Account Review Prep -- Pre-meeting briefs that depend on clean, accurate CRM data
Call Intelligence Analyzer -- Structured call extraction that feeds clean data into enriched CRM records
Contact Enrichment -- Enrich Attio contacts with company data, social profiles, and buying signals

CRM Data Cleanup: How AI Finds the Duplicates You Missed

CRM Data Cleanup: How AI Finds the Duplicates You Missed

The Scope of the Problem

Why Manual Cleanup Fails

What AI Catches That Humans Don't

The Merge Problem

Stale Data: The Silent Revenue Killer

The Ongoing Discipline

The Human Review Layer

What Clean Data Makes Possible

Try These Agents

For people who think busywork is boring