Coincidencia aproximada explicada: cuando 'casi igual' es la respuesta correcta
Two strings are an exact match when they are byte-identical. Two strings are a fuzzy match when they are similar enough that you would treat them as the same record — but not equal. 'Acme Corp' and 'Acme Corporation' are fuzzy matches. 'Johnatan Smith' and 'Jonathan Smith' are fuzzy matches. 'jose@acme.com' and 'JOSE@acme.com' should not be — that is just normalisation.
The whole point of fuzzy matching is to bridge the gap between 'exactly equal' and 'definitely different' for messy human-typed data: names, companies, addresses, free-text fields. Used correctly it saves hours of manual reconciliation. Used badly it creates a pile of false positives that takes longer to clean up than the original problem.
Always do the cheap normalisation first
The single most common mistake is to reach for fuzzy matching before cleaning the data. Most 'fuzzy' matches in real CSV files are not fuzzy at all — they are exact matches hidden by trivial formatting noise. Trailing spaces, mixed casing, accents, smart quotes copied from Word, zero-width characters pasted from a webpage. All of these are removed by deterministic, predictable normalisation rules.
Run those first. They are free, they are auditable, they always produce the same result and they handle the majority of near-matches in real data. After that pass, what is left is the actually-fuzzy cases — and there will be far fewer of them, which makes the fuzzy step much faster and the result much more useful.
What Jaro-Winkler actually measures
MessyMatch uses Jaro-Winkler similarity as its fuzzy metric. It scores any pair of strings from 0.0 (nothing in common) to 1.0 (identical). The base Jaro similarity counts how many characters match within a sliding window the length of the strings, and how many of those matches are out of order (transpositions). Winkler then adds a bonus for matching prefixes — the first few characters being identical bumps the score up.
The reason Jaro-Winkler is the default for name matching is that human typing errors cluster: people transpose adjacent characters, miss a letter in the middle of a word, or substitute a similar-looking character. They rarely start typing a name with the wrong letter. Jaro-Winkler captures exactly that behaviour: it tolerates middle-of-string noise and rewards prefix agreement.
Picking a threshold
A threshold of 0.92 is a sensible default. It catches single-letter typos, transpositions and minor word reorders while skipping most genuinely-different strings. Drop it to 0.85 if you want looser matches (more results, more false positives). Raise it to 0.96 if you want to only see very close matches.
There is no universally right threshold. It depends on what your data looks like and how expensive a false positive is in your workflow. The right way to pick is: run with the default, look at the result, adjust if the noise floor or recall is wrong. Most users land somewhere between 0.88 and 0.94 after one or two iterations.
Why naive fuzzy matching is too slow
Comparing every row in file A against every row in file B is an O(n × m) operation. Two files of 10,000 rows each is 100 million comparisons. At a few microseconds per Jaro-Winkler score, that is a minute or two of pure CPU. At 50,000 rows each, it is several billion comparisons — too slow to be useful.
The standard fix is blocking. Instead of scoring all pairs, you only score pairs that share some cheap characteristic — a common prefix, a common phonetic key, a length within ±N characters. Pairs outside the block cannot be similar enough to clear the threshold by construction, so skipping them is safe. Blocking is what makes fuzzy matching tractable on real files. MessyMatch uses a prefix block combined with a length pre-filter, which keeps comparisons usable up to roughly 50,000 rows per side in the browser.
When fuzzy matching is the wrong tool
Fuzzy matching is not magic and there are cases where you should not reach for it:
- Your join key is a stable identifier (UUID, order number, SKU). Exact compare is faster and accurate; fuzzy will invent false positives where two unrelated codes happen to look similar.
- The differences are semantic, not lexical. 'Big Apple' vs 'New York City' are zero by Jaro-Winkler; you need a domain-specific aliases table.
- You are matching long free-text descriptions. Jaro-Winkler is built for short strings; for paragraphs you want token-based metrics or vector similarity.
- The data is structured (addresses with reordered components, phone numbers in different formats). A parser plus exact compare on the normalised form is more reliable than fuzzy on the raw text.
Always show the reason
A fuzzy match without a reason is a guess. The result panel should tell you which characters matched, which differ, and what the score is — so you can decide whether to accept it as a real match or reject it as coincidence. That is what the Almost matches tab in MessyMatch surfaces: both original values, the similarity score and a human-readable reason (accent difference, casing, single-letter typo, word reorder, extra token).
Without that, fuzzy matching is opaque and you have to trust the threshold. With it, you can audit every result in two seconds and act on it confidently.
Try it on real data
The Fuzzy match CSV page is the fastest way to see what this looks like on your own data. Drop two CSVs, enable cleaning rules first, then turn on fuzzy and pick a threshold. The result panel will show you exact matches, almost-matches with scores, and everything that did not match at all.