Fuzzy match CSV files online.

Find approximate matches between two CSV files — typos, casing, accents, word order, abbreviations. Jaro-Winkler similarity with blocking, runs in your browser, never uploads your data.

File A
Drop file A here
CSV, TSV, TXT or XLSX
File B
Drop file B here
CSV, TSV, TXT or XLSX

When exact matching is not enough

Most real-world reconciliation breaks because the data is messy in ways no normalisation rule can fix. 'Acme Corporation' on the vendor list and 'Acme Corp.' in the invoice file. 'Johnatan Smith' in a typed field and 'Jonathan Smith' in another. 'Calle Mayor 12' and 'C/ Mayor, 12'. None of these are exact matches and none of them are caught by trimming spaces or ignoring casing.

Fuzzy matching scores how similar two strings are and flags pairs above a threshold as approximate matches. It is the right tool when your join key is a human-typed field (name, company, address) and the wrong tool when your join key is a stable identifier (UUID, order number, SKU) — for those, exact compare is faster and more accurate.

How the fuzzy matcher works

MessyMatch uses Jaro-Winkler similarity, a string distance metric tuned for short strings like names and codes. It gives extra weight to characters that match at the start of the string, which maps well to how humans actually mistype names. A score of 1.0 means identical; 0.0 means nothing in common. The default threshold is 0.92.

Comparing every row in A against every row in B is O(n × m) and becomes unusable past a few thousand rows on each side. The engine avoids that by blocking: only candidate pairs that share a prefix (or fall inside a length band) are scored. Pairs outside the block cannot be near-matches by construction, so skipping them is safe. In practice this keeps fuzzy comparisons usable up to about 50,000 rows per side.

Each almost-match in the result panel shows both original values, the similarity score and a human-readable reason — accent difference, casing, single-letter typo, word reorder, extra token. The reason is what makes the result actionable: you do not have to squint at two strings to figure out why they were flagged.

Always do the cheap normalisation first

The most common mistake with fuzzy matching is jumping straight to it. Most 'approximate' matches in messy CSV data are actually exact matches after trimming spaces, ignoring casing and stripping accents. Enable those cleaning rules first — they are free, deterministic and predictable. Save the fuzzy pass for the records that still do not match after the cheap rules ran. The result is fewer false positives and a much faster comparison.

Common use cases for fuzzy CSV matching

  • Reconciling a vendor list against an invoice file (company name variants)
  • Merging two contact databases where names were typed by different agents
  • Matching customer records across systems with no shared ID
  • Detecting near-duplicate products in a catalog before publishing
  • Cleaning a survey export where free-text answers vary slightly
  • Cross-checking a list of academic citations with inconsistent formatting

Honest limits of fuzzy matching

Jaro-Winkler is a character-level metric. It is excellent for names, company codes and short text fields with typos. It is weaker on long free-text descriptions, addresses with reordered components and anything that needs semantic understanding (e.g. 'Big Apple' vs 'New York City'). For those cases the right answer is usually domain-specific normalisation — an address parser, a known-aliases table — before passing the cleaned strings into the fuzzy matcher.

Browser-first by design

All scoring runs in your browser via a Web Worker. Our servers do not have an endpoint that ingests file content — the worker reads each file from disk, runs the comparison locally and hands the result back to the page. We only record metadata about the operation (row count, file size, format, elapsed time) for abuse limits. See the privacy policy for the full list.

Related tools

Frequently asked questions

What is a fuzzy match in CSV comparison?+

A fuzzy match — also called an approximate match — is two rows that should refer to the same record but are not byte-identical even after normalising spaces, casing and accents. For example, 'Acme Corp' vs 'Acme Corporation', or 'Johnatan' vs 'Jonathan'. MessyMatch flags these in the Almost matches tab with the similarity score and the reason.

What algorithm does the fuzzy matcher use?+

Jaro-Winkler similarity with a configurable threshold (default 0.92). To stay fast on large files the engine also uses blocking — only candidates that share a prefix or fall inside a length band are scored — so it does not compare every row in A against every row in B.

Can I tune how strict the fuzzy match is?+

Yes. Lower the similarity threshold to catch looser matches (more results, more false positives). Raise it to only see very close matches. The threshold lives in the compare settings panel next to the cleaning rules.

Will fuzzy matching find typos?+

Yes, that is the primary use case. Single-letter typos, transposed characters, common misspellings of names and companies all surface as almost-matches. Each result shows both original values so you can decide whether to merge.

Is fuzzy matching slower than exact compare?+

Yes — it scores candidate pairs instead of hashing into a map. Blocking + length pre-filter keep it tractable up to about 50,000 rows per side on a normal laptop. Past that, narrow the candidate set with a key-column compare first and apply fuzzy only on the remaining mismatches.

Are my files uploaded for fuzzy matching?+

No. The Jaro-Winkler scoring runs in your browser via a Web Worker. The file contents are processed inside your browser via a Web Worker and are not transmitted to our servers.

How is this different from VLOOKUP with TRUE?+

VLOOKUP with TRUE assumes the lookup column is sorted ascending and returns the closest lower value, which is wrong for almost-equal text matching. MessyMatch uses string similarity scoring designed for messy human data — names, companies, addresses — not for sorted numeric ranges.