Removing Duplicates in a Systematic Review: The Records-Removed Box

Removing duplicates is the step between identifying your records and screening them, and it populates the records removed before screening box of your PRISMA flow diagram. When you search several databases, the same article appears multiple times, and each appearance counts as a separate record at identification until you merge them. Deduplication takes the total records identified, removes the repeats, and leaves the unique records that enter title and abstract screening. You can record the exact counts in our free PRISMA 2020 flow diagram generator, and this guide explains how to deduplicate reliably and document the number defensibly.

The reason deduplication deserves attention is that it is both the first place a flow diagram can go wrong and the easiest count for a reviewer to question. Remove too few and the screening volume looks inflated; remove too many and you risk discarding genuinely distinct records.

Why Duplicates Exist in the First Place

A systematic review searches multiple databases on purpose, because no single database indexes everything. The cost of that comprehensiveness is overlap: an article indexed in PubMed, Embase, and Scopus is returned three times, so three records enter the identification box for one underlying article. This is correct and expected. PRISMA counts every record at identification precisely so the search is transparent, and deduplication is the documented step that resolves the overlap afterward. Our explainer on what goes in each PRISMA box sets out why the identification count is deliberately inflated before deduplication trims it.

Where Deduplication Sits in the Diagram

In the PRISMA 2020 layout, deduplication is reported in the records removed before screening box, alongside any records removed by automation tools or for other pre-screening reasons. The arithmetic is simple but strict: total records identified minus everything removed in this box equals the records that move into screening. That single subtraction is the first link in the chain a reviewer checks, so the duplicate count must be exact. A common error is to deduplicate silently and report only the post-deduplication total, which hides the step; the figure should show both the identified total and the number removed.

How to Find and Remove Duplicates Reliably

Deduplication is part automated, part manual, because no algorithm catches every variant. A defensible workflow is:

Export each database's results with full bibliographic fields, since matching depends on titles, authors, years, and identifiers being present.
Run automated deduplication in a reference manager or screening platform, which catches exact and near-exact matches.
Manually review near matches the tool is unsure about, because differing page numbers, accents, or abbreviated journal names can split one article into two records or merge two distinct ones.
Keep a record of the final number removed so the count is reproducible.

Figure 1. A reliable deduplication workflow combines automated matching for clear duplicates with manual review of borderline records before the count is finalised.

The tool you reach for shapes how many duplicates it catches and how much manual cleanup remains. The table below compares the common approaches.

Approach	How it matches	Strengths	Watch-outs
Reference manager (automated)	Title, author, year fields	Fast, built into existing libraries	Misses variant titles; can over-merge on sparse metadata
Specialised deduplication method	Multi-field weighted rules	Higher precision, designed for review-sized sets	Requires following the published steps carefully
Systematic review platform	Algorithmic match plus a review queue	Surfaces uncertain pairs for a human decision	Match thresholds differ between platforms
Manual review	Human judgement on each pair	Catches subtle variants and false matches	Time-consuming; impractical alone at scale

No single method is sufficient on its own. The defensible pattern is an automated first pass for the obvious repeats, then a manual second pass over the records the tool flags as uncertain.

The screening platforms researchers already use handle the first pass well, and the reliability of the whole process depends on clean screening discipline, which our guide to systematic review screening best practices covers in full.

If your screening is happening in Covidence or Rayyan, the duplicate count is already calculated for you, and the practical task is reading it out of the right place. Our guide to pulling deduplication figures straight from Covidence or Rayyan shows where each platform reports the number so it lands in the records-removed box without a recount.

Documenting the Count Defensibly

Two numbers must be reproducible: the total records identified and the number of duplicates removed. State the deduplication method in your methods text, naming the tool and whether a manual check followed, so a reader can see the removal was principled rather than arbitrary. If you also removed records by an automation tool or another rule before screening, report those separately within the same box rather than folding them into the duplicate count, because they are distinct removal reasons. Conflating them is one of the common PRISMA diagram mistakes that draws reviewer queries.

Avoiding Over- and Under-Deduplication

The two failure modes pull in opposite directions. Under-deduplication leaves repeats in the screening set, inflating the volume and risking the same article being screened inconsistently by different reviewers. Over-deduplication merges records that only look alike, silently dropping a distinct study before anyone reads it. The safeguard against both is the manual review of uncertain matches: trust the tool for clear duplicates, but adjudicate the borderline cases by hand. A defensible duplicate count, anchored to a stated method, keeps the very first subtraction in your flow diagram clean and the rest of the chain credible.

Frequently Asked Questions

Where do duplicates go in a PRISMA 2020 flow diagram?

Duplicates are reported in the records removed before screening box, separate from automation removals and other pre-screening removals. The number is subtracted from the total records identified, and the result is the count that enters title and abstract screening.

How do I remove duplicates in a systematic review?

Export each database's results with full bibliographic fields, run automated deduplication in a reference manager or screening platform, then manually review the near matches the tool is unsure about. Record the final number removed so the count is reproducible.

Should I count the same article from three databases as three records?

Yes, at the identification stage. Every database hit counts as a separate record so the search is transparent. Deduplication then merges the repeats in the records-removed box, and only the unique records proceed to screening.

What is the difference between under- and over-deduplication?

Under-deduplication leaves repeated records in the screening set, inflating the volume and risking inconsistent decisions. Over-deduplication merges records that only appear similar, dropping a distinct study before it is read. Manually reviewing uncertain matches guards against both.

Do I need to report how I removed duplicates?

Yes. State the deduplication method in your methods text, naming the tool and whether a manual check followed. Report duplicates separately from any records removed by automation tools, since those are distinct removal reasons within the same box.