Quality Assessment in Systematic Reviews: Tools, Scales, and Methods Compared

Quality assessment (also called risk of bias assessment or critical appraisal) is a mandatory component of systematic reviews that evaluates the methodological rigor of included studies to determine how much confidence readers can place in each study's results. The quality of individual studies directly affects the reliability of the review's conclusions. A systematic review that pools results from high-quality and critically flawed studies without distinguishing between them can produce misleading evidence.

Choosing the correct quality assessment tool depends on the study designs included in your review. Using an inappropriate tool (for example, applying a randomized trial tool to an observational study) is a methodological error that undermines the assessment's validity. This guide compares the most widely used tools and helps you select the right one for your review.

Why Quality Assessment Is Essential

Quality assessment serves several critical functions in a systematic review:

Contextualize results: Understanding each study's methodological strengths and weaknesses helps interpret its findings
Inform synthesis decisions: Quality assessment results may determine whether studies are combined in meta-analysis or analyzed separately
Support sensitivity analysis: Restricting analysis to low risk-of-bias studies tests whether conclusions are robust
Assess certainty of evidence: Frameworks like GRADE use risk of bias as a key domain for rating overall evidence certainty
Meet reporting standards: The PRISMA 2020 checklist requires reporting of quality assessment methods and results

Quality Assessment Tools by Study Design

Cochrane Risk of Bias 2 (RoB 2): For Randomized Controlled Trials

RoB 2 is the current standard for assessing randomized controlled trials, developed by the Cochrane Bias Methods Group. It replaced the original Cochrane Risk of Bias tool in 2019.

Structure: RoB 2 evaluates bias through five domains:

Bias arising from the randomization process: Was the allocation sequence random? Was allocation concealed?
Bias due to deviations from intended interventions: Were participants and personnel blinded? Were there deviations from the assigned intervention?
Bias due to missing outcome data: Was outcome data complete? Were reasons for missingness related to the outcome?
Bias in measurement of the outcome: Was the outcome assessor blinded? Could measurement be influenced by knowledge of intervention?
Bias in selection of the reported result: Were multiple outcome measurements available? Was there selective reporting?

Judgment for each domain: Low risk, Some concerns, High risk

Overall judgment: Low risk of bias (all domains low), Some concerns (at least one domain with some concerns but none high), High risk of bias (at least one domain high risk, or multiple domains with some concerns)

When to use: Any systematic review that includes randomized controlled trials as the primary study design.

ROBINS-I: For Non-Randomized Studies of Interventions

ROBINS-I (Risk Of Bias In Non-randomized Studies of Interventions) assesses non-randomized studies that compare outcomes between groups that received different interventions.

Structure: Seven bias domains:

Bias due to confounding
Bias in selection of participants into the study
Bias in classification of interventions
Bias due to deviations from intended interventions
Bias due to missing data
Bias in measurement of outcomes
Bias in selection of the reported result

Judgment levels: Low risk, Moderate risk, Serious risk, Critical risk, No information

When to use: Systematic reviews including cohort studies, case-control studies, or other non-randomized comparative studies evaluating interventions.

Newcastle-Ottawa Scale (NOS): For Observational Studies

The Newcastle-Ottawa Scale is a widely used tool for assessing the quality of non-randomized studies (cohort studies and case-control studies) in systematic reviews.

Structure: Three domains assessed using a star system (maximum 9 stars):

Selection (4 stars): Representativeness, selection of controls, ascertainment of exposure
Comparability (2 stars): Comparability of cohorts based on design or analysis
Outcome/Exposure (3 stars): Assessment of outcome, follow-up adequacy

Scoring: Studies are often categorized as high quality (7-9 stars), moderate quality (4-6 stars), or low quality (0-3 stars), though these thresholds vary across reviews.

When to use: Systematic reviews including cohort or case-control studies. The NOS is simpler than ROBINS-I and may be preferred when a less detailed assessment is sufficient.

Limitations: The NOS has been criticized for producing inconsistent inter-rater reliability and for its star system, which may oversimplify complex methodological judgments.

QUADAS-2: For Diagnostic Accuracy Studies

QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies) is designed specifically for systematic reviews of diagnostic test accuracy.

Structure: Four domains:

Patient selection: Was patient sampling consecutive or random? Were inappropriate exclusions avoided?
Index test: Was the index test interpreted without knowledge of the reference standard?
Reference standard: Was the reference standard likely to correctly classify the condition?
Flow and timing: Did all patients receive the reference standard? Were all patients included in the analysis?

Each domain is assessed for both risk of bias and applicability concerns.

When to use: Systematic reviews evaluating the accuracy of diagnostic tests (sensitivity, specificity, predictive values).

JBI Critical Appraisal Tools: For Multiple Study Designs

The Joanna Briggs Institute (JBI) provides a suite of standardized critical appraisal checklists covering 13 different study designs:

Randomized controlled trials (13 items)
Quasi-experimental studies (9 items)
Cohort studies (11 items)
Case-control studies (10 items)
Cross-sectional studies (8 items)
Case reports (8 items)
Case series (10 items)
Qualitative research (10 items)
Systematic reviews (11 items)
Text and opinion papers (6 items)
Prevalence studies (9 items)
Economic evaluations (11 items)
Diagnostic test accuracy studies (10 items)

When to use: JBI tools are particularly useful for mixed-methods reviews or reviews that include study designs not covered by Cochrane tools (e.g., qualitative studies, case series, cross-sectional studies).

Choosing the Right Tool

Study Design	Recommended Tool	Alternative
Randomized controlled trials	Cochrane RoB 2	JBI RCT checklist
Non-randomized intervention studies	ROBINS-I	Newcastle-Ottawa Scale
Cohort studies (not intervention)	Newcastle-Ottawa Scale	JBI Cohort checklist
Case-control studies	Newcastle-Ottawa Scale	JBI Case-Control checklist
Cross-sectional studies	JBI Cross-Sectional checklist	AXIS tool
Diagnostic accuracy studies	QUADAS-2	JBI Diagnostic checklist
Qualitative studies	JBI Qualitative checklist	CASP Qualitative checklist
Case reports/series	JBI Case Report/Series checklists	None
Mixed study designs	JBI suite (multiple checklists)	Design-specific tools

When your systematic review includes multiple study designs, use the appropriate tool for each design. Do not apply a single tool across different designs.

Conducting Quality Assessment

Step 1: Select Your Tool(s)

Choose tools during protocol development and document them in your systematic review protocol. Justify your selection based on the study designs you expect to include.

Step 2: Pilot the Assessment

Before assessing all studies, pilot the tool on 2-3 studies with both reviewers. Discuss judgments, calibrate interpretation of domains, and develop decision rules for common scenarios.

Step 3: Independent Dual Assessment

Two reviewers should independently assess each study using the selected tool. Record domain-level and overall judgments separately.

Step 4: Resolve Disagreements

Compare assessments and resolve discrepancies through discussion or a third reviewer. Document the resolution process and the level of initial agreement.

Step 5: Present Results

Present quality assessment results in:

A summary table showing domain-level and overall judgments for each study
A risk of bias summary figure (traffic light plot for RoB 2)
Narrative description in the results section

Incorporating Quality Assessment into Synthesis

Quality assessment results should inform your evidence synthesis, not merely be reported as a standalone exercise:

In Meta-Analysis

Sensitivity analysis: Run the meta-analysis restricted to low risk-of-bias studies and compare with the full analysis
Subgroup analysis: Stratify by overall quality rating (low vs high risk of bias)
Meta-regression: Use quality scores or domain-level judgments as moderators
Weighting: Some approaches weight studies by quality, though this is controversial

Quality assessment results can affect heterogeneity in your meta-analysis, since high-quality studies may show different effects than low-quality studies, and understanding this relationship is important for interpreting results.

In Narrative Synthesis

Emphasize findings from high-quality studies
Acknowledge limitations of low-quality studies
Use quality as a factor when resolving conflicting findings

In GRADE Assessment

Risk of bias is one of five GRADE domains for rating certainty of evidence. Serious or very serious risk of bias across the body of evidence leads to downgrading the certainty rating. For a broader discussion of how quality assessment fits into the systematic review process, see our comprehensive guide to conducting a systematic review.

Common Quality Assessment Mistakes

Using the wrong tool: Applying RoB 2 to observational studies or NOS to randomized trials. Match the tool to the study design.
Assessing at the study level only: RoB 2 is designed to be applied at the outcome level (the same study may have low risk of bias for one outcome and high risk for another). Assess per outcome where possible.
Equating quality scores with risk of bias: A high NOS score does not mean a study has no bias; it means the assessed domains scored well. Unmeasured confounding and other biases may still be present.
Not piloting the assessment: Without calibration, inter-rater reliability suffers. Pilot on 2-3 studies before full assessment.
Ignoring quality assessment in synthesis: Conducting quality assessment but not using it to inform the analysis (e.g., through sensitivity analysis) wastes the effort.
Single-reviewer assessment: Quality assessment, like screening, should be performed by at least two independent reviewers.

Reporting Quality Assessment in PRISMA

The PRISMA 2020 checklist requires:

Item 11: Describe the methods used to assess risk of bias of included studies (tools, processes, how results were used in synthesis)
Item 18: Present the results of risk of bias assessment for each included study
Item 22: Report the results of any sensitivity analyses based on risk of bias

Document the number of included studies and their quality distribution in your PRISMA flow diagram. Create yours using our free PRISMA 2020 compliant diagram tool.

The GRADE Framework

GRADE (Grading of Recommendations, Assessment, Development, and Evaluations) goes beyond individual study quality to rate the certainty of evidence across an entire body of evidence for each outcome. GRADE considers five domains:

Risk of bias: Limitations in study design and execution
Inconsistency: Unexplained variability in results
Indirectness: How directly the evidence applies to the review question
Imprecision: Wide confidence intervals or small sample sizes
Publication bias: Systematic underreporting of unfavorable results

Evidence starts at "high" certainty for RCTs and "low" for observational studies, then is downgraded for serious limitations in any domain. Large effect sizes, dose-response gradients, and absence of plausible confounding can upgrade observational evidence.

GRADE certainty levels:

High: Very confident that the true effect lies close to the estimate
Moderate: Moderately confident; the true effect is likely close but may be substantially different
Low: Limited confidence; the true effect may be substantially different
Very low: Very little confidence; the true effect is likely substantially different

Frequently Asked Questions

Which quality assessment tool is best?

There is no single best tool; the correct tool depends on the study designs included in your review. Cochrane RoB 2 is the gold standard for RCTs, ROBINS-I for non-randomized intervention studies, and QUADAS-2 for diagnostic accuracy studies. JBI checklists provide the broadest coverage across study designs.

Do I need to assess quality in a scoping review?

Quality assessment is optional in scoping reviews, unlike systematic reviews where it is mandatory. If your scoping review includes quality assessment, document the methods and results. If not, acknowledge this as a limitation. For more on scoping reviews, see our guide on PRISMA-ScR for scoping reviews.

Can I create my own quality assessment tool?

Creating custom tools is generally discouraged because they lack validation and may not comprehensively cover relevant bias domains. Use established, validated tools wherever possible. If no validated tool exists for your specific study design, adapt the closest available tool and document all modifications.

How do I handle studies with mixed methods?

For mixed-methods studies, assess each component using the appropriate tool: quantitative components with quantitative tools and qualitative components with qualitative tools. JBI provides a separate checklist for mixed-methods studies.

Should I exclude studies based on quality assessment?

Quality assessment results should inform synthesis but generally should not be used to exclude studies that met your eligibility criteria. Instead, use sensitivity analysis to examine whether results change when restricted to high-quality studies. Excluding studies based on quality can introduce selection bias and is controversial in systematic review methodology.