Quality assessment (also called risk of bias assessment or critical appraisal) is a mandatory component of systematic reviews that evaluates the methodological rigor of included studies to determine how much confidence readers can place in each study's results. The quality of individual studies directly affects the reliability of the review's conclusions. A systematic review that pools results from high-quality and critically flawed studies without distinguishing between them can produce misleading evidence.
Choosing the correct quality assessment tool depends on the study designs included in your review. Using an inappropriate tool (for example, applying a randomized trial tool to an observational study) is a methodological error that undermines the assessment's validity. This guide compares the most widely used tools and helps you select the right one for your review.
Why Quality Assessment Is Essential
Quality assessment serves several critical functions in a systematic review:
- Contextualize results: Understanding each study's methodological strengths and weaknesses helps interpret its findings
- Inform synthesis decisions: Quality assessment results may determine whether studies are combined in meta-analysis or analyzed separately
- Support sensitivity analysis: Restricting analysis to low risk-of-bias studies tests whether conclusions are robust
- Assess certainty of evidence: Frameworks like GRADE use risk of bias as a key domain for rating overall evidence certainty
- Meet reporting standards: The PRISMA 2020 checklist requires reporting of quality assessment methods and results
Quality Assessment Tools by Study Design
Cochrane Risk of Bias 2 (RoB 2): For Randomized Controlled Trials
RoB 2 is the current standard for assessing randomized controlled trials, developed by the Cochrane Bias Methods Group. It replaced the original Cochrane Risk of Bias tool in 2019.
Structure: RoB 2 evaluates bias through five domains:
- Bias arising from the randomization process: Was the allocation sequence random? Was allocation concealed?
- Bias due to deviations from intended interventions: Were participants and personnel blinded? Were there deviations from the assigned intervention?
- Bias due to missing outcome data: Was outcome data complete? Were reasons for missingness related to the outcome?
- Bias in measurement of the outcome: Was the outcome assessor blinded? Could measurement be influenced by knowledge of intervention?
- Bias in selection of the reported result: Were multiple outcome measurements available? Was there selective reporting?
Judgment for each domain: Low risk, Some concerns, High risk
Overall judgment: Low risk of bias (all domains low), Some concerns (at least one domain with some concerns but none high), High risk of bias (at least one domain high risk, or multiple domains with some concerns)
When to use: Any systematic review that includes randomized controlled trials as the primary study design.
ROBINS-I: For Non-Randomized Studies of Interventions
ROBINS-I (Risk Of Bias In Non-randomized Studies of Interventions) assesses non-randomized studies that compare outcomes between groups that received different interventions.
Structure: Seven bias domains:
- Bias due to confounding
- Bias in selection of participants into the study
- Bias in classification of interventions
- Bias due to deviations from intended interventions
- Bias due to missing data
- Bias in measurement of outcomes
- Bias in selection of the reported result
Judgment levels: Low risk, Moderate risk, Serious risk, Critical risk, No information
When to use: Systematic reviews including cohort studies, case-control studies, or other non-randomized comparative studies evaluating interventions.
Newcastle-Ottawa Scale (NOS): For Observational Studies
The Newcastle-Ottawa Scale is a widely used tool for assessing the quality of non-randomized studies (cohort studies and case-control studies) in systematic reviews.
Structure: Three domains assessed using a star system (maximum 9 stars):
- Selection (4 stars): Representativeness, selection of controls, ascertainment of exposure
- Comparability (2 stars): Comparability of cohorts based on design or analysis
- Outcome/Exposure (3 stars): Assessment of outcome, follow-up adequacy
Scoring: Studies are often categorized as high quality (7-9 stars), moderate quality (4-6 stars), or low quality (0-3 stars), though these thresholds vary across reviews.
When to use: Systematic reviews including cohort or case-control studies. The NOS is simpler than ROBINS-I and may be preferred when a less detailed assessment is sufficient.
Limitations: The NOS has been criticized for producing inconsistent inter-rater reliability and for its star system, which may oversimplify complex methodological judgments.
QUADAS-2: For Diagnostic Accuracy Studies
QUADAS-2 (Quality Assessment of Diagnostic Accuracy Studies) is designed specifically for systematic reviews of diagnostic test accuracy.
Structure: Four domains:
- Patient selection: Was patient sampling consecutive or random? Were inappropriate exclusions avoided?
- Index test: Was the index test interpreted without knowledge of the reference standard?
- Reference standard: Was the reference standard likely to correctly classify the condition?
- Flow and timing: Did all patients receive the reference standard? Were all patients included in the analysis?
Each domain is assessed for both risk of bias and applicability concerns.
When to use: Systematic reviews evaluating the accuracy of diagnostic tests (sensitivity, specificity, predictive values).
JBI Critical Appraisal Tools: For Multiple Study Designs
The Joanna Briggs Institute (JBI) provides a suite of standardized critical appraisal checklists covering 13 different study designs:
- Randomized controlled trials (13 items)
- Quasi-experimental studies (9 items)
- Cohort studies (11 items)
- Case-control studies (10 items)
- Cross-sectional studies (8 items)
- Case reports (8 items)
- Case series (10 items)
- Qualitative research (10 items)
- Systematic reviews (11 items)
- Text and opinion papers (6 items)
- Prevalence studies (9 items)
- Economic evaluations (11 items)
- Diagnostic test accuracy studies (10 items)
When to use: JBI tools are particularly useful for mixed-methods reviews or reviews that include study designs not covered by Cochrane tools (e.g., qualitative studies, case series, cross-sectional studies).
Choosing the Right Tool
| Study Design | Recommended Tool | Alternative |
|---|---|---|
| Randomized controlled trials | Cochrane RoB 2 | JBI RCT checklist |
| Non-randomized intervention studies | ROBINS-I | Newcastle-Ottawa Scale |
| Cohort studies (not intervention) | Newcastle-Ottawa Scale | JBI Cohort checklist |
| Case-control studies | Newcastle-Ottawa Scale | JBI Case-Control checklist |
| Cross-sectional studies | JBI Cross-Sectional checklist | AXIS tool |
| Diagnostic accuracy studies | QUADAS-2 | JBI Diagnostic checklist |
| Qualitative studies | JBI Qualitative checklist | CASP Qualitative checklist |
| Case reports/series | JBI Case Report/Series checklists | None |
| Mixed study designs | JBI suite (multiple checklists) | Design-specific tools |
When your systematic review includes multiple study designs, use the appropriate tool for each design. Do not apply a single tool across different designs.
Conducting Quality Assessment
Step 1: Select Your Tool(s)
Choose tools during protocol development and document them in your systematic review protocol. Justify your selection based on the study designs you expect to include.
Step 2: Pilot the Assessment
Before assessing all studies, pilot the tool on 2-3 studies with both reviewers. Discuss judgments, calibrate interpretation of domains, and develop decision rules for common scenarios.
Step 3: Independent Dual Assessment
Two reviewers should independently assess each study using the selected tool. Record domain-level and overall judgments separately.
Step 4: Resolve Disagreements
Compare assessments and resolve discrepancies through discussion or a third reviewer. Document the resolution process and the level of initial agreement.
Step 5: Present Results
Present quality assessment results in:
- A summary table showing domain-level and overall judgments for each study
- A risk of bias summary figure (traffic light plot for RoB 2)
- Narrative description in the results section
Incorporating Quality Assessment into Synthesis
Quality assessment results should inform your evidence synthesis, not merely be reported as a standalone exercise:
In Meta-Analysis
- Sensitivity analysis: Run the meta-analysis restricted to low risk-of-bias studies and compare with the full analysis
- Subgroup analysis: Stratify by overall quality rating (low vs high risk of bias)
- Meta-regression: Use quality scores or domain-level judgments as moderators
- Weighting: Some approaches weight studies by quality, though this is controversial
Quality assessment results can affect heterogeneity in your meta-analysis, since high-quality studies may show different effects than low-quality studies, and understanding this relationship is important for interpreting results.
In Narrative Synthesis
- Emphasize findings from high-quality studies
- Acknowledge limitations of low-quality studies
- Use quality as a factor when resolving conflicting findings
In GRADE Assessment
Risk of bias is one of five GRADE domains for rating certainty of evidence. Serious or very serious risk of bias across the body of evidence leads to downgrading the certainty rating. For a broader discussion of how quality assessment fits into the systematic review process, see our comprehensive guide to conducting a systematic review.
Common Quality Assessment Mistakes
-
Using the wrong tool: Applying RoB 2 to observational studies or NOS to randomized trials. Match the tool to the study design.
-
Assessing at the study level only: RoB 2 is designed to be applied at the outcome level (the same study may have low risk of bias for one outcome and high risk for another). Assess per outcome where possible.
-
Equating quality scores with risk of bias: A high NOS score does not mean a study has no bias; it means the assessed domains scored well. Unmeasured confounding and other biases may still be present.
-
Not piloting the assessment: Without calibration, inter-rater reliability suffers. Pilot on 2-3 studies before full assessment.
-
Ignoring quality assessment in synthesis: Conducting quality assessment but not using it to inform the analysis (e.g., through sensitivity analysis) wastes the effort.
-
Single-reviewer assessment: Quality assessment, like screening, should be performed by at least two independent reviewers.
Reporting Quality Assessment in PRISMA
The PRISMA 2020 checklist requires:
- Item 11: Describe the methods used to assess risk of bias of included studies (tools, processes, how results were used in synthesis)
- Item 18: Present the results of risk of bias assessment for each included study
- Item 22: Report the results of any sensitivity analyses based on risk of bias
Document the number of included studies and their quality distribution in your PRISMA flow diagram. Create yours using our free PRISMA 2020 compliant diagram tool.
The GRADE Framework
GRADE (Grading of Recommendations, Assessment, Development, and Evaluations) goes beyond individual study quality to rate the certainty of evidence across an entire body of evidence for each outcome. GRADE considers five domains:
- Risk of bias: Limitations in study design and execution
- Inconsistency: Unexplained variability in results
- Indirectness: How directly the evidence applies to the review question
- Imprecision: Wide confidence intervals or small sample sizes
- Publication bias: Systematic underreporting of unfavorable results
Evidence starts at "high" certainty for RCTs and "low" for observational studies, then is downgraded for serious limitations in any domain. Large effect sizes, dose-response gradients, and absence of plausible confounding can upgrade observational evidence.
GRADE certainty levels:
- High: Very confident that the true effect lies close to the estimate
- Moderate: Moderately confident; the true effect is likely close but may be substantially different
- Low: Limited confidence; the true effect may be substantially different
- Very low: Very little confidence; the true effect is likely substantially different
Frequently Asked Questions
Which quality assessment tool is best?
There is no single best tool; the correct tool depends on the study designs included in your review. Cochrane RoB 2 is the gold standard for RCTs, ROBINS-I for non-randomized intervention studies, and QUADAS-2 for diagnostic accuracy studies. JBI checklists provide the broadest coverage across study designs.
Do I need to assess quality in a scoping review?
Quality assessment is optional in scoping reviews, unlike systematic reviews where it is mandatory. If your scoping review includes quality assessment, document the methods and results. If not, acknowledge this as a limitation. For more on scoping reviews, see our guide on PRISMA-ScR for scoping reviews.
Can I create my own quality assessment tool?
Creating custom tools is generally discouraged because they lack validation and may not comprehensively cover relevant bias domains. Use established, validated tools wherever possible. If no validated tool exists for your specific study design, adapt the closest available tool and document all modifications.
How do I handle studies with mixed methods?
For mixed-methods studies, assess each component using the appropriate tool: quantitative components with quantitative tools and qualitative components with qualitative tools. JBI provides a separate checklist for mixed-methods studies.
Should I exclude studies based on quality assessment?
Quality assessment results should inform synthesis but generally should not be used to exclude studies that met your eligibility criteria. Instead, use sensitivity analysis to examine whether results change when restricted to high-quality studies. Excluding studies based on quality can introduce selection bias and is controversial in systematic review methodology.