← BlogClinical EvidenceJanuary 15, 20268 min read

    How Accurate Is AI Radiology Reporting?
    Evidence from Published Clinical Studies

    What does the peer-reviewed evidence say about AI CT reporting accuracy? We analyzed two independent clinical studies and compared the numbers to traditional radiology benchmarks.

    92.2%
    AI sensitivity in chest CT
    vs 58.3% unaided
    95.6%
    AI specificity
    vs 80.6% unaided
    94.9%
    Clinician approval rate
    multi-center study
    0.86
    Macro F1 score
    foundation model

    The central question in AI radiology

    "How accurate is AI radiology reporting?" is one of the most searched questions in the field — and one of the least honestly answered in vendor marketing. Most claims cite cherry-picked detection studies on single pathologies. The better question is: how does AI CT reporting perform across the full clinical workflow, including structured report generation, on a realistic mix of emergency and routine studies?

    This article summarizes two independent peer-reviewed studies that evaluated AI CT reporting in real clinical settings, explains what the numbers mean in practice, and describes what "accuracy" should actually mean when selecting an AI radiology partner.

    Study 1: Emergency chest CT in a clinical setting (Polish Radiology, 2025)

    The most rigorous evaluation of xAID's AI CT reporting to date was published in Polish Radiology (Pol Radiol, 2025). The study was a retrospective single-center evaluation conducted at an emergency radiology department, using 90 consecutive unenhanced chest CT scans.

    Study design: The same 90 cases were read by radiologists with AI assistance (AI-assisted) and without (unaided). Pathologies were evaluated across 9 categories:

    • Lung nodules
    • Pulmonary opacifications
    • Pneumothorax
    • Pleural and pericardial effusion
    • Pulmonary artery dilatation
    • Coronary artery calcifications
    • Aortic diameter
    • Vertebral fractures
    • Rib fractures
    With AI assistance
    92.2% pooled sensitivity
    95.6% pooled specificity
    Without AI (unaided)
    58.3% pooled sensitivity
    80.6% pooled specificity

    Source: Bonatti et al., Pol Radiol, 2025. Single-center retrospective study, emergency radiology setting.

    Three specific categories showed the largest gains: AI outperformed unaided radiologists substantially in detecting coronary artery calcifications, pulmonary artery dilatation, and vertebral fractures — findings that are commonly missed on routine emergency reads because they require precise quantitative measurement rather than pattern recognition.

    What this means for practice: A 33.9-percentage-point sensitivity advantage is clinically significant. In a center reading 200 chest CTs per month, this translates to approximately 68 additional correctly identified pathologies monthly — many of which would otherwise require patient recall, repeat imaging, or represent missed diagnoses.

    Study 2: Multi-center European clinical utility assessment (ResearchGate, 2025)

    A second independent study evaluated AI CT reporting clinical utility across four European radiology centers — France, Greece, Slovakia, and the United Kingdom. The study assessed 81 non-contrast chest CT cases with four board-certified radiologists.

    Unlike the Polish Radiology study (which measured detection accuracy), this study focused on clinical integration: would radiologists actually trust and use AI-generated report elements in practice?

    94.9%
    Clinician approval rate
    for clinical integration
    89.7%
    Image layout approval
    across 4 centers
    81.5%
    Diagnostic contribution
    of AI segmentation data

    Source: Polushkin et al., ResearchGate, 2025. Multi-center study across France, Greece, Slovakia, UK.

    94.9% approval does not mean radiologists agreed with everything — it means they found AI-generated structured report elements clinically usable with minor modifications. 81.5% reported that AI segmentation data contributed positively to their diagnostic process.

    The foundation model: technical architecture

    The accuracy results above are driven by xAID's Swin transformer-based foundation model — a transformer architecture adapted for 3D volumetric medical imaging. Key technical specifications:

    • Input resolution: Up to 256³ voxels — capturing fine-grained anatomical detail that lower-resolution models miss
    • Coverage: Head, chest, and abdomen CT
    • Findings analyzed: 100+ per scan
    • Macro F1 score: 0.86 across clinically relevant pathologies
    • Secondary verification: A second AI layer independently reviews findings; divergences are flagged for radiologist attention

    What "accuracy" should mean when evaluating AI radiology providers

    Most AI radiology vendors cite performance numbers from internal evaluations or narrow single-pathology studies. When evaluating accuracy claims, ask:

    1. Is the study independent? Was it conducted by the vendor or by an independent research team? The Polish Radiology study and the European multi-center study were both independent.
    2. What was the study design? Real emergency cases on a consecutive series are harder to cherry-pick than curated test sets.
    3. What pathologies were evaluated? Single-pathology performance (e.g., nodule detection) does not predict full-scan reporting performance.
    4. Is there published peer-reviewed evidence? Published performance numbers from independent studies — not vendor self-assessments — are required for a meaningful accuracy claim.
    5. Is a radiologist reviewing every report? Autonomous AI reporting without human review introduces liability and accuracy risks not present in AI-assisted reporting.

    95% accuracy — verified by peer-reviewed studies

    xAID's accuracy is backed by two independent peer-reviewed studies: 92.2% sensitivity and 95.6% specificity (Bonatti et al., Polish Radiology, 2025) and 94.9% radiologist approval rate across four European centers (Polushkin et al., ResearchGate, 2025). Every report is reviewed by a European radiologist before delivery. No other AI radiology vendor has published this level of independent clinical evidence.

    View full accuracy documentation →

    Frequently asked questions

    How accurate is AI CT reporting compared to a radiologist?

    In a peer-reviewed study published in Polish Radiology (2025), AI CT reporting achieved 92.2% pooled sensitivity and 95.6% specificity across 9 pathology categories in emergency chest CT — compared to 58.3% sensitivity and 80.6% specificity for radiologists reading scans without AI assistance. AI showed particularly large advantages in structured quantitative findings: coronary calcium, pulmonary artery measurements, and vertebral fractures.

    Is AI radiology reporting as good as a human radiologist?

    Published evidence suggests AI-assisted radiology outperforms unaided radiologists in specific detection tasks — especially structured quantitative findings. AI underperforms humans on complex contextual reasoning and rare presentations. The best outcomes come from AI-assisted workflows where a radiologist reviews every AI-generated draft before delivery, combining AI's consistent quantitative analysis with human clinical judgment.

    What studies have evaluated AI CT reporting accuracy?

    Two key independent studies evaluated xAID AI CT reporting: (1) Bonatti et al., Pol Radiol, 2025 — a retrospective single-center study of 90 emergency chest CTs, finding 92.2% AI sensitivity vs 58.3% unaided. (2) Polushkin et al., ResearchGate, 2025 — a multi-center study across four European centers, finding 94.9% clinician approval of AI report elements for clinical integration.

    Does AI radiology have a higher error rate than traditional radiology?

    The published evidence suggests AI-assisted radiology reduces certain error types (missed structured findings, quantitative measurements) rather than increasing them. AI does introduce its own failure modes — it may be less reliable on atypical presentations or complex multi-pathology cases. A dual-layer system (AI analysis + radiologist review) is specifically designed to catch both AI and human errors before report delivery.