Accurate but Unreliable?
While Large Language Models (LLMs) are increasingly used in healthcare, most benchmarks focus solely on average accuracy. This research tackles a more critical, often ignored dimension: consistency.
If a clinical question is posed repeatedly in Hindi, does the model provide the same answer? For Hindi speakers—a population of over 600 million—this “self-agreement” gap is a major barrier to safe telehealth adoption.
🏗️ The Methodology: “10 Runs, 3 Models”
To measure reliability, I evaluated 100 USMLE-style multiple-choice questions across 10 repeated runs each, using three specialized medical LLMs: MedAlpaca, Med-PaLM 2, and BioGPT.
1. Statistical Framework
Following the framework of Kim et al. (2025), I treated each of the 10 runs as an independent rater. This allowed for the calculation of:
- Fleiss’ Kappa: Measuring the degree of agreement between the model’s own answers across runs.
- Krippendorff’s Alpha: Providing a robust measure of reliability that accounts for chance.
2. The Translation Pipeline
Consistency is only as good as the prompt. I developed a three-stage validation pipeline to ensure high-fidelity Hindi medical queries:
- Machine Translation: Initial rendering via Google Translate.
- Back-Translation: Re-translating to English to identify semantic drift.
- Human Review: Final validation by native Hindi speakers to ensure medical terminology accuracy.
🛠️ The Engineering: Reliability Analysis
All experiments were conducted via Google Colab at a fixed temperature of 0.7 to reflect realistic, non-deterministic deployment conditions.
By analyzing 6,000 total responses, the research quantifies the “reliability gap”—where English prompts may yield high self-agreement (Kappa > 0.7), but Hindi prompts exhibit substantially lower consistency, even when average accuracy might seem acceptable.
🚀 Impact: Making AI Stable in Practice
This study provides the first statistical quantification of the reliability gap for specialized medical LLMs in an Indic language. It moves the conversation from “is the AI smart?” to “is the AI stable?”—an essential shift for ensuring that LLM-powered health tools are demonstrably reliable for non-English speaking populations.