Skip to content

Repository for PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams - Dataset Construction and Evaluation. Below, you will find the datasets, the code we used to generate the dataset, and the code we used to obtain answers from the LLMs.

License

Notifications You must be signed in to change notification settings

rodrigo-carrillo/PeruMedQA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

98 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PeruMedQA

Repository for PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams - Dataset Construction and Evaluation. You will find the datasets, the code we used to generate the dataset, and the code we used to obtain answers from the LLMs together with the outputs from the LLMs.


Figures

Percent (%) of correct answers by test (specialty or subspeciality) and LLM across years Figure 1_2026-02-23


Percent (%) of correct answers by test (specialty or subspeciality), LLM and years (updated on January 10, 2026). Figure 2_2026-02-23


Stress Evaluation

In this expanded work, we conducted one stress evaluation (https://arxiv.org/pdf/2509.18234v1). We randomized the order of the multiple-choice answers and repeated the evaluation with the LLMs. If the LLMs possessed genuine knowledge of the correct answers, they would have selected it irrespective of its position (i.e., whether the correct answer was option A or C).

MedGemma 27B, OctoMed 7B, and Meditron 7B were the only cases where there was no statistically significant difference between the original LLMs’ answers and those from the stress evaluation. In other words, whether the multiple-choice answers were presented in the original order or randomly shuffled, these three LLMs performed similarly in all cases. For all other LLMs, there was at least one case where there was a statistically significant difference (p < 0.05 for paired T-test or Wilcoxon Test).

Note: Text in red highlights the statistically significant p-values and the cases where there was the most substantial disparity between the original results and the results obtained from the stress tests (regardless of whether the disparity was statistically significant or not).

Screenshot 2026-02-17 at 9 49 16 AM

PrePrint

Carrillo-Larco RM, Melgarejo JL, Castillo-Cara M, Bravo-Rocca G. PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams--Dataset Construction and Evaluation. arXiv preprint arXiv:2509.11517. 2025 Sep 15. https://arxiv.org/abs/2509.11517

About

Repository for PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams - Dataset Construction and Evaluation. Below, you will find the datasets, the code we used to generate the dataset, and the code we used to obtain answers from the LLMs.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published