GitHub - rodrigo-carrillo/PeruMedQA: Repository for PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams - Dataset Construction and Evaluation. Below, you will find the datasets, the code we used to generate the dataset, and the code we used to obtain answers from the LLMs.

PeruMedQA

Repository for PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams - Dataset Construction and Evaluation. You will find the datasets, the code we used to generate the dataset, and the code we used to obtain answers from the LLMs together with the outputs from the LLMs.

Figures

Percent (%) of correct answers by test (specialty or subspeciality) and LLM across years

Percent (%) of correct answers by test (specialty or subspeciality), LLM and years (updated on January 10, 2026).

Stress Evaluation

In this expanded work, we conducted one stress evaluation (https://arxiv.org/pdf/2509.18234v1). We randomized the order of the multiple-choice answers and repeated the evaluation with the LLMs. If the LLMs possessed genuine knowledge of the correct answers, they would have selected it irrespective of its position (i.e., whether the correct answer was option A or C).

MedGemma 27B, OctoMed 7B, and Meditron 7B were the only cases where there was no statistically significant difference between the original LLMs’ answers and those from the stress evaluation. In other words, whether the multiple-choice answers were presented in the original order or randomly shuffled, these three LLMs performed similarly in all cases. For all other LLMs, there was at least one case where there was a statistically significant difference (p < 0.05 for paired T-test or Wilcoxon Test).

Note: Text in red highlights the statistically significant p-values and the cases where there was the most substantial disparity between the original results and the results obtained from the stress tests (regardless of whether the disparity was statistically significant or not).

PrePrint

Carrillo-Larco RM, Melgarejo JL, Castillo-Cara M, Bravo-Rocca G. PeruMedQA: Benchmarking Large Language Models (LLMs) on Peruvian Medical Exams--Dataset Construction and Evaluation. arXiv preprint arXiv:2509.11517. 2025 Sep 15. https://arxiv.org/abs/2509.11517

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
00.Code		00.Code
01.Datasets		01.Datasets
02.LLMs Results		02.LLMs Results
CONAREME		CONAREME
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PeruMedQA

Figures

Stress Evaluation

PrePrint

About

Uh oh!

Releases

Packages

Languages

License

rodrigo-carrillo/PeruMedQA

Folders and files

Latest commit

History

Repository files navigation

PeruMedQA

Figures

Stress Evaluation

PrePrint

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages