Quantifying the Invisible: Women Doctors in Rosenwald Guides

Researcher · EPFL Laboratory for History of Science and Technology

Aug 2025 – Feb 2026

Collaborated with historians to design a double-triangular annotation framework combining LLMs and human labeling — then extracted 577,000+ records from the Rosenwald Guides, uncovering 3,700+ female doctors.

Background

The Guide Rosenwald is an annual medical directory published in France from 1887 to 1949, listing physicians by name, specialty, address, and consultation hours across France and its colonies. Spanning 47 volumes and hundreds of thousands of entries, it is one of the most comprehensive records of the French medical profession over half a century.

Annotated close-up of a Rosenwald Guide entry showing Name, Year of diploma, Gender indicator, Address, Opening hours, and Title/Specialization fields
Each physician entry encodes six structured fields. The red-highlighted "Mlle" is a gender indicator.

This project is part of the SNSF MEDIF interdisciplinary research initiative, a collaboration between EPFL and historians at the Institut des humanités en médecine in Lausanne. We piloted on 20 editions (1887–1906, 4,116 pages), later expanding the scope to 1887–1922.

Goals

  • Extract all physician records from the guides, then identify female doctors by their gender indicator (Mme/Mlle)
  • Evaluate the extraction quality rigorously. No annotated benchmark existed for this data, so we first had to create one — which led to the Double Triangle annotation framework.

Annotation Framework

Double Triangle annotation framework diagram showing System A (Claude + Qwen), System B (Llama + Grok), Human Juries, and Final Expert
The Double Triangle framework: two independent LLM systems each backed by a human jury, with a Final Expert for cross-system mismatches.

We designed the Double Triangle annotation framework to combine LLMs and human labelers efficiently. Two independent annotation systems (S1: Claude + Qwen; S2: Llama + Grok) each produce labels independently. A human jury reviews disagreements within each system. Only when S1 and S2 still disagree does a Final Reviewer step in.

In practice, only 18.4–26.3% of entries require human attention at the jury level, and just 4.2% reach the Final Reviewer. The result: final WER of 0.0034 and CER of 0.0007, at less than half the human effort of full manual annotation.

Evaluation table of the Double Triangle annotation framework showing WER, CER, fields to correct, and human effort ratio for Ha, Hb, and R
Experiment results on page 32, 1887 Rosenwald Guide. Each round of human review substantially reduces error, with the Final Reviewer (R) achieving WER 0.0034 and CER 0.0007 at only 4.2% human effort.

Extraction

We benchmarked four extraction strategies — Image + Original OCR, Image only, Original OCR only, and Tesseract OCR only — each combined with multiple LLMs (Gemini 3 Pro Preview, Gemini 3 Flash Preview, GPT-5.2, GPT-5-mini) as post-OCR correction models. Image + Original OCR with Gemini 3 Pro Preview achieves the best overall performance (Avg WER 0.0360, Avg CER 0.0139), and was used for the full extraction.

OCR and post-OCR correction performance table across sources and models on 30 documents
OCR and post-OCR correction performance across sources and models (30 documents). Best result per source in bold, second best underlined.
WER by document for Gemini 3 Pro and Flash Preview across image, image+text, and original OCR sources
WER per document for the Gemini 3 family. Flash Preview tracks Pro closely across nearly all documents — suggesting accurate extraction is achievable with a much faster and cheaper model.

Conclusions

>99%
Annotation Accuracy
>50%
Annotation effort saved
2,600+
Benchmark entries
577,000+
Records extracted
Rosenwald Guides (1887-1922)
3,700+
Female doctors found
Rosenwald Guides (1887-1922)

We extracted 577,000+ structured records from 4,166 pages using an Image+OCR pipeline with Gemini 3 Pro Preview. Among them, over 3,700 entries belong to female doctors — physicians who had been effectively invisible to historical analysis because no structured dataset existed.

This dataset now provides historians with a quantitative foundation to study the female doctors in France.