On-device Medical Chatbot for Nurse-Midwives

Researcher & Engineer · D-tree International, Zanzibar, Tanzania

Feb 2026 – Present

Built an on-device medical chatbot running fully offline on edge devices to provide real-time, evidence-based guidance to nurse-midwives in low-connectivity settings in Zanzibar.

Motivation

  • In Zanzibar, maternal and newborn mortality remain a significant challenge, disproportionately affecting rural households.
  • Nurse-midwives frequently face complex cases, but accessible guidance at the point of care is severely limited.
  • High data costs and poor network connectivity make online searches unreliable.

That's why we developed MAM-AI, an on-device AI assistant built for nurse-midwives in Zanzibar. Providing real-time, evidence-based, locally relevant guidance.

No internet needed. Always available.

Demo

A short demo of the on-device chatbot running offline on an edge device.

System Design

RAG pipeline running fully offline on an Android edge device.

The app uses Retrieval-Augmented Generation (RAG): the nurse-midwife's query is embedded and used to retrieve the most relevant passages from indexed training materials, which are then injected into the prompt alongside the original question. Gemma 4 E4B (int4 quantization) runs the final generation entirely on-device — no internet required.

Dataset

20,534
Total QA pairs
OBGYN questions
5
Source benchmarks
Public medical datasets

To evaluate the on-device model, we curated an open OBGYN QA dataset by aggregating five public medical benchmarks, spanning Africa, India, Kenya, and the USA. The dataset is released open-source at obgyn-qa-collection.

DatasetItemsFormatGeographic Focus
AfriMed-QA697MCQ + Short AnswerPan-African (Ghana, Nigeria, Kenya, Malawi, South Africa)
MedMCQA18,508Multiple ChoiceIndia (entrance exams)
Kenya Clinical Vignettes284Clinical scenariosKenya
MedQA-USMLE1,025Board-style MCQUSA
Women's Health Benchmark20Expert prompts-

Model

We deploy Gemma 4 E4B, released on April 2, 2025 — Google's latest on-device model optimized for edge hardware. Running in int4 quantization.

Evaluation

Full evaluation results coming soon.

Answering Accuracy

We evaluate model accuracy across multiple medical QA benchmarks, including MCQ datasets (AfriMedQA, MedQA USMLE, MedMCQA) and open-ended clinical vignettes (Kenya Vignettes, AfriMedQA SAQ, WHB Stumps). Open-ended responses are scored by an LLM judge on accuracy, safety, completeness, helpfulness, and clarity.

Results coming soon.

Latency

We benchmark on-device latency on real Android hardware, measuring time-to-first-token (TTFT), decode throughput (tokens/sec), and end-to-end query time across short, medium, and long clinical queries.

Results coming soon.

Stability

We evaluate response consistency under repeated identical queries and across varying conversation history lengths, assessing whether the model produces reliable outputs under the constraints of on-device inference.

Results coming soon.

Dangerous Scenario Recognition

A dedicated evaluation of how the app handles high-stakes clinical emergencies — including postpartum hemorrhage, eclampsia, neonatal respiratory distress, and sepsis. We assess whether the model correctly identifies emergency escalation triggers, avoids underreacting to critical presentations, and produces safe, actionable guidance aligned with official protocols.

Results coming soon.