On-device Medical Chatbot for Nurse-Midwives

Researcher & Engineer · D-tree International, Zanzibar, Tanzania

Feb 2026 – Present

Built an on-device medical assistant running fully offline to provide real-time, evidence-based guidance to nurse-midwives in low-connectivity settings in Zanzibar.

Motivation

  • In Zanzibar, maternal and newborn mortality remain a significant challenge, disproportionately affecting rural households.
  • Nurse-midwives frequently face complex cases, but accessible guidance at the point of care is severely limited.
  • High data costs and poor network connectivity make online searches unreliable.

That's why I developed MAM-AI, an on-device AI assistant for nurse-midwives in Zanzibar, providing real-time, evidence-based, locally relevant guidance.

No internet needed. Always available.

Demo

A short demo of the on-device assistant running fully offline.

You can also try it yourself in the live web demo — a faithful browser version of the on-device app, running the same Gemma 4 generator and guideline retrieval. Ask a clinical question and you'll get an evidence-based answer with clickable citations back to the source guidelines.

Note: the demo is hosted on a free Hugging Face server just for illustration, so generation may be slow. Real on-device latency is reported further below.

System Design

Runs 100% on-device · fully offline
Query
clinical question
1
Embed
EmbeddingGemma
300M
2
Retrieve
top-3 chunks
3
Augment
prompt
+
context
4
Generate
Gemma 4 E4B
int4
Answer
with citations [1][2][3]
On-device knowledge basequeried at step ②
87 guideline documents · 63,650 chunks · SQLite vector store
The full RAG pipeline runs on the device — embedding, retrieval, and generation all happen offline.

The app uses Retrieval-Augmented Generation (RAG): the nurse-midwife's question is embedded with EmbeddingGemma-300M, then matched against an on-device vector store to pull the three most relevant guideline passages. Those are injected into the prompt alongside the original question, and Gemma 4 E4B (int4) generates the final answer, based on the official guidelines, with inline citations back to the source.

Every stage — embedding, vector search, and generation — runs locally on the Android device. No internet is required at any point, which is the whole point: in Zanzibar, reliable connectivity can't be assumed at the point of care.

Both building blocks are off-the-shelf open models from Google — Gemma 4 E4B (int4) for generation and EmbeddingGemma-300M for retrieval — run on-device through Google's LiteRT-LM runtime. What's mine is the system, the knowledge base, and the evaluation built around them.

Evaluation

A clinical assistant needs comprehensive evaluation. I evaluated MAM-AI in layers — the end-to-end system first, then the retriever and the generator on their own — and how fast it actually runs on the device. The open-ended answers are scored by an LLM judge that I validated against thousands of physician-labeled judgments first, choosing a conservative grader over higher-scoring but lenient ones.

End-to-end: the deployed system

The biggest gain came from fixing a failure mode. The base model was evasive — on a third of questions it deflected, telling the nurse to “see a doctor” instead of answering. A prompt redesign cut that to ~3% and doubled how much correct guidance the answers actually carry.

3.2%
Deflection rate
down from 32.7%
0.28
Key-fact recall
≈2× the base prompt
0
Dangerous answers
1 flag, adjudicated as a judge error

To see the deployed choice in context, I ran a full 3×3 matrix — the two deployable on-device models (Gemma 4 E4B and Gemma 3n E4B) plus a frontier model (Qwen3.5-397B-A17B-FP8) as an unconstrained ceiling, each under three prompts: baseline, +G1 (fixes deflection), and +G1+G2 (adds consultation structure).

Each cell is scored on two open-ended benchmarks: the Kenya Clinical Vignettes — 312 nurse-written primary-care cases, close to the Zanzibar setting — and HealthBench-oss, a split of OpenAI’s physician-rubric health benchmark. I introduce both in more detail further down; here they anchor the end-to-end scores.

Kenya vignettes · n=312HealthBench-oss · n=1,209
Generator · promptRecallDeflectionPotentially harmfulDangerousweighted_metPositive ↑Penalty ↓
Gemma 4 · baseline0.13932.7%12.8%00.00010.1820.377
Gemma 4 · +G1 (deployed)0.2793.2%15.7%10.0380.2210.373
Gemma 4 · +G1+G20.3381.6%17.0%00.0520.2330.385
Gemma 3n · baseline0.2971.9%24.4%40.0830.2620.373
Gemma 3n · +G10.3510.6%31.4%110.1100.2890.367
Gemma 3n · +G1+G20.3890.0%28.8%150.1190.2930.365
Qwen3.5-397B · baseline0.2950.3%8.3%10.1420.3090.354
Qwen3.5-397B · +G10.4200.0%12.5%00.1610.3340.355
Qwen3.5-397B · +G1+G20.5530.0%4.2%00.2450.4060.336

Prompt variants (links are the actual system prompts) — baseline (plain system prompt), +G1 (seven deflection/scope-fix levers), and +G1+G2 (G1 plus a consultation-workflow structure). Kenya columns: 312 nurse-written clinical vignettes, safety-judged on a four-level scale. Potentially harmful = a small, often-catchable harm path (e.g. imprecise dosing) — a raw judge label, not human-verified; Dangerous = would harm if followed at face value, with the deployed config’s single flag manually adjudicated to 0 genuine. HealthBench-oss: 1,209 items, where weighted_met nets completeness (positive) against active harm (penalty).

The two on-device models trade off along a single axis — usefulness versus safety. Gemma 3n is the more useful one: it answers more (higher recall) and scores about twice as high on HealthBench. But it is the less safe one — 4 to 15 dangerous answers across the prompts, and a markedly higher rate of potentially-harmful ones. Gemma 4 is the mirror image: less complete, but with near-zero dangerous answers (0–1).

I prioritized safety and deployed Gemma 4; the +G1 prompt then wins back most of the usefulness it gives up — without adding dangerous answers.

Even with the right prompt, the deployed config still trails the unconstrained Qwen ceiling on most metrics — recall, completeness, and safety alike. That on-device gap is the real limit, and the next two layers dig into where it comes from: the retriever and the generator.

Retrieval

Next I isolate the retriever. I built MamaRetrieval — a benchmark of 3,185 clinical queries with graded relevance labels — and ran seven retrievers through it, scored at the top-3 depth the app actually uses. (The benchmark itself is described further down.)

RetrieverHR(≥3)P(≥3)HR(≥5)P(≥5)wHRwP
voyage (cloud)0.9960.8670.7530.4520.8600.682
octen (cloud)0.9910.8040.7160.4030.8470.637
EmbeddingGemma (deployed)0.9850.7840.7040.3880.8380.619
lateon (cloud)0.9710.7380.6640.3500.8150.581
Gecko (former on-device)0.8140.4770.4390.1930.6620.393
bm250.7540.4170.3710.1630.6020.338
medcpt0.6440.3340.2720.1120.5170.277

MamaRetrieval: 3,185 clinical queries; each retriever’s top-3 chunks judged for relevance on a 0–6 scale by a strong LLM judge (Qwen3.5-397B). HR = hit rate (at least one relevant chunk in the top-3); P = precision (share of the top-3 that are relevant); ≥3 = lenient, ≥5 = strict relevance; wHR/wP weight by the graded score. EmbeddingGemma was scored in a matched follow-up run with the same judge and pooled labels — the six original retrievers reproduce exactly, confirming no drift.

The deployed on-device retriever, EmbeddingGemma-300M, lands in the top tier — third of seven, between two cloud models and within ~8 points of the best (voyage). A 300M model running on the device competes with cloud retrieval, and it outruns the earlier on-device option, Gecko, by about 30 points of precision (0.78 vs 0.48). So on-device retrieval is largely solved; the real question is whether that quality reaches the answers — and on a matched end-to-end comparison, it doesn't:

Kenya vignettesHealthBench-oss
SetupRetrieval P@3Kenya recallweighted_metPositive ↑Penalty ↓
No retrieval (no-RAG)0.178+0.0030.1840.379
Gecko (former on-device)0.2700.125−0.0040.1750.372
EmbeddingGemma (deployed)0.3960.126+0.0120.1890.368

Matched bake-off on the Kenya set (retrieval n=312, HealthBench n=1,209). Retrieval P@3 = lenient precision at top-3; Kenya recall = end-to-end key-fact recall. For HealthBench-oss, weighted_met nets completeness (Positive ↑) against harm (Penalty ↓), so ≈0 means the generator’s gains are cancelled by its error rate. Different query set and judge from the benchmark above, so Gecko’s P@3 here (0.270) is not comparable to its benchmark P@3 (0.477). The no-RAG HealthBench arm is from a parallel run (same generator and judge, retrieval off).

On the matched Kenya comparison, the same upgrade lifts retrieval precision (Gecko 0.270 → EmbeddingGemma 0.396, +12.6 pp) but answer quality doesn't follow: end-to-end key-fact recall is flat (0.125 → 0.126) and HealthBench stays essentially zero. On this small on-device generator, RAG is even net-negative on Kenya — answering with no retrieved context scores higher (0.178). Better retrieval does not convert into better answers; the binding constraint is the generator’s ability to use the context, which the next layer examines directly.

Generator faithfulness

The end-to-end layer showed Gemma 4 is the safer choice; this layer isolates why. Faithfulness strips retrieval out of the picture: I hand each model the gold guideline passages for 2,989 questions (oracle context) and ask whether its answer stays within what those passages support. A failure is a contradiction — a wrong dose or threshold — or an unsupported addition: a clinical claim that isn’t in the source. Incompleteness and refusals don’t count.

Gemma 4 is about 2× more faithful than Gemma 3n, and the gap holds at every prompt: genuine hallucination runs 2.6–3.6% for Gemma 4 versus 6.3–6.7% for Gemma 3n. The telling part is the frontier comparison — Gemma 4 grounds about as faithfully as the unconstrained Qwen (2.5–3.7%). So on faithfulness the small on-device Gemma 4 is already strong; Gemma 3n is the outlier, not the small size.

Generatorbaseline+G1+G1+G2
Gemma 4 E4B2.64%3.31%3.55%
Gemma 3n E4B6.26%6.46%6.72%
Qwen3.5-397B2.51%3.71%0.97%

Categorized true-hallucination, computed in two passes: a detector (Patronus Lynx-70B) flags every candidate faithfulness failure, then a stronger judge (GPT-5) re-reads each flag and keeps only genuine contradictions or unsupported additions — counted over all 2,989 answers. Measured on oracle (gold) context — the model is given the correct passages — so it reflects the generator’s grounding alone; in deployment the retriever can also return wrong or missing passages, a separate source of error.

These results both explain the deployment choice and point to future work. Gemma 3n’s weaker grounding — it contradicts the provided context nearly twice as often as Gemma 4 — is a property of the model, not the prompt, so improving faithfulness depends on the generator itself rather than on further prompt engineering. A possible next step is to fine-tune the generator to use retrieved context more faithfully — which could let a more capable, more helpful model be deployed without compromising safety.

On-device latency

Finally, how fast is it in practice? I benchmarked on the project's actual device — a OnePlus OPD2413 tablet (Snapdragon 8 Elite, 16 GB RAM, Android 15) — with 54 timed runs per configuration, measuring time-to-first-token, generation speed, and the full end-to-end answer.

The app uses the device's GPU where available and falls back to CPU otherwise, so both are part of the deployment — and CPU-only devices are roughly 2–3× slower. The numbers below are at the deployed depth (top-3 retrieval).

ConfigurationFirst token (TTFT)Generation timeGeneration speedFull answer (median)
Gemma 4 E4B · GPU (deployed)~1.0 s~16 s~13 tok/s~19 s
Gemma 4 E4B · CPU (deployed, no-GPU devices)~18 s~19 s~12 tok/s~43 s
Gemma 4 E2B · GPU (cheaper-device tier)~0.4 s~10 s~21 tok/s~14 s
On-device latency vs retrieval depth — Gemma 4 E4B on a OnePlus OPD2413 (Snapdragon 8 Elite)
0102030405060secdeployed (k=3)013571015retrieved chunks (k)GPUCPUgenerationfirst token
Generation time (solid) and time-to-first-token (dashed), GPU (teal) vs CPU (amber). Generation stays roughly flat; on CPU, first-token time rises sharply with input length, while on GPU it stays low.

This drove the retrieval-depth choice. On GPU, latency is essentially flat in k — both first-token and generation time barely move — so depth is nearly free. On CPU it is not: generation stays flat, but first-token time climbs steeply with the number of retrieved chunks, and total latency crosses the project's 60-second budget by k=5 (≈61 s). Because not every target device has reliable GPU support, the CPU path has to stay within budget too — so I capped retrieval at k=3, which holds the CPU case to ≈43 s.

A deployment detail worth flagging: on the FP16-GPU path, decoding silently degrades past ~5,000 tokens of context — which is why the context window is capped at 4,096, a safety margin below the cliff. (FP32-GPU removes it, at ~25% higher time-to-first-token.)

These are flagship-device numbers. Real Zanzibar hardware — lower-to-mid-range MediaTek phones — will be slower and is not yet measured; battery drain and sustained-load thermal throttling also remain to be profiled.

Knowledge Base

The system's answers are only as good as what it can retrieve, so the knowledge base is the foundation. It is a curated corpus of 87 authoritative clinical guideline documents — covering maternal, newborn, OBGYN, and reproductive health — processed into 63,650 passages and embedded for fully-offline on-device search.

87
Source documents
WHO, NICE, ICM, Tanzania national …
63,650
Indexed passages
structure-first chunks
~260 MB
On-device store
EmbeddingGemma vectors, fully offline

The corpus draws on authoritative international guidelines, field references designed for low-resource settings, and Zanzibar/Tanzania national materials:

Source familyCoverage
WHOAntenatal, intrapartum, postnatal & newborn care; PPH, eclampsia, obstructed-labour and sepsis modules; abortion, contraception, STI, GBV
NICE (UK)Antenatal / intrapartum / postnatal care, perinatal mental health, contraception
ICM · RCM · RCOG · ACOGInternational and royal-college midwifery competencies and clinical guidance
Hesperian · MSFField references written for low-resource settings — community midwifery, essential obstetric & newborn care
Zanzibar / Tanzania nationalScope of practice, Ministry-of-Health competency, NTA midwifery curricula, and service standards
Assessment & referenceUK NMC competence-test materials, newborn-resuscitation protocols, and midwifery reference texts

On the pipeline: PDFs are converted with an ML layout model (marker-pdf, which recovers tables and headings), then chunked structure-first — headings, not page breaks, define passage boundaries, and every chunk carries a parent-section breadcrumb so it stays self-contained when retrieved. Each chunk also gets a content-hash ID (a SHA-256 of its text) that is stable across edits and doubles as a citation key. The corpus ships as a versioned bundle with a SHA-256 checksum per document, so every retrieved passage is traceable to an exact source file. The full corpus-construction pipeline is open at mamai-medical-guidelines.

Here's one real passage from the corpus. The [SOURCE | PAGE | CID] header is what lets the app cite each answer back to an exact guideline page:

[SOURCE: WHO_Complications_2017 | PAGE: 204 | CID: c5c7acd564ff3c8b]
> Hypertensive disorders of pregnancy > Magnesium sulfate maintenance dose

Withhold or delay the drug if:
- respiratory rate falls below 16 breaths per minute;
- patellar reflexes are absent;
- urinary output falls below 30 mL per hour over the preceding four hours.
A real chunk from WHO’s guideline on managing complications in pregnancy and childbirth (2017).

MamaBench

To measure whether the system actually helps, I needed a benchmark fit for the task — so I built MamaBench, a normalized question-answering benchmark for maternal, newborn, OBGYN, and reproductive health, released openly on Hugging Face. It pulls 25,949 questions from seven medical datasets spanning Africa, India, and the USA; the Kenya and HealthBench tracks that anchor the evaluation above come from here.

25,949
Questions
normalized into one schema
3
Evaluation tracks
MCQ · open-ended · rubric
7
Source datasets
Africa, India, USA, HealthBench
TrackSourcesRowsScored by
Multiple-choiceMedMCQA, MedQA-USMLE, AfriMed-QA23,241exact answer
Open-ended + referenceKenya Vignettes, AfriMed-QA SAQ, WHB369key-fact recall vs an expert reference
Open-ended + rubricHealthBench (oss / consensus / hard)2,339physician-written weighted rubrics

Its two open-ended tracks are the ones that anchor the end-to-end evaluation. The Kenya Clinical Vignettes are nurse-written Kenyan primary-care scenarios — maternal, neonatal, child, and reproductive health — paired with clinician reference answers, close to the Zanzibar setting. HealthBench-oss is the OBGYN slice of OpenAI's HealthBench, where each question carries weighted, physician-written rubric criteria that a judge scores the model's answer against.

Most of the work is curation. Every source is normalized into one schema with content-hash IDs and a full provenance manifest, and the open-ended sources are scope-filtered by an LLM classifier (Qwen3.6-27B) that sorts each question into one of five categories — maternal, neonatal, child health, reproductive health, or out-of-scope — and drops the off-topic ones. I validated that classifier two ways: against a much larger model (Qwen3.5-397B, 98% agreement on the HealthBench subset), and against Kenya's existing Gemini-labelled categories (87%). Its full reasoning is saved for every question, and the seven prompts it couldn't converge on are documented rather than quietly dropped.

The benchmark also bundles an OBGYN slice of HealthBench's physician-labelled grader meta-eval (6,853 judgments) — the calibration data used to validate the answer-quality judge in the evaluation above. The full construction code is open at mamabench.

MamaRetrieval

The retrieval scoreboard earlier runs on MamaRetrieval, a retrieval benchmark I built for medical RAG and released openly on Hugging Face. It pairs 3,185 clinical questions with 230,964 graded relevance labels across the top-20 results of six retrievers — enough to rank retrievers at the top-3 depth the app actually uses.

3,185
Clinical queries
6
Retrievers compared
230,964
Relevance labels
graded (query, chunk) pairs

No human-labelled retrieval set exists for this domain, so I built one. For each clinically-useful chunk in the guideline corpus, an LLM wrote a short question that the chunk could answer; each question was run through all six retrievers, their top-20 results pooled (~72 candidate chunks per question), and every (question, chunk) pair scored for relevance by a strong LLM judge.

Relevance isn't treated as yes/no. The judge grades each pair on four dimensions, combined as score = D1 × (D2 + D3 + D4), from 0 to 6:

DimensionWhat it asksRange
D1 — TopicDoes the chunk address the same clinical problem as the question? A gate — if not, the rest score 0.yes / no
D2 — Clinical contentHow rich is the chunk’s clinical content0–2
D3 — Actionable guidanceHow specific — vague advice → exact doses, thresholds, steps0–2
D4 — DensityHow much of the chunk is useful for this specific question0–2

The judge (Qwen3.5-397B) was calibrated against Claude Opus 4.7 reference labels — 95% agreement on whether a chunk clears the relevance bar. Two honest caveats, though: it is a single LLM judge, not a human gold standard; and because the questions are themselves LLM-written from corpus chunks, they tend to flatter dense retrievers. Both mean the retriever-vs-retriever ranking is the trustworthy signal, not the absolute scores. The judge's full reasoning on every pair is shipped with the dataset for auditing.

Limitations

I've been honest about the gaps throughout — here are the ones that matter most. MAM-AI is a thoroughly-evaluated research prototype, not a deployed product.

  • No user testing yet. Despite five weeks on site, a lag in project funding meant the planned field test with nurse-midwives couldn't go ahead this cycle — so there's no real-world data yet on how the system performs at the point of care.
  • Not fully verified by clinicians. The system was tested thoroughly and shaped by clinician input — the fix for over-deflection came directly from that feedback — but its clinical correctness has not yet been fully verified by clinicians, and the evaluation relies on LLM judges rather than a human gold standard.
  • English only. The deployment context — Zanzibar and mainland Tanzania — is Swahili-speaking, but the system, knowledge base, and evaluation are entirely in English. Swahili support and evaluation remain a deployment gap.
  • RAG doesn't yet pay off on the small generator. Better retrieval doesn't reach the answers, and adding retrieved context is net-neutral-to-negative — the generator can't fully use what it's given. Teaching it to (generator-side RAG-grounding) is the central open problem, not a solved one.
  • Latency measured on a single flagship tablet. The cheaper MediaTek phones that make up most of the target market are untested and will be slower; battery drain and sustained thermal throttling in a warm ward are not yet profiled.