Machine versus human: Can Koda replace expert drug coders?

Picture: AI generated

By Dr. Julia Simon and Franziska Puosi

WHODrug Koda (https://who-umc.org/whodrug/whodrug-global/applications-and-services/koda/), developed by the Uppsala Monitoring Centre (UMC) in 2019, is one of the first artificial intelligence (AI) driven systems for automated drug coding. The system is trained on extensive historical drug coding data combined with UMC’s accumulated expertise and is continuously retrained with each WHODrug release to ensure alignment with current dictionary content and regulatory requirements.

In our organisation, we are using a validated coding tool with an audit trail that enables automated WHODrug coding for direct hits while supporting manual coding for all remaining verbatim terms. We aim for high quality and consistency in coding. Medical coders and reviewers work closely together, using unique‑term review lists and targeted sampling to verify correctness, document comments, and resolve discrepancies. All coding decisions are captured in an audit‑trailed environment, ensuring full traceability and enabling reliable integration of coding data into downstream clinical datasets. This combined approach of automation, expert assessment, and systematic quality control ensures robust, compliant, and reproducible drug coding across all studies.

Koda promises faster and more consistent coding, especially for large datasets from clinical trials or pharmacovigilance. These claims raise important questions for organisations like us that rely on high quality drug coding: How well does Koda perform? Does it genuinely reduce workload? And how closely do its results match manual coding by experts?

In this article, we critically assess Koda’s observed performance, and its practical utility in real-world settings.

Setup

To assess this, our organisation tested Koda using nearly 10,000 anonymised drug entries from a general test dataset compiled specifically for evaluation purposes.

The Koda test in our organisation was conducted as a joint initiative between medical coders and data managers, ensuring both clinical accuracy and technical robustness.

All relevant fields were prepared in a controlled Excel environment and uploaded into the Koda web app, where automated coding algorithms generated preliminary WHODrug assignments. Parallel the same drug entries were coded with our standard coding tool, in which direct hits are autocoded and all remaining entries undergo structured manual coding, followed by medical review, including audit trail validation and consistency checks.

The data were compared to evaluate concordance, identify discrepancies, and assess the potential of Koda to enhance efficiency and coding quality.

Results

Koda’s performance was evaluated by applying the system to a dataset of 9,827 verbatim drug names. Of all submitted verbatims, Koda automatically coded 7,040 entries, corresponding to 71.4% of the dataset (Figure 1A). The remaining 28.6% required additional manual coding. Among the automatically coded records, 101 cases (1.43% of Koda-coded verbatims) were judged to be incorrect after review – for example Epi‑pen for a bee‑sting was coded as EPI [CLINDAMYCIN PHOSPHATE], and Metex for psoriatic arthritis as METEX [METFORMIN HYDROCHLORIDE] instead of METEX [METHOTREXATE SODIUM] – resulting in an overall accuracy of approximately 98% for Koda‑generated drug names, drug codes, and ATC codes.

Comparison with our standard coding demonstrated substantial concordance. Of the 7,040 verbatims coded automatically by Koda, 6,317 (89.7%) received identical coding – drug name, drug code, and ATC code – when compared with the standard coding in our organisation (Figure 1B). Discrepancies were observed in 723 cases (10.3%). Within this subset, 536 cases (7.6%) involved differing drug names or codes while matching on ATC code. This divergence is attributed largely to Koda’s tendency to code closely to the original verbatim, whereas manual coding more often employed generic drug codes. Additionally, 145 cases (2.1%) showed identical drug codes but discordant ATC codes; 81 of these (55.9%) reflected an incorrect ATC assignment by Koda, while 43 cases (30%) represented situations in which Koda selected a more appropriate ATC classification than the manual coder. A smaller subset of 42 cases (0.6%) showed discrepancies in both drug codes and ATC code; these comprised 20 cases in which Koda misclassified entries and four cases in which manual coding was assessed as incorrect while Koda provided the correct classification.

Figure 1: A) The stacked bar shows the distribution of coding outcomes for all 9,827 verbatims processed by Koda. Fully coded entries, including both drug code and ATC code, account for 7,040 verbatims. Partial coding, where a drug code is present but the ATC code is missing, occurs in 1,191 cases. A total of 1,596 verbatims could not be coded automatically. B) The pie chart summarizes concordant and discordant coding outcomes for the subset of 7,040 verbatims that were fully coded by Koda (drug code and ATC). Concordant coding with respect to both drug code and ATC code represents the majority of cases (89.7%). Among the discordant results, differences in drug code with concordant ATC classification account for 7.6% of cases. Discordant ATC classifications with concordant drug code are observed in 2.1% of cases, while discordant drug codes and ATC classifications occur in 0.6% cases.

Patterns in Koda’s coding-path logic were examined to identify common sources of error. Most verbatims were processed through “direct hit” (n=3,409; 1.1% error rate) or “non‑unique solved by unsalted drug name” (n=2,204; 1.5% error rate) pathways. Higher error rates were observed for less frequently used coding paths such as “non‑unique solved by indication” which, despite only 126 occurrences, had an error rate of 5.6%. Several paths – including “spelling suggestion” “identify ingredients” “non‑unique solved by country” “non‑unique solved by route” and “non‑unique solved by preferred base” – did not provide explicit error-rate values in the underlying data.
Data coded using our standard coding tool were available for comparison. Of the 9,827 verbatims, 6,013 (61.2%) were removed via WHODrug autocoding in the standard coding workflow, and 3,814 required manual WHODrug-coding. Taking into account the number of single ATC assignments, there was an overlap of 5,272 verbatims (54.2%) between autocoded WHODrug entries and single ATC classifications (Figure 2). By contrast, Koda’s automatic coding covered 7,040 verbatims, representing an absolute increase in automated processing compared to the established standard coding workflow.

Figure 2: Summary of coding outcomes obtained using the Koda and GKM standard coding approaches. Automated coding with Koda resulted in the majority of fully coded verbatims, with 71.4% of all records being fully coded, including assignment of both a drug code and an ATC code. Partial coding, defined as the presence of a drug code without an assigned ATC code, resulted in 12.1% of verbatims, while 16.2% of records could not be coded automatically. Within the GKM coding approach, 54.2% of verbatims were fully coded through automated processing, consisting of automatically coded WHODrug entries with a single ATC assignment. In 7.5% of cases, the drug code was assigned automatically, but manual selection of the ATC code was required. In 24.3% of verbatims, the Drug code was entered manually, while the ATC code could be assigned automatically due to the availability of a single ATC option. For the remaining 14.5% of cases, both the drug code and the ATC code required manual selection.

Evaluation

Across the dataset of 9,827 verbatims, Koda automatically coded 7,040 entries – an autocoding rate of 71.4%, higher than the 61.2% achieved in our standard workflow. This increase of almost 1,800 automatically processed records illustrates Koda’s potential to reduce manual workload, even if users must still validate results.

Accuracy within the autocoded subset was high. Only 102 entries (1.45%) were judged incorrect, corresponding to an accuracy of roughly 98% for drug names, drug codes, and ATC codes. Errors occurred mainly in complex scenarios, such as entries requiring interpretation of indication or country information. Simple cases – direct matches or clearly recognisable drug names – showed very low error rates.

The comparison with our reference standard revealed strong overall consistency. Nearly 90% of Koda coded entries were identical to our standard coding across drug name, drug code, and ATC classification. Most remaining discrepancies reflected differences in coding conventions rather than clear mistakes. Koda tends to stay close to the verbatim wording, whereas human coders often select generic drug codes. This explains the majority of cases where drug names differed, but the ATC code still matched.

ATC related discrepancies offered further insight. In 145 cases with identical drug names but differing ATC codes, Koda was incorrect in 81 cases. However, in 43 cases Koda proposed the more appropriate ATC classification. This suggests that differences between automated and manual assignment are not always errors but may reflect variation in judgement, especially where multiple ATC options are plausible.

Koda’s internal coding path analysis confirmed that accuracy depended strongly on case complexity. Straightforward paths like “direct hit” or “non unique solved by unsalted drug name” dominated and showed error rates around 1–1.5%. Less common paths that required more nuanced interpretation – such as those relying on indication – showed higher error rates.

Overall, Koda delivered robust automated drug coding performance, matching standard coding results in most cases while increasing the share of autocoded entries. The findings indicate that Koda can reliably support large scale coding tasks and help streamline workflows, particularly for high volume studies.

However, testing also highlighted practical limitations of the current web application. Review and quality control functions are still limited: filtering and sorting options are restricted, metadata fields cannot be edited, and synonym management is not supported. As a result, users cannot yet fully replicate the structured review workflows that are standard in many organisations. These limitations affect usability more than accuracy, but they are essential for routine operational work

Conclusion

Koda demonstrated high accuracy, strong agreement with expert standard coding, and a clear increase in automated coverage. While it cannot fully replace human expertise – especially in complex or ambiguous cases – it performs reliably across most verbatims and shows meaningful potential to reduce manual workload. The biggest obstacle to implementation is not the quality of the coding, but rather the currently limited functionality of the web application in terms of review, traceability, and quality control.

As electronic data capture systems evolve and integrate coding tools more directly, the value of solutions like Koda will continue to grow. Its implementation in our organisation will be re-evaluated once the application offers more flexible review features and a smoother, user centric workflow.