EMR-Based Nursing Surveillance for Automatic ICD Coding

A clinical AI study showing that core EMR data available during nursing work can support practical diagnosis-related classification.

Type Clinical AI Research

Year 2025

Primary Role Graduate Researcher

Roles Graduate Researcher, AI Engineer, Data Scientist

Applied NLP and LLM Research Engineer Published work Summary available Medical AINursing SurveillanceEMRAutomatic ICD CodingKM-BERTXGBoostEnsemble

Combines heterogeneous structured EMR and Korean clinical text into an evaluable NLP pipeline.

structuredtexthybrid

datatrainingevaluation

Reviewed overall behavior and rare-class recall togetherCore EMR classification without post-hoc documentsStrong Rare-class Recall

A clinical AI pipeline that processes structured EMR and nursing text in parallel before stacking them for ICD prediction.

Structured EMR

Laboratory results, IO, BST, vital signs, and patient information

Nursing Text

Nursing notes and PACU records

Dual KM-BERT

Korean clinical text representation

PCA + XGBoost

Dimensionality reduction and final ICD prediction

Rare-class Evaluation

Overall behavior, class balance, and rare-class recall

Context

This project focused on supporting nursing surveillance for abdominal surgery patients through automatic ICD code prediction. Instead of relying on physician narratives or discharge summaries that become available later, the work centered on core EMR data that nurses can access during routine care.

Problem

Nurses continuously monitor patients and identify risks, but the signals needed for diagnosis-related classification are scattered across laboratory results, IO, BST, vital signs, patient information, nursing notes, and PACU records. Existing automatic ICD coding approaches often depend on physician-centered documents or extra resources, which makes them less suitable for direct nursing surveillance support.

Implementation

I worked on integrating heterogeneous EMR sources for 8,587 abdominal surgery patients and structuring them into a usable modeling pipeline. The approach combined two independently trained KM-BERT models, averaged their raw logits for an ensemble effect, reduced the representation with PCA, and used XGBoost as a stacking meta-classifier for the final ICD prediction task. The workflow also addressed class imbalance through stratified splitting and weighted sampling.

Outcome

The final Double KM-BERT + XGBoost + PCA model showed more stable classification behavior than the single-model and simple-ensemble baselines, while retaining meaningful recall on rare classes. This suggested that nursing-surveillance-oriented diagnosis classification can be practically reviewed using only core EMR data available during care.