Back to Projects
Project Detail

EMR-Based Nursing Surveillance for Automatic ICD Coding

A clinical AI study showing that core EMR data available during nursing work can support practical diagnosis-related classification.

Type Clinical AI Research
Year 2025
Primary Role Graduate Researcher
Roles Graduate Researcher, AI Engineer, Data Scientist
Applied NLP and LLM Research Engineer Published work Summary available Medical AINursing SurveillanceEMRAutomatic ICD CodingKM-BERTXGBoostEnsemble

Combines heterogeneous structured EMR and Korean clinical text into an evaluable NLP pipeline.

structuredtexthybrid
datatrainingevaluation
Reviewed overall behavior and rare-class recall togetherCore EMR classification without post-hoc documentsStrong Rare-class Recall
Problem

Nursing surveillance required diagnosis-related classification, but key clinical signals were fragmented across heterogeneous EMR sources.

Approach

Trained two KM-BERT models independently and averaged raw logits to stabilize text representation.

Outcome

The final model was reviewed for practical classification behavior rather than a single standalone score, including rare-class recall and available-data constraints.

Architecture

A clinical AI pipeline that processes structured EMR and nursing text in parallel before stacking them for ICD prediction.

Structured EMR

Laboratory results, IO, BST, vital signs, and patient information

Nursing Text

Nursing notes and PACU records

Dual KM-BERT

Korean clinical text representation

PCA + XGBoost

Dimensionality reduction and final ICD prediction

Rare-class Evaluation

Overall behavior, class balance, and rare-class recall

Outcome Metrics
Balanced Reviewed overall behavior together with rare-class recall
Core EMR Focused on data available during nursing work rather than post-hoc documents
High Practical recall on sparse classes
Research Support
Deep Learning based Automatic ICD Coding for Nursing Surveillance of Abdominal Surgery Patients
Journal of The Korea Society of Computer and Information | 2025

Supports the portfolio claim that NLP/LLM systems should be judged through domain data structure and error distribution, not only headline accuracy.

Open research detail

Context

This project focused on supporting nursing surveillance for abdominal surgery patients through automatic ICD code prediction. Instead of relying on physician narratives or discharge summaries that become available later, the work centered on core EMR data that nurses can access during routine care.

Problem

Nurses continuously monitor patients and identify risks, but the signals needed for diagnosis-related classification are scattered across laboratory results, IO, BST, vital signs, patient information, nursing notes, and PACU records. Existing automatic ICD coding approaches often depend on physician-centered documents or extra resources, which makes them less suitable for direct nursing surveillance support.

Implementation

I worked on integrating heterogeneous EMR sources for 8,587 abdominal surgery patients and structuring them into a usable modeling pipeline. The approach combined two independently trained KM-BERT models, averaged their raw logits for an ensemble effect, reduced the representation with PCA, and used XGBoost as a stacking meta-classifier for the final ICD prediction task. The workflow also addressed class imbalance through stratified splitting and weighted sampling.

Outcome

The final Double KM-BERT + XGBoost + PCA model showed more stable classification behavior than the single-model and simple-ensemble baselines, while retaining meaningful recall on rare classes. This suggested that nursing-surveillance-oriented diagnosis classification can be practically reviewed using only core EMR data available during care.