Nursing surveillance required diagnosis-related classification, but key clinical signals were fragmented across heterogeneous EMR sources.
EMR-Based Nursing Surveillance for Automatic ICD Coding
A clinical AI study showing that core EMR data available during nursing work can support practical diagnosis-related classification.
Combines heterogeneous structured EMR and Korean clinical text into an evaluable NLP pipeline.
Trained two KM-BERT models independently and averaged raw logits to stabilize text representation.
The final model was reviewed for practical classification behavior rather than a single standalone score, including rare-class recall and available-data constraints.
A clinical AI pipeline that processes structured EMR and nursing text in parallel before stacking them for ICD prediction.
Laboratory results, IO, BST, vital signs, and patient information
Nursing notes and PACU records
Korean clinical text representation
Dimensionality reduction and final ICD prediction
Overall behavior, class balance, and rare-class recall
Supports the portfolio claim that NLP/LLM systems should be judged through domain data structure and error distribution, not only headline accuracy.
Open research detailContext
This project focused on supporting nursing surveillance for abdominal surgery patients through automatic ICD code prediction. Instead of relying on physician narratives or discharge summaries that become available later, the work centered on core EMR data that nurses can access during routine care.
Problem
Nurses continuously monitor patients and identify risks, but the signals needed for diagnosis-related classification are scattered across laboratory results, IO, BST, vital signs, patient information, nursing notes, and PACU records. Existing automatic ICD coding approaches often depend on physician-centered documents or extra resources, which makes them less suitable for direct nursing surveillance support.
Implementation
I worked on integrating heterogeneous EMR sources for 8,587 abdominal surgery patients and structuring them into a usable modeling pipeline. The approach combined two independently trained KM-BERT models, averaged their raw logits for an ensemble effect, reduced the representation with PCA, and used XGBoost as a stacking meta-classifier for the final ICD prediction task. The workflow also addressed class imbalance through stratified splitting and weighted sampling.
Outcome
The final Double KM-BERT + XGBoost + PCA model showed more stable classification behavior than the single-model and simple-ensemble baselines, while retaining meaningful recall on rare classes. This suggested that nursing-surveillance-oriented diagnosis classification can be practically reviewed using only core EMR data available during care.