증거 수집하기

이 문서는 Contexta를 통해 실행 안에서 어떤 관측 기록을 남겨야 하는지 설명합니다.

Contexta의 기록은 실제로 수행된 학습, 평가 또는 추론 동작에서 생성된 정보여야 합니다.

무엇을 수집하나요?

증거	필요한 이유
데이터셋 또는 입력 레퍼런스	측정값이 어떤 입력에서 나왔는지 식별합니다.
단계와 배치	학습, 평가, 추론 작업이 실행된 위치를 기록합니다.
계산된 메트릭	실행 간 비교와 통과 기준을 판단하는데 사용할 수 있습니다.
이벤트와 사용량	호출, 선택, 예외, 이상 상태를 구조화된 형태로 남깁니다.
아티팩트	관측값을 모델, 체크포인트, 평가 파일, 리포트 같은 산출물과 연결합니다.

실행 가능한 예제

아래의 예제를 통해 Contexta가 어떻게 실험 환경에서 증거를 수집할 수 있는지 알아봅시다.

Machine Learning
Deep Learning
LLM

"""Train a real regression model and capture its measured evidence."""

import pickle
from pathlib import Path

from sklearn.datasets import load_diabetes
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split

from contexta import Contexta
from contexta.capture import LocalJsonlSink


features, targets = load_diabetes(return_X_y=True)
train_x, test_x, train_y, test_y = train_test_split(
    features, targets, test_size=0.2, random_state=42
)

workspace = Path(".contexta")
ctx = Contexta(workspace=str(workspace), config={"project_name": "diabetes-regression"})
local_sink = next(sink for sink in ctx.sinks if isinstance(sink, LocalJsonlSink))
model = LinearRegression()

with ctx.run("linear-regression", dataset_ref="dataset:sklearn.diabetes") as run:
    run.event(
        "dataset.loaded",
        message="Loaded the scikit-learn diabetes dataset",
        attributes={"rows": len(features), "features": features.shape[1]},
    )
    with run.stage("train"):
        model.fit(train_x, train_y)

    with run.stage("evaluate") as stage:
        predictions = model.predict(test_x)
        r2 = r2_score(test_y, predictions)
        mae = mean_absolute_error(test_y, predictions)
        stage.metric("r2", r2, unit="ratio")
        stage.metric("mae", mae)

    model_path = workspace / "models" / "linear-regression.pkl"
    model_path.parent.mkdir(parents=True, exist_ok=True)
    model_path.write_bytes(pickle.dumps(model))
    run.register_artifact("model", str(model_path), attributes={"format": "pickle"})

records_path = local_sink.file_path_for("RECORD").relative_to(Path.cwd())

print(f"Captured run: {run.ref}")
print(f"Measured r2: {r2:.3f}; mae: {mae:.3f}")
print(f"Records: {records_path.as_posix()}")
print(f"Model artifact: {model_path.as_posix()}")

"""Train a tiny CNN and capture epoch, evaluation, and checkpoint evidence."""

from pathlib import Path

import torch
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from torch import nn
from torch.utils.data import DataLoader, TensorDataset

from contexta import Contexta
from contexta.capture import LocalJsonlSink


class TinyCNN(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.layers = nn.Sequential(
            nn.Conv2d(1, 8, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Flatten(),
            nn.Linear(8 * 4 * 4, 10),
        )

    def forward(self, features: torch.Tensor) -> torch.Tensor:
        return self.layers(features)


torch.manual_seed(7)
digits = load_digits()
train_x, test_x, train_y, test_y = train_test_split(
    digits.images, digits.target, test_size=0.2, stratify=digits.target, random_state=7
)
train_data = TensorDataset(
    torch.tensor(train_x[:, None] / 16.0, dtype=torch.float32),
    torch.tensor(train_y, dtype=torch.long),
)
loader = DataLoader(train_data, batch_size=64, shuffle=True)
test_features = torch.tensor(test_x[:, None] / 16.0, dtype=torch.float32)
test_targets = torch.tensor(test_y, dtype=torch.long)

workspace = Path(".contexta")
ctx = Contexta(workspace=str(workspace), config={"project_name": "digits-cnn"})
local_sink = next(sink for sink in ctx.sinks if isinstance(sink, LocalJsonlSink))
model = TinyCNN()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
loss_fn = nn.CrossEntropyLoss()

with ctx.run("tiny-cnn", dataset_ref="dataset:sklearn.digits") as run:
    with run.stage("train") as stage:
        for epoch in range(1, 3):
            total_loss = 0.0
            for features, targets in loader:
                optimizer.zero_grad()
                loss = loss_fn(model(features), targets)
                loss.backward()
                optimizer.step()
                total_loss += loss.item() * len(targets)
            with stage.batch(f"epoch-{epoch}") as batch:
                batch.metric("loss", total_loss / len(train_data))

    with run.stage("evaluate") as stage:
        with torch.no_grad():
            logits = model(test_features)
            accuracy = (logits.argmax(dim=1) == test_targets).float().mean().item()
        stage.metric("accuracy", accuracy, unit="ratio")
        with stage.sample("first-validation-image") as sample:
            sample.metric(
                "prediction.correct",
                float(logits[0].argmax().item() == test_targets[0].item()),
                unit="ratio",
            )

    checkpoint = workspace / "models" / "tiny-cnn.pt"
    checkpoint.parent.mkdir(parents=True, exist_ok=True)
    torch.save(model.state_dict(), checkpoint)
    run.register_artifact("checkpoint", str(checkpoint), attributes={"epochs": 2})

with ctx.deployment("tiny-cnn-candidate", run_ref=run.ref) as deployment:
    deployment.event("checkpoint.selected", message="Selected trained checkpoint for review")

records_path = local_sink.file_path_for("RECORD").relative_to(Path.cwd())

print(f"Captured run: {run.ref}")
print(f"Measured validation accuracy: {accuracy:.3f}")
print(f"Records: {records_path.as_posix()}")
print(f"Checkpoint artifact: {checkpoint.as_posix()}")

"""Evaluate an OpenAI-shaped local mock API and capture response evidence."""

from pathlib import Path
from time import perf_counter
from types import SimpleNamespace

from contexta import Contexta
from contexta.capture import LocalJsonlSink


class MockCompletions:
    def create(self, *, model: str, messages: list[dict[str, str]]) -> SimpleNamespace:
        question = messages[-1]["content"]
        if "workspace" in question.lower():
            answer = "Contexta stores local evidence in a .contexta workspace."
        else:
            answer = "I cannot answer from the provided context."
        return SimpleNamespace(
            id=f"chatgpt-mock-{model}",
            choices=[SimpleNamespace(message=SimpleNamespace(content=answer))],
            usage=SimpleNamespace(completion_tokens=len(answer.split())),
        )


class MockOpenAI:
    def __init__(self) -> None:
        self.chat = type("Chat", (), {"completions": MockCompletions()})()


cases = [
    ("workspace-question", "Where is the workspace?", ".contexta"),
    ("unsupported-question", "Which GPU was used?", "cannot answer"),
]
workspace = Path(".contexta")
ctx = Contexta(workspace=str(workspace), config={"project_name": "mock-openai-eval"})
local_sink = next(sink for sink in ctx.sinks if isinstance(sink, LocalJsonlSink))
client = MockOpenAI()
passed = 0

with ctx.run("mock-chat-evaluation", dataset_ref="dataset:local.prompt-cases") as run:
    with run.stage("evaluate") as stage:
        for name, question, expected in cases:
            started = perf_counter()
            response = client.chat.completions.create(
                model="gpt-4.1-mini-mock",
                messages=[{"role": "user", "content": question}],
            )
            answer = response.choices[0].message.content
            correct = expected in answer
            passed += int(correct)
            with stage.sample(name) as sample:
                sample.metric("correct", float(correct), unit="ratio")
                sample.metric("latency.ms", (perf_counter() - started) * 1000, unit="ms")
                sample.metric("completion.tokens", response.usage.completion_tokens)
                sample.event("response.received", message=answer)
        pass_rate = passed / len(cases)
        stage.metric("pass.rate", pass_rate, unit="ratio")

with ctx.deployment("mock-chat-prompt", run_ref=run.ref) as deployment:
    deployment.event("prompt.selected", message="Selected observed prompt flow for staging")

records_path = local_sink.file_path_for("RECORD").relative_to(Path.cwd())

print(f"Captured run: {run.ref}")
print(f"Measured prompt-case pass rate: {pass_rate:.2f}")
print(f"Records: {records_path.as_posix()}")

코드를 capture_evidence.py로 저장한 뒤, Contexta가 설치된 환경에서 실행하세요.

uv run capture_evidence.py

이 예제는 수집된 실행과 측정 결과를 출력하고, 다음 위치에 기록된 증거를 남깁니다.

.contexta/
  cache/capture/record.jsonl

머신러닝과 딥러닝 예제는 실제로 생성한 파일도 .contexta/models/ 아래에 저장하고 아티팩트로 등록합니다.

LLM 예제는 Mock API 응답을 평가하여 샘플별 기록과 전체 통과율을 남기는 예제이므로, 존재하지 않는 산출 파일을 아티팩트처럼 만들지는 않습니다.

실행 결과 확인하기

머신러닝 예제에서는 터미널에 다음과 같은 출력이 표시됩니다.

Captured run: run:diabetes-regression.linear-regression
Measured r2: 0.453; mae: 42.794
Records: .contexta/cache/capture/record.jsonl
Model artifact: .contexta/models/linear-regression.pkl

이 출력을 통해 우리는 다음과 같은 사실을 알 수 있습니다.

출력	의미
`Captured run`	이번 학습과 평가는 `diabetes-regression` 프로젝트의 `linear-regression` 실행으로 묶여 저장되었습니다.
`Measured r2`	이 실행을 통해 얻은 결정 계수는 약 `0.453`입니다.
`mae`	예측값이 정답으로부터 평균적으로 약 `42.794` 정도 떨어져 있습니다.
`Model artifact`	학습이 끝난 뒤 실제 fitted model이 파일로 저장되었으며, Contexta가 그 파일을 이번 실행의 결과물로 등록했습니다.

터미널 출력은 실행한 직후에 결과를 빠르게 확인하기 위한 요약입니다.

우리는 Contexta가 남긴 관측 증거를 확인하고 싶으므로 .contexta/ 아래의 파일을 살펴봅니다.

.contexta/
  cache/capture/
    record.jsonl
    artifact.jsonl
  models/
    linear-regression.pkl

여기서 각 파일이 가지는 역할은 전부 다릅니다.

파일	들어 있는 정보	이 예제에서 확인할 내용
`record.jsonl`	발생한 이벤트와 측정된 메트릭	데이터셋을 불러온 사실, 평가 단계에서의 `r2`와 `mae`
`artifact.jsonl`	생성 · 사용한 파일의 등록 정보	저장된 모델 파일이 어느 실행에서 만들어졌는지, 경로와 검증 정보
`models/linear-regression.pkl`	실제 결과물 파일	학습된 `LinearRegression` 모델 본체

이제 우리는 어떠한 파일들이 생성되는지 알게 되었으므로, 이번에는 각 파일들에 대해 세부적으로 알아봅시다.

`record.jsonl`에서 확인하는 것

record.jsonl은 JSON Lines 파일이므로 매 줄마다 하나의 관측 기록이 작성됩니다. 머신러닝 예제에서는 다음 세 종류의 기록이 생깁니다.

기록	언제 생겼나요?	무엇을 알려 주나요?
`dataset.loaded` 이벤트	실행 후 diabetes dataset을 불러왔을 때	이번 결과가 행이 442개, feature가 10개인 입력에서 계산되었음.
`r2` 메트릭	`evaluate` 단계에서 예측을 평가했을 때	모델의 설명력이 약 `0.453`으로 측정되었음.
`mae` 메트릭	같은 `evaluate` 단계에서 평가했을 때	평균 절대 오차가 약 `42.794`로 측정되었음.

예를 들어 r2 기록의 중요한 부분은 다음과 같습니다.

아래의 내용에서 시각(time)과 기록(record) 식별자는 실행할 때마다 달라질 수 있지만, 실행과 단계의 연결 방식은 같습니다.

{
  "payload_type": "MetricRecord",
  "payload": {
    "envelope": {
      "record_type": "metric",
      "run_ref": "run:diabetes-regression.linear-regression",
      "stage_execution_ref": "stage:diabetes-regression.linear-regression.evaluate",
      "completeness_marker": "complete",
      "degradation_marker": "none"
    },
    "payload": {
      "metric_key": "r2",
      "unit": "ratio",
      "value": 0.4526027629719198
    }
  },
  "sink_name": "local-jsonl"
}

이 기록은 단순히 r2 = 0.453이라고 저장된 것이 아닙니다.

payload_type: "MetricRecord" : (수치값 형태의) 메트릭으로 측정한 결과입니다.
metric_key: "r2" / value: 0.4526027629719198 : r2 메트릭의 평가값이 약 0.453으로 측정되었습니다.
run_ref: "run:diabetes-regression.linear-regression" : 이 값이 run:diabetes-regression.linear-regression 실행에서 생성되었습니다.
stage_execution_ref : 이 값이 diabete-regression 프로젝트의 linear-regression 실행 중 evaluate 단계에서 계산된 평가 결과입니다.
completeness_marker: "complete" : 이 기록은 완료된 관측 기록입니다.
degradation_marker: "none" : Contexta가 이 기록에 누락이나 저하 상태를 표시하지 않았습니다.
sink_name: "local-jsonl" : 이 증거가 로컬 JSON Lines 파일에 기록되었습니다.

dataset.loaded 이벤트와 두 메트릭이 같은 run_ref를 공유합니다.

따라서 향후에도 r2 결과가 어떤 데이터셋을 사용한 어느 실행의 평가 결과인지 함께 확인할 수 있습니다.

`artifact.jsonl`에서 확인하는 것

머신러닝 예제에서는 실제 학습 모델을 .contexta/models/linear-regression.pkl에 저장하고, 그 파일을 model 아티팩트로 등록합니다.

그 등록 결과는 artifact.jsonl에서 확인할 수 있습니다.

{
  "family": "ARTIFACT",
  "payload": {
    "binding_status": "BOUND",
    "manifest": {
      "artifact_kind": "model",
      "run_ref": "run:diabetes-regression.linear-regression",
      "location_ref": "path:.../.contexta/models/linear-regression.pkl",
      "hash_value": "...",
      "size_bytes": 576,
      "attributes": {
        "format": "pickle"
      }
    }
  }
}

여기서 중요한 점은 모델 파일이 어딘가에 존재한다는 사실 이상의 정보들이 Contexta를 통해 남는다는 점입니다.

필드	의미
`artifact_kind: "model"`	이 결과물은 학습된 모델 파일입니다.
`run_ref`	이 모델은 앞에서 확인한 평가 메트릭과 같은 실행에서 만들어졌습니다.
`location_ref`	실제 모델 파일의 위치입니다.
`binding_status: "BOUND"`	등록할 당시 Contexta가 이 파일과 아티팩트 기록을 연결했습니다.
`hash_value` / `size_bytes`	나중에 파일이 바뀌었는지 검증하거나 동일 결과물인지 확인하는 데 사용할 수 있습니다.

따라서 이 예제는 단순히 '모델 점수가 0.453입니다'라는 정보를 알려주는 것을 넘어, 아래의 사항들을 하나의 관측 가능한 실행 기록으로 이어줍니다.

어떤 입력으로 실행했는가?
어느 평가 단계에서 그 점수가 계산되었는가?
그 실행이 실제로 어떤 모델 파일을 만들었는가?

무엇을 수집하나요?​

실행 가능한 예제​

실행 결과 확인하기​

record.jsonl에서 확인하는 것​

artifact.jsonl에서 확인하는 것​

관련 문서​

무엇을 수집하나요?

실행 가능한 예제

실행 결과 확인하기

`record.jsonl`에서 확인하는 것

`artifact.jsonl`에서 확인하는 것

관련 문서