실행 비교하기

여러 후보 중 어떤 실행을 선택할지 판단해야 할 때, 우리는 측정된 값을 기준으로 실행을 비교합니다.

이 문서는 두 후보를 실제로 실행하고 동일한 평가 기준으로 측정된 값을 Contexta에 기록한 뒤, 그 값을 바탕으로 더 나은 후보를 선택하는 과정을 설명합니다.

실행 가능한 예제

아래의 예제를 통해 Contexta가 어떻게 두 후보를 비교하고 선택할 수 있는지 알아봅시다.

Machine Learning
Deep Learning
LLM

"""Train two real SVM candidates and compare their captured evaluation results."""

import pickle
from pathlib import Path

from sklearn.datasets import load_iris
from sklearn.metrics import accuracy_score, f1_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

from contexta import Contexta
from contexta.capture import LocalJsonlSink


features, targets = load_iris(return_X_y=True)
train_x, test_x, train_y, test_y = train_test_split(
    features, targets, test_size=0.3, stratify=targets, random_state=7
)
candidates = {
    "linear-svm": SVC(kernel="linear"),
    "rbf-svm": SVC(kernel="rbf", gamma="scale"),
}
workspace = Path(".contexta")
ctx = Contexta(workspace=str(workspace), config={"project_name": "iris-svm"})
local_sink = next(sink for sink in ctx.sinks if isinstance(sink, LocalJsonlSink))
scores = {}
run_refs = {}

for name, estimator in candidates.items():
    with ctx.run(name, dataset_ref="dataset:sklearn.iris") as run:
        with run.stage("train"):
            model = make_pipeline(StandardScaler(), estimator)
            model.fit(train_x, train_y)

        with run.stage("evaluate") as stage:
            predictions = model.predict(test_x)
            accuracy = accuracy_score(test_y, predictions)
            macro_f1 = f1_score(test_y, predictions, average="macro")
            with stage.batch("holdout-split") as batch:
                batch.metric("accuracy", accuracy, unit="ratio")
                batch.metric("macro.f1", macro_f1, unit="ratio")
                with batch.sample("first-prediction") as sample:
                    sample.metric("correct", float(predictions[0] == test_y[0]), unit="ratio")

        model_path = workspace / "models" / f"{name}.pkl"
        model_path.parent.mkdir(parents=True, exist_ok=True)
        model_path.write_bytes(pickle.dumps(model))
        run.register_artifact("model", str(model_path), attributes={"candidate": name})
    scores[name] = accuracy
    run_refs[name] = run.ref

best_name = max(scores, key=scores.get)
delta = scores["rbf-svm"] - scores["linear-svm"]
records_path = local_sink.file_path_for("RECORD").relative_to(Path.cwd())
artifacts_path = local_sink.file_path_for("ARTIFACT").relative_to(Path.cwd())

print(f"Compared runs: {run_refs['linear-svm']} vs {run_refs['rbf-svm']}")
print(f"Accuracy: {scores['linear-svm']:.3f} -> {scores['rbf-svm']:.3f}")
print(f"Delta: {delta:+.3f}")
print(f"Selected run: {run_refs[best_name]}")
print(f"Records: {records_path.as_posix()}")
print(f"Artifacts: {artifacts_path.as_posix()}")

"""Train two tiny CNN configurations and compare their measured accuracy."""

from pathlib import Path

import torch
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from torch import nn
from torch.utils.data import DataLoader, TensorDataset

from contexta import Contexta
from contexta.capture import LocalJsonlSink


class TinyCNN(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.layers = nn.Sequential(
            nn.Conv2d(1, 8, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            nn.Flatten(),
            nn.Linear(8 * 4 * 4, 10),
        )

    def forward(self, features: torch.Tensor) -> torch.Tensor:
        return self.layers(features)


torch.manual_seed(7)
digits = load_digits()
train_x, test_x, train_y, test_y = train_test_split(
    digits.images, digits.target, test_size=0.2, stratify=digits.target, random_state=7
)
train_data = TensorDataset(
    torch.tensor(train_x[:, None] / 16.0, dtype=torch.float32),
    torch.tensor(train_y, dtype=torch.long),
)
test_features = torch.tensor(test_x[:, None] / 16.0, dtype=torch.float32)
test_targets = torch.tensor(test_y, dtype=torch.long)
loss_fn = nn.CrossEntropyLoss()
workspace = Path(".contexta")
ctx = Contexta(workspace=str(workspace), config={"project_name": "digits-cnn-compare"})
local_sink = next(sink for sink in ctx.sinks if isinstance(sink, LocalJsonlSink))
scores = {}
run_refs = {}

for name, learning_rate in {"cnn-fast": 0.01, "cnn-steady": 0.003}.items():
    model = TinyCNN()
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    loader = DataLoader(train_data, batch_size=64, shuffle=True)
    with ctx.run(name, dataset_ref="dataset:sklearn.digits") as run:
        with run.stage("train") as stage:
            for epoch in range(1, 3):
                total_loss = 0.0
                for features, targets in loader:
                    optimizer.zero_grad()
                    loss = loss_fn(model(features), targets)
                    loss.backward()
                    optimizer.step()
                    total_loss += loss.item() * len(targets)
                with stage.batch(f"epoch-{epoch}") as batch:
                    batch.metric("loss", total_loss / len(train_data))

        with run.stage("evaluate") as stage:
            with torch.no_grad():
                predictions = model(test_features).argmax(dim=1)
            accuracy = (predictions == test_targets).float().mean().item()
            stage.metric("accuracy", accuracy, unit="ratio")
            scores[name] = accuracy

        checkpoint = workspace / "models" / f"{name}.pt"
        checkpoint.parent.mkdir(parents=True, exist_ok=True)
        torch.save(model.state_dict(), checkpoint)
        run.register_artifact("checkpoint", str(checkpoint), attributes={"candidate": name})
    run_refs[name] = run.ref

best_name = max(scores, key=scores.get)
delta = scores["cnn-steady"] - scores["cnn-fast"]
records_path = local_sink.file_path_for("RECORD").relative_to(Path.cwd())
artifacts_path = local_sink.file_path_for("ARTIFACT").relative_to(Path.cwd())

print(f"Compared runs: {run_refs['cnn-fast']} vs {run_refs['cnn-steady']}")
print(f"Validation accuracy: {scores['cnn-fast']:.3f} -> {scores['cnn-steady']:.3f}")
print(f"Delta: {delta:+.3f}")
print(f"Selected run: {run_refs[best_name]}")
print(f"Records: {records_path.as_posix()}")
print(f"Artifacts: {artifacts_path.as_posix()}")

"""Compare two prompt strategies through an OpenAI-shaped local mock API."""

from pathlib import Path
from time import perf_counter
from types import SimpleNamespace

from contexta import Contexta
from contexta.capture import LocalJsonlSink


class MockCompletions:
    def create(self, *, model: str, messages: list[dict[str, str]]) -> SimpleNamespace:
        instruction = messages[0]["content"]
        question = messages[-1]["content"]
        if "workspace" in question.lower():
            answer = "Contexta stores local evidence in a .contexta workspace."
        elif "refuse unsupported" in instruction.lower():
            answer = "I cannot answer from the provided context."
        else:
            answer = "A GPU was probably used."
        return SimpleNamespace(
            choices=[SimpleNamespace(message=SimpleNamespace(content=answer))],
            usage=SimpleNamespace(completion_tokens=len(answer.split())),
        )


class MockOpenAI:
    def __init__(self) -> None:
        self.chat = type("Chat", (), {"completions": MockCompletions()})()


cases = [
    ("workspace-question", "Where is the workspace?", ".contexta"),
    ("unsupported-question", "Which GPU was used?", "cannot answer"),
]
prompts = {
    "helpful-only": "Answer the user's question.",
    "grounded": "Answer from known context and refuse unsupported questions.",
}
workspace = Path(".contexta")
ctx = Contexta(workspace=str(workspace), config={"project_name": "mock-openai-compare"})
local_sink = next(sink for sink in ctx.sinks if isinstance(sink, LocalJsonlSink))
client = MockOpenAI()
scores = {}
run_refs = {}

for name, instruction in prompts.items():
    passed = 0
    with ctx.run(name, dataset_ref="dataset:local.prompt-cases") as run:
        with run.stage("evaluate") as stage:
            for case_name, question, expected in cases:
                started = perf_counter()
                response = client.chat.completions.create(
                    model="gpt-4.1-mini-mock",
                    messages=[
                        {"role": "system", "content": instruction},
                        {"role": "user", "content": question},
                    ],
                )
                answer = response.choices[0].message.content
                correct = expected in answer
                passed += int(correct)
                with stage.sample(case_name) as sample:
                    sample.metric("correct", float(correct), unit="ratio")
                    sample.metric("latency.ms", (perf_counter() - started) * 1000, unit="ms")
                    sample.metric("completion.tokens", response.usage.completion_tokens)
            pass_rate = passed / len(cases)
            stage.metric("pass.rate", pass_rate, unit="ratio")
            scores[name] = pass_rate
    run_refs[name] = run.ref

best_name = max(scores, key=scores.get)
delta = scores["grounded"] - scores["helpful-only"]
records_path = local_sink.file_path_for("RECORD").relative_to(Path.cwd())

print(f"Compared runs: {run_refs['helpful-only']} vs {run_refs['grounded']}")
print(f"Pass rate: {scores['helpful-only']:.2f} -> {scores['grounded']:.2f}")
print(f"Delta: {delta:+.2f}")
print(f"Selected run: {run_refs[best_name]}")
print(f"Records: {records_path.as_posix()}")

코드를 compare_runs.py로 저장한 뒤, Contexta가 설치된 환경에서 실행하세요.

uv run compare_runs.py

세 예제 모두 각 후보를 실행해 평가값을 계산하고, 어느 후보 실행에서 만들어졌는지 .contexta/cache/capture/record.jsonl에 함께 남깁니다.

실행 결과 확인하기

머신러닝 예제에서는 터미널에 다음과 같은 출력이 표시됩니다.

Compared runs: run:iris-svm.linear-svm vs run:iris-svm.rbf-svm
Accuracy: 0.978 -> 0.956
Delta: -0.022
Selected run: run:iris-svm.linear-svm
Records: .contexta/cache/capture/record.jsonl
Artifacts: .contexta/cache/capture/artifact.jsonl

이 예제는 같은 Iris 데이터 분할을 사용해 두 모델을 각각 학습하고 평가합니다.

따라서 출력에 표시되는 차이는 이번 실행에서 두 모델이 같은 평가 입력에 대해 보인 결과입니다.

출력	의미
`Compared runs`	`linear-svm`과 `rbf-svm`이라는 두 실행을 비교합니다.
`Accuracy: 0.978 -> 0.956`	왼쪽 후보는 test sample의 약 `97.8%`, 오른쪽 후보는 약 `95.6%`를 맞혔습니다.
`Delta: -0.022`	오른쪽 후보인 `rbf-svm`의 정확도가 왼쪽 후보인 `linear-svm`보다 약 `0.022%p` 낮습니다.
`Selected run`	이 예제의 선택 기준인 `accuracy`가 더 높은 `linear-svm` 실행이 선택되었습니다.
`Records`	선택 판단의 근거가 되는 측정값을 확인할 JSON Lines 파일입니다.
`Artifacts`	비교된 두 실행이 각각 만든 fitted model 파일의 등록 정보를 확인할 파일입니다.

실행 후 워크스페이스에는 다음과 같은 결과가 생성됩니다.

.contexta/
  cache/capture/
    record.jsonl
    artifact.jsonl
  models/
    linear-svm.pkl
    rbf-svm.pkl

두 실행의 메트릭 확인하기

record.jsonl에는 각 후보의 evaluate 단계에서 계산된 아래와 같은 정보들이 저장됩니다.

accuracy
macro.f1
첫 예측 결과를 나타내는 sample 단위 correct 메트릭

비교에 사용된 accuracy 기록만 추려 보면 구조는 다음과 같습니다.

{
  "payload_type": "MetricRecord",
  "payload": {
    "envelope": {
      "run_ref": "run:iris-svm.linear-svm",
      "stage_execution_ref": "stage:iris-svm.linear-svm.evaluate",
      "batch_execution_ref": "batch:iris-svm.linear-svm.evaluate.holdout-split"
    },
    "payload": {
      "metric_key": "accuracy",
      "unit": "ratio",
      "value": 0.9777777777777777
    }
  }
}

{
  "payload_type": "MetricRecord",
  "payload": {
    "envelope": {
      "run_ref": "run:iris-svm.rbf-svm",
      "stage_execution_ref": "stage:iris-svm.rbf-svm.evaluate",
      "batch_execution_ref": "batch:iris-svm.rbf-svm.evaluate.holdout-split"
    },
    "payload": {
      "metric_key": "accuracy",
      "unit": "ratio",
      "value": 0.9555555555555556
    }
  }
}

두 기록은 다음과 같이 비교할 수 있습니다.

확인할 필드	중요한 이유
`metric_key: "accuracy"`	metric_key가 같습니다. 즉, 같은 기준으로 비교하고 있습니다.
`run_ref`	각 값이 어느 후보 모델의 실행 결과인지 구분합니다.
`stage_execution_ref`	두 값 모두 `evaluate` 단계에서 계산된 결과입니다.
`batch_execution_ref`	두 값이 동일한 역할의 `holdout-split` 평가 배치에서 측정되었습니다.
`value`	실제 선택 계산에 사용된 측정값입니다.

단순히 Selected run에 대한 출력만 보더라도 어떤 후보가 선택되었는지는 알 수 있습니다.

여기에 record.jsonl을 함께 확인한다면, 그 결정이 같은 평가 단계와 같은 평가 배치에서 나온 측정값을 기반으로 했다는 점까지 검토할 수 있습니다.

두 모델 아티팩트 확인하기

두 후보는 평가가 끝난 뒤에 각각 실제 fitted model 파일을 저장하고 아티팩트로 등록합니다.

artifact.jsonl에는 다음과 같은 등록 정보가 후보별로 하나씩 남습니다.

{
  "family": "ARTIFACT",
  "payload": {
    "binding_status": "BOUND",
    "manifest": {
      "artifact_kind": "model",
      "run_ref": "run:iris-svm.linear-svm",
      "location_ref": "path:.../.contexta/models/linear-svm.pkl",
      "hash_value": "...",
      "attributes": {
        "candidate": "linear-svm"
      }
    }
  }
}

run_ref는 메트릭 기록과 아티팩트 기록에 모두 포함됩니다.

따라서 선택된 점수뿐 아니라, 그 점수를 만든 실제 모델 파일까지 같은 실행을 기준으로 추적할 수 있습니다.

예를 들어 linear-svm이 선택되었다면 .contexta/models/linear-svm.pkl이 선택 판단과 연결되는 결과물입니다.

비교 기준

도메인	유용한 측정 기준
Machine Learning	validation accuracy, F1, MAE, calibration
Deep Learning	validation loss, validation accuracy, checkpoint size, latency
LLM	evaluation pass rate, faithfulness, latency, token usage

실행 가능한 예제​

실행 결과 확인하기​

두 실행의 메트릭 확인하기​

두 모델 아티팩트 확인하기​

비교 기준​

관련 문서​

실행 가능한 예제

실행 결과 확인하기

두 실행의 메트릭 확인하기

두 모델 아티팩트 확인하기

비교 기준

관련 문서