The Contexta View of ML Observability
Contexta treats ML Observability as a discipline of structured evidence.
This means that Observability is not limited to collecting traces, metrics, and logs from a running system. Those signals still matter, and ML systems need them as much as any other production system does. But they are not enough to explain how an ML result came into existence, whether it can be trusted, or how it should be compared with another result later.
An ML run leaves behind more than runtime behavior. It produces intermediate states, evaluation outcomes, artifacts, environment conditions, configuration choices, sample-level failures, reports, and deployment decisions. Some of these objects are emitted during execution. Others are created around execution. Still others become meaningful only when someone later tries to review, reproduce, compare, or deploy the result.
The central claim behind Contexta is that all of these pieces should be treated as part of the Observability story. ML Observability is the practice of preserving enough structured evidence about ML execution and outcomes that a run remains understandable after the fact.
This document explains that point of view. It is not an API reference and it is not meant to be a neutral survey of every possible Observability tool. It describes how Contexta thinks about ML Observability, why that view is broader than traditional application Observability, and why the project organizes its concepts around runs, stages, records, artifacts, lineage, environment, and reports.
Who This Document Is For
This document is intended for readers who want to understand the worldview behind Contexta before looking at individual APIs or implementation details.
It should be useful for engineers who are familiar with traditional Observability and want to understand what changes in ML systems. It should also be useful for ML practitioners who already track experiments but want a clearer model for connecting runs, artifacts, evaluation evidence, and operational outcomes. Operators and platform teams may find it useful because it explains why offline evidence and production behavior need to belong to the same story.
The document is intentionally conceptual. Its purpose is to make the project's mental model legible, not to enumerate every feature surface. A reader should come away with a practical sense of what Contexta means when it says that ML Observability is evidence-oriented, lifecycle-aware, reproducibility-aware, and review-oriented.
Thesis
The shortest version of the Contexta thesis is simple: traditional Observability asks whether a running system can be understood from its signals, while ML Observability asks whether an ML lifecycle can be understood from its evidence.
Signals are part of that evidence, but they are not the whole model. A trace may show where time was spent. A metric may show how a score changed. A log line may show that a fallback path was used. But an ML investigation usually needs more context than that. It needs to know which run produced the result, which stage generated the metric, which artifact was created, which environment shaped the output, which samples failed, and whether the evidence is complete enough to support a decision.
This is the conceptual move Contexta makes. It treats Observability data not only as operational telemetry, but as durable evidence for future investigation, comparison, reproducibility, review, and operational decision-making.
A telemetry system can tell us that something happened. An evidence system should help us understand what happened, where it happened, what it produced, how it connects to other objects, and how much confidence we should place in the record.
Why Traditional Observability Is Not Enough for ML
Traditional Observability is excellent at answering operational questions. It helps teams understand where a request spent time, which dependency failed, whether error volume increased, or whether latency and throughput are changing. ML systems still need those answers. Training jobs, evaluation pipelines, feature builders, batch scoring jobs, and inference services all benefit from traces, metrics, and logs.
The problem is not that the traditional three pillars are wrong. The problem is that they are too small to explain the lifecycle of an ML result.
Consider a model evaluation that produces a better aggregate score than the previous candidate. A conventional Observability system might tell us that the evaluation job completed, how long it took, and whether any errors appeared in the logs. That information is useful, but it does not answer the most important ML questions. It does not tell us which dataset snapshot was used, which exact run produced the candidate, which samples improved or regressed, whether the environment changed, which report summarized the evidence, or whether the resulting artifact was actually promoted.
The same pattern appears in training. A trace can show that the training process spent time in data loading, forward passes, checkpoint writing, or evaluation. Metrics can show training loss and validation loss over time. Logs can record warnings. But if the system cannot connect those signals to a named run, stage, environment snapshot, produced checkpoint, dataset version, and later report, then the evidence remains fragmented. It may help with immediate debugging, but it is weaker for later explanation.
ML workflows also create questions that are not naturally request-centered. A team may ask which run produced the model being discussed, which stage introduced a regression, which samples failed despite a healthy aggregate score, which artifact was promoted, which package set generated the output, or what evidence is missing from the record. These are not secondary questions. They are the questions people ask when they build, evaluate, review, and operate ML systems.
This is why Contexta does not stop at telemetry. It treats ML Observability as a structured evidence system. The goal is not only to know what happened during execution, but to preserve enough context that the execution can be understood later.
From Signals to Evidence
The word "evidence" is important because it shifts the focus from collection to explanation.
A signal is something the system emits. Evidence is something a future reader can use to understand, justify, or challenge a conclusion. The same data point can play both roles. A validation accuracy metric is a signal when it is emitted during evaluation. It becomes evidence when someone later asks whether a model should be promoted, why one run outperformed another, or whether a report can be trusted.
A metric value by itself is often too thin to support those questions. For example, validation_accuracy = 0.91 is meaningful only if the surrounding context is preserved. We need to know which run it belongs to, which stage produced it, which dataset or slice it describes, what environment generated it, which artifact bundle documents it, and whether any part of the capture was degraded. Without that structure, the metric is still a number, but it is weaker as evidence.
This distinction explains why Contexta prefers a schema-first, evidence-oriented model. Logs, metrics, and spans remain important, but they are stored as parts of a larger evidence graph. Their value increases when they are attached to projects, runs, stages, samples, artifacts, environments, and downstream decisions.
Evidence also has a temporal quality. A signal is often consumed near the moment it appears. Evidence must remain useful after the moment has passed. A team may need to revisit a run days or months later because a deployment failed, an audit question appeared, a model regressed, or a report needs to be reproduced. At that point, free-form memory and scattered dashboards are poor substitutes for structured context.
In this view, Observability is not only about watching a system while it runs. It is also about preserving the meaning of what happened so that the result can be inspected, compared, and explained after the fact.
The Core Evidence Model
The core question for Contexta is not only "What happened at runtime?" but "What would someone need to understand this ML result later?"
That later reader might be an engineer debugging a regression, an ML practitioner comparing candidate runs, an operator investigating a deployment, a reviewer checking whether evidence is sufficient, or an automated agent trying to evaluate a change. In all of those cases, the reader needs more than a stream of emissions. They need a structured account of execution.
A useful ML Observability model therefore needs to preserve what was executed, how the execution was structured, what evidence was produced, what artifacts were created or consumed, what environment shaped the result, how the resulting objects are related, what moved forward into reports or deployment, and how complete the evidence is.
This leads to a different mental model from conventional telemetry-centric Observability. Instead of starting from logs and metrics and hoping later consumers can infer the rest, Contexta starts from canonical objects such as projects, runs, stages, records, artifacts, lineage, and reports. Telemetry remains part of the model, but it is not asked to carry all meaning by itself.
This difference is subtle but important. In a telemetry-first system, a later investigation often reconstructs meaning from scattered facts. In an evidence-first system, the meaning model is preserved as part of the record. A metric already knows its scope. An artifact already knows its producing run. A report already knows what evidence it summarizes. A degraded marker already tells the reader that part of the record should be treated with caution.
The goal is not to collect more data for its own sake. The goal is to make future questions answerable without rebuilding the semantic structure every time.
The Eight Layers of ML Observability
The Contexta view of ML Observability can be understood through eight layers. These layers are not independent silos. They describe different kinds of evidence that become more useful when they are connected.
| Layer | What it captures | Questions it answers |
|---|---|---|
| Execution context | projects, runs, stages, operations | Where does this evidence belong? |
| Record families | events, metrics, spans, degraded markers | What happened during execution? |
| Granularity units | batches, samples | Which specific units failed or changed? |
| Artifact evidence | checkpoints, reports, bundles, snapshots | What did the run produce or consume? |
| Reproducibility context | platform, packages, environment, configuration | Under what conditions did this happen? |
| Relationship tracing | lineage, provenance | How are runs, artifacts, reports, and deployments connected? |
| Operational outcome | promotion, deployment, downstream state | What moved forward as a result of the run? |
| Investigation and interpretation | query, compare, diagnostics, trends, reports | How can users understand and communicate the evidence? |
The early layers define what happened and where it happened. The middle layers attach evidence, context, and produced objects. The later layers explain how those pieces connect to decisions, production outcomes, and human interpretation.
If any layer is missing, the overall story becomes weaker in a specific way. Without execution context, evidence loses semantic placement. Without granularity, aggregate metrics can hide localized failures. Without artifacts, outputs are difficult to anchor. Without reproducibility context, comparisons become fragile. Without lineage, related objects do not compose into an explanation. Without investigation surfaces, the evidence exists but remains hard to use.
These layers are not a checklist for perfection. Real systems often capture some layers more deeply than others. The point is that each layer describes a kind of question ML teams eventually ask. The more layers the system can connect, the less work users must do later to reconstruct what happened.
1. Execution Context
Execution context defines the shape of what is being observed.
In a conventional service, the most natural unit of Observability is often the request, process, or service boundary. In ML systems, the most natural unit is often the run. A run may represent a training execution, an evaluation pass, a prompt assessment session, an export workflow, or a batch inference job. It is the object people return to when they ask what happened, whether the result was good, and how it compares to another attempt.
Contexta organizes execution context around projects, runs, stages, and operations.
A project gives related work a stable home. It may correspond to a model family, a product workflow, an experiment domain, or a logical application boundary. This matters because ML evidence accumulates over time. A single run is meaningful on its own, but it becomes more useful when it belongs to a project where trends, comparisons, reports, and deployment history can be interpreted together.
A run is the primary unit of investigation. It is the thing users usually want to inspect later. When someone asks how a model candidate performed, what artifacts it produced, whether the evidence was complete, or whether it was healthy enough to promote, they are usually asking a run-centered question. Centering the run makes the rest of the Observability model easier to reason about because metrics, events, artifacts, environment snapshots, and reports can all be attached to a shared object.
Stages give the run semantic internal structure. A metric from train does not have the same meaning as a metric from evaluate, even if both are numeric. A warning during prepare does not mean the same thing as a warning during export. By preserving stages such as prepare, train, evaluate, retrieve, generate, package, and export, the system allows later readers to understand where evidence belongs in the lifecycle.
Operations provide finer-grained structure inside a stage. They are useful when stage-level visibility is too coarse. Tokenization, feature normalization, retrieval, reranking, checkpoint serialization, and metric aggregation may all be operations worth observing because they can explain local failures, performance regressions, or unexpected artifacts.
This layer is easy to underestimate because it can look like naming. In practice, it is much more than that. Execution context decides whether later evidence has a meaningful address. If a metric is attached only to a timestamp, a reader may have to infer where it belongs. If it is attached to a project, run, stage, and operation, the interpretation is much more direct.
When execution context is weak, teams often compensate with naming conventions, folder structures, or informal memory. A file name may include a run identifier. A log message may mention a stage. A dashboard may group values by convention. These approaches can work for small teams, but they become fragile when evidence needs to be compared, audited, exported, or revisited after the original context has faded.
The point of this layer is not merely organization. Execution context is the backbone of interpretation. Without it, metrics become disconnected numbers, logs become isolated remarks, and artifacts depend on naming conventions or human memory to explain where they came from. With it, evidence can be attached to meaningful scopes.
Example: execution context evidence
A minimal execution context record might look like this:
{
"project": "rag-support-bot",
"run_id": "run_2026_05_03_eval_014",
"run_type": "evaluation",
"started_at": "2026-05-03T09:12:44Z",
"stages": [
{
"name": "prepare",
"status": "completed",
"started_at": "2026-05-03T09:12:44Z",
"ended_at": "2026-05-03T09:13:10Z"
},
{
"name": "retrieve",
"status": "completed",
"started_at": "2026-05-03T09:13:10Z",
"ended_at": "2026-05-03T09:18:21Z"
},
{
"name": "evaluate",
"status": "completed",
"started_at": "2026-05-03T09:18:21Z",
"ended_at": "2026-05-03T09:24:03Z"
}
]
}
This does not yet describe quality, artifacts, or lineage. Its purpose is simpler and more foundational: it gives later evidence an address. If a metric says that faithfulness dropped, the reader can ask whether that metric came from evaluate, whether retrieval finished first, and whether the run followed the expected lifecycle.
In practice, this kind of evidence helps users compare stage duration across runs, detect missing stages, attach warnings to the right part of the workflow, and avoid relying on file names or dashboard filters to reconstruct the shape of execution.
2. Record Families
Record families are the append-style facts collected during execution. This is the layer that most closely resembles traditional Observability, but Contexta broadens the framing by treating these records as evidence attached to semantic execution context.
Events describe discrete facts in the narrative of a run. A dataset was loaded. Validation started. A checkpoint was saved. A fallback path was used. A schema validation failed. In a less structured system, these facts might appear only as log lines. In Contexta, they are more useful because they can be attached to a run, stage, operation, batch, or sample and then queried later.
Events are especially valuable when they mark lifecycle transitions or unusual paths. A run that completed successfully but used a fallback loader may not be equivalent to a run that followed the expected path. A validation stage that started but never emitted a completion event may deserve diagnostic attention. A checkpoint saved after a warning may need to be interpreted differently from one saved during a clean execution. These are small narrative facts, but they often matter during review.
Metrics describe measured values. They include familiar ML values such as training loss, validation loss, accuracy, F1, relevance, faithfulness, latency, throughput, and artifact size. The important point is that ML metrics need scope. A run-level average, a stage-level aggregate, a per-step training curve, a slice-specific score, and a sample-level evaluation value should not be collapsed into the same conceptual bucket. They answer different questions.
This scoping problem is one reason ML metrics are more complex than many service metrics. In a service dashboard, an average latency or error rate may be enough to alert a team that something is wrong. In ML, an average quality metric can hide the very pattern that matters. A model may improve globally while regressing on a critical slice. A retrieval system may maintain average relevance while failing on prompts with a specific structure. A training curve may look stable at a summary level while a later epoch introduces instability.
Spans describe timed execution segments. They help explain where time went, which operation became slower, whether a stage regressed, and where a timeout or failure occurred. Spans are especially useful when ML pipelines include expensive sub-steps such as retrieval, feature extraction, inference calls, evaluation loops, or export operations.
In Contexta, spans are important but not dominant. They are one record family among several. This matters because a trace can explain duration and sequencing, but it usually does not explain artifact meaning, sample-level quality, environment conditions, or downstream deployment decisions by itself.
Degraded markers make incompleteness explicit. This is one of the clearest differences between a basic telemetry model and the Contexta evidence model. Observability data is not always complete. Capture can be partial. Imports can lose detail. Replays can have gaps. Verification can produce warnings. If the system cannot represent those degraded states, later readers may mistake silence for health or completeness.
A degraded marker is not merely an error. It is evidence about evidence quality. It tells the reader that a record, stage, artifact, import, or replay should be interpreted with caution. For ML systems, that distinction matters because decisions are often made later from the stored record. A run with missing sample capture may still be useful, but it should not be treated as equivalent to a run with complete evidence.
This layer matters because it gives the run lived detail. Execution context tells us that a train stage existed. Record families tell us what happened inside it, how values changed, which milestones occurred, which warnings appeared, and where execution slowed down or degraded.
When record capture is weak, the run becomes a skeleton without enough history. Users may know that stages existed and artifacts were produced, but they may not know which path execution took, when warnings occurred, how metrics evolved, or whether the evidence was complete. The broader model does not replace records. It gives them stronger context.
Example: record evidence
A run might emit several record families during the same evaluation stage:
{
"records": [
{
"type": "event",
"run_id": "run_2026_05_03_eval_014",
"stage": "evaluate",
"name": "evaluation_started",
"timestamp": "2026-05-03T09:18:21Z"
},
{
"type": "metric",
"run_id": "run_2026_05_03_eval_014",
"stage": "evaluate",
"name": "faithfulness",
"scope": "slice",
"slice": "refund_questions",
"value": 0.72
},
{
"type": "span",
"run_id": "run_2026_05_03_eval_014",
"stage": "retrieve",
"operation": "rerank_documents",
"duration_ms": 1842
},
{
"type": "degraded_marker",
"run_id": "run_2026_05_03_eval_014",
"stage": "evaluate",
"reason": "sample_outputs_partially_redacted",
"severity": "warning"
}
]
}
This example shows why records are more useful when they are structured. The faithfulness metric is not just a number; it belongs to a run, a stage, and a slice. The span does not merely report elapsed time; it identifies the operation that consumed it. The degraded marker tells the reader that some evaluation evidence should be interpreted with caution.
A user can use this record set to ask whether a quality drop is slice-specific, whether retrieval latency changed, whether an unusual path occurred during execution, and whether the evidence is complete enough to support a promotion decision.
3. Granularity Units
ML systems often fail unevenly.
A model can look healthy at the run level while failing badly on a narrow slice of data. A retrieval system can perform well on average while consistently failing on a particular prompt type. An evaluation pipeline can show a stable aggregate score while a small number of samples produce severe regressions.
This is why Contexta treats granularity as part of Observability. Batch- and sample-level evidence make it possible to see failures that aggregate metrics hide.
A batch is a discrete unit of data processing within a stage. Depending on the workflow, it may represent an epoch, mini-batch group, cross-validation fold, file in a batch import, or stream chunk. Batch-level evidence helps answer where instability appeared inside a stage. It can show that one fold produced unusual metrics, one import chunk failed, or one epoch triggered degradation.
Batch visibility is especially useful when a stage is too large to be interpreted as one unit. A training stage may run for many epochs. A batch scoring stage may process many files. A data import may consume many chunks. If the system only records the stage-level result, localized problems inside the stage disappear into a summary. Batch-level evidence provides an intermediate resolution between the whole stage and the individual sample.
A sample is an individual unit encountered during execution. It may be a training example, validation row, image, prompt, generated answer, retrieved document, or other unit of model behavior. Sample-level evidence is often the difference between knowing that quality changed and understanding why it changed.
This layer is especially important for modern ML and LLM systems, where aggregate scores can be misleading. A model may pass an overall benchmark while failing on legally sensitive cases, minority slices, long-tail prompts, or edge cases that matter operationally. A RAG pipeline may look healthy on average while repeatedly retrieving irrelevant context for one class of questions. An LLM judge may produce acceptable aggregate scores while a few examples expose severe hallucination or policy failure.
Sample-level evidence does not mean every system must store every input forever. Privacy, storage cost, and operational constraints may limit what can be preserved. But the conceptual point remains: ML Observability should have a place for evidence below the aggregate level. If sample capture is partial, that partiality should be explicit rather than hidden.
Granularity turns Observability from a dashboard of averages into an investigation surface for model behavior. It allows users to ask not only whether performance changed, but where it changed, for whom it changed, and which concrete examples explain the change.
When this layer is missing, aggregate values can become deceptively comforting. Teams may know the average outcome but not which prompts failed badly, whether one slice regressed, whether failures were concentrated in a small subset, or whether warnings appeared only during later batches. Many ML problems are local before they become global. Granularity helps surface them earlier and explain them better later.
Example: sample-level evidence
A sample-level record might preserve enough detail to explain a failure without requiring the user to inspect the whole run manually:
{
"run_id": "run_2026_05_03_eval_014",
"stage": "evaluate",
"batch_id": "fold_03",
"sample_id": "prompt_042",
"slice": "refund_questions",
"input_summary": "User asks whether a partially used annual plan can be refunded.",
"retrieval": {
"top_document_ids": ["doc_policy_general", "doc_pricing_faq"],
"expected_document_id": "doc_refund_policy_2026"
},
"scores": {
"relevance": 0.41,
"faithfulness": 0.33
},
"failure_tags": ["missing_required_policy_context", "unsupported_claim"]
}
At the run level, this example might only appear as a small decrease in average faithfulness. At the sample level, the reason becomes more concrete: the retrieval stage missed the refund policy document, and the generated answer made an unsupported claim.
This kind of evidence helps users debug localized failures, identify weak slices, build regression sets, inspect prompt classes, and decide whether an aggregate metric is hiding a serious operational problem.
4. Artifact Evidence
ML workflows produce durable objects that are central to understanding the result.
A run may produce a checkpoint, model bundle, dataset snapshot, feature set, configuration snapshot, report bundle, export package, evidence bundle, or debug archive. These objects are not incidental side effects. They are often the things that later get reviewed, compared, promoted, deployed, archived, or audited.
For that reason, Contexta treats artifacts as first-class Observability evidence.
The important question is not merely where a file was stored. The important question is what produced it, what it represents, which run and stage it belongs to, which metrics or reports refer to it, and whether it became part of a downstream decision.
Consider a model checkpoint. If the system only stores a path to a file, much of the meaning remains external. Someone still needs to know which run produced it, which configuration shaped it, which evaluation report judged it, and whether it was later promoted. When the artifact is part of the evidence model, those relationships can be preserved directly.
The same is true for dataset snapshots. A metric is easier to interpret when the data behind it is known. If two evaluation runs disagree, a dataset snapshot can help determine whether the difference reflects a model change, a data change, or a configuration change. Without that artifact evidence, teams may have to reconstruct the data context from memory or external storage conventions.
Report bundles are also artifacts. They matter because ML decisions are often reviewed in human-readable form. A report may summarize evaluation results, compare candidate runs, describe degraded evidence, or support a deployment decision. If the report can be generated from canonical evidence, then the Observability model is not merely storing data; it is supporting explanation.
Debug bundles and evidence bundles play a similar role. They preserve material that may not fit cleanly into a single metric or event but may be crucial during investigation. For example, a debug bundle may contain sampled failures, intermediate outputs, configuration snapshots, and diagnostic notes. Treating such bundles as artifacts keeps them connected to the run rather than leaving them as detached files.
Without artifact evidence, teams often fall back to informal answers: the checkpoint should be in object storage, the report was probably generated around that time, or the model bundle likely came from a certain run. Those answers may work in casual collaboration, but they are weak foundations for comparison, audit, recovery, or deployment review.
In Contexta, artifacts are not only outputs. They are durable evidence objects that help anchor the meaning of a run.
Example: artifact evidence
An artifact manifest might connect produced files to the run and stage that created them:
{
"run_id": "run_2026_05_03_train_021",
"artifacts": [
{
"artifact_id": "artifact_model_bundle_v17",
"kind": "model_bundle",
"produced_by_stage": "package",
"path": "artifacts/model_bundle_v17.tar.gz",
"sha256": "9b91c7e4...",
"size_bytes": 184221903
},
{
"artifact_id": "artifact_eval_report_v17",
"kind": "report_bundle",
"produced_by_stage": "evaluate",
"path": "reports/evaluation_v17.md",
"summarizes_run": "run_2026_05_03_eval_014"
},
{
"artifact_id": "artifact_config_snapshot_v17",
"kind": "config_snapshot",
"produced_by_stage": "prepare",
"path": "snapshots/config_v17.json"
}
]
}
This manifest lets a reader treat artifacts as part of the evidence model rather than as loose files. The model bundle can be tied back to the packaging stage. The report can be tied to the evaluation run it summarizes. The configuration snapshot can be inspected when two runs produce different results.
In practice, this supports audit, reproducibility, deployment review, and recovery. It also helps prevent a common failure mode in ML projects: knowing that an important file exists somewhere, but not knowing with confidence which run produced it or which decision it supported.
5. Reproducibility Context
A result is difficult to trust if the conditions that produced it are unknown.
For ML systems, reproducibility context is not optional metadata. It is part of the explanation. Two runs can differ because the code changed, the data changed, the configuration changed, a package was upgraded, a tokenizer behaved differently, a CUDA version changed, or an environment variable altered execution. If those conditions are not captured, later comparison becomes guesswork.
Contexta therefore treats environment and configuration evidence as part of Observability. This can include the Python version, platform, package versions, relevant environment variables, configuration values, and capture time.
The practical value becomes clear during investigation. Suppose two evaluation runs produce different results. Without reproducibility context, the team may spend time debating whether the difference came from data, model code, dependency versions, or runtime platform changes. With environment and configuration evidence attached to the run, those possibilities can be narrowed more quickly.
Reproducibility context also matters when a result needs to be trusted by someone who did not run the workflow. A reviewer may want to know whether the environment was stable. An operator may need to confirm that the artifact being deployed was produced under expected conditions. A future engineer may need to recreate the run after dependencies have moved on. The environment snapshot does not guarantee perfect reproducibility, but it provides a starting point for honest analysis.
This layer is especially important in ML because small environmental differences can have large interpretive consequences. A package upgrade may change preprocessing. A tokenizer revision may change prompt behavior. A hardware or platform difference may affect performance or determinism. A configuration value may silently alter evaluation. If those conditions are not captured, the evidence can look more stable than it really is.
Traditional service Observability often treats environment as deployment metadata managed elsewhere. In ML workflows, environment is frequently part of the scientific and operational meaning of the result itself. A metric is more trustworthy when the conditions that produced it are known.
When this layer is missing, disagreement between runs becomes harder to explain. Teams may know that outputs differ, but not whether the cause was code, data, configuration, package versions, or platform. That uncertainty can consume far more time than the original capture would have cost.
Example: reproducibility context evidence
A run may capture environment and configuration context alongside its records:
{
"run_id": "run_2026_05_03_eval_014",
"environment": {
"python": "3.11.8",
"platform": "linux-x86_64",
"cuda": "12.4",
"packages": {
"torch": "2.4.1",
"transformers": "4.45.2",
"tokenizers": "0.20.1"
}
},
"configuration": {
"retriever": "hybrid_bm25_dense",
"top_k": 8,
"reranker": "cross_encoder_v3",
"temperature": 0.0,
"evaluation_set": "support_eval_2026_04"
},
"captured_at": "2026-05-03T09:12:44Z"
}
This evidence does not guarantee that a run can be perfectly reproduced, but it makes later investigation more honest. If another run differs, the team can check whether the package set changed, whether the evaluation set changed, whether top_k changed, or whether a different reranker was used.
The practical use is straightforward: reproducibility context narrows the search space. Instead of debating every possible cause of a metric delta, users can compare captured conditions and decide whether the difference is likely to come from data, configuration, dependencies, platform, or model behavior.
6. Relationship Tracing
ML systems produce connected networks of evidence.
Runs consume datasets. Stages produce metrics. Artifacts come from runs. Reports summarize evaluations. Deployments promote artifacts. Production behavior reflects earlier training and evaluation decisions. The resulting structure is not only a timeline; it is a graph.
This is why Contexta treats relationship tracing as a core layer of ML Observability.
Lineage describes how objects are connected. It answers questions such as where an artifact came from, which run produced a model bundle, what a report summarizes, which downstream object depends on a result, and what sits upstream of a deployment. These questions are common in ML because outputs are transformed, packaged, promoted, and reused across workflows.
Lineage is not only useful after something goes wrong. It also supports everyday review. When comparing two candidate runs, users may want to know whether they used the same dataset snapshot, whether their reports summarize equivalent evidence, whether their artifacts were produced by the same pipeline stages, or whether one candidate inherited a degraded input. These are relationship questions.
Provenance adds trust context to those relationships. It asks why we believe a connection is valid, whether the link was directly captured or inferred later, what evidence supports the assertion, and under what conditions the relationship was formed. Lineage tells us that things are connected. Provenance helps explain how confidently we should treat that connection.
This distinction matters because not every relationship has the same evidential strength. Some links are explicitly captured during execution. Others may be reconstructed during import, inferred from filenames, or derived from metadata. A mature Observability system should not present all of those relationships as equally certain.
A metrics dashboard may show that a model underperformed. Relationship tracing helps explain which upstream dataset, run, artifact, report, or deployment is involved. Without those links, a system can contain many useful pieces of evidence that still fail to compose into a coherent story.
When relationship tracing is missing, teams often have the parts but not the explanation. They may have a run, a checkpoint, a report, and a deployment record, but still lack a modeled path showing how one led to the next. That gap becomes costly during audit, incident response, or long-term maintenance.
Example: lineage and provenance evidence
Relationship evidence might describe how a deployed model is connected to upstream objects:
{
"relationships": [
{
"type": "produced",
"from": "run_2026_05_03_train_021",
"to": "artifact_model_bundle_v17",
"captured_by": "package_stage",
"confidence": "direct"
},
{
"type": "summarized_by",
"from": "run_2026_05_03_eval_014",
"to": "artifact_eval_report_v17",
"captured_by": "report_generator",
"confidence": "direct"
},
{
"type": "approved_for",
"from": "artifact_eval_report_v17",
"to": "deployment_2026_05_04_prod",
"captured_by": "promotion_record",
"confidence": "direct"
},
{
"type": "derived_from",
"from": "artifact_model_bundle_v17",
"to": "deployment_2026_05_04_prod",
"captured_by": "deployment_import",
"confidence": "inferred"
}
]
}
The useful detail here is not only that objects are connected. The record also says how each connection was established and how confident the system is in that relationship. A directly captured report relationship is stronger than an inferred deployment import relationship.
This helps users trace a production issue backward from deployment to artifact, report, evaluation run, and training run. It also helps reviewers distinguish between evidence that was captured during the workflow and evidence reconstructed later.
7. Operational Outcome
ML Observability should not stop at experimentation.
A run may produce an artifact. That artifact may be reviewed. A report may justify promotion. A deployment may move the artifact into production. Later production behavior may raise new questions about the original evidence. These steps should not be treated as separate worlds.
In the Contexta view, deployment and promotion are part of the evidence story because they describe what happened to the output of ML work.
A deployment is not only a release event. It is also an observable consequence of earlier runs, artifacts, evaluations, and decisions. A useful system should be able to answer which run was deployed, which artifact was promoted, which evidence supported the decision, whether deployment succeeded or failed, and which deployed result is currently active.
This layer matters because offline evaluation and online use are connected but not identical. A model can pass evaluation and still fail in production because the input distribution changes, the serving environment differs, or the operational context introduces new constraints. Conversely, a production issue may need to be investigated by tracing backward into the run, artifact, report, and environment that produced the deployed model.
Operational outcome also helps close the loop between experimentation and consequence. Many ML systems accumulate large numbers of experiments, but only a small number of artifacts move forward. Knowing which results were merely exploratory and which ones affected production is essential for interpretation. A run that produced a deployed artifact has a different operational significance from a run that was discarded.
If the Observability model stops before deployment, the bridge between experimentation and real-world consequence becomes weak. Including operational outcome allows teams to investigate production behavior in terms of the evidence that led to it.
The question is not only what happened during training or evaluation. It is also what moved forward because of that work.
Example: operational outcome evidence
A deployment record can connect operational state to the evidence that justified it:
{
"deployment_id": "deployment_2026_05_04_prod",
"environment": "production",
"status": "succeeded",
"started_at": "2026-05-04T08:30:00Z",
"completed_at": "2026-05-04T08:37:12Z",
"promoted_artifact": "artifact_model_bundle_v17",
"source_run": "run_2026_05_03_train_021",
"supporting_report": "artifact_eval_report_v17",
"checks": {
"schema_check": "passed",
"smoke_test": "passed",
"rollback_plan": "available"
}
}
This record makes deployment part of the evidence lifecycle rather than a detached release note. A production issue can be traced back to the promoted artifact, source run, and supporting report. A review can check whether the expected validation steps passed before the artifact moved forward.
Operational outcome evidence is useful because it separates experiments from consequences. Many runs may produce artifacts, but only some artifacts are promoted. Knowing what moved forward helps teams focus investigation on the results that actually affected users or downstream systems.
8. Investigation and Interpretation
The previous layers describe what evidence exists. This layer describes what users can do with it.
Contexta takes the position that Observability is incomplete if data is only emitted and stored. The evidence must be readable. Users need ways to query it, compare it, diagnose it, trend it, and turn it into reports.
Query is the basic read surface. It lets users list runs, inspect run snapshots, follow linked artifacts, and gather evidence around a subject. If data exists but cannot be assembled into an intelligible run view, the investigation experience remains weak. A user should not need to manually stitch together logs, metrics, file paths, and environment notes just to understand one run.
Comparison is central to ML work. Teams frequently need to know how one run differs from another, which stage changed, which metric regressed, which artifact differs, or which candidate should be preferred. This is not an occasional debugging need; it is part of the normal development and review loop. A useful Observability system should make comparison a first-class activity rather than a custom notebook rebuilt for every question.
Diagnostics help locate suspicious, incomplete, or degraded states. They can surface incomplete stages, missing expected terminal stages, completed stages with no metric evidence, degraded records, failed batches, or failed deployments. Diagnostics do not replace human judgment, but they help users decide where to look first.
Trends show movement over time or across runs. They help users see whether metrics are drifting, stage durations are growing, artifact sizes are changing, or step-level behavior is becoming unstable. Many ML problems appear gradually rather than as a single obvious failure. Trend surfaces make those gradual changes easier to notice and investigate.
Reports turn evidence into human-readable explanation. They support review, sharing, archiving, governance, and decision-making. A report generated from canonical evidence is more than a presentation layer; it is proof that the evidence model can support interpretation.
This report-oriented view is important because many ML decisions are not made by one automated component in real time. They are made through review. People compare candidate runs, decide whether evidence is sufficient, judge whether degradation is acceptable, and communicate findings to collaborators. Observability should support that human process, not only machine ingestion.
Without these investigation surfaces, teams often export raw data into notebooks, scripts, or one-off dashboards each time a question appears. That may work temporarily, but it leaves the burden of reconstruction on the user. A mature Observability system should make investigation a first-class workflow.
Example: investigation output
An investigation surface might assemble records, artifacts, and diagnostics into a run comparison result:
{
"comparison": {
"baseline_run": "run_2026_04_27_eval_009",
"candidate_run": "run_2026_05_03_eval_014",
"summary": {
"accuracy_delta": 0.018,
"faithfulness_delta": -0.042,
"latency_p95_delta_ms": 231
},
"notable_findings": [
{
"type": "slice_regression",
"slice": "refund_questions",
"metric": "faithfulness",
"delta": -0.11
},
{
"type": "artifact_growth",
"artifact": "artifact_model_bundle_v17",
"size_delta_percent": 18.4
},
{
"type": "degraded_evidence",
"stage": "evaluate",
"reason": "sample_outputs_partially_redacted"
}
],
"recommended_next_steps": [
"Inspect failed refund_questions samples.",
"Compare retrieval outputs for prompts with faithfulness regression.",
"Review whether partial redaction limits promotion confidence."
]
}
}
This is not a new kind of raw evidence. It is an interpretation built from evidence. Its value comes from connecting metrics, slices, artifacts, and degraded markers into a form that a human can act on.
A user can use this output to decide where to investigate first, whether the candidate run is still worth promoting, which samples should become regression tests, and whether missing evidence weakens the decision. This is why interpretation belongs inside the Observability story rather than outside it.
Where Production Monitoring Fits
This document emphasizes structured evidence, but production monitoring remains part of ML Observability.
Once a model or ML workflow is deployed, new questions appear. Input data may drift. Feature distributions may change. Prediction distributions may shift. Online outcomes may diverge from offline evaluation. Performance may degrade for a particular slice of users or examples. Feedback data may be incomplete, delayed, or biased.
In the Contexta model, these production signals are not separate from evidence. They extend the evidence graph forward. Production metrics and drift indicators become additional records connected to the run, artifact, configuration, evaluation report, and deployment that produced the active behavior.
This connection matters because production monitoring without lineage can identify symptoms without explaining their origin. A drift alert may show that input distributions changed, but the team still needs to know which model is active, which dataset it was evaluated against, which assumptions were made at promotion time, and which evidence supported deployment. Conversely, offline evidence without production linkage can explain how a model was built but not how it behaved once it mattered.
Production monitoring also changes the meaning of earlier evidence. An evaluation report may look strong when written, but later production data may reveal that the evaluation set missed an important slice. A deployment may initially succeed, but feedback quality may prove weaker than expected. By connecting production outcomes back to offline evidence, the system can support a fuller lifecycle of learning rather than treating deployment as the end of the story.
A stronger ML Observability model connects both sides. It lets teams move from production symptoms back to the evidence that produced the deployed behavior, and from offline evidence forward to real-world outcomes.
Why Explicit Incompleteness Matters
One theme cuts across all eight layers: the system should be honest about evidence quality.
In many engineering systems, missing Observability data is inconvenient. In ML systems, it can be misleading. Decisions are often made from evidence: whether a model should be promoted, whether a regression is real, whether a comparison is fair, whether a report supports review, or whether a deployed artifact matches the evaluated one.
If missing evidence is not represented, users may treat absence as health. A report with no warnings may look trustworthy even if sample capture was incomplete. A comparison may look fair even if one run lacks environment evidence. A deployment may look well-supported even if artifact lineage was partially inferred.
Contexta therefore treats degraded and partial states as data. Capture gaps, missing inputs, replay limitations, import loss, verification warnings, and inferred relationships should be visible to later readers. A partial answer should remain visibly partial instead of being silently upgraded into a complete-looking record.
This principle applies across the whole model. A stage can be complete while its sample evidence is partial. An artifact can exist while its lineage is inferred. A report can be generated while some diagnostics warn that evidence is incomplete. A deployment can succeed while the link to its supporting report is missing. These distinctions matter because they help readers decide how much confidence to place in what they see.
Explicit incompleteness is also useful for recovery. If a system knows what is missing, it can guide users toward repair, re-import, re-run, or cautious interpretation. If missingness is invisible, the system can only present a false sense of completeness.
This is not pessimism. It is a condition for trust. Users can make better decisions when the system distinguishes complete evidence from partial evidence, direct capture from inferred capture, healthy state from degraded state, and strong confidence from weak confidence.
How the Model Applies in Practice
The eight layers are easiest to understand through concrete workflows.
In a training workflow, execution context defines the project, run, and stages such as prepare, train, and evaluate. Record families capture losses, accuracies, warnings, degraded markers, and span timings. Granularity units preserve epoch-, batch-, or sample-level variation. Artifacts anchor checkpoints, model bundles, and configuration snapshots. Reproducibility context records the environment. Relationship tracing links outputs back to the run. Operational outcome may later connect a selected artifact to deployment. Investigation surfaces allow the run to be compared, diagnosed, trended, and reported.
A telemetry-only view may show that training ran. A structured evidence view can explain how it ran, what it produced, why its result differs from another run, and whether the evidence is strong enough to support promotion.
For example, suppose a new training run improves validation accuracy but produces a larger artifact and takes longer to export. A basic metrics dashboard can show the accuracy and duration changes. A stronger evidence model can show that the improvement came from a specific evaluation slice, that the export stage produced a larger bundle, that the package set changed, and that the report includes a degraded marker for incomplete sample capture. That richer view supports a better decision than the aggregate score alone.
In an evaluation workflow, the need for structured evidence is even more obvious. A run may include loading, scoring, aggregation, and reporting. Aggregate metrics may describe overall quality, but sample-level evidence reveals which cases failed. Artifacts preserve generated reports or exports. Provenance explains how conclusions were formed. This matters because model quality often depends on a small number of severe failures, not only on average behavior.
Evaluation also shows why reports matter. A reviewer usually does not want to inspect raw records one by one. They want an explanation: what was evaluated, under what conditions, which metrics changed, which examples failed, which artifacts were produced, and whether any evidence was incomplete. If the report is generated from canonical evidence, it can serve as a trustworthy bridge between raw capture and human review.
In an LLM or RAG workflow, the evidence model becomes broader still. Useful evidence may include prompt-level samples, retrieval-stage metrics, generation-stage metrics, faithfulness and relevance scores, fallback markers, retrieved context, and lineage connecting prompts, retrieved evidence, generated outputs, and reports. The most interesting questions are often about per-prompt behavior and evidence paths, not just latency or error count.
A RAG failure, for instance, may not be explained by one metric. The answer may be wrong because retrieval missed the relevant document, because reranking selected the wrong passage, because the generator ignored retrieved evidence, or because the evaluator scored the output inconsistently. To investigate that chain, the Observability model needs stage structure, sample-level records, artifacts containing evidence bundles, and lineage linking prompts, retrieval results, generated answers, and evaluation outcomes.
In a deployment workflow, the focus shifts from what the run did to what moved forward because of it. The system needs to connect artifacts, reports, environment, provenance, promotion, and deployment state. Observability becomes the story of consequence, not only execution.
A failed deployment should be traceable backward to the artifact that was deployed, the run that produced it, the report that supported it, and the environment in which it was created. A successful deployment should still remain connected to later production monitoring so that real-world behavior can be interpreted in light of the original evidence.
Implication: Observability for AI Agents
Structured evidence is useful for humans, but it is also useful for AI agents.
As AI coding agents increasingly generate, evaluate, debug, and maintain software, they need machine-readable context about prior work. They cannot rely on informal human memory or scattered dashboards. They need records, metrics, artifacts, lineage paths, diagnostics, and reports that can be inspected programmatically.
This creates a natural connection between ML Observability and agent-oriented workflows. A generator agent can produce a change. An evaluator agent can run tests, inspect evidence, compare outcomes, and return concrete feedback. A later session can resume from artifacts and structured context rather than from a fragile chat transcript.
This is especially important for long-running or multi-step work. Human teams often rely on shared memory, discussion, or dashboards to understand what happened before. Agents need a more explicit handoff. They benefit from structured run records, deterministic metrics, evidence bundles, and diagnostics that describe the state of the work without requiring access to the entire conversation history.
This does not change the core definition of ML Observability. It reinforces it. Evidence that is structured enough for human review is also more usable by agents that need to reason across runs, compare outcomes, and enforce quality over time.
What a Weak ML Observability Story Looks Like
A weak ML Observability story is usually not a complete absence of data. Most teams have some metrics, logs, artifacts, dashboards, scripts, or reports. The weakness appears when those pieces do not connect.
Metrics may exist, but their scope or stage may be ambiguous. Logs may contain important lifecycle events, but the boundaries between stages may be implicit. Artifacts may exist in storage, but their origin may be tracked by convention or memory. Environment differences may be reconstructed after the fact. Sample-level failures may be hidden behind averages. Comparisons may require custom scripts every time. Missing evidence may not be represented, making silence look like completeness. Deployment records may be detached from the evidence that justified them.
This kind of fragmentation is common. It does not mean the team has no Observability. It means the team has Observability signals without a coherent ML evidence model.
The cost of that fragmentation often appears later. A team may be able to debug a run immediately after it happens because the people involved still remember the context. But weeks later, when someone asks why a model was promoted or why a production behavior changed, the missing links become painful. The data exists, but the explanation has to be reconstructed.
A weak story therefore often feels adequate during execution and inadequate during review. Contexta is designed around the later moment, when evidence needs to remain interpretable after memory has faded.
What a Strong ML Observability Story Looks Like
A stronger ML Observability story connects the pieces.
The run structure is explicit. Records are attached to meaningful scopes. Sample- and batch-level variation can be inspected. Artifacts are preserved as evidence objects. Environment and configuration context are captured alongside results. Lineage and provenance explain how objects are related. Degraded or partial states are visible. Production outcomes can be connected back to offline evidence. Query, comparison, diagnostics, trends, and reporting are normal workflows rather than emergency reconstruction tools.
This does not require perfect capture. Perfect evidence is rarely realistic. What matters is that the system preserves enough structure for later questions to be answered without rebuilding the meaning model from scratch.
A strong story also does not require every user to inspect every layer all the time. Many users may begin with a report, a comparison, or a diagnostic summary. The value of the model is that those surfaces can be backed by structured evidence. When a question becomes deeper, the system can let the reader move from summary to records, from records to artifacts, from artifacts to lineage, and from lineage to operational outcomes.
In that sense, a strong ML Observability story is not only about completeness. It is about navigability. The evidence should be structured enough that users can move through it without losing meaning.
The Main Design Intuition Behind Contexta
The main design intuition behind Contexta can be summarized as follows.
Telemetry tells you that something happened. Structure tells you where it happened. Granularity tells you which units were affected. Artifacts tell you what was produced. Environment tells you under what conditions it happened. Lineage tells you how it connects. Degradation tells you how much to trust the record. Reports and diagnostics tell you how to interpret it.
This intuition connects the public concepts in the project. Even when implementation surfaces evolve, the underlying point of view remains consistent: ML Observability should preserve the evidence needed to understand ML work over time.
It also explains why Contexta does not frame ML Observability as application Observability with a few ML-specific tags. Tags can add context, but they do not by themselves create a coherent evidence model. The project instead starts from the objects that ML teams need to reason about: runs, stages, records, samples, artifacts, environments, relationships, operational outcomes, and reports.
What This Means for Contexta
This document describes the direction encoded in Contexta. It does not claim that every possible ML Observability workflow is complete or that every surface in the project is equally mature.
The direction is clear: local-first storage of canonical evidence, structured run-oriented investigation, explicit modeling of records and artifacts, environment and lineage as first-class context, support for comparison and reporting, and honest treatment of incomplete or degraded states. It also means connecting offline evidence to operational outcomes rather than treating experimentation and deployment as separate histories.
Local-first storage matters because evidence should be inspectable and owned, not only hidden behind a remote backend. Canonical evidence matters because reports, exports, diagnostics, and comparisons should derive from the same underlying record rather than from separate ad hoc reconstructions. Run-oriented investigation matters because most ML questions eventually return to what a specific execution did, produced, and justified.
The broader point is that Contexta sees ML Observability not as "logs, metrics, and traces applied to ML," but as a discipline of preserving structured evidence about ML execution and outcomes so that they remain understandable, reviewable, reproducible, and trustworthy over time.
Summary
ML Observability, in the Contexta sense, is the practice of making ML execution legible after it happens.
It includes telemetry, but it also includes semantic execution structure, records attached to meaningful scopes, sample- and batch-level granularity, artifact evidence, reproducibility context, lineage and provenance, operational outcomes, explicit degraded-state modeling, and interpretation surfaces for query, comparison, diagnostics, trends, and reports.
That is why Contexta organizes its public concepts around runs, stages, records, artifacts, lineage, and reports rather than around telemetry alone.
A practical working definition is:
ML Observability is the discipline of preserving enough structured evidence about runs, stages, records, samples, artifacts, environment, relationships, and evidence quality to support future investigation, comparison, reproducibility, review, and operational decision-making.