adding online evaluation for custom code based evaluators and CLI examples (#1412)

2026-05-22 14:43:35 +00:00 · 2026-04-28 11:48:28 -07:00
parent 2c0fdfc523
commit 4eafe85bc7
3 changed files with 805 additions and 55 deletions
@@ -4,7 +4,136 @@

 This tutorial shows how to build and run **custom code-based evaluators** with Amazon Bedrock AgentCore Evaluations. Instead of relying on an LLM as the judge, code-based evaluators delegate scoring to an AWS Lambda function you write. This gives you deterministic, low-cost, fully customizable evaluation logic that can encode exact business rules, format constraints, or data validation requirements that an LLM might interpret loosely.

-The tutorial pairs code-based evaluators with the built-in LLM evaluators from the [groundtruth tutorial](../05-groundtruth-based-evalautions/) to show how both types work side-by-side in a mixed evaluation run.
+The tutorial demonstrates code-based evaluators in **both on-demand and online evaluation** modes, and pairs them with built-in LLM evaluators to show how both types work side-by-side in a mixed evaluation run.
+
+---
+
+## Setup with AgentCore CLI
+
+The fastest way to bootstrap and deploy the agent is with the [AgentCore CLI](https://github.com/aws/agentcore-cli) (`0.11.0`).
+
+### Prerequisites
+
+- **Node.js** 20.x or later
+- **uv** 0.4+ (Python package manager)
+- **AWS CLI** 2.x with credentials configured
+- **Docker** running locally (for agent container build)
+- **Git** 2.x
+
+### Install the CLI
+
+```bash
+npm install -g @aws/agentcore@0.11.0
+agentcore --version   # should print 0.11.0
+```
+
+### Configure AWS credentials
+
+```bash
+aws configure
+aws sts get-caller-identity   # verify credentials
+```
+
+Your IAM user/role needs permissions for: AgentCore Runtime, AgentCore Evaluations, Lambda,
+CloudWatch Logs, ECR, IAM, and Bedrock.
+
+### Create and deploy the agent
+
+```bash
+# Scaffold a new AgentCore project
+agentcore create --name HRAssistant --framework Strands --model-provider Bedrock --defaults
+
+# Copy the HR assistant implementation
+cp hr_assistant_agent.py app/HRAssistant/main.py
+
+# Test locally
+agentcore dev
+
+# Deploy to AWS (builds container, pushes to ECR, creates AgentCore Runtime)
+agentcore deploy
+```
+
+After `agentcore deploy` completes, note the **Runtime ID** and **ARN** from the output.
+
+### Register a code-based evaluator via CLI
+
+`agentcore add evaluator` registers the evaluator in your project's `agentcore.json`. The evaluator
+is created in AWS when you run `agentcore deploy`.
+
+```bash
+# Register a TRACE-level code-based evaluator
+agentcore add evaluator \
+  --name HRResponseLength \
+  --level TRACE \
+  --type code-based \
+  --lambda-arn arn:aws:lambda:<region>:<account-id>:function:hr-response-length \
+  --timeout 30
+
+# Register a SESSION-level code-based evaluator
+agentcore add evaluator \
+  --name HRFactChecker \
+  --level SESSION \
+  --type code-based \
+  --lambda-arn arn:aws:lambda:<region>:<account-id>:function:hr-fact-checker \
+  --timeout 60
+```
+
+### Run on-demand evaluation via CLI
+
+**Standalone mode** (no project needed) — use `--runtime-arn` and `--evaluator-arn` with the
+full ARNs of already-deployed resources. This works from any directory:
+
+```bash
+agentcore run eval \
+  --runtime-arn <agent-runtime-arn> \
+  --evaluator-arn <hr-response-length-evaluator-arn> \
+  --evaluator-arn <hr-fact-checker-evaluator-arn> \
+  --session-id <session-id> \
+  --region <aws-region>
+```
+
+Mix code-based (`--evaluator-arn`) with builtin (`--evaluator`) in one command:
+
+```bash
+agentcore run eval \
+  --runtime-arn <agent-runtime-arn> \
+  --evaluator-arn <hr-response-length-evaluator-arn> \
+  --evaluator-arn <hr-fact-checker-evaluator-arn> \
+  --evaluator Builtin.Correctness \
+  --evaluator Builtin.Helpfulness \
+  --session-id <session-id> \
+  --region <aws-region>
+```
+
+**Project mode** (inside a deployed project directory) — use evaluator names from `agentcore.json`.
+Requires `agentcore deploy` to have been run first:
+
+```bash
+agentcore run eval \
+  --runtime HRAssistant \
+  --evaluator HRResponseLength \
+  --evaluator HRFactChecker \
+  --session-id <session-id>
+```
+
+### Add online evaluation via CLI
+
+`agentcore add online-eval` adds the config to `agentcore.json`; it is created in AWS on
+`agentcore deploy`. Run from inside your project directory:
+
+```bash
+# sampling-rate is a percentage (0.01–100)
+agentcore add online-eval \
+  --name hr_online_eval \
+  --runtime HRAssistant \
+  --evaluator HRResponseLength \
+  --evaluator HRFactChecker \
+  --sampling-rate 100 \
+  --enable-on-create
+```
+
+> You can also use the notebook (Step 10) to create the online eval config programmatically
+> using the boto3 SDK, without needing a project directory.

 ---

@@ -56,6 +185,7 @@ def lambda_handler(input: EvaluatorInput, context) -> EvaluatorOutput:
 │  2. Register evaluators via bedrock-agentcore-control                        │
 │  3a. On-demand: EvaluationClient.run(session_id, evaluator_ids)             │
 │  3b. Dataset: OnDemandEvaluationDatasetRunner.run(dataset, agent_invoker)   │
+│  3c. Online: create_online_evaluation_config (auto-evaluates all sessions)  │
 └────────────────┬────────────────────────────────────────────────────────────┘
                 │
     ┌───────────▼────────────┐        ┌──────────────────────────────┐
@@ -81,7 +211,8 @@ def lambda_handler(input: EvaluatorInput, context) -> EvaluatorOutput:
 1. Agent is invoked; OTel spans are written to CloudWatch
 2. `EvaluationClient` or `OnDemandEvaluationDatasetRunner` collects spans from CloudWatch
 3. The service calls each evaluator — builtin evaluators run LLM inference; code-based evaluators invoke your Lambda with the span payload
-4. All results are aggregated and returned
+4. For **online evaluation**, AgentCore continuously watches the log group and automatically evaluates new sessions without any explicit trigger
+5. All results are aggregated and returned (on-demand) or written to the online evaluation results log group

 ---

@@ -91,11 +222,11 @@ def lambda_handler(input: EvaluatorInput, context) -> EvaluatorOutput:
 - **Docker** running locally (for agent container image build)
 - **AWS credentials** with permissions for:
  - `bedrock-agentcore:*` — runtime and evaluations
-  - `bedrock-agentcore-control:*` — evaluator registration
+  - `bedrock-agentcore-control:*` — evaluator registration and online eval config management
  - `lambda:CreateFunction`, `lambda:UpdateFunctionCode`, `lambda:AddPermission`, `lambda:GetFunction`
  - `logs:FilterLogEvents`, `logs:DescribeLogGroups` — CloudWatch span collection
  - `ecr:*` — container image for the agent
-  - `iam:*` — auto-creating the agent execution role
+  - `iam:*` — creating execution roles for the agent and online evaluation
 - **IAM role** named `AgentCoreLambdaExecutionRole` with `AWSLambdaBasicExecutionRole` attached
 - **bedrock-agentcore >= 1.6.0** installed in the notebook kernel

@@ -109,6 +240,7 @@ def lambda_handler(input: EvaluatorInput, context) -> EvaluatorOutput:
 |---|---|
 | `programmatic_evaluators.ipynb` | Main tutorial notebook (standalone, end-to-end) |
 | `hr_assistant_agent.py` | HR Assistant Strands agent (same as groundtruth tutorial) |
+| `Dockerfile` | Container definition for the agent (used by Step 3 fresh deploy and `agentcore deploy`) |
 | `requirements.txt` | Python dependencies (`bedrock-agentcore>=1.6.0`) |
 | `lambdas/hr_response_length/lambda_function.py` | Response length evaluator Lambda |
 | `lambdas/hr_fact_checker/lambda_function.py` | HR fact-checking evaluator Lambda |
@@ -124,6 +256,7 @@ Checks that each agent response is between 50 and 600 characters. Responses shor
 - **Level:** TRACE — evaluated once per agent response
 - **Lambda:** `hr-response-length`
 - **Returns:** `1.0` (PASS) if within range, `0.0` (FAIL) otherwise
+- **Used in:** On-demand evaluation (Steps 7 & 8) and Online evaluation (Step 10)

 ### HRFactChecker (SESSION level)

@@ -137,6 +270,7 @@ Deterministically validates that the HR assistant's responses contain accurate f
  - PTO request ID format `PTO-2026-NNN`
  - Policy facts: 15-day PTO accrual, 2-day advance notice, 401k 4% match, 90% health coverage
 - **Returns:** fraction of applicable checks passed (0.0–1.0), labeled `PASS`, `PARTIAL`, `FAIL`, or `SKIP`
+- **Used in:** On-demand evaluation (Steps 7 & 8) and Online evaluation (Step 10)

 ---

@@ -156,6 +290,60 @@ Results from all five evaluators are collected per scenario, letting you compare

 ---

+## Online Evaluation with Code-Based Evaluators
+
+Step 10 of the notebook demonstrates **online evaluation** — a continuous evaluation mode where
+AgentCore automatically evaluates every live agent session without explicit API calls per session.
+
+### How it works
+
+1. Register code-based evaluators (Steps 4–6, same as for on-demand)
+2. Create an online evaluation config via `create_online_evaluation_config`:
+   - Point it at the agent's CloudWatch log group
+   - Set a sampling rate (0–100%)
+   - List the evaluator IDs (code-based and/or builtin)
+   - Provide an IAM execution role the service can assume
+3. Enable the config — AgentCore starts watching the log group
+4. Every new agent session is automatically evaluated
+5. Results appear in the online evaluation results CloudWatch log group
+
+### Evaluator locking
+
+When a code-based evaluator is referenced by an **enabled** online evaluation config, AgentCore
+**locks** it automatically. You cannot modify or delete a locked evaluator. To update it:
+
+```
+disable/delete online eval config
+         ↓
+update evaluator Lambda or re-register
+         ↓
+re-create online eval config
+```
+
+### On-demand vs. online comparison
+
+| Dimension | On-demand | Online |
+|---|---|---|
+| Trigger | Explicit per session | Automatic on every invocation |
+| Setup | `EvaluationClient.run()` or `OnDemandEvaluationDatasetRunner` | `create_online_evaluation_config` once |
+| Code-based evaluators | ✅ Supported | ✅ Supported |
+| Evaluator locking | No | Yes — while config is enabled |
+| Best for | CI/CD, ad-hoc debugging | Continuous production monitoring |
+
+### AgentCore CLI shortcut
+
+```bash
+# sampling-rate is a percentage (0.01–100); 50 = evaluate 50% of sessions
+agentcore add online-eval \
+  --name my_online_eval \
+  --runtime MyAgent \
+  --evaluator MyCodeEvaluator \
+  --sampling-rate 50 \
+  --enable-on-create
+```
+
+---
+
 ## Sample Prompts

 The dataset includes five scenarios that exercise facts the `HRFactChecker` validates:
@@ -178,14 +366,15 @@ You can extend the dataset with additional scenarios to test more HR topics (rem
 |---|---|
 | 1 | Install dependencies (`bedrock-agentcore>=1.6.0`) |
 | 2 | Configure AWS session, region, and Lambda role ARN |
-| 3 | Agent setup — reload from `%store` (groundtruth notebook) or deploy fresh |
+| 3 | Agent setup — reload from `%store` (groundtruth notebook) or deploy fresh with boto3 |
 | 4 | Define Lambda evaluator functions using the `@custom_code_based_evaluator()` decorator |
 | 5 | Deploy Lambda functions (bundled with bedrock-agentcore SDK + pydantic) |
 | 6 | Register evaluators via `bedrock-agentcore-control` boto3 service |
 | 7 | On-demand evaluation with `EvaluationClient` (code-based + builtin evaluators) |
 | 8 | Dataset evaluation with `OnDemandEvaluationDatasetRunner` (mixed evaluator set) |
 | 9 | Inspect and compare results (per-scenario tables + aggregate score comparison) |
-| 10 | Cleanup — delete Lambda functions, evaluator records, and agent runtime |
+| **10** | **Online evaluation with `create_online_evaluation_config` (code-based evaluators, auto-triggered)** |
+| 11 | Cleanup — delete Lambda functions, evaluator records, online eval config, and agent runtime |

 ---

@@ -213,8 +402,10 @@ span.span_events[*]
 - **Business rule enforcement** — encode domain-specific rules that LLMs might interpret loosely
 - **High-volume evaluation** — reduce cost for evaluations that run on every production session
 - **Regulatory requirements** — verify that required disclosures or disclaimers are always present
+- **Continuous monitoring** — combine with online evaluation for zero-touch production quality gates

-> **Note:** Code-based evaluators are supported for **on-demand evaluation** (`EvaluationClient`, `OnDemandEvaluationDatasetRunner`) only. Online evaluation configs support built-in LLM evaluators only.
+Code-based evaluators are supported for **both on-demand** (`EvaluationClient`,
+`OnDemandEvaluationDatasetRunner`) and **online** (`create_online_evaluation_config`) evaluation.

 ---

@@ -223,20 +414,27 @@ span.span_events[*]
 To remove created AWS resources:

 ```python
-# Delete Lambda functions
+# 1. Disable online evaluation config first (unlocks evaluators)
+cp_client.update_online_evaluation_config(
+    onlineEvaluationConfigId=ONLINE_EVAL_CONFIG_ID,
+    enableOnCreate=False,
+)
+cp_client.delete_online_evaluation_config(onlineEvaluationConfigId=ONLINE_EVAL_CONFIG_ID)
+
+# 2. Delete Lambda functions
 for fn in ["hr-response-length", "hr-fact-checker"]:
    lambda_client.delete_function(FunctionName=fn)

-# Delete evaluator registrations
+# 3. Delete evaluator registrations (now unlocked)
 for name, eid in CODE_EVAL_IDS.items():
    cp_client.delete_evaluator(evaluatorId=eid)

-# Delete agent runtime (only if deployed in this notebook)
+# 4. Delete agent runtime (only if deployed in this notebook)
 if not _agent_loaded:
-    agent_runtime.delete()
+    agentcore_control.delete_agent_runtime(agentRuntimeId=AGENT_ID)
 ```

-Alternatively, run the cleanup cell (Step 10) in the notebook — it is commented out by default to prevent accidental deletion.
+Alternatively, run the cleanup cell (Step 11) in the notebook — it is commented out by default to prevent accidental deletion.

 ---

@@ -245,4 +443,5 @@ Alternatively, run the cleanup cell (Step 10) in the notebook — it is commente
 - Extend `HRFactChecker` with additional business rules as your agent and data model evolve
 - Combine code-based evaluators with `EvaluationClient` to validate specific production sessions
 - Add code-based evaluators to your CI/CD pipeline for zero-cost regression testing on every deployment
+- Use online evaluation with a lower sampling rate (e.g. 10%) to cost-effectively monitor high-traffic agents
 - Explore the [groundtruth tutorial](../05-groundtruth-based-evalautions/) for `EvaluationClient` and ground-truth-based evaluations with built-in evaluators
@@ -30,7 +30,7 @@
    "| **HRResponseLength** | TRACE | `hr-response-length` | Response length is 50–600 chars |\n",
    "| **HRFactChecker** | SESSION | `hr-fact-checker` | PTO balances, pay stubs, and policy facts are accurate |\n",
    "\n",
-    "Then we'll run `OnDemandEvaluationDatasetRunner` with a **mixed evaluator set** combining these code-based evaluators with built-in LLM-as-as-Judge evaluators.\n",
+    "Then we'll run `OnDemandEvaluationDatasetRunner` with a **mixed evaluator set** combining these code-based evaluators with built-in LLM-as-as-Judge evaluators. We will also set up online evaluation using these evaluation for live monitoring.\n",
    "\n",
    "### Tutorial Details\n",
    "\n",
@@ -111,7 +111,27 @@
    "from botocore.config import Config\n",
    "from IPython.display import display, Markdown\n",
    "\n",
-    "REGION = \"aws_region\" # Add AWS region here \n",
+    "# ── Region configuration ──────────────────────────────────────────────────────\n",
+    "# REGION: the AWS region where the AgentCore Runtime (agent) is deployed.\n",
+    "#   Auto-detected from the boto3 session (reads AWS_DEFAULT_REGION env var or\n",
+    "#   the default region in ~/.aws/config). Set explicitly if needed, e.g.:\n",
+    "#   REGION = \"us-east-1\"\n",
+    "#\n",
+    "#   If you ran groundtruth_evaluations.ipynb first, REGION is also restored\n",
+    "#   from %store in the agent-load cell below, overriding this value.\n",
+    "REGION = Session().region_name\n",
+    "assert REGION, (\n",
+    "    \"No AWS region detected. Set AWS_DEFAULT_REGION or configure a default \"\n",
+    "    \"region in ~/.aws/config, or set REGION explicitly above.\"\n",
+    ")\n",
+    "\n",
+    "# EVAL_REGION: region for Lambda evaluators and evaluator registrations.\n",
+    "#   For online evaluation, this MUST match REGION (the agent's CloudWatch log\n",
+    "#   group and the evaluation config must be in the same region). The\n",
+    "#   agent-clients cell below aligns EVAL_REGION to REGION automatically\n",
+    "#   after %store restores the agent's actual region.\n",
+    "EVAL_REGION = REGION\n",
+    "\n",
    "boto_session = Session(region_name=REGION)\n",
    "ACCOUNT_ID = boto3.client(\"sts\").get_caller_identity()[\"Account\"]\n",
    "\n",
@@ -119,13 +139,10 @@
    "# Update this if your role has a different name\n",
    "LAMBDA_ROLE_ARN = f\"arn:aws:iam::{ACCOUNT_ID}:role/AgentCoreLambdaExecutionRole\"\n",
    "\n",
-    "# Evaluation region — Lambda evaluator functions and evaluator registrations must be here.\n",
-    "EVAL_REGION = \"aws_region\" # Set AWS Region here\n",
-    "\n",
    "print(f\"Region          : {REGION}\")\n",
+    "print(f\"Eval Region     : {EVAL_REGION}\")\n",
    "print(f\"Account         : {ACCOUNT_ID}\")\n",
-    "print(f\"Lambda Role ARN : {LAMBDA_ROLE_ARN}\")\n",
-    "print(f\"Eval Region     : {EVAL_REGION}\")"
+    "print(f\"Lambda Role ARN : {LAMBDA_ROLE_ARN}\")"
   ]
  },
  {
@@ -191,34 +208,137 @@
   },
   "outputs": [],
   "source": [
+    "\n",
    "# Deploy agent if not already loaded\n",
    "if not _agent_loaded:\n",
-    "    from bedrock_agentcore_starter_toolkit import Runtime\n",
+    "    # -------------------------------------------------------------------------\n",
+    "    # Fresh deployment using boto3 (bedrock-agentcore-control) + Docker/ECR.\n",
+    "    # This path runs only when the groundtruth notebook has NOT been executed\n",
+    "    # first. If you prefer the CLI, run `agentcore deploy` from the project\n",
+    "    # root instead and set AGENT_ID / AGENT_ARN / CW_LOG_GROUP manually below.\n",
+    "    # -------------------------------------------------------------------------\n",
    "\n",
-    "    agent_runtime = Runtime()\n",
-    "    agent_runtime.configure(\n",
-    "        entrypoint=\"hr_assistant_agent.py\",\n",
-    "        requirements_file=\"requirements.txt\",\n",
-    "        auto_create_execution_role=True,\n",
-    "        auto_create_ecr=True,\n",
-    "        region=REGION,\n",
-    "        agent_name=\"hr_assistant_codeeval_tutorial\",\n",
-    "        idle_timeout=120,\n",
+    "    ecr_client = boto3.client(\"ecr\", region_name=REGION)\n",
+    "    agentcore_control_deploy = boto3.client(\"bedrock-agentcore-control\", region_name=REGION)\n",
+    "    iam_client = boto3.client(\"iam\")\n",
+    "\n",
+    "    AGENT_NAME = \"hr_assistant_codeeval_tutorial\"\n",
+    "    ECR_REPO_NAME = f\"agentcore-{AGENT_NAME}\"\n",
+    "\n",
+    "    # ------------------------------------------------------------------\n",
+    "    # 1. Ensure IAM execution role exists\n",
+    "    # ------------------------------------------------------------------\n",
+    "    EXECUTION_ROLE_NAME = \"AgentCoreRuntimeExecutionRole\"\n",
+    "    EXECUTION_ROLE_ARN = f\"arn:aws:iam::{ACCOUNT_ID}:role/{EXECUTION_ROLE_NAME}\"\n",
+    "\n",
+    "    try:\n",
+    "        iam_client.get_role(RoleName=EXECUTION_ROLE_NAME)\n",
+    "        print(f\"Using existing IAM role: {EXECUTION_ROLE_ARN}\")\n",
+    "    except iam_client.exceptions.NoSuchEntityException:\n",
+    "        print(f\"Creating IAM role: {EXECUTION_ROLE_NAME}...\")\n",
+    "        trust_policy = json.dumps({\n",
+    "            \"Version\": \"2012-10-17\",\n",
+    "            \"Statement\": [{\n",
+    "                \"Effect\": \"Allow\",\n",
+    "                \"Principal\": {\"Service\": \"bedrock-agentcore.amazonaws.com\"},\n",
+    "                \"Action\": \"sts:AssumeRole\",\n",
+    "            }],\n",
+    "        })\n",
+    "        iam_client.create_role(\n",
+    "            RoleName=EXECUTION_ROLE_NAME,\n",
+    "            AssumeRolePolicyDocument=trust_policy,\n",
+    "            Description=\"Execution role for AgentCore Runtime tutorial agents\",\n",
+    "        )\n",
+    "        for policy_arn in [\n",
+    "            \"arn:aws:iam::aws:policy/AmazonBedrockFullAccess\",\n",
+    "            \"arn:aws:iam::aws:policy/CloudWatchLogsFullAccess\",\n",
+    "        ]:\n",
+    "            iam_client.attach_role_policy(RoleName=EXECUTION_ROLE_NAME, PolicyArn=policy_arn)\n",
+    "        print(f\"Created: {EXECUTION_ROLE_ARN}\")\n",
+    "        time.sleep(10)  # allow IAM propagation\n",
+    "\n",
+    "    # ------------------------------------------------------------------\n",
+    "    # 2. Create ECR repository (or reuse existing)\n",
+    "    # ------------------------------------------------------------------\n",
+    "    try:\n",
+    "        ecr_resp = ecr_client.create_repository(repositoryName=ECR_REPO_NAME)\n",
+    "        ECR_REPO_URI = ecr_resp[\"repository\"][\"repositoryUri\"]\n",
+    "        print(f\"Created ECR repo: {ECR_REPO_URI}\")\n",
+    "    except ecr_client.exceptions.RepositoryAlreadyExistsException:\n",
+    "        ECR_REPO_URI = ecr_client.describe_repositories(\n",
+    "            repositoryNames=[ECR_REPO_NAME]\n",
+    "        )[\"repositories\"][0][\"repositoryUri\"]\n",
+    "        print(f\"Using existing ECR repo: {ECR_REPO_URI}\")\n",
+    "\n",
+    "    IMAGE_URI = f\"{ECR_REPO_URI}:latest\"\n",
+    "\n",
+    "    # ------------------------------------------------------------------\n",
+    "    # 3. Build Docker image and push to ECR\n",
+    "    # ------------------------------------------------------------------\n",
+    "    ecr_registry = f\"{ACCOUNT_ID}.dkr.ecr.{REGION}.amazonaws.com\"\n",
+    "    print(\"Docker login to ECR...\")\n",
+    "    subprocess.run(\n",
+    "        f\"aws ecr get-login-password --region {REGION} | docker login --username AWS --password-stdin {ecr_registry}\",\n",
+    "        shell=True, check=True,\n",
    "    )\n",
-    "    launch_result = agent_runtime.launch()\n",
+    "    print(\"Building Docker image (this may take a few minutes)...\")\n",
+    "    subprocess.run([\"docker\", \"build\", \"--platform\", \"linux/amd64\", \"-t\", IMAGE_URI, \".\"], check=True)\n",
+    "    print(\"Pushing image to ECR...\")\n",
+    "    subprocess.run([\"docker\", \"push\", IMAGE_URI], check=True)\n",
+    "    print(f\"Image pushed: {IMAGE_URI}\")\n",
    "\n",
-    "    terminal = {\"READY\", \"CREATE_FAILED\", \"DELETE_FAILED\", \"UPDATE_FAILED\"}\n",
+    "    # Allow ECR pull from AgentCore\n",
+    "    ecr_client.set_repository_policy(\n",
+    "        repositoryName=ECR_REPO_NAME,\n",
+    "        policyText=json.dumps({\n",
+    "            \"Version\": \"2012-10-17\",\n",
+    "            \"Statement\": [{\n",
+    "                \"Sid\": \"AllowAgentCorePull\",\n",
+    "                \"Effect\": \"Allow\",\n",
+    "                \"Principal\": {\"Service\": \"bedrock-agentcore.amazonaws.com\"},\n",
+    "                \"Action\": [\"ecr:GetDownloadUrlForLayer\", \"ecr:BatchGetImage\", \"ecr:BatchCheckLayerAvailability\"],\n",
+    "            }],\n",
+    "        }),\n",
+    "    )\n",
+    "\n",
+    "    # ------------------------------------------------------------------\n",
+    "    # 4. Create (or update) AgentCore Runtime\n",
+    "    # ------------------------------------------------------------------\n",
+    "    artifact = {\"containerConfiguration\": {\"containerUri\": IMAGE_URI}}\n",
+    "    try:\n",
+    "        resp = agentcore_control_deploy.create_agent_runtime(\n",
+    "            agentRuntimeName=AGENT_NAME,\n",
+    "            agentRuntimeArtifact=artifact,\n",
+    "            executionRoleArn=EXECUTION_ROLE_ARN,\n",
+    "            networkConfiguration={\"networkMode\": \"PUBLIC\"},\n",
+    "        )\n",
+    "        AGENT_ID = resp[\"agentRuntimeId\"]\n",
+    "        AGENT_ARN = resp[\"agentRuntimeArn\"]\n",
+    "        print(f\"Created AgentCore Runtime: {AGENT_ID}\")\n",
+    "    except agentcore_control_deploy.exceptions.ConflictException:\n",
+    "        runtimes = agentcore_control_deploy.list_agent_runtimes()[\"agentRuntimes\"]\n",
+    "        existing = next((r for r in runtimes if r[\"agentRuntimeName\"] == AGENT_NAME), None)\n",
+    "        assert existing, f\"Runtime {AGENT_NAME} not found after conflict\"\n",
+    "        AGENT_ID = existing[\"agentRuntimeId\"]\n",
+    "        AGENT_ARN = existing[\"agentRuntimeArn\"]\n",
+    "        agentcore_control_deploy.update_agent_runtime(\n",
+    "            agentRuntimeId=AGENT_ID,\n",
+    "            agentRuntimeArtifact=artifact,\n",
+    "        )\n",
+    "        print(f\"Updated existing runtime: {AGENT_ID}\")\n",
+    "\n",
+    "    # Wait until READY\n",
+    "    terminal = {\"READY\", \"CREATE_FAILED\", \"UPDATE_FAILED\"}\n",
    "    while True:\n",
-    "        status = agent_runtime.status().endpoint[\"status\"]\n",
+    "        status = agentcore_control_deploy.get_agent_runtime(\n",
+    "            agentRuntimeId=AGENT_ID\n",
+    "        )[\"status\"]\n",
    "        print(f\"  Status: {status}\")\n",
    "        if status in terminal:\n",
    "            break\n",
    "        time.sleep(15)\n",
    "\n",
    "    assert status == \"READY\", f\"Deployment failed: {status}\"\n",
-    "\n",
-    "    AGENT_ID = launch_result.agent_id\n",
-    "    AGENT_ARN = launch_result.agent_arn\n",
    "    CW_LOG_GROUP = f\"/aws/bedrock-agentcore/runtimes/{AGENT_ID}-DEFAULT\"\n",
    "\n",
    "    print(\"\\nAgent deployed:\")\n",
@@ -226,7 +346,7 @@
    "    print(f\"  AGENT_ARN    : {AGENT_ARN}\")\n",
    "    print(f\"  CW_LOG_GROUP : {CW_LOG_GROUP}\")\n",
    "else:\n",
-    "    print(\"Using existing agent — skipping deployment.\")"
+    "    print(\"Using existing agent — skipping deployment.\")\n"
   ]
  },
  {
@@ -250,9 +370,18 @@
    "    print(f\"Note: agent is in {_arn_region}, overriding REGION={REGION}\")\n",
    "    REGION = _arn_region\n",
    "\n",
+    "# Align EVAL_REGION with the agent's region so that Lambda evaluators, evaluator\n",
+    "# registrations, and the online evaluation config all live in the same region as\n",
+    "# the agent's CloudWatch log group. Online evaluation requires the log group and\n",
+    "# the evaluators to be in the same region as the control-plane config.\n",
+    "if EVAL_REGION != REGION:\n",
+    "    print(f\"Aligning EVAL_REGION: {EVAL_REGION} → {REGION} (must match agent region for online eval)\")\n",
+    "    EVAL_REGION = REGION\n",
+    "\n",
    "# boto3 client for agent invocation — must be in the same region as the agent\n",
    "agentcore_client = boto3.client(\"bedrock-agentcore\", region_name=REGION)\n",
-    "print(f\"agentcore_client region: {REGION}\")"
+    "print(f\"REGION      : {REGION}\")\n",
+    "print(f\"EVAL_REGION : {EVAL_REGION}\")"
   ]
  },
  {
@@ -1385,12 +1514,404 @@
    "print(f\"Results saved to: {results_path}\")"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "axt5qz5c7lg",
+   "metadata": {},
+   "source": [
+    "## Step 10: Online Evaluation with Code-Based Evaluators\n",
+    "\n",
+    "**Online evaluation** continuously monitors your live agent traffic and automatically scores sessions\n",
+    "as they happen — no manual triggering required. You configure it once, and AgentCore watches your\n",
+    "agent's CloudWatch log stream, evaluating new sessions at a configurable sampling rate.\n",
+    "\n",
+    "In this step we reuse the **same code-based evaluators** (`HRResponseLength` and `HRFactChecker`)\n",
+    "we registered in Step 6. This demonstrates that a single evaluator registration can serve both\n",
+    "on-demand and online evaluation use cases.\n",
+    "\n",
+    "### How online evaluation works\n",
+    "\n",
+    "```\n",
+    "Agent invocation\n",
+    "      │\n",
+    "      ▼  (OTel spans → CloudWatch)\n",
+    "AgentCore Runtime log group\n",
+    "      │\n",
+    "      ▼  (online eval config watches the log group)\n",
+    "AgentCore Evaluations\n",
+    "      ├── Builtin LLM evaluators  → LLM inference\n",
+    "      └── Code-based evaluators   → your Lambda function\n",
+    "             │\n",
+    "             ▼\n",
+    "       Results in CloudWatch Logs\n",
+    "       /aws/bedrock-agentcore/evaluations/online-evaluations/...\n",
+    "```\n",
+    "\n",
+    "### Key differences from on-demand evaluation\n",
+    "\n",
+    "| | On-demand | Online |\n",
+    "|---|---|---|\n",
+    "| **Trigger** | Explicit API call per session | Automatic, event-driven |\n",
+    "| **Scope** | Specific session(s) you choose | All sessions (or a sampled %) |\n",
+    "| **Setup** | Call `EvaluationClient.run()` per session | Configure once with `create_online_evaluation_config` |\n",
+    "| **Evaluator locking** | No | Code-based evaluators become **locked** while the config is enabled |\n",
+    "| **Best for** | Ad-hoc checks, CI/CD pipelines | Continuous production monitoring |\n",
+    "\n",
+    "### IAM execution role\n",
+    "\n",
+    "Online evaluation requires an **evaluation execution role** — an IAM role that AgentCore Evaluations\n",
+    "assumes to invoke your Lambda evaluators and read CloudWatch spans. It must trust\n",
+    "`bedrock-agentcore.amazonaws.com` and have `lambda:InvokeFunction` + `logs:FilterLogEvents`\n",
+    "permissions.\n",
+    "\n",
+    "> **Evaluator locking:** When a code-based evaluator is referenced by an enabled online evaluation\n",
+    "> config, AgentCore automatically locks it to prevent accidental modification. To update the\n",
+    "> evaluator, first disable the online evaluation config (or delete it), then update the evaluator,\n",
+    "> then re-enable the config."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6v2u7hcmazi",
+   "metadata": {},
+   "source": [
+    "### Step 10a: Create IAM Evaluation Execution Role"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "m3n6no96fck",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "iam_client = boto3.client(\"iam\")\n",
+    "ONLINE_EVAL_ROLE_NAME = \"AgentCoreOnlineEvaluationRole\"\n",
+    "ONLINE_EVAL_ROLE_ARN = f\"arn:aws:iam::{ACCOUNT_ID}:role/{ONLINE_EVAL_ROLE_NAME}\"\n",
+    "\n",
+    "# Trust policy: allow AgentCore Evaluations service to assume this role\n",
+    "trust_policy = json.dumps({\n",
+    "    \"Version\": \"2012-10-17\",\n",
+    "    \"Statement\": [{\n",
+    "        \"Effect\": \"Allow\",\n",
+    "        \"Principal\": {\"Service\": \"bedrock-agentcore.amazonaws.com\"},\n",
+    "        \"Action\": \"sts:AssumeRole\",\n",
+    "    }],\n",
+    "})\n",
+    "\n",
+    "# Inline permission policy: invoke Lambda evaluators + full CloudWatch Logs access.\n",
+    "# The online evaluation service requires:\n",
+    "#   READ  — FilterLogEvents, GetLogEvents, StartQuery, GetQueryResults on:\n",
+    "#             - agent runtime log group (/aws/bedrock-agentcore/runtimes/...)\n",
+    "#             - OTel spans log group (aws/spans — no leading slash)\n",
+    "#   WRITE — CreateLogGroup, CreateLogStream, PutLogEvents for writing evaluation results to:\n",
+    "#             - /aws/bedrock-agentcore/evaluations/results/<config-name>\n",
+    "eval_policy = json.dumps({\n",
+    "    \"Version\": \"2012-10-17\",\n",
+    "    \"Statement\": [\n",
+    "        {\n",
+    "            \"Sid\": \"InvokeLambdaEvaluators\",\n",
+    "            \"Effect\": \"Allow\",\n",
+    "            \"Action\": [\"lambda:InvokeFunction\", \"lambda:GetFunction\"],\n",
+    "            \"Resource\": [\n",
+    "                lambda_arn_response_length,\n",
+    "                lambda_arn_fact_checker,\n",
+    "            ],\n",
+    "        },\n",
+    "        {\n",
+    "            \"Sid\": \"CloudWatchLogsReadSpans\",\n",
+    "            \"Effect\": \"Allow\",\n",
+    "            \"Action\": [\n",
+    "                \"logs:FilterLogEvents\",\n",
+    "                \"logs:DescribeLogGroups\",\n",
+    "                \"logs:DescribeLogStreams\",\n",
+    "                \"logs:GetLogEvents\",\n",
+    "                \"logs:StartQuery\",\n",
+    "                \"logs:GetQueryResults\",\n",
+    "                \"logs:StopQuery\",\n",
+    "            ],\n",
+    "            \"Resource\": \"*\",\n",
+    "        },\n",
+    "        {\n",
+    "            \"Sid\": \"CloudWatchLogsWriteResults\",\n",
+    "            \"Effect\": \"Allow\",\n",
+    "            \"Action\": [\n",
+    "                \"logs:CreateLogGroup\",\n",
+    "                \"logs:CreateLogStream\",\n",
+    "                \"logs:PutLogEvents\",\n",
+    "            ],\n",
+    "            \"Resource\": f\"arn:aws:logs:{REGION}:{ACCOUNT_ID}:log-group:/aws/bedrock-agentcore/evaluations/*\",\n",
+    "        },\n",
+    "    ],\n",
+    "})\n",
+    "\n",
+    "try:\n",
+    "    iam_client.get_role(RoleName=ONLINE_EVAL_ROLE_NAME)\n",
+    "    print(f\"Using existing role: {ONLINE_EVAL_ROLE_ARN}\")\n",
+    "    iam_client.put_role_policy(\n",
+    "        RoleName=ONLINE_EVAL_ROLE_NAME,\n",
+    "        PolicyName=\"AgentCoreOnlineEvalPermissions\",\n",
+    "        PolicyDocument=eval_policy,\n",
+    "    )\n",
+    "    print(\"  Inline policy updated.\")\n",
+    "except iam_client.exceptions.NoSuchEntityException:\n",
+    "    print(f\"Creating IAM role: {ONLINE_EVAL_ROLE_NAME}...\")\n",
+    "    iam_client.create_role(\n",
+    "        RoleName=ONLINE_EVAL_ROLE_NAME,\n",
+    "        AssumeRolePolicyDocument=trust_policy,\n",
+    "        Description=\"Execution role for AgentCore online evaluation with code-based evaluators\",\n",
+    "    )\n",
+    "    iam_client.put_role_policy(\n",
+    "        RoleName=ONLINE_EVAL_ROLE_NAME,\n",
+    "        PolicyName=\"AgentCoreOnlineEvalPermissions\",\n",
+    "        PolicyDocument=eval_policy,\n",
+    "    )\n",
+    "    print(f\"Created: {ONLINE_EVAL_ROLE_ARN}\")\n",
+    "\n",
+    "print(f\"\\nOnline eval execution role: {ONLINE_EVAL_ROLE_ARN}\")\n",
+    "print(\"Waiting 10s for IAM propagation...\")\n",
+    "time.sleep(10)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "k2easqzclef",
+   "metadata": {},
+   "source": [
+    "### Step 10b: Create Online Evaluation Configuration\n",
+    "\n",
+    "We create an online evaluation config that monitors the HR assistant's live CloudWatch log group\n",
+    "and evaluates every session (100% sampling rate) with our two code-based evaluators.\n",
+    "\n",
+    "The config references:\n",
+    "- **`HRResponseLength`** (TRACE level) — evaluated per agent response turn\n",
+    "- **`HRFactChecker`** (SESSION level) — evaluated once per completed session\n",
+    "\n",
+    "> Once this config is **enabled**, both code-based evaluators are automatically **locked**\n",
+    "> and cannot be modified until the config is disabled or deleted."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e49zvxa0r0f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "# Unique name for the online eval config (no hyphens — service regex: [a-zA-Z][a-zA-Z0-9_]{0,99})\n",
+    "ONLINE_EVAL_CONFIG_NAME = f\"hr_online_eval_{RUN_SUFFIX}\"\n",
+    "\n",
+    "# The OTel service name is <agentRuntimeName>.DEFAULT\n",
+    "# AGENT_ARN format: arn:aws:bedrock-agentcore:{region}:{account}:runtime/{id}\n",
+    "_runtime_id = AGENT_ARN.split(\"/\")[-1]  # e.g. hr_assistant_codeeval_tutorial-AbCdEfGhIj\n",
+    "_agent_runtime_name = _runtime_id.rsplit(\"-\", 1)[0]  # strip auto-generated suffix\n",
+    "OTEL_SERVICE_NAME = f\"{_agent_runtime_name}.DEFAULT\"\n",
+    "\n",
+    "print(f\"Online eval config name : {ONLINE_EVAL_CONFIG_NAME}\")\n",
+    "print(f\"Monitoring log group    : {CW_LOG_GROUP}\")\n",
+    "print(f\"OTel service name       : {OTEL_SERVICE_NAME}\")\n",
+    "print(f\"Evaluators              : {list(CODE_EVAL_IDS.keys())}\")\n",
+    "print()\n",
+    "\n",
+    "online_eval_resp = cp_client.create_online_evaluation_config(\n",
+    "    onlineEvaluationConfigName=ONLINE_EVAL_CONFIG_NAME,\n",
+    "    # Evaluate 100% of sessions; lower this in production to control cost\n",
+    "    rule={\"samplingConfig\": {\"samplingPercentage\": 100.0}},\n",
+    "    # Watch the agent's runtime CloudWatch log group for new OTel spans\n",
+    "    dataSourceConfig={\n",
+    "        \"cloudWatchLogs\": {\n",
+    "            \"logGroupNames\": [CW_LOG_GROUP],\n",
+    "            \"serviceNames\": [OTEL_SERVICE_NAME],\n",
+    "        }\n",
+    "    },\n",
+    "    # Code-based + builtin evaluators can be mixed freely\n",
+    "    evaluators=[\n",
+    "        {\"evaluatorId\": CODE_EVAL_IDS[\"HRResponseLength\"]},\n",
+    "        {\"evaluatorId\": CODE_EVAL_IDS[\"HRFactChecker\"]},\n",
+    "    ],\n",
+    "    evaluationExecutionRoleArn=ONLINE_EVAL_ROLE_ARN,\n",
+    "    # enableOnCreate=True activates the config immediately on creation\n",
+    "    enableOnCreate=True,\n",
+    ")\n",
+    "\n",
+    "ONLINE_EVAL_CONFIG_ID = online_eval_resp[\"onlineEvaluationConfigId\"]\n",
+    "ONLINE_EVAL_CONFIG_ARN = online_eval_resp.get(\"onlineEvaluationConfigArn\", \"\")\n",
+    "\n",
+    "print(f\"Online eval config created:\")\n",
+    "print(f\"  ID  : {ONLINE_EVAL_CONFIG_ID}\")\n",
+    "print(f\"  ARN : {ONLINE_EVAL_CONFIG_ARN}\")\n",
+    "print()\n",
+    "print(\"Evaluators are now LOCKED — they cannot be modified while this config is enabled.\")\n",
+    "print(\"To update an evaluator: disable this config → update evaluator → re-enable.\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1g6zs5hef32",
+   "metadata": {},
+   "source": [
+    "### Step 10c: Invoke Agent to Trigger Online Evaluation\n",
+    "\n",
+    "Now we invoke the HR assistant with a few turns. Because the online evaluation config is active\n",
+    "and watching the runtime log group, AgentCore will automatically evaluate each session as OTel\n",
+    "spans arrive in CloudWatch — no explicit evaluation API call needed."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "u2fnzeuy4rd",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "# Invoke the agent to generate a fresh session that will be auto-evaluated online.\n",
+    "# We use two separate sessions to demonstrate per-session evaluation.\n",
+    "ONLINE_SESSION_IDS = [\n",
+    "    f\"online-eval-{uuid.uuid4()}\",\n",
+    "    f\"online-eval-{uuid.uuid4()}\",\n",
+    "]\n",
+    "\n",
+    "ONLINE_SESSION_TURNS = [\n",
+    "    # Session 1: PTO balance + policy lookup\n",
+    "    [\n",
+    "        \"What is the PTO balance for employee EMP-001?\",\n",
+    "        \"What is the company PTO policy?\",\n",
+    "    ],\n",
+    "    # Session 2: pay stub + benefits\n",
+    "    [\n",
+    "        \"Can you pull up the January 2026 pay stub for EMP-001?\",\n",
+    "        \"What health insurance options does the company offer?\",\n",
+    "    ],\n",
+    "]\n",
+    "\n",
+    "print(\"Invoking agent sessions (these will be auto-evaluated online)...\")\n",
+    "for session_id, turns in zip(ONLINE_SESSION_IDS, ONLINE_SESSION_TURNS):\n",
+    "    print(f\"\\n  Session: {session_id}\")\n",
+    "    for prompt in turns:\n",
+    "        print(f\"    > {prompt}\")\n",
+    "        reply = invoke_agent_simple(prompt, session_id)\n",
+    "        print(f\"    < {reply[:100]}...\")\n",
+    "\n",
+    "print(f\"\\nBoth sessions invoked.\")\n",
+    "print(\"AgentCore will automatically evaluate them as spans arrive in CloudWatch.\")\n",
+    "print(\"Waiting 120s for CloudWatch ingestion + evaluation processing...\")\n",
+    "time.sleep(120)\n",
+    "print(\"Ready to check online evaluation results.\")\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "q5k1zfi9xmf",
+   "metadata": {},
+   "source": [
+    "### Step 10d: Retrieve and Display Online Evaluation Results\n",
+    "\n",
+    "Online evaluation results are written to CloudWatch Logs in the evaluations results log group.\n",
+    "We query the log group for evaluation events for our sessions and display the scores."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "qjcf1kvocj",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "import re as _re\n",
+    "\n",
+    "# Online evaluation results log group is in the same region as the evaluation config\n",
+    "# (which is REGION = EVAL_REGION after alignment in the agent-clients cell).\n",
+    "logs_client = boto3.client(\"logs\", region_name=REGION)\n",
+    "\n",
+    "ONLINE_EVAL_RESULTS_LOG_GROUP = \"/aws/bedrock-agentcore/evaluations/online-evaluations/results/default\"\n",
+    "\n",
+    "look_back_ms = int(time.time() * 1000) - (30 * 60 * 1000)  # last 30 minutes\n",
+    "\n",
+    "print(f\"Querying online eval results from: {ONLINE_EVAL_RESULTS_LOG_GROUP}\")\n",
+    "print(f\"Filtering for session IDs: {[s[:20] + '...' for s in ONLINE_SESSION_IDS]}\\n\")\n",
+    "\n",
+    "online_results = []\n",
+    "try:\n",
+    "    paginator = logs_client.get_paginator(\"filter_log_events\")\n",
+    "    for page in paginator.paginate(\n",
+    "        logGroupName=ONLINE_EVAL_RESULTS_LOG_GROUP,\n",
+    "        startTime=look_back_ms,\n",
+    "    ):\n",
+    "        for event in page.get(\"events\", []):\n",
+    "            try:\n",
+    "                log_entry = json.loads(event[\"message\"])\n",
+    "            except (json.JSONDecodeError, TypeError):\n",
+    "                continue\n",
+    "\n",
+    "            attrs = log_entry.get(\"attributes\", log_entry)\n",
+    "            session_id = attrs.get(\"session.id\", \"\")\n",
+    "\n",
+    "            if not any(sid == session_id for sid in ONLINE_SESSION_IDS):\n",
+    "                continue\n",
+    "\n",
+    "            online_results.append({\n",
+    "                \"session_id\": session_id,\n",
+    "                \"evaluator_name\": attrs.get(\"gen_ai.evaluation.name\", \"\"),\n",
+    "                \"score\": attrs.get(\"gen_ai.evaluation.score.value\"),\n",
+    "                \"label\": attrs.get(\"gen_ai.evaluation.score.label\", \"\"),\n",
+    "                \"explanation\": (attrs.get(\"gen_ai.evaluation.explanation\") or \"\")[:120],\n",
+    "            })\n",
+    "\n",
+    "except logs_client.exceptions.ResourceNotFoundException:\n",
+    "    print(f\"Note: Log group '{ONLINE_EVAL_RESULTS_LOG_GROUP}' not found yet.\")\n",
+    "    print(\"This is normal if no sessions have been evaluated yet.\")\n",
+    "    print(\"Results will appear here after AgentCore processes the first session.\")\n",
+    "\n",
+    "print(f\"Found {len(online_results)} online evaluation result event(s).\")\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b5z9kr4ic6r",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "# Display online evaluation results as a markdown table\n",
+    "name_by_id = {v: k for k, v in CODE_EVAL_IDS.items()}\n",
+    "\n",
+    "if online_results:\n",
+    "    rows = [\n",
+    "        \"| Session (truncated) | Evaluator | Score | Label | Explanation |\",\n",
+    "        \"|---|---|---|---|---|\",\n",
+    "    ]\n",
+    "    for r in online_results:\n",
+    "        short_session = r[\"session_id\"][:30] + \"...\"\n",
+    "        evaluator = r[\"evaluator_name\"] or \"(unknown)\"\n",
+    "        score = str(r[\"score\"]) if r[\"score\"] is not None else \"N/A\"\n",
+    "        label = r[\"label\"] or \"\"\n",
+    "        explanation = r[\"explanation\"].replace(\"\\n\", \" \")\n",
+    "        rows.append(f\"| `{short_session}` | **{evaluator}** | {score} | {label} | {explanation} |\")\n",
+    "    display(Markdown(\"### Online Evaluation Results\\n\\n\" + \"\\n\".join(rows)))\n",
+    "else:\n",
+    "    display(Markdown(\"\"\"### Online Evaluation Results\n",
+    "\n",
+    "> **No results yet.** Online evaluation is asynchronous — AgentCore may still be processing the\n",
+    "> sessions. Try re-running this cell after another 60–120 seconds.\n",
+    ">\n",
+    "> You can also check the AgentCore console or run the query below to inspect the results log group\n",
+    "> directly once events arrive.\n",
+    "\"\"\"))\n",
+    "    print(f\"Log group to monitor: {ONLINE_EVAL_RESULTS_LOG_GROUP}\")\n",
+    "    print(f\"Session IDs invoked  : {ONLINE_SESSION_IDS}\")\n"
+   ]
+  },
  {
   "cell_type": "markdown",
   "id": "cleanup-md",
   "metadata": {},
   "source": [
-    "## Step 10: Cleanup\n",
+    "## Step 11: Cleanup\n",
    "\n",
    "Delete created resources to avoid ongoing charges."
   ]
@@ -1409,8 +1930,25 @@
   },
   "outputs": [],
   "source": [
+    "\n",
    "# Uncomment to clean up resources\n",
    "\n",
+    "# # Disable and delete online evaluation config (must disable before deleting locked evaluators)\n",
+    "# try:\n",
+    "#     cp_client.update_online_evaluation_config(\n",
+    "#         onlineEvaluationConfigId=ONLINE_EVAL_CONFIG_ID,\n",
+    "#         enableOnCreate=False,\n",
+    "#     )\n",
+    "#     print(f\"Disabled online eval config: {ONLINE_EVAL_CONFIG_ID}\")\n",
+    "# except Exception as e:\n",
+    "#     print(f\"Could not disable online eval config: {e}\")\n",
+    "#\n",
+    "# try:\n",
+    "#     cp_client.delete_online_evaluation_config(onlineEvaluationConfigId=ONLINE_EVAL_CONFIG_ID)\n",
+    "#     print(f\"Deleted online eval config: {ONLINE_EVAL_CONFIG_ID}\")\n",
+    "# except Exception as e:\n",
+    "#     print(f\"Could not delete online eval config: {e}\")\n",
+    "\n",
    "# # Delete Lambda functions\n",
    "# for fn in [\"hr-response-length\", \"hr-fact-checker\"]:\n",
    "#     try:\n",
@@ -1419,7 +1957,7 @@
    "#     except Exception as e:\n",
    "#         print(f\"Could not delete {fn}: {e}\")\n",
    "\n",
-    "# # Delete evaluator records\n",
+    "# # Delete evaluator records (only possible after online eval config is deleted/disabled)\n",
    "# for name, eid in CODE_EVAL_IDS.items():\n",
    "#     try:\n",
    "#         cp_client.delete_evaluator(evaluatorId=eid)\n",
@@ -1429,10 +1967,11 @@
    "\n",
    "# # Delete agent runtime (only if deployed in this notebook)\n",
    "# if not _agent_loaded:\n",
-    "#     agent_runtime.delete()\n",
-    "#     print(\"Agent runtime deleted.\")\n",
+    "#     agentcore_control_deploy = boto3.client(\"bedrock-agentcore-control\", region_name=REGION)\n",
+    "#     agentcore_control_deploy.delete_agent_runtime(agentRuntimeId=AGENT_ID)\n",
+    "#     print(f\"Deleted agent runtime: {AGENT_ID}\")\n",
    "\n",
-    "print(\"Cleanup skipped. Uncomment the cells above to delete resources.\")"
+    "print(\"Cleanup skipped. Uncomment the cells above to delete resources.\")\n"
   ]
  },
  {
@@ -1442,24 +1981,25 @@
   "source": [
    "## Summary\n",
    "\n",
-    "You've created two Lambda-backed code-based evaluators and run them in two ways:\n",
+    "You've created two Lambda-backed code-based evaluators and run them in three ways:\n",
    "\n",
    "**Step 7 — On-Demand Evaluation (`EvaluationClient`)**: evaluated a specific production session\n",
    "with a mix of builtin LLM evaluators and code-based evaluators.\n",
    "\n",
-    "**Step 8 — `OnDemandEvaluationDatasetRunner`**: automatically invoked the agent across a dataset and scored\n",
-    "each scenario with the full mixed evaluator set.\n",
+    "**Step 8 — `OnDemandEvaluationDatasetRunner`**: automatically invoked the agent across a dataset\n",
+    "and scored each scenario with the full mixed evaluator set.\n",
    "\n",
-    "| Evaluator | Type | Level | What it measured |\n",
+    "**Step 10 — Online Evaluation (`create_online_evaluation_config`)**: deployed a continuous\n",
+    "evaluation config that automatically scores every live session as OTel spans arrive in CloudWatch.\n",
+    "No per-session API calls needed.\n",
+    "\n",
+    "| Evaluator | Type | Level | Used in |\n",
    "|---|---|---|---|\n",
-    "| `Builtin.Correctness` | LLM | TRACE | Semantic similarity to expected response |\n",
-    "| `Builtin.Helpfulness` | LLM | TRACE | Response helpfulness |\n",
-    "| `Builtin.ResponseRelevance` | LLM | TRACE | Relevance to the user's question |\n",
-    "| `HRResponseLength` | Code | TRACE | Response length within 50–600 chars |\n",
-    "| `HRFactChecker` | Code | SESSION | Factual accuracy of PTO, pay stub, policy data |\n",
-    "\n",
-    "> **Note:** Code-based evaluators are supported for **on-demand evaluation only**.\n",
-    "> Online evaluation configs (`create_online_config`) support builtin LLM evaluators only.\n",
+    "| `Builtin.Correctness` | LLM | TRACE | On-demand (Steps 7 & 8) |\n",
+    "| `Builtin.Helpfulness` | LLM | TRACE | On-demand (Step 8) |\n",
+    "| `Builtin.ResponseRelevance` | LLM | TRACE | On-demand (Step 8) |\n",
+    "| `HRResponseLength` | Code | TRACE | On-demand **and** Online (Steps 7, 8, 10) |\n",
+    "| `HRFactChecker` | Code | SESSION | On-demand **and** Online (Steps 7, 8, 10) |\n",
    "\n",
    "### When to use code-based evaluators\n",
    "\n",
@@ -1468,11 +2008,23 @@
    "- **Business rule enforcement**: Encode domain-specific rules that LLMs might misinterpret\n",
    "- **High-volume evaluation**: Reduce cost for evaluations that run on every production session\n",
    "- **Regulatory requirements**: Ensure certain disclosures or disclaimers are always present\n",
+    "- **Continuous monitoring**: Combine with online evaluation for zero-touch production quality gates\n",
+    "\n",
+    "### On-demand vs. online evaluation summary\n",
+    "\n",
+    "| Dimension | On-demand | Online |\n",
+    "|---|---|---|\n",
+    "| Trigger | Explicit per session | Automatic on every invocation |\n",
+    "| Setup | `EvaluationClient.run()` or `OnDemandEvaluationDatasetRunner` | `create_online_evaluation_config` once |\n",
+    "| Code-based evaluators | ✅ Supported | ✅ Supported |\n",
+    "| Evaluator locking | No | Yes — while config is enabled |\n",
+    "| Best for | CI/CD, ad-hoc debugging | Continuous production monitoring |\n",
    "\n",
    "### Next steps\n",
    "\n",
    "- Combine code-based evaluators with `EvaluationClient` to evaluate specific production sessions\n",
    "- Add code-based evaluators to your CI/CD pipeline for automated regression testing\n",
+    "- Use online evaluation with a lower sampling rate (e.g. 10%) to cost-effectively monitor high-traffic agents\n",
    "- Extend `HRFactChecker` with additional business rules as your agent evolves\n"
   ]
  },
@@ -1,5 +1,4 @@
 bedrock-agentcore>=1.6.0
-bedrock-agentcore-starter-toolkit>=0.3.0
 boto3>=1.42.0
 strands-agents
 strands-agents-tools