1
0
mirror of synced 2026-05-22 14:43:35 +00:00

adding online evaluation for custom code based evaluators and CLI examples (#1412)

This commit is contained in:
Bharathi Srinivasan
2026-04-28 11:48:28 -07:00
committed by GitHub
parent 2c0fdfc523
commit 4eafe85bc7
3 changed files with 805 additions and 55 deletions
@@ -4,7 +4,136 @@
This tutorial shows how to build and run **custom code-based evaluators** with Amazon Bedrock AgentCore Evaluations. Instead of relying on an LLM as the judge, code-based evaluators delegate scoring to an AWS Lambda function you write. This gives you deterministic, low-cost, fully customizable evaluation logic that can encode exact business rules, format constraints, or data validation requirements that an LLM might interpret loosely.
The tutorial pairs code-based evaluators with the built-in LLM evaluators from the [groundtruth tutorial](../05-groundtruth-based-evalautions/) to show how both types work side-by-side in a mixed evaluation run.
The tutorial demonstrates code-based evaluators in **both on-demand and online evaluation** modes, and pairs them with built-in LLM evaluators to show how both types work side-by-side in a mixed evaluation run.
---
## Setup with AgentCore CLI
The fastest way to bootstrap and deploy the agent is with the [AgentCore CLI](https://github.com/aws/agentcore-cli) (`0.11.0`).
### Prerequisites
- **Node.js** 20.x or later
- **uv** 0.4+ (Python package manager)
- **AWS CLI** 2.x with credentials configured
- **Docker** running locally (for agent container build)
- **Git** 2.x
### Install the CLI
```bash
npm install -g @aws/agentcore@0.11.0
agentcore --version # should print 0.11.0
```
### Configure AWS credentials
```bash
aws configure
aws sts get-caller-identity # verify credentials
```
Your IAM user/role needs permissions for: AgentCore Runtime, AgentCore Evaluations, Lambda,
CloudWatch Logs, ECR, IAM, and Bedrock.
### Create and deploy the agent
```bash
# Scaffold a new AgentCore project
agentcore create --name HRAssistant --framework Strands --model-provider Bedrock --defaults
# Copy the HR assistant implementation
cp hr_assistant_agent.py app/HRAssistant/main.py
# Test locally
agentcore dev
# Deploy to AWS (builds container, pushes to ECR, creates AgentCore Runtime)
agentcore deploy
```
After `agentcore deploy` completes, note the **Runtime ID** and **ARN** from the output.
### Register a code-based evaluator via CLI
`agentcore add evaluator` registers the evaluator in your project's `agentcore.json`. The evaluator
is created in AWS when you run `agentcore deploy`.
```bash
# Register a TRACE-level code-based evaluator
agentcore add evaluator \
--name HRResponseLength \
--level TRACE \
--type code-based \
--lambda-arn arn:aws:lambda:<region>:<account-id>:function:hr-response-length \
--timeout 30
# Register a SESSION-level code-based evaluator
agentcore add evaluator \
--name HRFactChecker \
--level SESSION \
--type code-based \
--lambda-arn arn:aws:lambda:<region>:<account-id>:function:hr-fact-checker \
--timeout 60
```
### Run on-demand evaluation via CLI
**Standalone mode** (no project needed) — use `--runtime-arn` and `--evaluator-arn` with the
full ARNs of already-deployed resources. This works from any directory:
```bash
agentcore run eval \
--runtime-arn <agent-runtime-arn> \
--evaluator-arn <hr-response-length-evaluator-arn> \
--evaluator-arn <hr-fact-checker-evaluator-arn> \
--session-id <session-id> \
--region <aws-region>
```
Mix code-based (`--evaluator-arn`) with builtin (`--evaluator`) in one command:
```bash
agentcore run eval \
--runtime-arn <agent-runtime-arn> \
--evaluator-arn <hr-response-length-evaluator-arn> \
--evaluator-arn <hr-fact-checker-evaluator-arn> \
--evaluator Builtin.Correctness \
--evaluator Builtin.Helpfulness \
--session-id <session-id> \
--region <aws-region>
```
**Project mode** (inside a deployed project directory) — use evaluator names from `agentcore.json`.
Requires `agentcore deploy` to have been run first:
```bash
agentcore run eval \
--runtime HRAssistant \
--evaluator HRResponseLength \
--evaluator HRFactChecker \
--session-id <session-id>
```
### Add online evaluation via CLI
`agentcore add online-eval` adds the config to `agentcore.json`; it is created in AWS on
`agentcore deploy`. Run from inside your project directory:
```bash
# sampling-rate is a percentage (0.01100)
agentcore add online-eval \
--name hr_online_eval \
--runtime HRAssistant \
--evaluator HRResponseLength \
--evaluator HRFactChecker \
--sampling-rate 100 \
--enable-on-create
```
> You can also use the notebook (Step 10) to create the online eval config programmatically
> using the boto3 SDK, without needing a project directory.
---
@@ -56,6 +185,7 @@ def lambda_handler(input: EvaluatorInput, context) -> EvaluatorOutput:
│ 2. Register evaluators via bedrock-agentcore-control │
│ 3a. On-demand: EvaluationClient.run(session_id, evaluator_ids) │
│ 3b. Dataset: OnDemandEvaluationDatasetRunner.run(dataset, agent_invoker) │
│ 3c. Online: create_online_evaluation_config (auto-evaluates all sessions) │
└────────────────┬────────────────────────────────────────────────────────────┘
┌───────────▼────────────┐ ┌──────────────────────────────┐
@@ -81,7 +211,8 @@ def lambda_handler(input: EvaluatorInput, context) -> EvaluatorOutput:
1. Agent is invoked; OTel spans are written to CloudWatch
2. `EvaluationClient` or `OnDemandEvaluationDatasetRunner` collects spans from CloudWatch
3. The service calls each evaluator — builtin evaluators run LLM inference; code-based evaluators invoke your Lambda with the span payload
4. All results are aggregated and returned
4. For **online evaluation**, AgentCore continuously watches the log group and automatically evaluates new sessions without any explicit trigger
5. All results are aggregated and returned (on-demand) or written to the online evaluation results log group
---
@@ -91,11 +222,11 @@ def lambda_handler(input: EvaluatorInput, context) -> EvaluatorOutput:
- **Docker** running locally (for agent container image build)
- **AWS credentials** with permissions for:
- `bedrock-agentcore:*` — runtime and evaluations
- `bedrock-agentcore-control:*` — evaluator registration
- `bedrock-agentcore-control:*` — evaluator registration and online eval config management
- `lambda:CreateFunction`, `lambda:UpdateFunctionCode`, `lambda:AddPermission`, `lambda:GetFunction`
- `logs:FilterLogEvents`, `logs:DescribeLogGroups` — CloudWatch span collection
- `ecr:*` — container image for the agent
- `iam:*`auto-creating the agent execution role
- `iam:*` — creating execution roles for the agent and online evaluation
- **IAM role** named `AgentCoreLambdaExecutionRole` with `AWSLambdaBasicExecutionRole` attached
- **bedrock-agentcore >= 1.6.0** installed in the notebook kernel
@@ -109,6 +240,7 @@ def lambda_handler(input: EvaluatorInput, context) -> EvaluatorOutput:
|---|---|
| `programmatic_evaluators.ipynb` | Main tutorial notebook (standalone, end-to-end) |
| `hr_assistant_agent.py` | HR Assistant Strands agent (same as groundtruth tutorial) |
| `Dockerfile` | Container definition for the agent (used by Step 3 fresh deploy and `agentcore deploy`) |
| `requirements.txt` | Python dependencies (`bedrock-agentcore>=1.6.0`) |
| `lambdas/hr_response_length/lambda_function.py` | Response length evaluator Lambda |
| `lambdas/hr_fact_checker/lambda_function.py` | HR fact-checking evaluator Lambda |
@@ -124,6 +256,7 @@ Checks that each agent response is between 50 and 600 characters. Responses shor
- **Level:** TRACE — evaluated once per agent response
- **Lambda:** `hr-response-length`
- **Returns:** `1.0` (PASS) if within range, `0.0` (FAIL) otherwise
- **Used in:** On-demand evaluation (Steps 7 & 8) and Online evaluation (Step 10)
### HRFactChecker (SESSION level)
@@ -137,6 +270,7 @@ Deterministically validates that the HR assistant's responses contain accurate f
- PTO request ID format `PTO-2026-NNN`
- Policy facts: 15-day PTO accrual, 2-day advance notice, 401k 4% match, 90% health coverage
- **Returns:** fraction of applicable checks passed (0.01.0), labeled `PASS`, `PARTIAL`, `FAIL`, or `SKIP`
- **Used in:** On-demand evaluation (Steps 7 & 8) and Online evaluation (Step 10)
---
@@ -156,6 +290,60 @@ Results from all five evaluators are collected per scenario, letting you compare
---
## Online Evaluation with Code-Based Evaluators
Step 10 of the notebook demonstrates **online evaluation** — a continuous evaluation mode where
AgentCore automatically evaluates every live agent session without explicit API calls per session.
### How it works
1. Register code-based evaluators (Steps 46, same as for on-demand)
2. Create an online evaluation config via `create_online_evaluation_config`:
- Point it at the agent's CloudWatch log group
- Set a sampling rate (0100%)
- List the evaluator IDs (code-based and/or builtin)
- Provide an IAM execution role the service can assume
3. Enable the config — AgentCore starts watching the log group
4. Every new agent session is automatically evaluated
5. Results appear in the online evaluation results CloudWatch log group
### Evaluator locking
When a code-based evaluator is referenced by an **enabled** online evaluation config, AgentCore
**locks** it automatically. You cannot modify or delete a locked evaluator. To update it:
```
disable/delete online eval config
update evaluator Lambda or re-register
re-create online eval config
```
### On-demand vs. online comparison
| Dimension | On-demand | Online |
|---|---|---|
| Trigger | Explicit per session | Automatic on every invocation |
| Setup | `EvaluationClient.run()` or `OnDemandEvaluationDatasetRunner` | `create_online_evaluation_config` once |
| Code-based evaluators | ✅ Supported | ✅ Supported |
| Evaluator locking | No | Yes — while config is enabled |
| Best for | CI/CD, ad-hoc debugging | Continuous production monitoring |
### AgentCore CLI shortcut
```bash
# sampling-rate is a percentage (0.01100); 50 = evaluate 50% of sessions
agentcore add online-eval \
--name my_online_eval \
--runtime MyAgent \
--evaluator MyCodeEvaluator \
--sampling-rate 50 \
--enable-on-create
```
---
## Sample Prompts
The dataset includes five scenarios that exercise facts the `HRFactChecker` validates:
@@ -178,14 +366,15 @@ You can extend the dataset with additional scenarios to test more HR topics (rem
|---|---|
| 1 | Install dependencies (`bedrock-agentcore>=1.6.0`) |
| 2 | Configure AWS session, region, and Lambda role ARN |
| 3 | Agent setup — reload from `%store` (groundtruth notebook) or deploy fresh |
| 3 | Agent setup — reload from `%store` (groundtruth notebook) or deploy fresh with boto3 |
| 4 | Define Lambda evaluator functions using the `@custom_code_based_evaluator()` decorator |
| 5 | Deploy Lambda functions (bundled with bedrock-agentcore SDK + pydantic) |
| 6 | Register evaluators via `bedrock-agentcore-control` boto3 service |
| 7 | On-demand evaluation with `EvaluationClient` (code-based + builtin evaluators) |
| 8 | Dataset evaluation with `OnDemandEvaluationDatasetRunner` (mixed evaluator set) |
| 9 | Inspect and compare results (per-scenario tables + aggregate score comparison) |
| 10 | Cleanup — delete Lambda functions, evaluator records, and agent runtime |
| **10** | **Online evaluation with `create_online_evaluation_config` (code-based evaluators, auto-triggered)** |
| 11 | Cleanup — delete Lambda functions, evaluator records, online eval config, and agent runtime |
---
@@ -213,8 +402,10 @@ span.span_events[*]
- **Business rule enforcement** — encode domain-specific rules that LLMs might interpret loosely
- **High-volume evaluation** — reduce cost for evaluations that run on every production session
- **Regulatory requirements** — verify that required disclosures or disclaimers are always present
- **Continuous monitoring** — combine with online evaluation for zero-touch production quality gates
> **Note:** Code-based evaluators are supported for **on-demand evaluation** (`EvaluationClient`, `OnDemandEvaluationDatasetRunner`) only. Online evaluation configs support built-in LLM evaluators only.
Code-based evaluators are supported for **both on-demand** (`EvaluationClient`,
`OnDemandEvaluationDatasetRunner`) and **online** (`create_online_evaluation_config`) evaluation.
---
@@ -223,20 +414,27 @@ span.span_events[*]
To remove created AWS resources:
```python
# Delete Lambda functions
# 1. Disable online evaluation config first (unlocks evaluators)
cp_client.update_online_evaluation_config(
onlineEvaluationConfigId=ONLINE_EVAL_CONFIG_ID,
enableOnCreate=False,
)
cp_client.delete_online_evaluation_config(onlineEvaluationConfigId=ONLINE_EVAL_CONFIG_ID)
# 2. Delete Lambda functions
for fn in ["hr-response-length", "hr-fact-checker"]:
lambda_client.delete_function(FunctionName=fn)
# Delete evaluator registrations
# 3. Delete evaluator registrations (now unlocked)
for name, eid in CODE_EVAL_IDS.items():
cp_client.delete_evaluator(evaluatorId=eid)
# Delete agent runtime (only if deployed in this notebook)
# 4. Delete agent runtime (only if deployed in this notebook)
if not _agent_loaded:
agent_runtime.delete()
agentcore_control.delete_agent_runtime(agentRuntimeId=AGENT_ID)
```
Alternatively, run the cleanup cell (Step 10) in the notebook — it is commented out by default to prevent accidental deletion.
Alternatively, run the cleanup cell (Step 11) in the notebook — it is commented out by default to prevent accidental deletion.
---
@@ -245,4 +443,5 @@ Alternatively, run the cleanup cell (Step 10) in the notebook — it is commente
- Extend `HRFactChecker` with additional business rules as your agent and data model evolve
- Combine code-based evaluators with `EvaluationClient` to validate specific production sessions
- Add code-based evaluators to your CI/CD pipeline for zero-cost regression testing on every deployment
- Use online evaluation with a lower sampling rate (e.g. 10%) to cost-effectively monitor high-traffic agents
- Explore the [groundtruth tutorial](../05-groundtruth-based-evalautions/) for `EvaluationClient` and ground-truth-based evaluations with built-in evaluators
@@ -30,7 +30,7 @@
"| **HRResponseLength** | TRACE | `hr-response-length` | Response length is 50600 chars |\n",
"| **HRFactChecker** | SESSION | `hr-fact-checker` | PTO balances, pay stubs, and policy facts are accurate |\n",
"\n",
"Then we'll run `OnDemandEvaluationDatasetRunner` with a **mixed evaluator set** combining these code-based evaluators with built-in LLM-as-as-Judge evaluators.\n",
"Then we'll run `OnDemandEvaluationDatasetRunner` with a **mixed evaluator set** combining these code-based evaluators with built-in LLM-as-as-Judge evaluators. We will also set up online evaluation using these evaluation for live monitoring.\n",
"\n",
"### Tutorial Details\n",
"\n",
@@ -111,7 +111,27 @@
"from botocore.config import Config\n",
"from IPython.display import display, Markdown\n",
"\n",
"REGION = \"aws_region\" # Add AWS region here \n",
"# ── Region configuration ──────────────────────────────────────────────────────\n",
"# REGION: the AWS region where the AgentCore Runtime (agent) is deployed.\n",
"# Auto-detected from the boto3 session (reads AWS_DEFAULT_REGION env var or\n",
"# the default region in ~/.aws/config). Set explicitly if needed, e.g.:\n",
"# REGION = \"us-east-1\"\n",
"#\n",
"# If you ran groundtruth_evaluations.ipynb first, REGION is also restored\n",
"# from %store in the agent-load cell below, overriding this value.\n",
"REGION = Session().region_name\n",
"assert REGION, (\n",
" \"No AWS region detected. Set AWS_DEFAULT_REGION or configure a default \"\n",
" \"region in ~/.aws/config, or set REGION explicitly above.\"\n",
")\n",
"\n",
"# EVAL_REGION: region for Lambda evaluators and evaluator registrations.\n",
"# For online evaluation, this MUST match REGION (the agent's CloudWatch log\n",
"# group and the evaluation config must be in the same region). The\n",
"# agent-clients cell below aligns EVAL_REGION to REGION automatically\n",
"# after %store restores the agent's actual region.\n",
"EVAL_REGION = REGION\n",
"\n",
"boto_session = Session(region_name=REGION)\n",
"ACCOUNT_ID = boto3.client(\"sts\").get_caller_identity()[\"Account\"]\n",
"\n",
@@ -119,13 +139,10 @@
"# Update this if your role has a different name\n",
"LAMBDA_ROLE_ARN = f\"arn:aws:iam::{ACCOUNT_ID}:role/AgentCoreLambdaExecutionRole\"\n",
"\n",
"# Evaluation region — Lambda evaluator functions and evaluator registrations must be here.\n",
"EVAL_REGION = \"aws_region\" # Set AWS Region here\n",
"\n",
"print(f\"Region : {REGION}\")\n",
"print(f\"Eval Region : {EVAL_REGION}\")\n",
"print(f\"Account : {ACCOUNT_ID}\")\n",
"print(f\"Lambda Role ARN : {LAMBDA_ROLE_ARN}\")\n",
"print(f\"Eval Region : {EVAL_REGION}\")"
"print(f\"Lambda Role ARN : {LAMBDA_ROLE_ARN}\")"
]
},
{
@@ -191,34 +208,137 @@
},
"outputs": [],
"source": [
"\n",
"# Deploy agent if not already loaded\n",
"if not _agent_loaded:\n",
" from bedrock_agentcore_starter_toolkit import Runtime\n",
" # -------------------------------------------------------------------------\n",
" # Fresh deployment using boto3 (bedrock-agentcore-control) + Docker/ECR.\n",
" # This path runs only when the groundtruth notebook has NOT been executed\n",
" # first. If you prefer the CLI, run `agentcore deploy` from the project\n",
" # root instead and set AGENT_ID / AGENT_ARN / CW_LOG_GROUP manually below.\n",
" # -------------------------------------------------------------------------\n",
"\n",
" agent_runtime = Runtime()\n",
" agent_runtime.configure(\n",
" entrypoint=\"hr_assistant_agent.py\",\n",
" requirements_file=\"requirements.txt\",\n",
" auto_create_execution_role=True,\n",
" auto_create_ecr=True,\n",
" region=REGION,\n",
" agent_name=\"hr_assistant_codeeval_tutorial\",\n",
" idle_timeout=120,\n",
" ecr_client = boto3.client(\"ecr\", region_name=REGION)\n",
" agentcore_control_deploy = boto3.client(\"bedrock-agentcore-control\", region_name=REGION)\n",
" iam_client = boto3.client(\"iam\")\n",
"\n",
" AGENT_NAME = \"hr_assistant_codeeval_tutorial\"\n",
" ECR_REPO_NAME = f\"agentcore-{AGENT_NAME}\"\n",
"\n",
" # ------------------------------------------------------------------\n",
" # 1. Ensure IAM execution role exists\n",
" # ------------------------------------------------------------------\n",
" EXECUTION_ROLE_NAME = \"AgentCoreRuntimeExecutionRole\"\n",
" EXECUTION_ROLE_ARN = f\"arn:aws:iam::{ACCOUNT_ID}:role/{EXECUTION_ROLE_NAME}\"\n",
"\n",
" try:\n",
" iam_client.get_role(RoleName=EXECUTION_ROLE_NAME)\n",
" print(f\"Using existing IAM role: {EXECUTION_ROLE_ARN}\")\n",
" except iam_client.exceptions.NoSuchEntityException:\n",
" print(f\"Creating IAM role: {EXECUTION_ROLE_NAME}...\")\n",
" trust_policy = json.dumps({\n",
" \"Version\": \"2012-10-17\",\n",
" \"Statement\": [{\n",
" \"Effect\": \"Allow\",\n",
" \"Principal\": {\"Service\": \"bedrock-agentcore.amazonaws.com\"},\n",
" \"Action\": \"sts:AssumeRole\",\n",
" }],\n",
" })\n",
" iam_client.create_role(\n",
" RoleName=EXECUTION_ROLE_NAME,\n",
" AssumeRolePolicyDocument=trust_policy,\n",
" Description=\"Execution role for AgentCore Runtime tutorial agents\",\n",
" )\n",
" for policy_arn in [\n",
" \"arn:aws:iam::aws:policy/AmazonBedrockFullAccess\",\n",
" \"arn:aws:iam::aws:policy/CloudWatchLogsFullAccess\",\n",
" ]:\n",
" iam_client.attach_role_policy(RoleName=EXECUTION_ROLE_NAME, PolicyArn=policy_arn)\n",
" print(f\"Created: {EXECUTION_ROLE_ARN}\")\n",
" time.sleep(10) # allow IAM propagation\n",
"\n",
" # ------------------------------------------------------------------\n",
" # 2. Create ECR repository (or reuse existing)\n",
" # ------------------------------------------------------------------\n",
" try:\n",
" ecr_resp = ecr_client.create_repository(repositoryName=ECR_REPO_NAME)\n",
" ECR_REPO_URI = ecr_resp[\"repository\"][\"repositoryUri\"]\n",
" print(f\"Created ECR repo: {ECR_REPO_URI}\")\n",
" except ecr_client.exceptions.RepositoryAlreadyExistsException:\n",
" ECR_REPO_URI = ecr_client.describe_repositories(\n",
" repositoryNames=[ECR_REPO_NAME]\n",
" )[\"repositories\"][0][\"repositoryUri\"]\n",
" print(f\"Using existing ECR repo: {ECR_REPO_URI}\")\n",
"\n",
" IMAGE_URI = f\"{ECR_REPO_URI}:latest\"\n",
"\n",
" # ------------------------------------------------------------------\n",
" # 3. Build Docker image and push to ECR\n",
" # ------------------------------------------------------------------\n",
" ecr_registry = f\"{ACCOUNT_ID}.dkr.ecr.{REGION}.amazonaws.com\"\n",
" print(\"Docker login to ECR...\")\n",
" subprocess.run(\n",
" f\"aws ecr get-login-password --region {REGION} | docker login --username AWS --password-stdin {ecr_registry}\",\n",
" shell=True, check=True,\n",
" )\n",
" launch_result = agent_runtime.launch()\n",
" print(\"Building Docker image (this may take a few minutes)...\")\n",
" subprocess.run([\"docker\", \"build\", \"--platform\", \"linux/amd64\", \"-t\", IMAGE_URI, \".\"], check=True)\n",
" print(\"Pushing image to ECR...\")\n",
" subprocess.run([\"docker\", \"push\", IMAGE_URI], check=True)\n",
" print(f\"Image pushed: {IMAGE_URI}\")\n",
"\n",
" terminal = {\"READY\", \"CREATE_FAILED\", \"DELETE_FAILED\", \"UPDATE_FAILED\"}\n",
" # Allow ECR pull from AgentCore\n",
" ecr_client.set_repository_policy(\n",
" repositoryName=ECR_REPO_NAME,\n",
" policyText=json.dumps({\n",
" \"Version\": \"2012-10-17\",\n",
" \"Statement\": [{\n",
" \"Sid\": \"AllowAgentCorePull\",\n",
" \"Effect\": \"Allow\",\n",
" \"Principal\": {\"Service\": \"bedrock-agentcore.amazonaws.com\"},\n",
" \"Action\": [\"ecr:GetDownloadUrlForLayer\", \"ecr:BatchGetImage\", \"ecr:BatchCheckLayerAvailability\"],\n",
" }],\n",
" }),\n",
" )\n",
"\n",
" # ------------------------------------------------------------------\n",
" # 4. Create (or update) AgentCore Runtime\n",
" # ------------------------------------------------------------------\n",
" artifact = {\"containerConfiguration\": {\"containerUri\": IMAGE_URI}}\n",
" try:\n",
" resp = agentcore_control_deploy.create_agent_runtime(\n",
" agentRuntimeName=AGENT_NAME,\n",
" agentRuntimeArtifact=artifact,\n",
" executionRoleArn=EXECUTION_ROLE_ARN,\n",
" networkConfiguration={\"networkMode\": \"PUBLIC\"},\n",
" )\n",
" AGENT_ID = resp[\"agentRuntimeId\"]\n",
" AGENT_ARN = resp[\"agentRuntimeArn\"]\n",
" print(f\"Created AgentCore Runtime: {AGENT_ID}\")\n",
" except agentcore_control_deploy.exceptions.ConflictException:\n",
" runtimes = agentcore_control_deploy.list_agent_runtimes()[\"agentRuntimes\"]\n",
" existing = next((r for r in runtimes if r[\"agentRuntimeName\"] == AGENT_NAME), None)\n",
" assert existing, f\"Runtime {AGENT_NAME} not found after conflict\"\n",
" AGENT_ID = existing[\"agentRuntimeId\"]\n",
" AGENT_ARN = existing[\"agentRuntimeArn\"]\n",
" agentcore_control_deploy.update_agent_runtime(\n",
" agentRuntimeId=AGENT_ID,\n",
" agentRuntimeArtifact=artifact,\n",
" )\n",
" print(f\"Updated existing runtime: {AGENT_ID}\")\n",
"\n",
" # Wait until READY\n",
" terminal = {\"READY\", \"CREATE_FAILED\", \"UPDATE_FAILED\"}\n",
" while True:\n",
" status = agent_runtime.status().endpoint[\"status\"]\n",
" status = agentcore_control_deploy.get_agent_runtime(\n",
" agentRuntimeId=AGENT_ID\n",
" )[\"status\"]\n",
" print(f\" Status: {status}\")\n",
" if status in terminal:\n",
" break\n",
" time.sleep(15)\n",
"\n",
" assert status == \"READY\", f\"Deployment failed: {status}\"\n",
"\n",
" AGENT_ID = launch_result.agent_id\n",
" AGENT_ARN = launch_result.agent_arn\n",
" CW_LOG_GROUP = f\"/aws/bedrock-agentcore/runtimes/{AGENT_ID}-DEFAULT\"\n",
"\n",
" print(\"\\nAgent deployed:\")\n",
@@ -226,7 +346,7 @@
" print(f\" AGENT_ARN : {AGENT_ARN}\")\n",
" print(f\" CW_LOG_GROUP : {CW_LOG_GROUP}\")\n",
"else:\n",
" print(\"Using existing agent — skipping deployment.\")"
" print(\"Using existing agent — skipping deployment.\")\n"
]
},
{
@@ -250,9 +370,18 @@
" print(f\"Note: agent is in {_arn_region}, overriding REGION={REGION}\")\n",
" REGION = _arn_region\n",
"\n",
"# Align EVAL_REGION with the agent's region so that Lambda evaluators, evaluator\n",
"# registrations, and the online evaluation config all live in the same region as\n",
"# the agent's CloudWatch log group. Online evaluation requires the log group and\n",
"# the evaluators to be in the same region as the control-plane config.\n",
"if EVAL_REGION != REGION:\n",
" print(f\"Aligning EVAL_REGION: {EVAL_REGION} → {REGION} (must match agent region for online eval)\")\n",
" EVAL_REGION = REGION\n",
"\n",
"# boto3 client for agent invocation — must be in the same region as the agent\n",
"agentcore_client = boto3.client(\"bedrock-agentcore\", region_name=REGION)\n",
"print(f\"agentcore_client region: {REGION}\")"
"print(f\"REGION : {REGION}\")\n",
"print(f\"EVAL_REGION : {EVAL_REGION}\")"
]
},
{
@@ -1385,12 +1514,404 @@
"print(f\"Results saved to: {results_path}\")"
]
},
{
"cell_type": "markdown",
"id": "axt5qz5c7lg",
"metadata": {},
"source": [
"## Step 10: Online Evaluation with Code-Based Evaluators\n",
"\n",
"**Online evaluation** continuously monitors your live agent traffic and automatically scores sessions\n",
"as they happen — no manual triggering required. You configure it once, and AgentCore watches your\n",
"agent's CloudWatch log stream, evaluating new sessions at a configurable sampling rate.\n",
"\n",
"In this step we reuse the **same code-based evaluators** (`HRResponseLength` and `HRFactChecker`)\n",
"we registered in Step 6. This demonstrates that a single evaluator registration can serve both\n",
"on-demand and online evaluation use cases.\n",
"\n",
"### How online evaluation works\n",
"\n",
"```\n",
"Agent invocation\n",
" │\n",
" ▼ (OTel spans → CloudWatch)\n",
"AgentCore Runtime log group\n",
" │\n",
" ▼ (online eval config watches the log group)\n",
"AgentCore Evaluations\n",
" ├── Builtin LLM evaluators → LLM inference\n",
" └── Code-based evaluators → your Lambda function\n",
" │\n",
" ▼\n",
" Results in CloudWatch Logs\n",
" /aws/bedrock-agentcore/evaluations/online-evaluations/...\n",
"```\n",
"\n",
"### Key differences from on-demand evaluation\n",
"\n",
"| | On-demand | Online |\n",
"|---|---|---|\n",
"| **Trigger** | Explicit API call per session | Automatic, event-driven |\n",
"| **Scope** | Specific session(s) you choose | All sessions (or a sampled %) |\n",
"| **Setup** | Call `EvaluationClient.run()` per session | Configure once with `create_online_evaluation_config` |\n",
"| **Evaluator locking** | No | Code-based evaluators become **locked** while the config is enabled |\n",
"| **Best for** | Ad-hoc checks, CI/CD pipelines | Continuous production monitoring |\n",
"\n",
"### IAM execution role\n",
"\n",
"Online evaluation requires an **evaluation execution role** — an IAM role that AgentCore Evaluations\n",
"assumes to invoke your Lambda evaluators and read CloudWatch spans. It must trust\n",
"`bedrock-agentcore.amazonaws.com` and have `lambda:InvokeFunction` + `logs:FilterLogEvents`\n",
"permissions.\n",
"\n",
"> **Evaluator locking:** When a code-based evaluator is referenced by an enabled online evaluation\n",
"> config, AgentCore automatically locks it to prevent accidental modification. To update the\n",
"> evaluator, first disable the online evaluation config (or delete it), then update the evaluator,\n",
"> then re-enable the config."
]
},
{
"cell_type": "markdown",
"id": "6v2u7hcmazi",
"metadata": {},
"source": [
"### Step 10a: Create IAM Evaluation Execution Role"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "m3n6no96fck",
"metadata": {},
"outputs": [],
"source": [
"\n",
"iam_client = boto3.client(\"iam\")\n",
"ONLINE_EVAL_ROLE_NAME = \"AgentCoreOnlineEvaluationRole\"\n",
"ONLINE_EVAL_ROLE_ARN = f\"arn:aws:iam::{ACCOUNT_ID}:role/{ONLINE_EVAL_ROLE_NAME}\"\n",
"\n",
"# Trust policy: allow AgentCore Evaluations service to assume this role\n",
"trust_policy = json.dumps({\n",
" \"Version\": \"2012-10-17\",\n",
" \"Statement\": [{\n",
" \"Effect\": \"Allow\",\n",
" \"Principal\": {\"Service\": \"bedrock-agentcore.amazonaws.com\"},\n",
" \"Action\": \"sts:AssumeRole\",\n",
" }],\n",
"})\n",
"\n",
"# Inline permission policy: invoke Lambda evaluators + full CloudWatch Logs access.\n",
"# The online evaluation service requires:\n",
"# READ — FilterLogEvents, GetLogEvents, StartQuery, GetQueryResults on:\n",
"# - agent runtime log group (/aws/bedrock-agentcore/runtimes/...)\n",
"# - OTel spans log group (aws/spans — no leading slash)\n",
"# WRITE — CreateLogGroup, CreateLogStream, PutLogEvents for writing evaluation results to:\n",
"# - /aws/bedrock-agentcore/evaluations/results/<config-name>\n",
"eval_policy = json.dumps({\n",
" \"Version\": \"2012-10-17\",\n",
" \"Statement\": [\n",
" {\n",
" \"Sid\": \"InvokeLambdaEvaluators\",\n",
" \"Effect\": \"Allow\",\n",
" \"Action\": [\"lambda:InvokeFunction\", \"lambda:GetFunction\"],\n",
" \"Resource\": [\n",
" lambda_arn_response_length,\n",
" lambda_arn_fact_checker,\n",
" ],\n",
" },\n",
" {\n",
" \"Sid\": \"CloudWatchLogsReadSpans\",\n",
" \"Effect\": \"Allow\",\n",
" \"Action\": [\n",
" \"logs:FilterLogEvents\",\n",
" \"logs:DescribeLogGroups\",\n",
" \"logs:DescribeLogStreams\",\n",
" \"logs:GetLogEvents\",\n",
" \"logs:StartQuery\",\n",
" \"logs:GetQueryResults\",\n",
" \"logs:StopQuery\",\n",
" ],\n",
" \"Resource\": \"*\",\n",
" },\n",
" {\n",
" \"Sid\": \"CloudWatchLogsWriteResults\",\n",
" \"Effect\": \"Allow\",\n",
" \"Action\": [\n",
" \"logs:CreateLogGroup\",\n",
" \"logs:CreateLogStream\",\n",
" \"logs:PutLogEvents\",\n",
" ],\n",
" \"Resource\": f\"arn:aws:logs:{REGION}:{ACCOUNT_ID}:log-group:/aws/bedrock-agentcore/evaluations/*\",\n",
" },\n",
" ],\n",
"})\n",
"\n",
"try:\n",
" iam_client.get_role(RoleName=ONLINE_EVAL_ROLE_NAME)\n",
" print(f\"Using existing role: {ONLINE_EVAL_ROLE_ARN}\")\n",
" iam_client.put_role_policy(\n",
" RoleName=ONLINE_EVAL_ROLE_NAME,\n",
" PolicyName=\"AgentCoreOnlineEvalPermissions\",\n",
" PolicyDocument=eval_policy,\n",
" )\n",
" print(\" Inline policy updated.\")\n",
"except iam_client.exceptions.NoSuchEntityException:\n",
" print(f\"Creating IAM role: {ONLINE_EVAL_ROLE_NAME}...\")\n",
" iam_client.create_role(\n",
" RoleName=ONLINE_EVAL_ROLE_NAME,\n",
" AssumeRolePolicyDocument=trust_policy,\n",
" Description=\"Execution role for AgentCore online evaluation with code-based evaluators\",\n",
" )\n",
" iam_client.put_role_policy(\n",
" RoleName=ONLINE_EVAL_ROLE_NAME,\n",
" PolicyName=\"AgentCoreOnlineEvalPermissions\",\n",
" PolicyDocument=eval_policy,\n",
" )\n",
" print(f\"Created: {ONLINE_EVAL_ROLE_ARN}\")\n",
"\n",
"print(f\"\\nOnline eval execution role: {ONLINE_EVAL_ROLE_ARN}\")\n",
"print(\"Waiting 10s for IAM propagation...\")\n",
"time.sleep(10)\n"
]
},
{
"cell_type": "markdown",
"id": "k2easqzclef",
"metadata": {},
"source": [
"### Step 10b: Create Online Evaluation Configuration\n",
"\n",
"We create an online evaluation config that monitors the HR assistant's live CloudWatch log group\n",
"and evaluates every session (100% sampling rate) with our two code-based evaluators.\n",
"\n",
"The config references:\n",
"- **`HRResponseLength`** (TRACE level) — evaluated per agent response turn\n",
"- **`HRFactChecker`** (SESSION level) — evaluated once per completed session\n",
"\n",
"> Once this config is **enabled**, both code-based evaluators are automatically **locked**\n",
"> and cannot be modified until the config is disabled or deleted."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e49zvxa0r0f",
"metadata": {},
"outputs": [],
"source": [
"\n",
"# Unique name for the online eval config (no hyphens — service regex: [a-zA-Z][a-zA-Z0-9_]{0,99})\n",
"ONLINE_EVAL_CONFIG_NAME = f\"hr_online_eval_{RUN_SUFFIX}\"\n",
"\n",
"# The OTel service name is <agentRuntimeName>.DEFAULT\n",
"# AGENT_ARN format: arn:aws:bedrock-agentcore:{region}:{account}:runtime/{id}\n",
"_runtime_id = AGENT_ARN.split(\"/\")[-1] # e.g. hr_assistant_codeeval_tutorial-AbCdEfGhIj\n",
"_agent_runtime_name = _runtime_id.rsplit(\"-\", 1)[0] # strip auto-generated suffix\n",
"OTEL_SERVICE_NAME = f\"{_agent_runtime_name}.DEFAULT\"\n",
"\n",
"print(f\"Online eval config name : {ONLINE_EVAL_CONFIG_NAME}\")\n",
"print(f\"Monitoring log group : {CW_LOG_GROUP}\")\n",
"print(f\"OTel service name : {OTEL_SERVICE_NAME}\")\n",
"print(f\"Evaluators : {list(CODE_EVAL_IDS.keys())}\")\n",
"print()\n",
"\n",
"online_eval_resp = cp_client.create_online_evaluation_config(\n",
" onlineEvaluationConfigName=ONLINE_EVAL_CONFIG_NAME,\n",
" # Evaluate 100% of sessions; lower this in production to control cost\n",
" rule={\"samplingConfig\": {\"samplingPercentage\": 100.0}},\n",
" # Watch the agent's runtime CloudWatch log group for new OTel spans\n",
" dataSourceConfig={\n",
" \"cloudWatchLogs\": {\n",
" \"logGroupNames\": [CW_LOG_GROUP],\n",
" \"serviceNames\": [OTEL_SERVICE_NAME],\n",
" }\n",
" },\n",
" # Code-based + builtin evaluators can be mixed freely\n",
" evaluators=[\n",
" {\"evaluatorId\": CODE_EVAL_IDS[\"HRResponseLength\"]},\n",
" {\"evaluatorId\": CODE_EVAL_IDS[\"HRFactChecker\"]},\n",
" ],\n",
" evaluationExecutionRoleArn=ONLINE_EVAL_ROLE_ARN,\n",
" # enableOnCreate=True activates the config immediately on creation\n",
" enableOnCreate=True,\n",
")\n",
"\n",
"ONLINE_EVAL_CONFIG_ID = online_eval_resp[\"onlineEvaluationConfigId\"]\n",
"ONLINE_EVAL_CONFIG_ARN = online_eval_resp.get(\"onlineEvaluationConfigArn\", \"\")\n",
"\n",
"print(f\"Online eval config created:\")\n",
"print(f\" ID : {ONLINE_EVAL_CONFIG_ID}\")\n",
"print(f\" ARN : {ONLINE_EVAL_CONFIG_ARN}\")\n",
"print()\n",
"print(\"Evaluators are now LOCKED — they cannot be modified while this config is enabled.\")\n",
"print(\"To update an evaluator: disable this config → update evaluator → re-enable.\")\n"
]
},
{
"cell_type": "markdown",
"id": "1g6zs5hef32",
"metadata": {},
"source": [
"### Step 10c: Invoke Agent to Trigger Online Evaluation\n",
"\n",
"Now we invoke the HR assistant with a few turns. Because the online evaluation config is active\n",
"and watching the runtime log group, AgentCore will automatically evaluate each session as OTel\n",
"spans arrive in CloudWatch — no explicit evaluation API call needed."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "u2fnzeuy4rd",
"metadata": {},
"outputs": [],
"source": [
"\n",
"# Invoke the agent to generate a fresh session that will be auto-evaluated online.\n",
"# We use two separate sessions to demonstrate per-session evaluation.\n",
"ONLINE_SESSION_IDS = [\n",
" f\"online-eval-{uuid.uuid4()}\",\n",
" f\"online-eval-{uuid.uuid4()}\",\n",
"]\n",
"\n",
"ONLINE_SESSION_TURNS = [\n",
" # Session 1: PTO balance + policy lookup\n",
" [\n",
" \"What is the PTO balance for employee EMP-001?\",\n",
" \"What is the company PTO policy?\",\n",
" ],\n",
" # Session 2: pay stub + benefits\n",
" [\n",
" \"Can you pull up the January 2026 pay stub for EMP-001?\",\n",
" \"What health insurance options does the company offer?\",\n",
" ],\n",
"]\n",
"\n",
"print(\"Invoking agent sessions (these will be auto-evaluated online)...\")\n",
"for session_id, turns in zip(ONLINE_SESSION_IDS, ONLINE_SESSION_TURNS):\n",
" print(f\"\\n Session: {session_id}\")\n",
" for prompt in turns:\n",
" print(f\" > {prompt}\")\n",
" reply = invoke_agent_simple(prompt, session_id)\n",
" print(f\" < {reply[:100]}...\")\n",
"\n",
"print(f\"\\nBoth sessions invoked.\")\n",
"print(\"AgentCore will automatically evaluate them as spans arrive in CloudWatch.\")\n",
"print(\"Waiting 120s for CloudWatch ingestion + evaluation processing...\")\n",
"time.sleep(120)\n",
"print(\"Ready to check online evaluation results.\")\n"
]
},
{
"cell_type": "markdown",
"id": "q5k1zfi9xmf",
"metadata": {},
"source": [
"### Step 10d: Retrieve and Display Online Evaluation Results\n",
"\n",
"Online evaluation results are written to CloudWatch Logs in the evaluations results log group.\n",
"We query the log group for evaluation events for our sessions and display the scores."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "qjcf1kvocj",
"metadata": {},
"outputs": [],
"source": [
"\n",
"import re as _re\n",
"\n",
"# Online evaluation results log group is in the same region as the evaluation config\n",
"# (which is REGION = EVAL_REGION after alignment in the agent-clients cell).\n",
"logs_client = boto3.client(\"logs\", region_name=REGION)\n",
"\n",
"ONLINE_EVAL_RESULTS_LOG_GROUP = \"/aws/bedrock-agentcore/evaluations/online-evaluations/results/default\"\n",
"\n",
"look_back_ms = int(time.time() * 1000) - (30 * 60 * 1000) # last 30 minutes\n",
"\n",
"print(f\"Querying online eval results from: {ONLINE_EVAL_RESULTS_LOG_GROUP}\")\n",
"print(f\"Filtering for session IDs: {[s[:20] + '...' for s in ONLINE_SESSION_IDS]}\\n\")\n",
"\n",
"online_results = []\n",
"try:\n",
" paginator = logs_client.get_paginator(\"filter_log_events\")\n",
" for page in paginator.paginate(\n",
" logGroupName=ONLINE_EVAL_RESULTS_LOG_GROUP,\n",
" startTime=look_back_ms,\n",
" ):\n",
" for event in page.get(\"events\", []):\n",
" try:\n",
" log_entry = json.loads(event[\"message\"])\n",
" except (json.JSONDecodeError, TypeError):\n",
" continue\n",
"\n",
" attrs = log_entry.get(\"attributes\", log_entry)\n",
" session_id = attrs.get(\"session.id\", \"\")\n",
"\n",
" if not any(sid == session_id for sid in ONLINE_SESSION_IDS):\n",
" continue\n",
"\n",
" online_results.append({\n",
" \"session_id\": session_id,\n",
" \"evaluator_name\": attrs.get(\"gen_ai.evaluation.name\", \"\"),\n",
" \"score\": attrs.get(\"gen_ai.evaluation.score.value\"),\n",
" \"label\": attrs.get(\"gen_ai.evaluation.score.label\", \"\"),\n",
" \"explanation\": (attrs.get(\"gen_ai.evaluation.explanation\") or \"\")[:120],\n",
" })\n",
"\n",
"except logs_client.exceptions.ResourceNotFoundException:\n",
" print(f\"Note: Log group '{ONLINE_EVAL_RESULTS_LOG_GROUP}' not found yet.\")\n",
" print(\"This is normal if no sessions have been evaluated yet.\")\n",
" print(\"Results will appear here after AgentCore processes the first session.\")\n",
"\n",
"print(f\"Found {len(online_results)} online evaluation result event(s).\")\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b5z9kr4ic6r",
"metadata": {},
"outputs": [],
"source": [
"\n",
"# Display online evaluation results as a markdown table\n",
"name_by_id = {v: k for k, v in CODE_EVAL_IDS.items()}\n",
"\n",
"if online_results:\n",
" rows = [\n",
" \"| Session (truncated) | Evaluator | Score | Label | Explanation |\",\n",
" \"|---|---|---|---|---|\",\n",
" ]\n",
" for r in online_results:\n",
" short_session = r[\"session_id\"][:30] + \"...\"\n",
" evaluator = r[\"evaluator_name\"] or \"(unknown)\"\n",
" score = str(r[\"score\"]) if r[\"score\"] is not None else \"N/A\"\n",
" label = r[\"label\"] or \"\"\n",
" explanation = r[\"explanation\"].replace(\"\\n\", \" \")\n",
" rows.append(f\"| `{short_session}` | **{evaluator}** | {score} | {label} | {explanation} |\")\n",
" display(Markdown(\"### Online Evaluation Results\\n\\n\" + \"\\n\".join(rows)))\n",
"else:\n",
" display(Markdown(\"\"\"### Online Evaluation Results\n",
"\n",
"> **No results yet.** Online evaluation is asynchronous — AgentCore may still be processing the\n",
"> sessions. Try re-running this cell after another 60120 seconds.\n",
">\n",
"> You can also check the AgentCore console or run the query below to inspect the results log group\n",
"> directly once events arrive.\n",
"\"\"\"))\n",
" print(f\"Log group to monitor: {ONLINE_EVAL_RESULTS_LOG_GROUP}\")\n",
" print(f\"Session IDs invoked : {ONLINE_SESSION_IDS}\")\n"
]
},
{
"cell_type": "markdown",
"id": "cleanup-md",
"metadata": {},
"source": [
"## Step 10: Cleanup\n",
"## Step 11: Cleanup\n",
"\n",
"Delete created resources to avoid ongoing charges."
]
@@ -1409,8 +1930,25 @@
},
"outputs": [],
"source": [
"\n",
"# Uncomment to clean up resources\n",
"\n",
"# # Disable and delete online evaluation config (must disable before deleting locked evaluators)\n",
"# try:\n",
"# cp_client.update_online_evaluation_config(\n",
"# onlineEvaluationConfigId=ONLINE_EVAL_CONFIG_ID,\n",
"# enableOnCreate=False,\n",
"# )\n",
"# print(f\"Disabled online eval config: {ONLINE_EVAL_CONFIG_ID}\")\n",
"# except Exception as e:\n",
"# print(f\"Could not disable online eval config: {e}\")\n",
"#\n",
"# try:\n",
"# cp_client.delete_online_evaluation_config(onlineEvaluationConfigId=ONLINE_EVAL_CONFIG_ID)\n",
"# print(f\"Deleted online eval config: {ONLINE_EVAL_CONFIG_ID}\")\n",
"# except Exception as e:\n",
"# print(f\"Could not delete online eval config: {e}\")\n",
"\n",
"# # Delete Lambda functions\n",
"# for fn in [\"hr-response-length\", \"hr-fact-checker\"]:\n",
"# try:\n",
@@ -1419,7 +1957,7 @@
"# except Exception as e:\n",
"# print(f\"Could not delete {fn}: {e}\")\n",
"\n",
"# # Delete evaluator records\n",
"# # Delete evaluator records (only possible after online eval config is deleted/disabled)\n",
"# for name, eid in CODE_EVAL_IDS.items():\n",
"# try:\n",
"# cp_client.delete_evaluator(evaluatorId=eid)\n",
@@ -1429,10 +1967,11 @@
"\n",
"# # Delete agent runtime (only if deployed in this notebook)\n",
"# if not _agent_loaded:\n",
"# agent_runtime.delete()\n",
"# print(\"Agent runtime deleted.\")\n",
"# agentcore_control_deploy = boto3.client(\"bedrock-agentcore-control\", region_name=REGION)\n",
"# agentcore_control_deploy.delete_agent_runtime(agentRuntimeId=AGENT_ID)\n",
"# print(f\"Deleted agent runtime: {AGENT_ID}\")\n",
"\n",
"print(\"Cleanup skipped. Uncomment the cells above to delete resources.\")"
"print(\"Cleanup skipped. Uncomment the cells above to delete resources.\")\n"
]
},
{
@@ -1442,24 +1981,25 @@
"source": [
"## Summary\n",
"\n",
"You've created two Lambda-backed code-based evaluators and run them in two ways:\n",
"You've created two Lambda-backed code-based evaluators and run them in three ways:\n",
"\n",
"**Step 7 — On-Demand Evaluation (`EvaluationClient`)**: evaluated a specific production session\n",
"with a mix of builtin LLM evaluators and code-based evaluators.\n",
"\n",
"**Step 8 — `OnDemandEvaluationDatasetRunner`**: automatically invoked the agent across a dataset and scored\n",
"each scenario with the full mixed evaluator set.\n",
"**Step 8 — `OnDemandEvaluationDatasetRunner`**: automatically invoked the agent across a dataset\n",
"and scored each scenario with the full mixed evaluator set.\n",
"\n",
"| Evaluator | Type | Level | What it measured |\n",
"**Step 10 — Online Evaluation (`create_online_evaluation_config`)**: deployed a continuous\n",
"evaluation config that automatically scores every live session as OTel spans arrive in CloudWatch.\n",
"No per-session API calls needed.\n",
"\n",
"| Evaluator | Type | Level | Used in |\n",
"|---|---|---|---|\n",
"| `Builtin.Correctness` | LLM | TRACE | Semantic similarity to expected response |\n",
"| `Builtin.Helpfulness` | LLM | TRACE | Response helpfulness |\n",
"| `Builtin.ResponseRelevance` | LLM | TRACE | Relevance to the user's question |\n",
"| `HRResponseLength` | Code | TRACE | Response length within 50600 chars |\n",
"| `HRFactChecker` | Code | SESSION | Factual accuracy of PTO, pay stub, policy data |\n",
"\n",
"> **Note:** Code-based evaluators are supported for **on-demand evaluation only**.\n",
"> Online evaluation configs (`create_online_config`) support builtin LLM evaluators only.\n",
"| `Builtin.Correctness` | LLM | TRACE | On-demand (Steps 7 & 8) |\n",
"| `Builtin.Helpfulness` | LLM | TRACE | On-demand (Step 8) |\n",
"| `Builtin.ResponseRelevance` | LLM | TRACE | On-demand (Step 8) |\n",
"| `HRResponseLength` | Code | TRACE | On-demand **and** Online (Steps 7, 8, 10) |\n",
"| `HRFactChecker` | Code | SESSION | On-demand **and** Online (Steps 7, 8, 10) |\n",
"\n",
"### When to use code-based evaluators\n",
"\n",
@@ -1468,11 +2008,23 @@
"- **Business rule enforcement**: Encode domain-specific rules that LLMs might misinterpret\n",
"- **High-volume evaluation**: Reduce cost for evaluations that run on every production session\n",
"- **Regulatory requirements**: Ensure certain disclosures or disclaimers are always present\n",
"- **Continuous monitoring**: Combine with online evaluation for zero-touch production quality gates\n",
"\n",
"### On-demand vs. online evaluation summary\n",
"\n",
"| Dimension | On-demand | Online |\n",
"|---|---|---|\n",
"| Trigger | Explicit per session | Automatic on every invocation |\n",
"| Setup | `EvaluationClient.run()` or `OnDemandEvaluationDatasetRunner` | `create_online_evaluation_config` once |\n",
"| Code-based evaluators | ✅ Supported | ✅ Supported |\n",
"| Evaluator locking | No | Yes — while config is enabled |\n",
"| Best for | CI/CD, ad-hoc debugging | Continuous production monitoring |\n",
"\n",
"### Next steps\n",
"\n",
"- Combine code-based evaluators with `EvaluationClient` to evaluate specific production sessions\n",
"- Add code-based evaluators to your CI/CD pipeline for automated regression testing\n",
"- Use online evaluation with a lower sampling rate (e.g. 10%) to cost-effectively monitor high-traffic agents\n",
"- Extend `HRFactChecker` with additional business rules as your agent evolves\n"
]
},
@@ -1,5 +1,4 @@
bedrock-agentcore>=1.6.0
bedrock-agentcore-starter-toolkit>=0.3.0
boto3>=1.42.0
strands-agents
strands-agents-tools