adding online evaluation for custom code based evaluators and CLI examples (#1412)
This commit is contained in:
committed by
GitHub
parent
2c0fdfc523
commit
4eafe85bc7
@@ -4,7 +4,136 @@
|
||||
|
||||
This tutorial shows how to build and run **custom code-based evaluators** with Amazon Bedrock AgentCore Evaluations. Instead of relying on an LLM as the judge, code-based evaluators delegate scoring to an AWS Lambda function you write. This gives you deterministic, low-cost, fully customizable evaluation logic that can encode exact business rules, format constraints, or data validation requirements that an LLM might interpret loosely.
|
||||
|
||||
The tutorial pairs code-based evaluators with the built-in LLM evaluators from the [groundtruth tutorial](../05-groundtruth-based-evalautions/) to show how both types work side-by-side in a mixed evaluation run.
|
||||
The tutorial demonstrates code-based evaluators in **both on-demand and online evaluation** modes, and pairs them with built-in LLM evaluators to show how both types work side-by-side in a mixed evaluation run.
|
||||
|
||||
---
|
||||
|
||||
## Setup with AgentCore CLI
|
||||
|
||||
The fastest way to bootstrap and deploy the agent is with the [AgentCore CLI](https://github.com/aws/agentcore-cli) (`0.11.0`).
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- **Node.js** 20.x or later
|
||||
- **uv** 0.4+ (Python package manager)
|
||||
- **AWS CLI** 2.x with credentials configured
|
||||
- **Docker** running locally (for agent container build)
|
||||
- **Git** 2.x
|
||||
|
||||
### Install the CLI
|
||||
|
||||
```bash
|
||||
npm install -g @aws/agentcore@0.11.0
|
||||
agentcore --version # should print 0.11.0
|
||||
```
|
||||
|
||||
### Configure AWS credentials
|
||||
|
||||
```bash
|
||||
aws configure
|
||||
aws sts get-caller-identity # verify credentials
|
||||
```
|
||||
|
||||
Your IAM user/role needs permissions for: AgentCore Runtime, AgentCore Evaluations, Lambda,
|
||||
CloudWatch Logs, ECR, IAM, and Bedrock.
|
||||
|
||||
### Create and deploy the agent
|
||||
|
||||
```bash
|
||||
# Scaffold a new AgentCore project
|
||||
agentcore create --name HRAssistant --framework Strands --model-provider Bedrock --defaults
|
||||
|
||||
# Copy the HR assistant implementation
|
||||
cp hr_assistant_agent.py app/HRAssistant/main.py
|
||||
|
||||
# Test locally
|
||||
agentcore dev
|
||||
|
||||
# Deploy to AWS (builds container, pushes to ECR, creates AgentCore Runtime)
|
||||
agentcore deploy
|
||||
```
|
||||
|
||||
After `agentcore deploy` completes, note the **Runtime ID** and **ARN** from the output.
|
||||
|
||||
### Register a code-based evaluator via CLI
|
||||
|
||||
`agentcore add evaluator` registers the evaluator in your project's `agentcore.json`. The evaluator
|
||||
is created in AWS when you run `agentcore deploy`.
|
||||
|
||||
```bash
|
||||
# Register a TRACE-level code-based evaluator
|
||||
agentcore add evaluator \
|
||||
--name HRResponseLength \
|
||||
--level TRACE \
|
||||
--type code-based \
|
||||
--lambda-arn arn:aws:lambda:<region>:<account-id>:function:hr-response-length \
|
||||
--timeout 30
|
||||
|
||||
# Register a SESSION-level code-based evaluator
|
||||
agentcore add evaluator \
|
||||
--name HRFactChecker \
|
||||
--level SESSION \
|
||||
--type code-based \
|
||||
--lambda-arn arn:aws:lambda:<region>:<account-id>:function:hr-fact-checker \
|
||||
--timeout 60
|
||||
```
|
||||
|
||||
### Run on-demand evaluation via CLI
|
||||
|
||||
**Standalone mode** (no project needed) — use `--runtime-arn` and `--evaluator-arn` with the
|
||||
full ARNs of already-deployed resources. This works from any directory:
|
||||
|
||||
```bash
|
||||
agentcore run eval \
|
||||
--runtime-arn <agent-runtime-arn> \
|
||||
--evaluator-arn <hr-response-length-evaluator-arn> \
|
||||
--evaluator-arn <hr-fact-checker-evaluator-arn> \
|
||||
--session-id <session-id> \
|
||||
--region <aws-region>
|
||||
```
|
||||
|
||||
Mix code-based (`--evaluator-arn`) with builtin (`--evaluator`) in one command:
|
||||
|
||||
```bash
|
||||
agentcore run eval \
|
||||
--runtime-arn <agent-runtime-arn> \
|
||||
--evaluator-arn <hr-response-length-evaluator-arn> \
|
||||
--evaluator-arn <hr-fact-checker-evaluator-arn> \
|
||||
--evaluator Builtin.Correctness \
|
||||
--evaluator Builtin.Helpfulness \
|
||||
--session-id <session-id> \
|
||||
--region <aws-region>
|
||||
```
|
||||
|
||||
**Project mode** (inside a deployed project directory) — use evaluator names from `agentcore.json`.
|
||||
Requires `agentcore deploy` to have been run first:
|
||||
|
||||
```bash
|
||||
agentcore run eval \
|
||||
--runtime HRAssistant \
|
||||
--evaluator HRResponseLength \
|
||||
--evaluator HRFactChecker \
|
||||
--session-id <session-id>
|
||||
```
|
||||
|
||||
### Add online evaluation via CLI
|
||||
|
||||
`agentcore add online-eval` adds the config to `agentcore.json`; it is created in AWS on
|
||||
`agentcore deploy`. Run from inside your project directory:
|
||||
|
||||
```bash
|
||||
# sampling-rate is a percentage (0.01–100)
|
||||
agentcore add online-eval \
|
||||
--name hr_online_eval \
|
||||
--runtime HRAssistant \
|
||||
--evaluator HRResponseLength \
|
||||
--evaluator HRFactChecker \
|
||||
--sampling-rate 100 \
|
||||
--enable-on-create
|
||||
```
|
||||
|
||||
> You can also use the notebook (Step 10) to create the online eval config programmatically
|
||||
> using the boto3 SDK, without needing a project directory.
|
||||
|
||||
---
|
||||
|
||||
@@ -56,6 +185,7 @@ def lambda_handler(input: EvaluatorInput, context) -> EvaluatorOutput:
|
||||
│ 2. Register evaluators via bedrock-agentcore-control │
|
||||
│ 3a. On-demand: EvaluationClient.run(session_id, evaluator_ids) │
|
||||
│ 3b. Dataset: OnDemandEvaluationDatasetRunner.run(dataset, agent_invoker) │
|
||||
│ 3c. Online: create_online_evaluation_config (auto-evaluates all sessions) │
|
||||
└────────────────┬────────────────────────────────────────────────────────────┘
|
||||
│
|
||||
┌───────────▼────────────┐ ┌──────────────────────────────┐
|
||||
@@ -81,7 +211,8 @@ def lambda_handler(input: EvaluatorInput, context) -> EvaluatorOutput:
|
||||
1. Agent is invoked; OTel spans are written to CloudWatch
|
||||
2. `EvaluationClient` or `OnDemandEvaluationDatasetRunner` collects spans from CloudWatch
|
||||
3. The service calls each evaluator — builtin evaluators run LLM inference; code-based evaluators invoke your Lambda with the span payload
|
||||
4. All results are aggregated and returned
|
||||
4. For **online evaluation**, AgentCore continuously watches the log group and automatically evaluates new sessions without any explicit trigger
|
||||
5. All results are aggregated and returned (on-demand) or written to the online evaluation results log group
|
||||
|
||||
---
|
||||
|
||||
@@ -91,11 +222,11 @@ def lambda_handler(input: EvaluatorInput, context) -> EvaluatorOutput:
|
||||
- **Docker** running locally (for agent container image build)
|
||||
- **AWS credentials** with permissions for:
|
||||
- `bedrock-agentcore:*` — runtime and evaluations
|
||||
- `bedrock-agentcore-control:*` — evaluator registration
|
||||
- `bedrock-agentcore-control:*` — evaluator registration and online eval config management
|
||||
- `lambda:CreateFunction`, `lambda:UpdateFunctionCode`, `lambda:AddPermission`, `lambda:GetFunction`
|
||||
- `logs:FilterLogEvents`, `logs:DescribeLogGroups` — CloudWatch span collection
|
||||
- `ecr:*` — container image for the agent
|
||||
- `iam:*` — auto-creating the agent execution role
|
||||
- `iam:*` — creating execution roles for the agent and online evaluation
|
||||
- **IAM role** named `AgentCoreLambdaExecutionRole` with `AWSLambdaBasicExecutionRole` attached
|
||||
- **bedrock-agentcore >= 1.6.0** installed in the notebook kernel
|
||||
|
||||
@@ -109,6 +240,7 @@ def lambda_handler(input: EvaluatorInput, context) -> EvaluatorOutput:
|
||||
|---|---|
|
||||
| `programmatic_evaluators.ipynb` | Main tutorial notebook (standalone, end-to-end) |
|
||||
| `hr_assistant_agent.py` | HR Assistant Strands agent (same as groundtruth tutorial) |
|
||||
| `Dockerfile` | Container definition for the agent (used by Step 3 fresh deploy and `agentcore deploy`) |
|
||||
| `requirements.txt` | Python dependencies (`bedrock-agentcore>=1.6.0`) |
|
||||
| `lambdas/hr_response_length/lambda_function.py` | Response length evaluator Lambda |
|
||||
| `lambdas/hr_fact_checker/lambda_function.py` | HR fact-checking evaluator Lambda |
|
||||
@@ -124,6 +256,7 @@ Checks that each agent response is between 50 and 600 characters. Responses shor
|
||||
- **Level:** TRACE — evaluated once per agent response
|
||||
- **Lambda:** `hr-response-length`
|
||||
- **Returns:** `1.0` (PASS) if within range, `0.0` (FAIL) otherwise
|
||||
- **Used in:** On-demand evaluation (Steps 7 & 8) and Online evaluation (Step 10)
|
||||
|
||||
### HRFactChecker (SESSION level)
|
||||
|
||||
@@ -137,6 +270,7 @@ Deterministically validates that the HR assistant's responses contain accurate f
|
||||
- PTO request ID format `PTO-2026-NNN`
|
||||
- Policy facts: 15-day PTO accrual, 2-day advance notice, 401k 4% match, 90% health coverage
|
||||
- **Returns:** fraction of applicable checks passed (0.0–1.0), labeled `PASS`, `PARTIAL`, `FAIL`, or `SKIP`
|
||||
- **Used in:** On-demand evaluation (Steps 7 & 8) and Online evaluation (Step 10)
|
||||
|
||||
---
|
||||
|
||||
@@ -156,6 +290,60 @@ Results from all five evaluators are collected per scenario, letting you compare
|
||||
|
||||
---
|
||||
|
||||
## Online Evaluation with Code-Based Evaluators
|
||||
|
||||
Step 10 of the notebook demonstrates **online evaluation** — a continuous evaluation mode where
|
||||
AgentCore automatically evaluates every live agent session without explicit API calls per session.
|
||||
|
||||
### How it works
|
||||
|
||||
1. Register code-based evaluators (Steps 4–6, same as for on-demand)
|
||||
2. Create an online evaluation config via `create_online_evaluation_config`:
|
||||
- Point it at the agent's CloudWatch log group
|
||||
- Set a sampling rate (0–100%)
|
||||
- List the evaluator IDs (code-based and/or builtin)
|
||||
- Provide an IAM execution role the service can assume
|
||||
3. Enable the config — AgentCore starts watching the log group
|
||||
4. Every new agent session is automatically evaluated
|
||||
5. Results appear in the online evaluation results CloudWatch log group
|
||||
|
||||
### Evaluator locking
|
||||
|
||||
When a code-based evaluator is referenced by an **enabled** online evaluation config, AgentCore
|
||||
**locks** it automatically. You cannot modify or delete a locked evaluator. To update it:
|
||||
|
||||
```
|
||||
disable/delete online eval config
|
||||
↓
|
||||
update evaluator Lambda or re-register
|
||||
↓
|
||||
re-create online eval config
|
||||
```
|
||||
|
||||
### On-demand vs. online comparison
|
||||
|
||||
| Dimension | On-demand | Online |
|
||||
|---|---|---|
|
||||
| Trigger | Explicit per session | Automatic on every invocation |
|
||||
| Setup | `EvaluationClient.run()` or `OnDemandEvaluationDatasetRunner` | `create_online_evaluation_config` once |
|
||||
| Code-based evaluators | ✅ Supported | ✅ Supported |
|
||||
| Evaluator locking | No | Yes — while config is enabled |
|
||||
| Best for | CI/CD, ad-hoc debugging | Continuous production monitoring |
|
||||
|
||||
### AgentCore CLI shortcut
|
||||
|
||||
```bash
|
||||
# sampling-rate is a percentage (0.01–100); 50 = evaluate 50% of sessions
|
||||
agentcore add online-eval \
|
||||
--name my_online_eval \
|
||||
--runtime MyAgent \
|
||||
--evaluator MyCodeEvaluator \
|
||||
--sampling-rate 50 \
|
||||
--enable-on-create
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Sample Prompts
|
||||
|
||||
The dataset includes five scenarios that exercise facts the `HRFactChecker` validates:
|
||||
@@ -178,14 +366,15 @@ You can extend the dataset with additional scenarios to test more HR topics (rem
|
||||
|---|---|
|
||||
| 1 | Install dependencies (`bedrock-agentcore>=1.6.0`) |
|
||||
| 2 | Configure AWS session, region, and Lambda role ARN |
|
||||
| 3 | Agent setup — reload from `%store` (groundtruth notebook) or deploy fresh |
|
||||
| 3 | Agent setup — reload from `%store` (groundtruth notebook) or deploy fresh with boto3 |
|
||||
| 4 | Define Lambda evaluator functions using the `@custom_code_based_evaluator()` decorator |
|
||||
| 5 | Deploy Lambda functions (bundled with bedrock-agentcore SDK + pydantic) |
|
||||
| 6 | Register evaluators via `bedrock-agentcore-control` boto3 service |
|
||||
| 7 | On-demand evaluation with `EvaluationClient` (code-based + builtin evaluators) |
|
||||
| 8 | Dataset evaluation with `OnDemandEvaluationDatasetRunner` (mixed evaluator set) |
|
||||
| 9 | Inspect and compare results (per-scenario tables + aggregate score comparison) |
|
||||
| 10 | Cleanup — delete Lambda functions, evaluator records, and agent runtime |
|
||||
| **10** | **Online evaluation with `create_online_evaluation_config` (code-based evaluators, auto-triggered)** |
|
||||
| 11 | Cleanup — delete Lambda functions, evaluator records, online eval config, and agent runtime |
|
||||
|
||||
---
|
||||
|
||||
@@ -213,8 +402,10 @@ span.span_events[*]
|
||||
- **Business rule enforcement** — encode domain-specific rules that LLMs might interpret loosely
|
||||
- **High-volume evaluation** — reduce cost for evaluations that run on every production session
|
||||
- **Regulatory requirements** — verify that required disclosures or disclaimers are always present
|
||||
- **Continuous monitoring** — combine with online evaluation for zero-touch production quality gates
|
||||
|
||||
> **Note:** Code-based evaluators are supported for **on-demand evaluation** (`EvaluationClient`, `OnDemandEvaluationDatasetRunner`) only. Online evaluation configs support built-in LLM evaluators only.
|
||||
Code-based evaluators are supported for **both on-demand** (`EvaluationClient`,
|
||||
`OnDemandEvaluationDatasetRunner`) and **online** (`create_online_evaluation_config`) evaluation.
|
||||
|
||||
---
|
||||
|
||||
@@ -223,20 +414,27 @@ span.span_events[*]
|
||||
To remove created AWS resources:
|
||||
|
||||
```python
|
||||
# Delete Lambda functions
|
||||
# 1. Disable online evaluation config first (unlocks evaluators)
|
||||
cp_client.update_online_evaluation_config(
|
||||
onlineEvaluationConfigId=ONLINE_EVAL_CONFIG_ID,
|
||||
enableOnCreate=False,
|
||||
)
|
||||
cp_client.delete_online_evaluation_config(onlineEvaluationConfigId=ONLINE_EVAL_CONFIG_ID)
|
||||
|
||||
# 2. Delete Lambda functions
|
||||
for fn in ["hr-response-length", "hr-fact-checker"]:
|
||||
lambda_client.delete_function(FunctionName=fn)
|
||||
|
||||
# Delete evaluator registrations
|
||||
# 3. Delete evaluator registrations (now unlocked)
|
||||
for name, eid in CODE_EVAL_IDS.items():
|
||||
cp_client.delete_evaluator(evaluatorId=eid)
|
||||
|
||||
# Delete agent runtime (only if deployed in this notebook)
|
||||
# 4. Delete agent runtime (only if deployed in this notebook)
|
||||
if not _agent_loaded:
|
||||
agent_runtime.delete()
|
||||
agentcore_control.delete_agent_runtime(agentRuntimeId=AGENT_ID)
|
||||
```
|
||||
|
||||
Alternatively, run the cleanup cell (Step 10) in the notebook — it is commented out by default to prevent accidental deletion.
|
||||
Alternatively, run the cleanup cell (Step 11) in the notebook — it is commented out by default to prevent accidental deletion.
|
||||
|
||||
---
|
||||
|
||||
@@ -245,4 +443,5 @@ Alternatively, run the cleanup cell (Step 10) in the notebook — it is commente
|
||||
- Extend `HRFactChecker` with additional business rules as your agent and data model evolve
|
||||
- Combine code-based evaluators with `EvaluationClient` to validate specific production sessions
|
||||
- Add code-based evaluators to your CI/CD pipeline for zero-cost regression testing on every deployment
|
||||
- Use online evaluation with a lower sampling rate (e.g. 10%) to cost-effectively monitor high-traffic agents
|
||||
- Explore the [groundtruth tutorial](../05-groundtruth-based-evalautions/) for `EvaluationClient` and ground-truth-based evaluations with built-in evaluators
|
||||
|
||||
+594
-42
@@ -30,7 +30,7 @@
|
||||
"| **HRResponseLength** | TRACE | `hr-response-length` | Response length is 50–600 chars |\n",
|
||||
"| **HRFactChecker** | SESSION | `hr-fact-checker` | PTO balances, pay stubs, and policy facts are accurate |\n",
|
||||
"\n",
|
||||
"Then we'll run `OnDemandEvaluationDatasetRunner` with a **mixed evaluator set** combining these code-based evaluators with built-in LLM-as-as-Judge evaluators.\n",
|
||||
"Then we'll run `OnDemandEvaluationDatasetRunner` with a **mixed evaluator set** combining these code-based evaluators with built-in LLM-as-as-Judge evaluators. We will also set up online evaluation using these evaluation for live monitoring.\n",
|
||||
"\n",
|
||||
"### Tutorial Details\n",
|
||||
"\n",
|
||||
@@ -111,7 +111,27 @@
|
||||
"from botocore.config import Config\n",
|
||||
"from IPython.display import display, Markdown\n",
|
||||
"\n",
|
||||
"REGION = \"aws_region\" # Add AWS region here \n",
|
||||
"# ── Region configuration ──────────────────────────────────────────────────────\n",
|
||||
"# REGION: the AWS region where the AgentCore Runtime (agent) is deployed.\n",
|
||||
"# Auto-detected from the boto3 session (reads AWS_DEFAULT_REGION env var or\n",
|
||||
"# the default region in ~/.aws/config). Set explicitly if needed, e.g.:\n",
|
||||
"# REGION = \"us-east-1\"\n",
|
||||
"#\n",
|
||||
"# If you ran groundtruth_evaluations.ipynb first, REGION is also restored\n",
|
||||
"# from %store in the agent-load cell below, overriding this value.\n",
|
||||
"REGION = Session().region_name\n",
|
||||
"assert REGION, (\n",
|
||||
" \"No AWS region detected. Set AWS_DEFAULT_REGION or configure a default \"\n",
|
||||
" \"region in ~/.aws/config, or set REGION explicitly above.\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"# EVAL_REGION: region for Lambda evaluators and evaluator registrations.\n",
|
||||
"# For online evaluation, this MUST match REGION (the agent's CloudWatch log\n",
|
||||
"# group and the evaluation config must be in the same region). The\n",
|
||||
"# agent-clients cell below aligns EVAL_REGION to REGION automatically\n",
|
||||
"# after %store restores the agent's actual region.\n",
|
||||
"EVAL_REGION = REGION\n",
|
||||
"\n",
|
||||
"boto_session = Session(region_name=REGION)\n",
|
||||
"ACCOUNT_ID = boto3.client(\"sts\").get_caller_identity()[\"Account\"]\n",
|
||||
"\n",
|
||||
@@ -119,13 +139,10 @@
|
||||
"# Update this if your role has a different name\n",
|
||||
"LAMBDA_ROLE_ARN = f\"arn:aws:iam::{ACCOUNT_ID}:role/AgentCoreLambdaExecutionRole\"\n",
|
||||
"\n",
|
||||
"# Evaluation region — Lambda evaluator functions and evaluator registrations must be here.\n",
|
||||
"EVAL_REGION = \"aws_region\" # Set AWS Region here\n",
|
||||
"\n",
|
||||
"print(f\"Region : {REGION}\")\n",
|
||||
"print(f\"Eval Region : {EVAL_REGION}\")\n",
|
||||
"print(f\"Account : {ACCOUNT_ID}\")\n",
|
||||
"print(f\"Lambda Role ARN : {LAMBDA_ROLE_ARN}\")\n",
|
||||
"print(f\"Eval Region : {EVAL_REGION}\")"
|
||||
"print(f\"Lambda Role ARN : {LAMBDA_ROLE_ARN}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -191,34 +208,137 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"\n",
|
||||
"# Deploy agent if not already loaded\n",
|
||||
"if not _agent_loaded:\n",
|
||||
" from bedrock_agentcore_starter_toolkit import Runtime\n",
|
||||
" # -------------------------------------------------------------------------\n",
|
||||
" # Fresh deployment using boto3 (bedrock-agentcore-control) + Docker/ECR.\n",
|
||||
" # This path runs only when the groundtruth notebook has NOT been executed\n",
|
||||
" # first. If you prefer the CLI, run `agentcore deploy` from the project\n",
|
||||
" # root instead and set AGENT_ID / AGENT_ARN / CW_LOG_GROUP manually below.\n",
|
||||
" # -------------------------------------------------------------------------\n",
|
||||
"\n",
|
||||
" agent_runtime = Runtime()\n",
|
||||
" agent_runtime.configure(\n",
|
||||
" entrypoint=\"hr_assistant_agent.py\",\n",
|
||||
" requirements_file=\"requirements.txt\",\n",
|
||||
" auto_create_execution_role=True,\n",
|
||||
" auto_create_ecr=True,\n",
|
||||
" region=REGION,\n",
|
||||
" agent_name=\"hr_assistant_codeeval_tutorial\",\n",
|
||||
" idle_timeout=120,\n",
|
||||
" ecr_client = boto3.client(\"ecr\", region_name=REGION)\n",
|
||||
" agentcore_control_deploy = boto3.client(\"bedrock-agentcore-control\", region_name=REGION)\n",
|
||||
" iam_client = boto3.client(\"iam\")\n",
|
||||
"\n",
|
||||
" AGENT_NAME = \"hr_assistant_codeeval_tutorial\"\n",
|
||||
" ECR_REPO_NAME = f\"agentcore-{AGENT_NAME}\"\n",
|
||||
"\n",
|
||||
" # ------------------------------------------------------------------\n",
|
||||
" # 1. Ensure IAM execution role exists\n",
|
||||
" # ------------------------------------------------------------------\n",
|
||||
" EXECUTION_ROLE_NAME = \"AgentCoreRuntimeExecutionRole\"\n",
|
||||
" EXECUTION_ROLE_ARN = f\"arn:aws:iam::{ACCOUNT_ID}:role/{EXECUTION_ROLE_NAME}\"\n",
|
||||
"\n",
|
||||
" try:\n",
|
||||
" iam_client.get_role(RoleName=EXECUTION_ROLE_NAME)\n",
|
||||
" print(f\"Using existing IAM role: {EXECUTION_ROLE_ARN}\")\n",
|
||||
" except iam_client.exceptions.NoSuchEntityException:\n",
|
||||
" print(f\"Creating IAM role: {EXECUTION_ROLE_NAME}...\")\n",
|
||||
" trust_policy = json.dumps({\n",
|
||||
" \"Version\": \"2012-10-17\",\n",
|
||||
" \"Statement\": [{\n",
|
||||
" \"Effect\": \"Allow\",\n",
|
||||
" \"Principal\": {\"Service\": \"bedrock-agentcore.amazonaws.com\"},\n",
|
||||
" \"Action\": \"sts:AssumeRole\",\n",
|
||||
" }],\n",
|
||||
" })\n",
|
||||
" iam_client.create_role(\n",
|
||||
" RoleName=EXECUTION_ROLE_NAME,\n",
|
||||
" AssumeRolePolicyDocument=trust_policy,\n",
|
||||
" Description=\"Execution role for AgentCore Runtime tutorial agents\",\n",
|
||||
" )\n",
|
||||
" for policy_arn in [\n",
|
||||
" \"arn:aws:iam::aws:policy/AmazonBedrockFullAccess\",\n",
|
||||
" \"arn:aws:iam::aws:policy/CloudWatchLogsFullAccess\",\n",
|
||||
" ]:\n",
|
||||
" iam_client.attach_role_policy(RoleName=EXECUTION_ROLE_NAME, PolicyArn=policy_arn)\n",
|
||||
" print(f\"Created: {EXECUTION_ROLE_ARN}\")\n",
|
||||
" time.sleep(10) # allow IAM propagation\n",
|
||||
"\n",
|
||||
" # ------------------------------------------------------------------\n",
|
||||
" # 2. Create ECR repository (or reuse existing)\n",
|
||||
" # ------------------------------------------------------------------\n",
|
||||
" try:\n",
|
||||
" ecr_resp = ecr_client.create_repository(repositoryName=ECR_REPO_NAME)\n",
|
||||
" ECR_REPO_URI = ecr_resp[\"repository\"][\"repositoryUri\"]\n",
|
||||
" print(f\"Created ECR repo: {ECR_REPO_URI}\")\n",
|
||||
" except ecr_client.exceptions.RepositoryAlreadyExistsException:\n",
|
||||
" ECR_REPO_URI = ecr_client.describe_repositories(\n",
|
||||
" repositoryNames=[ECR_REPO_NAME]\n",
|
||||
" )[\"repositories\"][0][\"repositoryUri\"]\n",
|
||||
" print(f\"Using existing ECR repo: {ECR_REPO_URI}\")\n",
|
||||
"\n",
|
||||
" IMAGE_URI = f\"{ECR_REPO_URI}:latest\"\n",
|
||||
"\n",
|
||||
" # ------------------------------------------------------------------\n",
|
||||
" # 3. Build Docker image and push to ECR\n",
|
||||
" # ------------------------------------------------------------------\n",
|
||||
" ecr_registry = f\"{ACCOUNT_ID}.dkr.ecr.{REGION}.amazonaws.com\"\n",
|
||||
" print(\"Docker login to ECR...\")\n",
|
||||
" subprocess.run(\n",
|
||||
" f\"aws ecr get-login-password --region {REGION} | docker login --username AWS --password-stdin {ecr_registry}\",\n",
|
||||
" shell=True, check=True,\n",
|
||||
" )\n",
|
||||
" launch_result = agent_runtime.launch()\n",
|
||||
" print(\"Building Docker image (this may take a few minutes)...\")\n",
|
||||
" subprocess.run([\"docker\", \"build\", \"--platform\", \"linux/amd64\", \"-t\", IMAGE_URI, \".\"], check=True)\n",
|
||||
" print(\"Pushing image to ECR...\")\n",
|
||||
" subprocess.run([\"docker\", \"push\", IMAGE_URI], check=True)\n",
|
||||
" print(f\"Image pushed: {IMAGE_URI}\")\n",
|
||||
"\n",
|
||||
" terminal = {\"READY\", \"CREATE_FAILED\", \"DELETE_FAILED\", \"UPDATE_FAILED\"}\n",
|
||||
" # Allow ECR pull from AgentCore\n",
|
||||
" ecr_client.set_repository_policy(\n",
|
||||
" repositoryName=ECR_REPO_NAME,\n",
|
||||
" policyText=json.dumps({\n",
|
||||
" \"Version\": \"2012-10-17\",\n",
|
||||
" \"Statement\": [{\n",
|
||||
" \"Sid\": \"AllowAgentCorePull\",\n",
|
||||
" \"Effect\": \"Allow\",\n",
|
||||
" \"Principal\": {\"Service\": \"bedrock-agentcore.amazonaws.com\"},\n",
|
||||
" \"Action\": [\"ecr:GetDownloadUrlForLayer\", \"ecr:BatchGetImage\", \"ecr:BatchCheckLayerAvailability\"],\n",
|
||||
" }],\n",
|
||||
" }),\n",
|
||||
" )\n",
|
||||
"\n",
|
||||
" # ------------------------------------------------------------------\n",
|
||||
" # 4. Create (or update) AgentCore Runtime\n",
|
||||
" # ------------------------------------------------------------------\n",
|
||||
" artifact = {\"containerConfiguration\": {\"containerUri\": IMAGE_URI}}\n",
|
||||
" try:\n",
|
||||
" resp = agentcore_control_deploy.create_agent_runtime(\n",
|
||||
" agentRuntimeName=AGENT_NAME,\n",
|
||||
" agentRuntimeArtifact=artifact,\n",
|
||||
" executionRoleArn=EXECUTION_ROLE_ARN,\n",
|
||||
" networkConfiguration={\"networkMode\": \"PUBLIC\"},\n",
|
||||
" )\n",
|
||||
" AGENT_ID = resp[\"agentRuntimeId\"]\n",
|
||||
" AGENT_ARN = resp[\"agentRuntimeArn\"]\n",
|
||||
" print(f\"Created AgentCore Runtime: {AGENT_ID}\")\n",
|
||||
" except agentcore_control_deploy.exceptions.ConflictException:\n",
|
||||
" runtimes = agentcore_control_deploy.list_agent_runtimes()[\"agentRuntimes\"]\n",
|
||||
" existing = next((r for r in runtimes if r[\"agentRuntimeName\"] == AGENT_NAME), None)\n",
|
||||
" assert existing, f\"Runtime {AGENT_NAME} not found after conflict\"\n",
|
||||
" AGENT_ID = existing[\"agentRuntimeId\"]\n",
|
||||
" AGENT_ARN = existing[\"agentRuntimeArn\"]\n",
|
||||
" agentcore_control_deploy.update_agent_runtime(\n",
|
||||
" agentRuntimeId=AGENT_ID,\n",
|
||||
" agentRuntimeArtifact=artifact,\n",
|
||||
" )\n",
|
||||
" print(f\"Updated existing runtime: {AGENT_ID}\")\n",
|
||||
"\n",
|
||||
" # Wait until READY\n",
|
||||
" terminal = {\"READY\", \"CREATE_FAILED\", \"UPDATE_FAILED\"}\n",
|
||||
" while True:\n",
|
||||
" status = agent_runtime.status().endpoint[\"status\"]\n",
|
||||
" status = agentcore_control_deploy.get_agent_runtime(\n",
|
||||
" agentRuntimeId=AGENT_ID\n",
|
||||
" )[\"status\"]\n",
|
||||
" print(f\" Status: {status}\")\n",
|
||||
" if status in terminal:\n",
|
||||
" break\n",
|
||||
" time.sleep(15)\n",
|
||||
"\n",
|
||||
" assert status == \"READY\", f\"Deployment failed: {status}\"\n",
|
||||
"\n",
|
||||
" AGENT_ID = launch_result.agent_id\n",
|
||||
" AGENT_ARN = launch_result.agent_arn\n",
|
||||
" CW_LOG_GROUP = f\"/aws/bedrock-agentcore/runtimes/{AGENT_ID}-DEFAULT\"\n",
|
||||
"\n",
|
||||
" print(\"\\nAgent deployed:\")\n",
|
||||
@@ -226,7 +346,7 @@
|
||||
" print(f\" AGENT_ARN : {AGENT_ARN}\")\n",
|
||||
" print(f\" CW_LOG_GROUP : {CW_LOG_GROUP}\")\n",
|
||||
"else:\n",
|
||||
" print(\"Using existing agent — skipping deployment.\")"
|
||||
" print(\"Using existing agent — skipping deployment.\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -250,9 +370,18 @@
|
||||
" print(f\"Note: agent is in {_arn_region}, overriding REGION={REGION}\")\n",
|
||||
" REGION = _arn_region\n",
|
||||
"\n",
|
||||
"# Align EVAL_REGION with the agent's region so that Lambda evaluators, evaluator\n",
|
||||
"# registrations, and the online evaluation config all live in the same region as\n",
|
||||
"# the agent's CloudWatch log group. Online evaluation requires the log group and\n",
|
||||
"# the evaluators to be in the same region as the control-plane config.\n",
|
||||
"if EVAL_REGION != REGION:\n",
|
||||
" print(f\"Aligning EVAL_REGION: {EVAL_REGION} → {REGION} (must match agent region for online eval)\")\n",
|
||||
" EVAL_REGION = REGION\n",
|
||||
"\n",
|
||||
"# boto3 client for agent invocation — must be in the same region as the agent\n",
|
||||
"agentcore_client = boto3.client(\"bedrock-agentcore\", region_name=REGION)\n",
|
||||
"print(f\"agentcore_client region: {REGION}\")"
|
||||
"print(f\"REGION : {REGION}\")\n",
|
||||
"print(f\"EVAL_REGION : {EVAL_REGION}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -1385,12 +1514,404 @@
|
||||
"print(f\"Results saved to: {results_path}\")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "axt5qz5c7lg",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Step 10: Online Evaluation with Code-Based Evaluators\n",
|
||||
"\n",
|
||||
"**Online evaluation** continuously monitors your live agent traffic and automatically scores sessions\n",
|
||||
"as they happen — no manual triggering required. You configure it once, and AgentCore watches your\n",
|
||||
"agent's CloudWatch log stream, evaluating new sessions at a configurable sampling rate.\n",
|
||||
"\n",
|
||||
"In this step we reuse the **same code-based evaluators** (`HRResponseLength` and `HRFactChecker`)\n",
|
||||
"we registered in Step 6. This demonstrates that a single evaluator registration can serve both\n",
|
||||
"on-demand and online evaluation use cases.\n",
|
||||
"\n",
|
||||
"### How online evaluation works\n",
|
||||
"\n",
|
||||
"```\n",
|
||||
"Agent invocation\n",
|
||||
" │\n",
|
||||
" ▼ (OTel spans → CloudWatch)\n",
|
||||
"AgentCore Runtime log group\n",
|
||||
" │\n",
|
||||
" ▼ (online eval config watches the log group)\n",
|
||||
"AgentCore Evaluations\n",
|
||||
" ├── Builtin LLM evaluators → LLM inference\n",
|
||||
" └── Code-based evaluators → your Lambda function\n",
|
||||
" │\n",
|
||||
" ▼\n",
|
||||
" Results in CloudWatch Logs\n",
|
||||
" /aws/bedrock-agentcore/evaluations/online-evaluations/...\n",
|
||||
"```\n",
|
||||
"\n",
|
||||
"### Key differences from on-demand evaluation\n",
|
||||
"\n",
|
||||
"| | On-demand | Online |\n",
|
||||
"|---|---|---|\n",
|
||||
"| **Trigger** | Explicit API call per session | Automatic, event-driven |\n",
|
||||
"| **Scope** | Specific session(s) you choose | All sessions (or a sampled %) |\n",
|
||||
"| **Setup** | Call `EvaluationClient.run()` per session | Configure once with `create_online_evaluation_config` |\n",
|
||||
"| **Evaluator locking** | No | Code-based evaluators become **locked** while the config is enabled |\n",
|
||||
"| **Best for** | Ad-hoc checks, CI/CD pipelines | Continuous production monitoring |\n",
|
||||
"\n",
|
||||
"### IAM execution role\n",
|
||||
"\n",
|
||||
"Online evaluation requires an **evaluation execution role** — an IAM role that AgentCore Evaluations\n",
|
||||
"assumes to invoke your Lambda evaluators and read CloudWatch spans. It must trust\n",
|
||||
"`bedrock-agentcore.amazonaws.com` and have `lambda:InvokeFunction` + `logs:FilterLogEvents`\n",
|
||||
"permissions.\n",
|
||||
"\n",
|
||||
"> **Evaluator locking:** When a code-based evaluator is referenced by an enabled online evaluation\n",
|
||||
"> config, AgentCore automatically locks it to prevent accidental modification. To update the\n",
|
||||
"> evaluator, first disable the online evaluation config (or delete it), then update the evaluator,\n",
|
||||
"> then re-enable the config."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "6v2u7hcmazi",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Step 10a: Create IAM Evaluation Execution Role"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "m3n6no96fck",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"\n",
|
||||
"iam_client = boto3.client(\"iam\")\n",
|
||||
"ONLINE_EVAL_ROLE_NAME = \"AgentCoreOnlineEvaluationRole\"\n",
|
||||
"ONLINE_EVAL_ROLE_ARN = f\"arn:aws:iam::{ACCOUNT_ID}:role/{ONLINE_EVAL_ROLE_NAME}\"\n",
|
||||
"\n",
|
||||
"# Trust policy: allow AgentCore Evaluations service to assume this role\n",
|
||||
"trust_policy = json.dumps({\n",
|
||||
" \"Version\": \"2012-10-17\",\n",
|
||||
" \"Statement\": [{\n",
|
||||
" \"Effect\": \"Allow\",\n",
|
||||
" \"Principal\": {\"Service\": \"bedrock-agentcore.amazonaws.com\"},\n",
|
||||
" \"Action\": \"sts:AssumeRole\",\n",
|
||||
" }],\n",
|
||||
"})\n",
|
||||
"\n",
|
||||
"# Inline permission policy: invoke Lambda evaluators + full CloudWatch Logs access.\n",
|
||||
"# The online evaluation service requires:\n",
|
||||
"# READ — FilterLogEvents, GetLogEvents, StartQuery, GetQueryResults on:\n",
|
||||
"# - agent runtime log group (/aws/bedrock-agentcore/runtimes/...)\n",
|
||||
"# - OTel spans log group (aws/spans — no leading slash)\n",
|
||||
"# WRITE — CreateLogGroup, CreateLogStream, PutLogEvents for writing evaluation results to:\n",
|
||||
"# - /aws/bedrock-agentcore/evaluations/results/<config-name>\n",
|
||||
"eval_policy = json.dumps({\n",
|
||||
" \"Version\": \"2012-10-17\",\n",
|
||||
" \"Statement\": [\n",
|
||||
" {\n",
|
||||
" \"Sid\": \"InvokeLambdaEvaluators\",\n",
|
||||
" \"Effect\": \"Allow\",\n",
|
||||
" \"Action\": [\"lambda:InvokeFunction\", \"lambda:GetFunction\"],\n",
|
||||
" \"Resource\": [\n",
|
||||
" lambda_arn_response_length,\n",
|
||||
" lambda_arn_fact_checker,\n",
|
||||
" ],\n",
|
||||
" },\n",
|
||||
" {\n",
|
||||
" \"Sid\": \"CloudWatchLogsReadSpans\",\n",
|
||||
" \"Effect\": \"Allow\",\n",
|
||||
" \"Action\": [\n",
|
||||
" \"logs:FilterLogEvents\",\n",
|
||||
" \"logs:DescribeLogGroups\",\n",
|
||||
" \"logs:DescribeLogStreams\",\n",
|
||||
" \"logs:GetLogEvents\",\n",
|
||||
" \"logs:StartQuery\",\n",
|
||||
" \"logs:GetQueryResults\",\n",
|
||||
" \"logs:StopQuery\",\n",
|
||||
" ],\n",
|
||||
" \"Resource\": \"*\",\n",
|
||||
" },\n",
|
||||
" {\n",
|
||||
" \"Sid\": \"CloudWatchLogsWriteResults\",\n",
|
||||
" \"Effect\": \"Allow\",\n",
|
||||
" \"Action\": [\n",
|
||||
" \"logs:CreateLogGroup\",\n",
|
||||
" \"logs:CreateLogStream\",\n",
|
||||
" \"logs:PutLogEvents\",\n",
|
||||
" ],\n",
|
||||
" \"Resource\": f\"arn:aws:logs:{REGION}:{ACCOUNT_ID}:log-group:/aws/bedrock-agentcore/evaluations/*\",\n",
|
||||
" },\n",
|
||||
" ],\n",
|
||||
"})\n",
|
||||
"\n",
|
||||
"try:\n",
|
||||
" iam_client.get_role(RoleName=ONLINE_EVAL_ROLE_NAME)\n",
|
||||
" print(f\"Using existing role: {ONLINE_EVAL_ROLE_ARN}\")\n",
|
||||
" iam_client.put_role_policy(\n",
|
||||
" RoleName=ONLINE_EVAL_ROLE_NAME,\n",
|
||||
" PolicyName=\"AgentCoreOnlineEvalPermissions\",\n",
|
||||
" PolicyDocument=eval_policy,\n",
|
||||
" )\n",
|
||||
" print(\" Inline policy updated.\")\n",
|
||||
"except iam_client.exceptions.NoSuchEntityException:\n",
|
||||
" print(f\"Creating IAM role: {ONLINE_EVAL_ROLE_NAME}...\")\n",
|
||||
" iam_client.create_role(\n",
|
||||
" RoleName=ONLINE_EVAL_ROLE_NAME,\n",
|
||||
" AssumeRolePolicyDocument=trust_policy,\n",
|
||||
" Description=\"Execution role for AgentCore online evaluation with code-based evaluators\",\n",
|
||||
" )\n",
|
||||
" iam_client.put_role_policy(\n",
|
||||
" RoleName=ONLINE_EVAL_ROLE_NAME,\n",
|
||||
" PolicyName=\"AgentCoreOnlineEvalPermissions\",\n",
|
||||
" PolicyDocument=eval_policy,\n",
|
||||
" )\n",
|
||||
" print(f\"Created: {ONLINE_EVAL_ROLE_ARN}\")\n",
|
||||
"\n",
|
||||
"print(f\"\\nOnline eval execution role: {ONLINE_EVAL_ROLE_ARN}\")\n",
|
||||
"print(\"Waiting 10s for IAM propagation...\")\n",
|
||||
"time.sleep(10)\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "k2easqzclef",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Step 10b: Create Online Evaluation Configuration\n",
|
||||
"\n",
|
||||
"We create an online evaluation config that monitors the HR assistant's live CloudWatch log group\n",
|
||||
"and evaluates every session (100% sampling rate) with our two code-based evaluators.\n",
|
||||
"\n",
|
||||
"The config references:\n",
|
||||
"- **`HRResponseLength`** (TRACE level) — evaluated per agent response turn\n",
|
||||
"- **`HRFactChecker`** (SESSION level) — evaluated once per completed session\n",
|
||||
"\n",
|
||||
"> Once this config is **enabled**, both code-based evaluators are automatically **locked**\n",
|
||||
"> and cannot be modified until the config is disabled or deleted."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "e49zvxa0r0f",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"\n",
|
||||
"# Unique name for the online eval config (no hyphens — service regex: [a-zA-Z][a-zA-Z0-9_]{0,99})\n",
|
||||
"ONLINE_EVAL_CONFIG_NAME = f\"hr_online_eval_{RUN_SUFFIX}\"\n",
|
||||
"\n",
|
||||
"# The OTel service name is <agentRuntimeName>.DEFAULT\n",
|
||||
"# AGENT_ARN format: arn:aws:bedrock-agentcore:{region}:{account}:runtime/{id}\n",
|
||||
"_runtime_id = AGENT_ARN.split(\"/\")[-1] # e.g. hr_assistant_codeeval_tutorial-AbCdEfGhIj\n",
|
||||
"_agent_runtime_name = _runtime_id.rsplit(\"-\", 1)[0] # strip auto-generated suffix\n",
|
||||
"OTEL_SERVICE_NAME = f\"{_agent_runtime_name}.DEFAULT\"\n",
|
||||
"\n",
|
||||
"print(f\"Online eval config name : {ONLINE_EVAL_CONFIG_NAME}\")\n",
|
||||
"print(f\"Monitoring log group : {CW_LOG_GROUP}\")\n",
|
||||
"print(f\"OTel service name : {OTEL_SERVICE_NAME}\")\n",
|
||||
"print(f\"Evaluators : {list(CODE_EVAL_IDS.keys())}\")\n",
|
||||
"print()\n",
|
||||
"\n",
|
||||
"online_eval_resp = cp_client.create_online_evaluation_config(\n",
|
||||
" onlineEvaluationConfigName=ONLINE_EVAL_CONFIG_NAME,\n",
|
||||
" # Evaluate 100% of sessions; lower this in production to control cost\n",
|
||||
" rule={\"samplingConfig\": {\"samplingPercentage\": 100.0}},\n",
|
||||
" # Watch the agent's runtime CloudWatch log group for new OTel spans\n",
|
||||
" dataSourceConfig={\n",
|
||||
" \"cloudWatchLogs\": {\n",
|
||||
" \"logGroupNames\": [CW_LOG_GROUP],\n",
|
||||
" \"serviceNames\": [OTEL_SERVICE_NAME],\n",
|
||||
" }\n",
|
||||
" },\n",
|
||||
" # Code-based + builtin evaluators can be mixed freely\n",
|
||||
" evaluators=[\n",
|
||||
" {\"evaluatorId\": CODE_EVAL_IDS[\"HRResponseLength\"]},\n",
|
||||
" {\"evaluatorId\": CODE_EVAL_IDS[\"HRFactChecker\"]},\n",
|
||||
" ],\n",
|
||||
" evaluationExecutionRoleArn=ONLINE_EVAL_ROLE_ARN,\n",
|
||||
" # enableOnCreate=True activates the config immediately on creation\n",
|
||||
" enableOnCreate=True,\n",
|
||||
")\n",
|
||||
"\n",
|
||||
"ONLINE_EVAL_CONFIG_ID = online_eval_resp[\"onlineEvaluationConfigId\"]\n",
|
||||
"ONLINE_EVAL_CONFIG_ARN = online_eval_resp.get(\"onlineEvaluationConfigArn\", \"\")\n",
|
||||
"\n",
|
||||
"print(f\"Online eval config created:\")\n",
|
||||
"print(f\" ID : {ONLINE_EVAL_CONFIG_ID}\")\n",
|
||||
"print(f\" ARN : {ONLINE_EVAL_CONFIG_ARN}\")\n",
|
||||
"print()\n",
|
||||
"print(\"Evaluators are now LOCKED — they cannot be modified while this config is enabled.\")\n",
|
||||
"print(\"To update an evaluator: disable this config → update evaluator → re-enable.\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "1g6zs5hef32",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Step 10c: Invoke Agent to Trigger Online Evaluation\n",
|
||||
"\n",
|
||||
"Now we invoke the HR assistant with a few turns. Because the online evaluation config is active\n",
|
||||
"and watching the runtime log group, AgentCore will automatically evaluate each session as OTel\n",
|
||||
"spans arrive in CloudWatch — no explicit evaluation API call needed."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "u2fnzeuy4rd",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"\n",
|
||||
"# Invoke the agent to generate a fresh session that will be auto-evaluated online.\n",
|
||||
"# We use two separate sessions to demonstrate per-session evaluation.\n",
|
||||
"ONLINE_SESSION_IDS = [\n",
|
||||
" f\"online-eval-{uuid.uuid4()}\",\n",
|
||||
" f\"online-eval-{uuid.uuid4()}\",\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"ONLINE_SESSION_TURNS = [\n",
|
||||
" # Session 1: PTO balance + policy lookup\n",
|
||||
" [\n",
|
||||
" \"What is the PTO balance for employee EMP-001?\",\n",
|
||||
" \"What is the company PTO policy?\",\n",
|
||||
" ],\n",
|
||||
" # Session 2: pay stub + benefits\n",
|
||||
" [\n",
|
||||
" \"Can you pull up the January 2026 pay stub for EMP-001?\",\n",
|
||||
" \"What health insurance options does the company offer?\",\n",
|
||||
" ],\n",
|
||||
"]\n",
|
||||
"\n",
|
||||
"print(\"Invoking agent sessions (these will be auto-evaluated online)...\")\n",
|
||||
"for session_id, turns in zip(ONLINE_SESSION_IDS, ONLINE_SESSION_TURNS):\n",
|
||||
" print(f\"\\n Session: {session_id}\")\n",
|
||||
" for prompt in turns:\n",
|
||||
" print(f\" > {prompt}\")\n",
|
||||
" reply = invoke_agent_simple(prompt, session_id)\n",
|
||||
" print(f\" < {reply[:100]}...\")\n",
|
||||
"\n",
|
||||
"print(f\"\\nBoth sessions invoked.\")\n",
|
||||
"print(\"AgentCore will automatically evaluate them as spans arrive in CloudWatch.\")\n",
|
||||
"print(\"Waiting 120s for CloudWatch ingestion + evaluation processing...\")\n",
|
||||
"time.sleep(120)\n",
|
||||
"print(\"Ready to check online evaluation results.\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "q5k1zfi9xmf",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"### Step 10d: Retrieve and Display Online Evaluation Results\n",
|
||||
"\n",
|
||||
"Online evaluation results are written to CloudWatch Logs in the evaluations results log group.\n",
|
||||
"We query the log group for evaluation events for our sessions and display the scores."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "qjcf1kvocj",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"\n",
|
||||
"import re as _re\n",
|
||||
"\n",
|
||||
"# Online evaluation results log group is in the same region as the evaluation config\n",
|
||||
"# (which is REGION = EVAL_REGION after alignment in the agent-clients cell).\n",
|
||||
"logs_client = boto3.client(\"logs\", region_name=REGION)\n",
|
||||
"\n",
|
||||
"ONLINE_EVAL_RESULTS_LOG_GROUP = \"/aws/bedrock-agentcore/evaluations/online-evaluations/results/default\"\n",
|
||||
"\n",
|
||||
"look_back_ms = int(time.time() * 1000) - (30 * 60 * 1000) # last 30 minutes\n",
|
||||
"\n",
|
||||
"print(f\"Querying online eval results from: {ONLINE_EVAL_RESULTS_LOG_GROUP}\")\n",
|
||||
"print(f\"Filtering for session IDs: {[s[:20] + '...' for s in ONLINE_SESSION_IDS]}\\n\")\n",
|
||||
"\n",
|
||||
"online_results = []\n",
|
||||
"try:\n",
|
||||
" paginator = logs_client.get_paginator(\"filter_log_events\")\n",
|
||||
" for page in paginator.paginate(\n",
|
||||
" logGroupName=ONLINE_EVAL_RESULTS_LOG_GROUP,\n",
|
||||
" startTime=look_back_ms,\n",
|
||||
" ):\n",
|
||||
" for event in page.get(\"events\", []):\n",
|
||||
" try:\n",
|
||||
" log_entry = json.loads(event[\"message\"])\n",
|
||||
" except (json.JSONDecodeError, TypeError):\n",
|
||||
" continue\n",
|
||||
"\n",
|
||||
" attrs = log_entry.get(\"attributes\", log_entry)\n",
|
||||
" session_id = attrs.get(\"session.id\", \"\")\n",
|
||||
"\n",
|
||||
" if not any(sid == session_id for sid in ONLINE_SESSION_IDS):\n",
|
||||
" continue\n",
|
||||
"\n",
|
||||
" online_results.append({\n",
|
||||
" \"session_id\": session_id,\n",
|
||||
" \"evaluator_name\": attrs.get(\"gen_ai.evaluation.name\", \"\"),\n",
|
||||
" \"score\": attrs.get(\"gen_ai.evaluation.score.value\"),\n",
|
||||
" \"label\": attrs.get(\"gen_ai.evaluation.score.label\", \"\"),\n",
|
||||
" \"explanation\": (attrs.get(\"gen_ai.evaluation.explanation\") or \"\")[:120],\n",
|
||||
" })\n",
|
||||
"\n",
|
||||
"except logs_client.exceptions.ResourceNotFoundException:\n",
|
||||
" print(f\"Note: Log group '{ONLINE_EVAL_RESULTS_LOG_GROUP}' not found yet.\")\n",
|
||||
" print(\"This is normal if no sessions have been evaluated yet.\")\n",
|
||||
" print(\"Results will appear here after AgentCore processes the first session.\")\n",
|
||||
"\n",
|
||||
"print(f\"Found {len(online_results)} online evaluation result event(s).\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"id": "b5z9kr4ic6r",
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"\n",
|
||||
"# Display online evaluation results as a markdown table\n",
|
||||
"name_by_id = {v: k for k, v in CODE_EVAL_IDS.items()}\n",
|
||||
"\n",
|
||||
"if online_results:\n",
|
||||
" rows = [\n",
|
||||
" \"| Session (truncated) | Evaluator | Score | Label | Explanation |\",\n",
|
||||
" \"|---|---|---|---|---|\",\n",
|
||||
" ]\n",
|
||||
" for r in online_results:\n",
|
||||
" short_session = r[\"session_id\"][:30] + \"...\"\n",
|
||||
" evaluator = r[\"evaluator_name\"] or \"(unknown)\"\n",
|
||||
" score = str(r[\"score\"]) if r[\"score\"] is not None else \"N/A\"\n",
|
||||
" label = r[\"label\"] or \"\"\n",
|
||||
" explanation = r[\"explanation\"].replace(\"\\n\", \" \")\n",
|
||||
" rows.append(f\"| `{short_session}` | **{evaluator}** | {score} | {label} | {explanation} |\")\n",
|
||||
" display(Markdown(\"### Online Evaluation Results\\n\\n\" + \"\\n\".join(rows)))\n",
|
||||
"else:\n",
|
||||
" display(Markdown(\"\"\"### Online Evaluation Results\n",
|
||||
"\n",
|
||||
"> **No results yet.** Online evaluation is asynchronous — AgentCore may still be processing the\n",
|
||||
"> sessions. Try re-running this cell after another 60–120 seconds.\n",
|
||||
">\n",
|
||||
"> You can also check the AgentCore console or run the query below to inspect the results log group\n",
|
||||
"> directly once events arrive.\n",
|
||||
"\"\"\"))\n",
|
||||
" print(f\"Log group to monitor: {ONLINE_EVAL_RESULTS_LOG_GROUP}\")\n",
|
||||
" print(f\"Session IDs invoked : {ONLINE_SESSION_IDS}\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"id": "cleanup-md",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Step 10: Cleanup\n",
|
||||
"## Step 11: Cleanup\n",
|
||||
"\n",
|
||||
"Delete created resources to avoid ongoing charges."
|
||||
]
|
||||
@@ -1409,8 +1930,25 @@
|
||||
},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"\n",
|
||||
"# Uncomment to clean up resources\n",
|
||||
"\n",
|
||||
"# # Disable and delete online evaluation config (must disable before deleting locked evaluators)\n",
|
||||
"# try:\n",
|
||||
"# cp_client.update_online_evaluation_config(\n",
|
||||
"# onlineEvaluationConfigId=ONLINE_EVAL_CONFIG_ID,\n",
|
||||
"# enableOnCreate=False,\n",
|
||||
"# )\n",
|
||||
"# print(f\"Disabled online eval config: {ONLINE_EVAL_CONFIG_ID}\")\n",
|
||||
"# except Exception as e:\n",
|
||||
"# print(f\"Could not disable online eval config: {e}\")\n",
|
||||
"#\n",
|
||||
"# try:\n",
|
||||
"# cp_client.delete_online_evaluation_config(onlineEvaluationConfigId=ONLINE_EVAL_CONFIG_ID)\n",
|
||||
"# print(f\"Deleted online eval config: {ONLINE_EVAL_CONFIG_ID}\")\n",
|
||||
"# except Exception as e:\n",
|
||||
"# print(f\"Could not delete online eval config: {e}\")\n",
|
||||
"\n",
|
||||
"# # Delete Lambda functions\n",
|
||||
"# for fn in [\"hr-response-length\", \"hr-fact-checker\"]:\n",
|
||||
"# try:\n",
|
||||
@@ -1419,7 +1957,7 @@
|
||||
"# except Exception as e:\n",
|
||||
"# print(f\"Could not delete {fn}: {e}\")\n",
|
||||
"\n",
|
||||
"# # Delete evaluator records\n",
|
||||
"# # Delete evaluator records (only possible after online eval config is deleted/disabled)\n",
|
||||
"# for name, eid in CODE_EVAL_IDS.items():\n",
|
||||
"# try:\n",
|
||||
"# cp_client.delete_evaluator(evaluatorId=eid)\n",
|
||||
@@ -1429,10 +1967,11 @@
|
||||
"\n",
|
||||
"# # Delete agent runtime (only if deployed in this notebook)\n",
|
||||
"# if not _agent_loaded:\n",
|
||||
"# agent_runtime.delete()\n",
|
||||
"# print(\"Agent runtime deleted.\")\n",
|
||||
"# agentcore_control_deploy = boto3.client(\"bedrock-agentcore-control\", region_name=REGION)\n",
|
||||
"# agentcore_control_deploy.delete_agent_runtime(agentRuntimeId=AGENT_ID)\n",
|
||||
"# print(f\"Deleted agent runtime: {AGENT_ID}\")\n",
|
||||
"\n",
|
||||
"print(\"Cleanup skipped. Uncomment the cells above to delete resources.\")"
|
||||
"print(\"Cleanup skipped. Uncomment the cells above to delete resources.\")\n"
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -1442,24 +1981,25 @@
|
||||
"source": [
|
||||
"## Summary\n",
|
||||
"\n",
|
||||
"You've created two Lambda-backed code-based evaluators and run them in two ways:\n",
|
||||
"You've created two Lambda-backed code-based evaluators and run them in three ways:\n",
|
||||
"\n",
|
||||
"**Step 7 — On-Demand Evaluation (`EvaluationClient`)**: evaluated a specific production session\n",
|
||||
"with a mix of builtin LLM evaluators and code-based evaluators.\n",
|
||||
"\n",
|
||||
"**Step 8 — `OnDemandEvaluationDatasetRunner`**: automatically invoked the agent across a dataset and scored\n",
|
||||
"each scenario with the full mixed evaluator set.\n",
|
||||
"**Step 8 — `OnDemandEvaluationDatasetRunner`**: automatically invoked the agent across a dataset\n",
|
||||
"and scored each scenario with the full mixed evaluator set.\n",
|
||||
"\n",
|
||||
"| Evaluator | Type | Level | What it measured |\n",
|
||||
"**Step 10 — Online Evaluation (`create_online_evaluation_config`)**: deployed a continuous\n",
|
||||
"evaluation config that automatically scores every live session as OTel spans arrive in CloudWatch.\n",
|
||||
"No per-session API calls needed.\n",
|
||||
"\n",
|
||||
"| Evaluator | Type | Level | Used in |\n",
|
||||
"|---|---|---|---|\n",
|
||||
"| `Builtin.Correctness` | LLM | TRACE | Semantic similarity to expected response |\n",
|
||||
"| `Builtin.Helpfulness` | LLM | TRACE | Response helpfulness |\n",
|
||||
"| `Builtin.ResponseRelevance` | LLM | TRACE | Relevance to the user's question |\n",
|
||||
"| `HRResponseLength` | Code | TRACE | Response length within 50–600 chars |\n",
|
||||
"| `HRFactChecker` | Code | SESSION | Factual accuracy of PTO, pay stub, policy data |\n",
|
||||
"\n",
|
||||
"> **Note:** Code-based evaluators are supported for **on-demand evaluation only**.\n",
|
||||
"> Online evaluation configs (`create_online_config`) support builtin LLM evaluators only.\n",
|
||||
"| `Builtin.Correctness` | LLM | TRACE | On-demand (Steps 7 & 8) |\n",
|
||||
"| `Builtin.Helpfulness` | LLM | TRACE | On-demand (Step 8) |\n",
|
||||
"| `Builtin.ResponseRelevance` | LLM | TRACE | On-demand (Step 8) |\n",
|
||||
"| `HRResponseLength` | Code | TRACE | On-demand **and** Online (Steps 7, 8, 10) |\n",
|
||||
"| `HRFactChecker` | Code | SESSION | On-demand **and** Online (Steps 7, 8, 10) |\n",
|
||||
"\n",
|
||||
"### When to use code-based evaluators\n",
|
||||
"\n",
|
||||
@@ -1468,11 +2008,23 @@
|
||||
"- **Business rule enforcement**: Encode domain-specific rules that LLMs might misinterpret\n",
|
||||
"- **High-volume evaluation**: Reduce cost for evaluations that run on every production session\n",
|
||||
"- **Regulatory requirements**: Ensure certain disclosures or disclaimers are always present\n",
|
||||
"- **Continuous monitoring**: Combine with online evaluation for zero-touch production quality gates\n",
|
||||
"\n",
|
||||
"### On-demand vs. online evaluation summary\n",
|
||||
"\n",
|
||||
"| Dimension | On-demand | Online |\n",
|
||||
"|---|---|---|\n",
|
||||
"| Trigger | Explicit per session | Automatic on every invocation |\n",
|
||||
"| Setup | `EvaluationClient.run()` or `OnDemandEvaluationDatasetRunner` | `create_online_evaluation_config` once |\n",
|
||||
"| Code-based evaluators | ✅ Supported | ✅ Supported |\n",
|
||||
"| Evaluator locking | No | Yes — while config is enabled |\n",
|
||||
"| Best for | CI/CD, ad-hoc debugging | Continuous production monitoring |\n",
|
||||
"\n",
|
||||
"### Next steps\n",
|
||||
"\n",
|
||||
"- Combine code-based evaluators with `EvaluationClient` to evaluate specific production sessions\n",
|
||||
"- Add code-based evaluators to your CI/CD pipeline for automated regression testing\n",
|
||||
"- Use online evaluation with a lower sampling rate (e.g. 10%) to cost-effectively monitor high-traffic agents\n",
|
||||
"- Extend `HRFactChecker` with additional business rules as your agent evolves\n"
|
||||
]
|
||||
},
|
||||
|
||||
@@ -1,5 +1,4 @@
|
||||
bedrock-agentcore>=1.6.0
|
||||
bedrock-agentcore-starter-toolkit>=0.3.0
|
||||
boto3>=1.42.0
|
||||
strands-agents
|
||||
strands-agents-tools
|
||||
|
||||
Reference in New Issue
Block a user