Running Evals¶

Here's how to run benchmark evaluations with wags. We currently support: - BFCL: Berkeley Function Call Leaderboard multi-turn tests - AppWorld: Realistic task evaluation across 9 day-to-day apps

Setup¶

First Time Setup¶

Bash

# 1. Initialize the BFCL data submodule
git submodule update --init --recursive

# 2. Install evaluation dependencies
UV_GIT_LFS=1 uv pip install -e ".[dev,evals]"

Updating Data¶

If you already have the submodule initialized:

Bash

# Update to latest test data
git submodule update --remote

AppWorld Setup¶

For AppWorld benchmark evaluation:

Bash

# Install evaluation dependencies
UV_GIT_LFS=1 uv pip install -e ".[dev,evals]"

# Initialize AppWorld environment
appworld install

# Download benchmark data
appworld download data

Running Tests¶

Basic Usage¶

Bash

# Run all BFCL multi-turn tests
.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py

# Run specific test
.venv/bin/pytest 'tests/benchmarks/bfcl/test_bfcl.py::test_bfcl[multi_turn_base_121]'

# Run test category (multi_turn_base, multi_turn_miss_func, multi_turn_miss_param, multi_turn_long_context)
.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py -k "multi_turn_miss_func"

With Different Models¶

Bash

# Use GPT-4o (default)
.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py --model gpt-4o

# Use Claude
.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py --model claude-3-5-sonnet-20241022

# Use GPT-4o-mini
.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py --model gpt-4o-mini

Custom Output Directory¶

Bash

# Save results to specific directory
.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py --output-dir outputs/experiment1

Validation Mode¶

Validate existing logs without running new tests:

BFCL:

Bash

# Validate logs (auto-detects from outputs/raw/)
.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py --validate-only

# Or specify custom output directory
.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py --validate-only --output-dir outputs/experiment1

AppWorld:

Bash

# Validate logs (auto-detects from results/{model}/{dataset}/)
.venv/bin/pytest tests/benchmarks/appworld/test_appworld.py --validate-only --model gpt-4o --dataset train

AppWorld Results Organization¶

AppWorld tests automatically organize results during execution:

Bash

# Run tests - results automatically organized
.venv/bin/pytest tests/benchmarks/appworld/test_appworld.py --dataset train --model gpt-4o

# Results automatically written to:
# - results/gpt-4o/train/outputs/raw/ (conversation logs)
# - results/gpt-4o/train/failure_reports/ (auto-generated for failed tests)
# - experiments/outputs/gpt-4o/train/ (AppWorld evaluation data)

# Clean up large experiment directories after tests
rm -rf experiments/outputs/gpt-4o/  # Frees ~15GB

AppWorld-specific options:

Bash

--dataset DATASET         # Dataset: train, dev, test_normal, test_challenge (default: train)
--limit N                 # Run only first N tasks from dataset
--start-from TASK_ID      # Resume from specific task ID

Parallel Execution¶

Run tests in parallel using multiple workers:

Bash

# Run with 4 workers
.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py -n 4

# Run with 8 workers
.venv/bin/pytest tests/benchmarks/appworld/test_appworld.py --dataset train -n 8

# Auto-detect number of CPUs
.venv/bin/pytest tests/benchmarks/bfcl/test_bfcl.py -n auto

Running Evals¶

Setup¶

First Time Setup¶

Updating Data¶

AppWorld Setup¶

Running Tests¶

Basic Usage¶

With Different Models¶

Custom Output Directory¶

Validation Mode¶

AppWorld Results Organization¶

Parallel Execution¶

Further Reading¶

BFCL¶

AppWorld¶