Communityv1.0.0

Agent Evaluation

Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.

4.5kdownloads7stars54active installsrustyorb
View on ClawHubBack to Skills

Skill Details

Slug
agent-evaluation
Latest Version
1.0.0
Author
rustyorb
Published
Feb 9, 2026
Updated
Feb 25, 2026
Total Versions
1

How to Install

  1. 1 on OpenClawdBots (takes under 60 seconds).
  2. 2Open your bot dashboard and go to the Skills tab.
  3. 3Switch to the ClawHub tab and search for Agent Evaluation.
  4. 4Click Install and the skill is deployed to your bot automatically.

Changelog — v1.0.0

- Initial release of agent-evaluation skill for testing and benchmarking LLM agents. - Supports behavioral testing, capability assessment, reliability metrics, and production monitoring. - Includes practical testing patterns: statistical test evaluation, behavioral contract testing, and adversarial testing. - Highlights common anti-patterns and sharp edges in LLM agent evaluation. - Designed for use alongside related skills such as multi-agent orchestration and autonomous agents.