How to Test smolagents from HuggingFace
smolagents is HuggingFace's lightweight Python library for building agents. "Small" is in the name — the library is minimal by design. What that means for testing: you get less scaffolding to hide behind and more direct access to the agent's internals. That's actually a testing advantage.
But "lightweight framework" doesn't mean "lightweight failure mode." An agent that generates and executes Python code (CodeAgent) can break in ways that are dramatically different from a conventional tool-calling agent. And a multi-agent pipeline where agents delegate to each other has coordination failure modes you won't find without tests.
Here's how to test smolagents applications properly.
What You're Testing in a smolagents Application
smolagents has two primary agent types:
- CodeAgent — generates Python code snippets and executes them. The agent's "actions" are code.
- ToolCallingAgent — calls tools via structured JSON, similar to OpenAI function calling.
Each has its own testing surface:
- Tools — the functions agents call; these are your safest starting point
- Agent configuration — correct model, tools list, system prompt
- Code generation quality (CodeAgent) — does the generated code actually solve the task?
- Tool selection (ToolCallingAgent) — does the agent pick the right tool for the right query?
- Multi-agent delegation — does the manager agent correctly route tasks to sub-agents?
Unit Testing smolagents Tools
Tools in smolagents are decorated Python functions. Test them directly — no agent, no LLM.
# tools/calculator.py
from smolagents import tool
@tool
def calculate(expression: str) -> str:
"""
Evaluates a mathematical expression and returns the result.
Args:
expression: A mathematical expression like '2 + 2' or '(10 * 5) / 2'
Returns:
The result as a string
"""
try:
result = eval(expression, {"__builtins__": {}}, {})
return str(result)
except Exception as e:
return f"Error: {e}"# tests/test_tools.py
import pytest
from tools.calculator import calculate
def test_calculator_basic_arithmetic():
assert calculate("2 + 2") == "4"
assert calculate("10 * 5") == "50"
assert calculate("100 / 4") == "25.0"
def test_calculator_complex_expression():
assert calculate("(10 + 5) * 2") == "30"
def test_calculator_returns_error_for_invalid_expression():
result = calculate("import os; os.system('rm -rf /')")
# The tool should handle this without executing system commands
assert "Error" in result
def test_calculator_handles_division_by_zero():
result = calculate("1 / 0")
assert "Error" in result or "ZeroDivision" in result
def test_tool_has_correct_metadata():
# smolagents exposes tool metadata — verify it's correct
assert calculate.name == "calculate"
assert "mathematical expression" in calculate.description.lower()
assert "expression" in calculate.inputsThe last test — checking tool metadata — is important for smolagents. The agent uses tool.name, tool.description, and tool.inputs to decide when and how to call the tool. If the description is wrong, the agent will misuse the tool. Tests that assert on tool metadata catch these configuration bugs.
Testing ToolCallingAgent
ToolCallingAgent uses structured tool calls. Test it by mocking the model and asserting on which tool gets called.
# tests/test_tool_calling_agent.py
import pytest
from unittest.mock import MagicMock, patch
from smolagents import ToolCallingAgent, HfApiModel
from tools.calculator import calculate
from tools.search_tool import search_web
def create_agent():
model = HfApiModel(model_id="meta-llama/Meta-Llama-3.1-8B-Instruct")
return ToolCallingAgent(
tools=[calculate, search_web],
model=model,
max_steps=5,
)
def test_agent_has_correct_tools():
agent = create_agent()
tool_names = [t.name for t in agent.tools.values()]
assert "calculate" in tool_names
assert "search_web" in tool_names
def test_agent_respects_max_steps():
agent = create_agent()
assert agent.max_steps == 5
def test_agent_tool_calling_with_mock_model():
mock_model = MagicMock()
# First call: model decides to use calculate tool
# Second call: model produces final answer
mock_model.side_effect = [
MagicMock(
content=None,
tool_calls=[MagicMock(
function=MagicMock(
name="calculate",
arguments={"expression": "2 + 2"}
)
)]
),
MagicMock(
content="The answer is 4.",
tool_calls=[]
)
]
agent = ToolCallingAgent(
tools=[calculate],
model=mock_model,
max_steps=3,
)
# Run the agent — the key assertion is that it doesn't crash
# and that the calculate tool would have been invoked
assert agent.tools["calculate"] is not NoneTesting CodeAgent
CodeAgent is more complex — it generates executable Python code. The safety surface is wider (generated code can do unexpected things), and the test surface is broader.
# tests/test_code_agent.py
import pytest
from unittest.mock import MagicMock, patch
from smolagents import CodeAgent, HfApiModel
from tools.calculator import calculate
def create_code_agent():
model = HfApiModel(model_id="meta-llama/Meta-Llama-3.1-8B-Instruct")
return CodeAgent(
tools=[calculate],
model=model,
max_steps=5,
additional_authorized_imports=["math"],
)
def test_code_agent_authorized_imports():
agent = create_code_agent()
# Verify the agent only allows expected imports
assert "math" in agent.additional_authorized_imports
def test_code_agent_uses_local_python_executor_by_default():
agent = create_code_agent()
# CodeAgent uses LocalPythonExecutor by default
# Verify it's not accidentally configured to run remote code
assert agent.executor_type in ("local", None) or \
"local" in str(type(agent.python_executor)).lower()
def test_code_agent_tool_available_in_sandbox():
agent = create_code_agent()
# The calculate tool should be accessible inside the code sandbox
assert "calculate" in agent.tools
def test_code_agent_with_mock_model():
mock_model = MagicMock()
# Model generates code that calls the calculate tool
mock_model.return_value = MagicMock(
content="""
```py
result = calculate("10 * 5")
final_answer(result)
```""",
tool_calls=[]
)
agent = CodeAgent(
tools=[calculate],
model=mock_model,
max_steps=2,
)
# The agent should be constructable with these settings
assert agent.max_steps == 2Testing Multi-Agent Systems
smolagents supports multi-agent pipelines where a manager agent delegates to specialized sub-agents. This is where the most surprising failures occur.
# tests/test_multi_agent.py
import pytest
from smolagents import ToolCallingAgent, CodeAgent, HfApiModel, ManagedAgent
def create_multi_agent_system():
model = HfApiModel(model_id="meta-llama/Meta-Llama-3.1-8B-Instruct")
# Sub-agent: handles math
math_agent = ToolCallingAgent(
tools=[calculate],
model=model,
max_steps=3,
)
# Wrap in ManagedAgent for delegation
managed_math = ManagedAgent(
agent=math_agent,
name="math_specialist",
description="Handles mathematical calculations and numeric analysis.",
)
# Manager agent
manager = CodeAgent(
tools=[managed_math],
model=model,
max_steps=10,
)
return manager, managed_math
def test_managed_agent_has_correct_name():
_, managed_math = create_multi_agent_system()
assert managed_math.name == "math_specialist"
def test_managed_agent_description_is_informative():
_, managed_math = create_multi_agent_system()
# Description needs to be specific enough for the manager to route correctly
assert len(managed_math.description) > 20
assert "math" in managed_math.description.lower() or \
"calculat" in managed_math.description.lower()
def test_manager_has_sub_agent_as_tool():
manager, managed_math = create_multi_agent_system()
assert "math_specialist" in manager.tools
def test_system_max_steps_are_sensible():
manager, _ = create_multi_agent_system()
# Manager should have more steps than sub-agents to allow delegation
assert manager.max_steps >= 5The test_managed_agent_description_is_informative test is the most important one here. The manager agent reads the sub-agent's description to decide when to delegate. A vague or wrong description means the manager routes tasks to the wrong agent. Testing the description content catches this class of bug.
Integration Tests Against the Full Pipeline
Once unit tests pass, run a small set of integration tests that exercise the full agent loop with a real (or stubbed) model:
# tests/test_integration.py
import pytest
import os
pytestmark = pytest.mark.skipif(
not os.getenv("HF_TOKEN"),
reason="Requires HF_TOKEN for integration tests"
)
def test_tool_calling_agent_solves_math_problem():
from smolagents import ToolCallingAgent, HfApiModel
from tools.calculator import calculate
agent = ToolCallingAgent(
tools=[calculate],
model=HfApiModel(model_id="meta-llama/Meta-Llama-3.1-8B-Instruct"),
max_steps=3,
)
result = agent.run("What is 25 multiplied by 4?")
assert "100" in str(result)
def test_agent_completes_within_timeout():
import time
from smolagents import ToolCallingAgent, HfApiModel
from tools.calculator import calculate
agent = ToolCallingAgent(
tools=[calculate],
model=HfApiModel(model_id="meta-llama/Meta-Llama-3.1-8B-Instruct"),
max_steps=3,
)
start = time.time()
result = agent.run("Calculate 100 / 4.")
elapsed = time.time() - start
assert elapsed < 30, f"Agent took {elapsed:.1f}s — too slow"
assert result is not NoneHow HelpMeTest Helps
If your smolagents application has a browser UI — a chat interface, a dashboard where users submit tasks — HelpMeTest tests what users actually experience:
Go to https://your-smolagents-app.com
Click the task input
Type "Calculate 15% of 200"
Submit the task
Wait for the agent to respond
Verify the response contains "30"HelpMeTest runs these on a schedule and alerts you when the agent stops responding correctly in production — a HuggingFace API outage, a model that was updated and now produces different output, or a timeout that leaves the UI spinner stuck indefinitely.
Set up helpmetest health smolagents-api 5m for a continuous availability check that fires every 5 minutes. The free tier covers 10 tests — enough for full coverage of your critical agent paths.
What to Ship
Minimum test coverage for a production smolagents application:
- Tool metadata tests — name, description, and input schema are correct for every tool
- Tool unit tests — happy path and error path for every tool function
- Agent configuration tests — correct tools list, max_steps, and authorized imports
- Multi-agent structure tests — sub-agent names and descriptions are specific and correct
- Integration tests — at least one real-model test per major agent capability (run on deploy)
- Browser tests — UI smoke test running on a schedule in production
smolagents is intentionally minimal — which means the bugs you introduce are intentional too. Tests catch them before users do.
Start with HelpMeTest's free tier — 10 tests, no credit card. Add browser-level monitoring to your smolagents application in under ten minutes at helpmetest.com.