How to Test smolagents from HuggingFace

How to Test smolagents from HuggingFace

smolagents is HuggingFace's lightweight Python library for building agents. "Small" is in the name — the library is minimal by design. What that means for testing: you get less scaffolding to hide behind and more direct access to the agent's internals. That's actually a testing advantage.

But "lightweight framework" doesn't mean "lightweight failure mode." An agent that generates and executes Python code (CodeAgent) can break in ways that are dramatically different from a conventional tool-calling agent. And a multi-agent pipeline where agents delegate to each other has coordination failure modes you won't find without tests.

Here's how to test smolagents applications properly.

What You're Testing in a smolagents Application

smolagents has two primary agent types:

  • CodeAgent — generates Python code snippets and executes them. The agent's "actions" are code.
  • ToolCallingAgent — calls tools via structured JSON, similar to OpenAI function calling.

Each has its own testing surface:

  • Tools — the functions agents call; these are your safest starting point
  • Agent configuration — correct model, tools list, system prompt
  • Code generation quality (CodeAgent) — does the generated code actually solve the task?
  • Tool selection (ToolCallingAgent) — does the agent pick the right tool for the right query?
  • Multi-agent delegation — does the manager agent correctly route tasks to sub-agents?

Unit Testing smolagents Tools

Tools in smolagents are decorated Python functions. Test them directly — no agent, no LLM.

# tools/calculator.py
from smolagents import tool

@tool
def calculate(expression: str) -> str:
    """
    Evaluates a mathematical expression and returns the result.
    
    Args:
        expression: A mathematical expression like '2 + 2' or '(10 * 5) / 2'
    
    Returns:
        The result as a string
    """
    try:
        result = eval(expression, {"__builtins__": {}}, {})
        return str(result)
    except Exception as e:
        return f"Error: {e}"
# tests/test_tools.py
import pytest
from tools.calculator import calculate

def test_calculator_basic_arithmetic():
    assert calculate("2 + 2") == "4"
    assert calculate("10 * 5") == "50"
    assert calculate("100 / 4") == "25.0"

def test_calculator_complex_expression():
    assert calculate("(10 + 5) * 2") == "30"

def test_calculator_returns_error_for_invalid_expression():
    result = calculate("import os; os.system('rm -rf /')")
    # The tool should handle this without executing system commands
    assert "Error" in result

def test_calculator_handles_division_by_zero():
    result = calculate("1 / 0")
    assert "Error" in result or "ZeroDivision" in result

def test_tool_has_correct_metadata():
    # smolagents exposes tool metadata — verify it's correct
    assert calculate.name == "calculate"
    assert "mathematical expression" in calculate.description.lower()
    assert "expression" in calculate.inputs

The last test — checking tool metadata — is important for smolagents. The agent uses tool.name, tool.description, and tool.inputs to decide when and how to call the tool. If the description is wrong, the agent will misuse the tool. Tests that assert on tool metadata catch these configuration bugs.

Testing ToolCallingAgent

ToolCallingAgent uses structured tool calls. Test it by mocking the model and asserting on which tool gets called.

# tests/test_tool_calling_agent.py
import pytest
from unittest.mock import MagicMock, patch
from smolagents import ToolCallingAgent, HfApiModel
from tools.calculator import calculate
from tools.search_tool import search_web

def create_agent():
    model = HfApiModel(model_id="meta-llama/Meta-Llama-3.1-8B-Instruct")
    return ToolCallingAgent(
        tools=[calculate, search_web],
        model=model,
        max_steps=5,
    )

def test_agent_has_correct_tools():
    agent = create_agent()
    tool_names = [t.name for t in agent.tools.values()]
    assert "calculate" in tool_names
    assert "search_web" in tool_names

def test_agent_respects_max_steps():
    agent = create_agent()
    assert agent.max_steps == 5

def test_agent_tool_calling_with_mock_model():
    mock_model = MagicMock()
    
    # First call: model decides to use calculate tool
    # Second call: model produces final answer
    mock_model.side_effect = [
        MagicMock(
            content=None,
            tool_calls=[MagicMock(
                function=MagicMock(
                    name="calculate",
                    arguments={"expression": "2 + 2"}
                )
            )]
        ),
        MagicMock(
            content="The answer is 4.",
            tool_calls=[]
        )
    ]

    agent = ToolCallingAgent(
        tools=[calculate],
        model=mock_model,
        max_steps=3,
    )

    # Run the agent — the key assertion is that it doesn't crash
    # and that the calculate tool would have been invoked
    assert agent.tools["calculate"] is not None

Testing CodeAgent

CodeAgent is more complex — it generates executable Python code. The safety surface is wider (generated code can do unexpected things), and the test surface is broader.

# tests/test_code_agent.py
import pytest
from unittest.mock import MagicMock, patch
from smolagents import CodeAgent, HfApiModel
from tools.calculator import calculate

def create_code_agent():
    model = HfApiModel(model_id="meta-llama/Meta-Llama-3.1-8B-Instruct")
    return CodeAgent(
        tools=[calculate],
        model=model,
        max_steps=5,
        additional_authorized_imports=["math"],
    )

def test_code_agent_authorized_imports():
    agent = create_code_agent()
    # Verify the agent only allows expected imports
    assert "math" in agent.additional_authorized_imports

def test_code_agent_uses_local_python_executor_by_default():
    agent = create_code_agent()
    # CodeAgent uses LocalPythonExecutor by default
    # Verify it's not accidentally configured to run remote code
    assert agent.executor_type in ("local", None) or \
           "local" in str(type(agent.python_executor)).lower()

def test_code_agent_tool_available_in_sandbox():
    agent = create_code_agent()
    # The calculate tool should be accessible inside the code sandbox
    assert "calculate" in agent.tools

def test_code_agent_with_mock_model():
    mock_model = MagicMock()
    
    # Model generates code that calls the calculate tool
    mock_model.return_value = MagicMock(
        content="""
```py
result = calculate("10 * 5")
final_answer(result)
```""",
        tool_calls=[]
    )

    agent = CodeAgent(
        tools=[calculate],
        model=mock_model,
        max_steps=2,
    )

    # The agent should be constructable with these settings
    assert agent.max_steps == 2

Testing Multi-Agent Systems

smolagents supports multi-agent pipelines where a manager agent delegates to specialized sub-agents. This is where the most surprising failures occur.

# tests/test_multi_agent.py
import pytest
from smolagents import ToolCallingAgent, CodeAgent, HfApiModel, ManagedAgent

def create_multi_agent_system():
    model = HfApiModel(model_id="meta-llama/Meta-Llama-3.1-8B-Instruct")

    # Sub-agent: handles math
    math_agent = ToolCallingAgent(
        tools=[calculate],
        model=model,
        max_steps=3,
    )

    # Wrap in ManagedAgent for delegation
    managed_math = ManagedAgent(
        agent=math_agent,
        name="math_specialist",
        description="Handles mathematical calculations and numeric analysis.",
    )

    # Manager agent
    manager = CodeAgent(
        tools=[managed_math],
        model=model,
        max_steps=10,
    )

    return manager, managed_math

def test_managed_agent_has_correct_name():
    _, managed_math = create_multi_agent_system()
    assert managed_math.name == "math_specialist"

def test_managed_agent_description_is_informative():
    _, managed_math = create_multi_agent_system()
    # Description needs to be specific enough for the manager to route correctly
    assert len(managed_math.description) > 20
    assert "math" in managed_math.description.lower() or \
           "calculat" in managed_math.description.lower()

def test_manager_has_sub_agent_as_tool():
    manager, managed_math = create_multi_agent_system()
    assert "math_specialist" in manager.tools

def test_system_max_steps_are_sensible():
    manager, _ = create_multi_agent_system()
    # Manager should have more steps than sub-agents to allow delegation
    assert manager.max_steps >= 5

The test_managed_agent_description_is_informative test is the most important one here. The manager agent reads the sub-agent's description to decide when to delegate. A vague or wrong description means the manager routes tasks to the wrong agent. Testing the description content catches this class of bug.

Integration Tests Against the Full Pipeline

Once unit tests pass, run a small set of integration tests that exercise the full agent loop with a real (or stubbed) model:

# tests/test_integration.py
import pytest
import os

pytestmark = pytest.mark.skipif(
    not os.getenv("HF_TOKEN"),
    reason="Requires HF_TOKEN for integration tests"
)

def test_tool_calling_agent_solves_math_problem():
    from smolagents import ToolCallingAgent, HfApiModel
    from tools.calculator import calculate

    agent = ToolCallingAgent(
        tools=[calculate],
        model=HfApiModel(model_id="meta-llama/Meta-Llama-3.1-8B-Instruct"),
        max_steps=3,
    )

    result = agent.run("What is 25 multiplied by 4?")
    assert "100" in str(result)

def test_agent_completes_within_timeout():
    import time
    from smolagents import ToolCallingAgent, HfApiModel
    from tools.calculator import calculate

    agent = ToolCallingAgent(
        tools=[calculate],
        model=HfApiModel(model_id="meta-llama/Meta-Llama-3.1-8B-Instruct"),
        max_steps=3,
    )

    start = time.time()
    result = agent.run("Calculate 100 / 4.")
    elapsed = time.time() - start

    assert elapsed < 30, f"Agent took {elapsed:.1f}s — too slow"
    assert result is not None

How HelpMeTest Helps

If your smolagents application has a browser UI — a chat interface, a dashboard where users submit tasks — HelpMeTest tests what users actually experience:

Go to https://your-smolagents-app.com
Click the task input
Type "Calculate 15% of 200"
Submit the task
Wait for the agent to respond
Verify the response contains "30"

HelpMeTest runs these on a schedule and alerts you when the agent stops responding correctly in production — a HuggingFace API outage, a model that was updated and now produces different output, or a timeout that leaves the UI spinner stuck indefinitely.

Set up helpmetest health smolagents-api 5m for a continuous availability check that fires every 5 minutes. The free tier covers 10 tests — enough for full coverage of your critical agent paths.

What to Ship

Minimum test coverage for a production smolagents application:

  1. Tool metadata tests — name, description, and input schema are correct for every tool
  2. Tool unit tests — happy path and error path for every tool function
  3. Agent configuration tests — correct tools list, max_steps, and authorized imports
  4. Multi-agent structure tests — sub-agent names and descriptions are specific and correct
  5. Integration tests — at least one real-model test per major agent capability (run on deploy)
  6. Browser tests — UI smoke test running on a schedule in production

smolagents is intentionally minimal — which means the bugs you introduce are intentional too. Tests catch them before users do.


Start with HelpMeTest's free tier — 10 tests, no credit card. Add browser-level monitoring to your smolagents application in under ten minutes at helpmetest.com.

Read more