Testing AI Memory Systems: Mem0, Zep, and Long-Term Agent Memory

Testing AI Memory Systems: Mem0, Zep, and Long-Term Agent Memory

AI memory systems (Mem0, Zep, Pinecone-based, or custom vector stores) let agents remember information across conversations. Testing them requires more than verifying that data is stored and retrieved — you need to test retrieval relevance, memory conflicts, context injection into prompts, and what happens when memories degrade or contradict.

Key Takeaways

Memory retrieval is semantic, not exact — test for relevance, not equality. Unlike a key-value store, memory systems retrieve by semantic similarity. A memory stored as "user likes Python" should be retrieved when the user asks about programming, not just when they mention "Python."

Memory conflicts are the hardest failure mode to test. When an agent stores "user prefers dark mode" and later "user switched to light mode," which wins? Test conflict resolution explicitly.

Context injection is a separate layer to test. Storing and retrieving memories is different from injecting them into prompts correctly. A memory retrieved but not injected properly is invisible to the model.

Memory systems have latency that affects UX. Retrieval adds 50-500ms per turn. Test that your memory system meets your latency SLO before going to production.

What AI Memory Systems Do

An AI agent without memory treats every conversation as the first. Memory systems give agents persistence:

  • Short-term memory: Within a single conversation (context window)
  • Long-term memory: Across conversations (vector store, database)
  • Episodic memory: Specific past events ("last Tuesday you asked about X")
  • Semantic memory: Learned facts ("user is a Python developer")
  • Procedural memory: Learned behaviors ("user prefers concise answers")

The two most widely used libraries are:

  • Mem0: Managed memory layer, extracts and stores facts automatically from conversations
  • Zep: Open-source, stores and retrieves conversation history with temporal awareness

Setting Up Mem0 for Testing

from mem0 import MemoryClient
import pytest

# For testing, use a separate Mem0 project or mock
memory_client = MemoryClient(api_key="your-mem0-api-key")

TEST_USER_ID = "test-user-pytest"

@pytest.fixture(autouse=True)
def clean_test_memories():
    """Clear test user memories before each test"""
    yield
    # Cleanup after test
    try:
        memories = memory_client.search("", user_id=TEST_USER_ID, limit=100)
        for mem in memories:
            memory_client.delete(mem["id"])
    except Exception:
        pass  # Best effort cleanup

def test_memory_stores_user_preference():
    """Basic storage: memory is retrievable after adding"""
    memory_client.add(
        "I prefer Python over JavaScript for backend development",
        user_id=TEST_USER_ID
    )
    
    results = memory_client.search(
        "What programming language does the user prefer?",
        user_id=TEST_USER_ID
    )
    
    assert len(results) > 0, "Memory should be retrievable"
    
    # Check that at least one result mentions Python
    combined_text = " ".join(r["memory"] for r in results).lower()
    assert "python" in combined_text, f"Expected Python in memory results: {results}"

def test_memory_retrieval_is_semantically_relevant():
    """Semantic search: related query retrieves related memory"""
    memory_client.add(
        "The user works as a machine learning engineer at a fintech startup",
        user_id=TEST_USER_ID
    )
    
    # Query with different wording but same semantic meaning
    results = memory_client.search(
        "What is the user's job?",
        user_id=TEST_USER_ID
    )
    
    assert len(results) > 0
    top_result = results[0]["memory"].lower()
    assert any(term in top_result for term in ["machine learning", "engineer", "fintech"])

def test_irrelevant_query_returns_low_relevance():
    """Unrelated queries should not retrieve unrelated memories"""
    memory_client.add(
        "User's favorite color is blue",
        user_id=TEST_USER_ID
    )
    
    results = memory_client.search(
        "What database does the user use?",
        user_id=TEST_USER_ID,
        limit=5
    )
    
    # If results are returned, they should have low relevance scores
    for result in results:
        if "color" in result["memory"].lower():
            # Color memory retrieved for a database query — check relevance score
            assert result.get("score", 1.0) < 0.5, (
                "Irrelevant memory should have low relevance score"
            )

Testing Memory Extraction from Conversations

Mem0 automatically extracts facts from conversation text. Test that extraction works correctly:

def test_mem0_extracts_facts_from_conversation():
    """Mem0 should extract discrete facts from conversation"""
    conversation = [
        {"role": "user", "content": "I'm building a SaaS app in Django and React"},
        {"role": "assistant", "content": "Great! Django is excellent for backend APIs."},
        {"role": "user", "content": "Yes, and I'm deploying to AWS ECS"},
    ]
    
    memory_client.add(conversation, user_id=TEST_USER_ID)
    
    # Check that key facts were extracted
    django_results = memory_client.search("backend framework", user_id=TEST_USER_ID)
    aws_results = memory_client.search("deployment platform", user_id=TEST_USER_ID)
    
    django_text = " ".join(r["memory"] for r in django_results).lower()
    aws_text = " ".join(r["memory"] for r in aws_results).lower()
    
    assert "django" in django_text, "Django preference should be extracted"
    assert "aws" in aws_text or "ecs" in aws_text, "AWS deployment should be extracted"

def test_mem0_does_not_hallucinate_facts():
    """Mem0 should not extract facts not present in conversation"""
    conversation = [
        {"role": "user", "content": "What's the weather like today?"},
        {"role": "assistant", "content": "I don't have real-time weather data."},
    ]
    
    memory_client.add(conversation, user_id=TEST_USER_ID)
    
    # Should not extract a programming language preference
    results = memory_client.search(
        "user's programming language preference",
        user_id=TEST_USER_ID
    )
    
    # Either no results or results shouldn't claim a language preference
    for result in results:
        memory_text = result["memory"].lower()
        assert not any(
            lang in memory_text
            for lang in ["python", "javascript", "java", "rust", "go"]
        ), f"Hallucinated language preference: {result['memory']}"

Testing Zep Memory

Zep stores conversation sessions and provides temporal memory retrieval:

from zep_python import ZepClient, Message, Memory
import pytest

zep_client = ZepClient(api_url="http://localhost:8000")

TEST_SESSION_ID = "pytest-session-001"

@pytest.fixture
def fresh_session():
    """Create a clean Zep session for each test"""
    zep_client.memory.delete_session(TEST_SESSION_ID)
    yield TEST_SESSION_ID
    zep_client.memory.delete_session(TEST_SESSION_ID)

def test_zep_stores_conversation_turns(fresh_session):
    """Messages added to session are retrievable"""
    messages = [
        Message(role="user", content="My name is Alice"),
        Message(role="assistant", content="Hello Alice, nice to meet you!"),
        Message(role="user", content="I'm working on a Python project"),
    ]
    
    zep_client.memory.add_memory(
        fresh_session,
        Memory(messages=messages)
    )
    
    # Retrieve the memory
    memory = zep_client.memory.get_memory(fresh_session)
    
    assert memory is not None
    assert len(memory.messages) == 3
    
    # Verify content
    all_content = " ".join(m.content for m in memory.messages)
    assert "Alice" in all_content
    assert "Python" in all_content

def test_zep_search_returns_relevant_context(fresh_session):
    """Zep semantic search returns relevant past conversation"""
    # Add a multi-turn conversation
    messages = [
        Message(role="user", content="I need help with async Python code"),
        Message(role="assistant", content="Sure, let's use asyncio for concurrent operations"),
        Message(role="user", content="I'm having issues with event loops"),
        Message(role="assistant", content="Event loop issues often stem from blocking calls"),
    ]
    zep_client.memory.add_memory(fresh_session, Memory(messages=messages))
    
    # Search for related content
    results = zep_client.memory.search_memory(
        fresh_session,
        "asynchronous programming problems"
    )
    
    assert len(results) > 0
    # Top result should be about async/event loops
    top_text = results[0].message.content.lower()
    assert any(term in top_text for term in ["async", "asyncio", "event loop", "concurrent"])

def test_zep_memory_summary_is_generated(fresh_session):
    """Zep should generate summaries for longer conversations"""
    # Add enough messages to trigger summarization
    messages = []
    for i in range(20):
        messages.append(Message(role="user", content=f"Question {i} about Python testing"))
        messages.append(Message(role="assistant", content=f"Answer {i}: Use pytest for testing"))
    
    zep_client.memory.add_memory(fresh_session, Memory(messages=messages))
    
    # Wait for summary generation (Zep processes async)
    import time
    time.sleep(2)
    
    memory = zep_client.memory.get_memory(fresh_session)
    
    # Summary should exist for longer conversations
    assert memory.summary is not None
    assert len(memory.summary.content) > 20, "Summary should be non-trivial"

Testing Memory Conflicts

What happens when a user updates their preferences?

def test_updated_preference_overrides_old():
    """New preferences should supersede old ones"""
    # Add initial preference
    memory_client.add(
        "User prefers TypeScript over JavaScript",
        user_id=TEST_USER_ID
    )
    
    # Update preference
    memory_client.add(
        "User has switched to Go for backend development, moved away from TypeScript",
        user_id=TEST_USER_ID
    )
    
    results = memory_client.search(
        "What backend language does the user prefer?",
        user_id=TEST_USER_ID,
        limit=3
    )
    
    # Top result should reflect the newer preference
    top_memory = results[0]["memory"].lower()
    assert "go" in top_memory, (
        f"Top memory should reflect updated Go preference, got: {results[0]['memory']}"
    )

def test_contradictory_memories_surface_most_recent():
    """Most recently added memory should rank higher than old contradictory ones"""
    import time
    
    memory_client.add(
        "User is based in San Francisco",
        user_id=TEST_USER_ID
    )
    time.sleep(0.5)
    
    memory_client.add(
        "User relocated to New York City for a new job",
        user_id=TEST_USER_ID
    )
    
    results = memory_client.search(
        "Where does the user live?",
        user_id=TEST_USER_ID,
        limit=5
    )
    
    # First result should mention New York
    assert len(results) > 0
    top_memory = results[0]["memory"].lower()
    assert "new york" in top_memory or "nyc" in top_memory, (
        f"Most recent location (NYC) should rank first: {results[0]['memory']}"
    )

Testing Context Injection Into Prompts

Retrieving memories doesn't automatically help the model — you need to inject them into the prompt correctly:

from openai import OpenAI

openai_client = OpenAI()

def answer_with_memory(question: str, user_id: str) -> str:
    """Answer a question using retrieved memories for context"""
    # Retrieve relevant memories
    memories = memory_client.search(question, user_id=user_id, limit=5)
    
    # Inject into system prompt
    if memories:
        memory_context = "\n".join(
            f"- {m['memory']}" for m in memories
        )
        system_prompt = f"""You are a helpful assistant with memory of previous conversations.

Known facts about this user:
{memory_context}

Use this context to personalize your responses."""
    else:
        system_prompt = "You are a helpful assistant."
    
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": question}
        ]
    )
    
    return response.choices[0].message.content

@pytest.mark.llm
def test_memory_context_influences_response():
    """Model should use retrieved memories in its response"""
    # Store a clear preference
    memory_client.add(
        "User is an expert in Rust programming language, prefers low-level systems programming",
        user_id=TEST_USER_ID
    )
    
    response = answer_with_memory(
        "What language should I use for my new project?",
        TEST_USER_ID
    )
    
    # Response should reference Rust given the stored preference
    assert "rust" in response.lower(), (
        f"Response should mention Rust given stored preference. Got: {response[:200]}"
    )

@pytest.mark.llm
def test_empty_memory_graceful_fallback():
    """App should handle case where no memories exist for user"""
    # Query for a brand new user with no memories
    response = answer_with_memory(
        "What do you know about me?",
        user_id="brand-new-user-no-memories"
    )
    
    # Should not crash and should indicate no prior context
    assert response is not None
    assert len(response) > 10

Testing Memory System Latency

import time
import statistics

@pytest.mark.performance
def test_memory_retrieval_latency():
    """Memory retrieval should complete within SLO"""
    LATENCY_SLO_MS = 500  # 500ms max
    
    # Populate with some memories
    for i in range(10):
        memory_client.add(
            f"User fact {i}: This is test memory entry number {i}",
            user_id=TEST_USER_ID
        )
    
    # Measure retrieval latency
    latencies = []
    for _ in range(10):
        start = time.perf_counter()
        memory_client.search("test memory", user_id=TEST_USER_ID, limit=5)
        latencies.append((time.perf_counter() - start) * 1000)
    
    p95_latency = sorted(latencies)[int(len(latencies) * 0.95)]
    avg_latency = statistics.mean(latencies)
    
    print(f"Memory retrieval latency — avg: {avg_latency:.1f}ms, p95: {p95_latency:.1f}ms")
    
    assert p95_latency < LATENCY_SLO_MS, (
        f"p95 retrieval latency {p95_latency:.1f}ms exceeds SLO {LATENCY_SLO_MS}ms"
    )

Testing Memory Isolation Between Users

Critical: verify that user A's memories don't leak to user B:

def test_memory_is_isolated_per_user():
    """Critical: memories must not leak between users"""
    user_a = "test-user-alice-isolation"
    user_b = "test-user-bob-isolation"
    
    try:
        # Alice stores a sensitive preference
        memory_client.add(
            "Alice's secret project name is Project Phoenix",
            user_id=user_a
        )
        
        # Bob should not see Alice's memories
        bob_results = memory_client.search(
            "What project is the user working on?",
            user_id=user_b
        )
        
        bob_memories_text = " ".join(
            r["memory"] for r in bob_results
        ).lower()
        
        assert "phoenix" not in bob_memories_text, (
            f"SECURITY FAILURE: Bob can see Alice's memory! Results: {bob_results}"
        )
    
    finally:
        # Cleanup both users
        for uid in [user_a, user_b]:
            try:
                memories = memory_client.search("", user_id=uid, limit=100)
                for m in memories:
                    memory_client.delete(m["id"])
            except Exception:
                pass

End-to-End Memory Agent Testing

Test the full flow: conversation → memory extraction → retrieval → response:

@pytest.mark.llm
def test_agent_remembers_across_turns():
    """Full end-to-end: agent should remember info from turn 1 in turn 3"""
    session_user = "e2e-memory-test-user"
    
    try:
        # Turn 1: introduce a fact
        response_1 = answer_with_memory(
            "My name is Jordan and I'm learning Rust",
            session_user
        )
        # Store what was said in this turn
        memory_client.add(
            f"User said: My name is Jordan and I'm learning Rust",
            user_id=session_user
        )
        
        # Turn 2: different topic
        answer_with_memory("What's the best way to handle errors?", session_user)
        
        # Turn 3: reference to earlier fact
        response_3 = answer_with_memory(
            "What language did I say I was learning?",
            session_user
        )
        
        assert "rust" in response_3.lower(), (
            f"Agent should remember Rust from turn 1. Response: {response_3[:300]}"
        )
    finally:
        memories = memory_client.search("", user_id=session_user, limit=100)
        for m in memories:
            memory_client.delete(m["id"])

Key Takeaways

AI memory systems require testing beyond CRUD — verify semantic retrieval relevance, conflict resolution (newer preferences should win), correct context injection into prompts, and strict user isolation. Use realistic test fixtures that mirror production memory patterns. Test memory latency under load — retrieval adds meaningful overhead to every turn. Always run isolation tests to guarantee memories don't leak between users, which is a security issue, not just a quality issue.

Read more