Testing AI Memory Systems: Mem0, Zep, and Long-Term Agent Memory
AI memory systems (Mem0, Zep, Pinecone-based, or custom vector stores) let agents remember information across conversations. Testing them requires more than verifying that data is stored and retrieved — you need to test retrieval relevance, memory conflicts, context injection into prompts, and what happens when memories degrade or contradict.
Key Takeaways
Memory retrieval is semantic, not exact — test for relevance, not equality. Unlike a key-value store, memory systems retrieve by semantic similarity. A memory stored as "user likes Python" should be retrieved when the user asks about programming, not just when they mention "Python."
Memory conflicts are the hardest failure mode to test. When an agent stores "user prefers dark mode" and later "user switched to light mode," which wins? Test conflict resolution explicitly.
Context injection is a separate layer to test. Storing and retrieving memories is different from injecting them into prompts correctly. A memory retrieved but not injected properly is invisible to the model.
Memory systems have latency that affects UX. Retrieval adds 50-500ms per turn. Test that your memory system meets your latency SLO before going to production.
What AI Memory Systems Do
An AI agent without memory treats every conversation as the first. Memory systems give agents persistence:
- Short-term memory: Within a single conversation (context window)
- Long-term memory: Across conversations (vector store, database)
- Episodic memory: Specific past events ("last Tuesday you asked about X")
- Semantic memory: Learned facts ("user is a Python developer")
- Procedural memory: Learned behaviors ("user prefers concise answers")
The two most widely used libraries are:
- Mem0: Managed memory layer, extracts and stores facts automatically from conversations
- Zep: Open-source, stores and retrieves conversation history with temporal awareness
Setting Up Mem0 for Testing
from mem0 import MemoryClient
import pytest
# For testing, use a separate Mem0 project or mock
memory_client = MemoryClient(api_key="your-mem0-api-key")
TEST_USER_ID = "test-user-pytest"
@pytest.fixture(autouse=True)
def clean_test_memories():
"""Clear test user memories before each test"""
yield
# Cleanup after test
try:
memories = memory_client.search("", user_id=TEST_USER_ID, limit=100)
for mem in memories:
memory_client.delete(mem["id"])
except Exception:
pass # Best effort cleanup
def test_memory_stores_user_preference():
"""Basic storage: memory is retrievable after adding"""
memory_client.add(
"I prefer Python over JavaScript for backend development",
user_id=TEST_USER_ID
)
results = memory_client.search(
"What programming language does the user prefer?",
user_id=TEST_USER_ID
)
assert len(results) > 0, "Memory should be retrievable"
# Check that at least one result mentions Python
combined_text = " ".join(r["memory"] for r in results).lower()
assert "python" in combined_text, f"Expected Python in memory results: {results}"
def test_memory_retrieval_is_semantically_relevant():
"""Semantic search: related query retrieves related memory"""
memory_client.add(
"The user works as a machine learning engineer at a fintech startup",
user_id=TEST_USER_ID
)
# Query with different wording but same semantic meaning
results = memory_client.search(
"What is the user's job?",
user_id=TEST_USER_ID
)
assert len(results) > 0
top_result = results[0]["memory"].lower()
assert any(term in top_result for term in ["machine learning", "engineer", "fintech"])
def test_irrelevant_query_returns_low_relevance():
"""Unrelated queries should not retrieve unrelated memories"""
memory_client.add(
"User's favorite color is blue",
user_id=TEST_USER_ID
)
results = memory_client.search(
"What database does the user use?",
user_id=TEST_USER_ID,
limit=5
)
# If results are returned, they should have low relevance scores
for result in results:
if "color" in result["memory"].lower():
# Color memory retrieved for a database query — check relevance score
assert result.get("score", 1.0) < 0.5, (
"Irrelevant memory should have low relevance score"
)Testing Memory Extraction from Conversations
Mem0 automatically extracts facts from conversation text. Test that extraction works correctly:
def test_mem0_extracts_facts_from_conversation():
"""Mem0 should extract discrete facts from conversation"""
conversation = [
{"role": "user", "content": "I'm building a SaaS app in Django and React"},
{"role": "assistant", "content": "Great! Django is excellent for backend APIs."},
{"role": "user", "content": "Yes, and I'm deploying to AWS ECS"},
]
memory_client.add(conversation, user_id=TEST_USER_ID)
# Check that key facts were extracted
django_results = memory_client.search("backend framework", user_id=TEST_USER_ID)
aws_results = memory_client.search("deployment platform", user_id=TEST_USER_ID)
django_text = " ".join(r["memory"] for r in django_results).lower()
aws_text = " ".join(r["memory"] for r in aws_results).lower()
assert "django" in django_text, "Django preference should be extracted"
assert "aws" in aws_text or "ecs" in aws_text, "AWS deployment should be extracted"
def test_mem0_does_not_hallucinate_facts():
"""Mem0 should not extract facts not present in conversation"""
conversation = [
{"role": "user", "content": "What's the weather like today?"},
{"role": "assistant", "content": "I don't have real-time weather data."},
]
memory_client.add(conversation, user_id=TEST_USER_ID)
# Should not extract a programming language preference
results = memory_client.search(
"user's programming language preference",
user_id=TEST_USER_ID
)
# Either no results or results shouldn't claim a language preference
for result in results:
memory_text = result["memory"].lower()
assert not any(
lang in memory_text
for lang in ["python", "javascript", "java", "rust", "go"]
), f"Hallucinated language preference: {result['memory']}"Testing Zep Memory
Zep stores conversation sessions and provides temporal memory retrieval:
from zep_python import ZepClient, Message, Memory
import pytest
zep_client = ZepClient(api_url="http://localhost:8000")
TEST_SESSION_ID = "pytest-session-001"
@pytest.fixture
def fresh_session():
"""Create a clean Zep session for each test"""
zep_client.memory.delete_session(TEST_SESSION_ID)
yield TEST_SESSION_ID
zep_client.memory.delete_session(TEST_SESSION_ID)
def test_zep_stores_conversation_turns(fresh_session):
"""Messages added to session are retrievable"""
messages = [
Message(role="user", content="My name is Alice"),
Message(role="assistant", content="Hello Alice, nice to meet you!"),
Message(role="user", content="I'm working on a Python project"),
]
zep_client.memory.add_memory(
fresh_session,
Memory(messages=messages)
)
# Retrieve the memory
memory = zep_client.memory.get_memory(fresh_session)
assert memory is not None
assert len(memory.messages) == 3
# Verify content
all_content = " ".join(m.content for m in memory.messages)
assert "Alice" in all_content
assert "Python" in all_content
def test_zep_search_returns_relevant_context(fresh_session):
"""Zep semantic search returns relevant past conversation"""
# Add a multi-turn conversation
messages = [
Message(role="user", content="I need help with async Python code"),
Message(role="assistant", content="Sure, let's use asyncio for concurrent operations"),
Message(role="user", content="I'm having issues with event loops"),
Message(role="assistant", content="Event loop issues often stem from blocking calls"),
]
zep_client.memory.add_memory(fresh_session, Memory(messages=messages))
# Search for related content
results = zep_client.memory.search_memory(
fresh_session,
"asynchronous programming problems"
)
assert len(results) > 0
# Top result should be about async/event loops
top_text = results[0].message.content.lower()
assert any(term in top_text for term in ["async", "asyncio", "event loop", "concurrent"])
def test_zep_memory_summary_is_generated(fresh_session):
"""Zep should generate summaries for longer conversations"""
# Add enough messages to trigger summarization
messages = []
for i in range(20):
messages.append(Message(role="user", content=f"Question {i} about Python testing"))
messages.append(Message(role="assistant", content=f"Answer {i}: Use pytest for testing"))
zep_client.memory.add_memory(fresh_session, Memory(messages=messages))
# Wait for summary generation (Zep processes async)
import time
time.sleep(2)
memory = zep_client.memory.get_memory(fresh_session)
# Summary should exist for longer conversations
assert memory.summary is not None
assert len(memory.summary.content) > 20, "Summary should be non-trivial"Testing Memory Conflicts
What happens when a user updates their preferences?
def test_updated_preference_overrides_old():
"""New preferences should supersede old ones"""
# Add initial preference
memory_client.add(
"User prefers TypeScript over JavaScript",
user_id=TEST_USER_ID
)
# Update preference
memory_client.add(
"User has switched to Go for backend development, moved away from TypeScript",
user_id=TEST_USER_ID
)
results = memory_client.search(
"What backend language does the user prefer?",
user_id=TEST_USER_ID,
limit=3
)
# Top result should reflect the newer preference
top_memory = results[0]["memory"].lower()
assert "go" in top_memory, (
f"Top memory should reflect updated Go preference, got: {results[0]['memory']}"
)
def test_contradictory_memories_surface_most_recent():
"""Most recently added memory should rank higher than old contradictory ones"""
import time
memory_client.add(
"User is based in San Francisco",
user_id=TEST_USER_ID
)
time.sleep(0.5)
memory_client.add(
"User relocated to New York City for a new job",
user_id=TEST_USER_ID
)
results = memory_client.search(
"Where does the user live?",
user_id=TEST_USER_ID,
limit=5
)
# First result should mention New York
assert len(results) > 0
top_memory = results[0]["memory"].lower()
assert "new york" in top_memory or "nyc" in top_memory, (
f"Most recent location (NYC) should rank first: {results[0]['memory']}"
)Testing Context Injection Into Prompts
Retrieving memories doesn't automatically help the model — you need to inject them into the prompt correctly:
from openai import OpenAI
openai_client = OpenAI()
def answer_with_memory(question: str, user_id: str) -> str:
"""Answer a question using retrieved memories for context"""
# Retrieve relevant memories
memories = memory_client.search(question, user_id=user_id, limit=5)
# Inject into system prompt
if memories:
memory_context = "\n".join(
f"- {m['memory']}" for m in memories
)
system_prompt = f"""You are a helpful assistant with memory of previous conversations.
Known facts about this user:
{memory_context}
Use this context to personalize your responses."""
else:
system_prompt = "You are a helpful assistant."
response = openai_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": question}
]
)
return response.choices[0].message.content
@pytest.mark.llm
def test_memory_context_influences_response():
"""Model should use retrieved memories in its response"""
# Store a clear preference
memory_client.add(
"User is an expert in Rust programming language, prefers low-level systems programming",
user_id=TEST_USER_ID
)
response = answer_with_memory(
"What language should I use for my new project?",
TEST_USER_ID
)
# Response should reference Rust given the stored preference
assert "rust" in response.lower(), (
f"Response should mention Rust given stored preference. Got: {response[:200]}"
)
@pytest.mark.llm
def test_empty_memory_graceful_fallback():
"""App should handle case where no memories exist for user"""
# Query for a brand new user with no memories
response = answer_with_memory(
"What do you know about me?",
user_id="brand-new-user-no-memories"
)
# Should not crash and should indicate no prior context
assert response is not None
assert len(response) > 10Testing Memory System Latency
import time
import statistics
@pytest.mark.performance
def test_memory_retrieval_latency():
"""Memory retrieval should complete within SLO"""
LATENCY_SLO_MS = 500 # 500ms max
# Populate with some memories
for i in range(10):
memory_client.add(
f"User fact {i}: This is test memory entry number {i}",
user_id=TEST_USER_ID
)
# Measure retrieval latency
latencies = []
for _ in range(10):
start = time.perf_counter()
memory_client.search("test memory", user_id=TEST_USER_ID, limit=5)
latencies.append((time.perf_counter() - start) * 1000)
p95_latency = sorted(latencies)[int(len(latencies) * 0.95)]
avg_latency = statistics.mean(latencies)
print(f"Memory retrieval latency — avg: {avg_latency:.1f}ms, p95: {p95_latency:.1f}ms")
assert p95_latency < LATENCY_SLO_MS, (
f"p95 retrieval latency {p95_latency:.1f}ms exceeds SLO {LATENCY_SLO_MS}ms"
)Testing Memory Isolation Between Users
Critical: verify that user A's memories don't leak to user B:
def test_memory_is_isolated_per_user():
"""Critical: memories must not leak between users"""
user_a = "test-user-alice-isolation"
user_b = "test-user-bob-isolation"
try:
# Alice stores a sensitive preference
memory_client.add(
"Alice's secret project name is Project Phoenix",
user_id=user_a
)
# Bob should not see Alice's memories
bob_results = memory_client.search(
"What project is the user working on?",
user_id=user_b
)
bob_memories_text = " ".join(
r["memory"] for r in bob_results
).lower()
assert "phoenix" not in bob_memories_text, (
f"SECURITY FAILURE: Bob can see Alice's memory! Results: {bob_results}"
)
finally:
# Cleanup both users
for uid in [user_a, user_b]:
try:
memories = memory_client.search("", user_id=uid, limit=100)
for m in memories:
memory_client.delete(m["id"])
except Exception:
passEnd-to-End Memory Agent Testing
Test the full flow: conversation → memory extraction → retrieval → response:
@pytest.mark.llm
def test_agent_remembers_across_turns():
"""Full end-to-end: agent should remember info from turn 1 in turn 3"""
session_user = "e2e-memory-test-user"
try:
# Turn 1: introduce a fact
response_1 = answer_with_memory(
"My name is Jordan and I'm learning Rust",
session_user
)
# Store what was said in this turn
memory_client.add(
f"User said: My name is Jordan and I'm learning Rust",
user_id=session_user
)
# Turn 2: different topic
answer_with_memory("What's the best way to handle errors?", session_user)
# Turn 3: reference to earlier fact
response_3 = answer_with_memory(
"What language did I say I was learning?",
session_user
)
assert "rust" in response_3.lower(), (
f"Agent should remember Rust from turn 1. Response: {response_3[:300]}"
)
finally:
memories = memory_client.search("", user_id=session_user, limit=100)
for m in memories:
memory_client.delete(m["id"])Key Takeaways
AI memory systems require testing beyond CRUD — verify semantic retrieval relevance, conflict resolution (newer preferences should win), correct context injection into prompts, and strict user isolation. Use realistic test fixtures that mirror production memory patterns. Test memory latency under load — retrieval adds meaningful overhead to every turn. Always run isolation tests to guarantee memories don't leak between users, which is a security issue, not just a quality issue.