Meet Harness-1: A 20B Retrieval Subagent Trained With Reinforcement Learning Inside a Stateful Search Harness on gpt-oss-20b

Learn to build a stateful search harness system inspired by Harness-1, a 20B parameter retrieval subagent trained with reinforcement learning. This tutorial teaches you to implement candidate pooling, evidence graph maintenance, and reinforcement learning-based decision making.

Introduction

In this tutorial, we'll explore how to implement a stateful search harness similar to the Harress-1 system described in the recent MarkTechPost article. Harness-1 is a 20B parameter retrieval subagent trained using reinforcement learning within a stateful search framework. While the full system is complex, we'll build a simplified version that demonstrates core concepts like candidate pooling, evidence graph maintenance, and reinforcement learning-based decision making for search optimization.

This tutorial will help you understand how to build a retrieval system that maintains state information, learns from feedback, and makes intelligent decisions about when to stop searching. We'll focus on the practical implementation aspects using Python and common AI libraries.

Prerequisites

Python 3.8+
Basic understanding of machine learning concepts
Knowledge of reinforcement learning fundamentals
Experience with vector databases (e.g., Chroma, FAISS)
Installed packages: numpy, scikit-learn, torch, chromadb, transformers

Step-by-Step Instructions

1. Setting Up the Environment

First, we'll create a virtual environment and install the required dependencies. This ensures we have a clean working space with all necessary libraries.

python -m venv harness_env
source harness_env/bin/activate  # On Windows: harness_env\Scripts\activate
pip install numpy scikit-learn torch chromadb transformers

Why: Setting up a virtual environment isolates our project dependencies from the system-wide Python installation, preventing conflicts with other projects.

2. Creating the Search Harness Class

We'll begin by implementing the core harness class that will manage our search state, candidate pool, and evidence graph.

import numpy as np
import torch
from typing import List, Dict, Any
from chromadb import Client


class SearchHarness:
    def __init__(self, chroma_client: Client):
        self.chroma_client = chroma_client
        self.candidate_pool = []
        self.evidence_graph = {}
        self.verification_records = []
        self.search_history = []
        
    def add_candidates(self, documents: List[Dict[str, Any]]):
        """Add documents to the candidate pool"""
        for doc in documents:
            self.candidate_pool.append(doc)
            
    def get_candidate_pool(self) -> List[Dict[str, Any]]:
        """Return current candidate pool"""
        return self.candidate_pool
        
    def update_evidence_graph(self, query: str, results: List[Dict[str, Any]]):
        """Update evidence graph with search results"""
        if query not in self.evidence_graph:
            self.evidence_graph[query] = []
        self.evidence_graph[query].extend(results)
        
    def get_evidence(self, query: str) -> List[Dict[str, Any]]:
        """Get evidence for a query"""
        return self.evidence_graph.get(query, [])

Why: This class structure mimics the bookkeeping functions mentioned in the Harness-1 system. The candidate pool holds potential search results, while the evidence graph maintains a history of what was found for each query.

3. Implementing the Policy Agent

The policy agent decides what to search, curate, verify, and when to stop. We'll create a simple reinforcement learning-based policy.

class PolicyAgent:
    def __init__(self, num_actions: int):
        self.num_actions = num_actions
        self.q_table = np.zeros((100, num_actions))  # Simplified Q-table
        
    def get_action(self, state: int, epsilon: float = 0.1) -> int:
        """Choose action using epsilon-greedy policy"""
        if np.random.random() < epsilon:
            return np.random.randint(self.num_actions)
        else:
            return np.argmax(self.q_table[state])
            
    def update_q_table(self, state: int, action: int, reward: float, next_state: int, alpha: float = 0.1):
        """Update Q-table using Q-learning update rule"""
        current_q = self.q_table[state, action]
        max_next_q = np.max(self.q_table[next_state])
        new_q = current_q + alpha * (reward + 0.9 * max_next_q - current_q)
        self.q_table[state, action] = new_q

Why: This policy agent learns from feedback about search effectiveness. The Q-learning approach allows it to improve its decision-making over time, similar to how Harness-1 learned through reinforcement learning.

4. Creating the Search Loop

Now we'll implement the main search loop that integrates our harness and policy agent.

def search_with_harness(harness: SearchHarness, policy: PolicyAgent, query: str, max_iterations: int = 5):
    """Main search loop implementing the search harness logic"""
    print(f"Starting search for: {query}")
    
    # Initialize search state
    state = 0
    iteration = 0
    search_results = []
    
    while iteration < max_iterations:
        # Get action from policy
        action = policy.get_action(state)
        
        # Execute action
        if action == 0:  # Search
            print("Action: Searching for new candidates")
            # Simulate search (in practice, this would call a search API)
            candidates = harness.get_candidate_pool()[:3]  # Get top 3 candidates
            harness.update_evidence_graph(query, candidates)
            search_results.extend(candidates)
            reward = len(candidates)  # Reward based on number of candidates found
            
        elif action == 1:  # Curate
            print("Action: Curating existing candidates")
            # Simulate curation
            reward = 0.5  # Small reward for curation
            
        elif action == 2:  # Verify
            print("Action: Verifying results")
            # Simulate verification
            reward = 1.0  # Reward for verification
            
        elif action == 3:  # Stop
            print("Action: Stopping search")
            break
        
        # Update policy with reward
        next_state = min(state + 1, 99)  # Keep state bounded
        policy.update_q_table(state, action, reward, next_state)
        state = next_state
        iteration += 1
        
        # Update harness state
        harness.search_history.append({
            'query': query,
            'action': action,
            'iteration': iteration
        })
        
    return search_results

Why: This loop represents the core decision-making process in the Harness-1 system. The policy agent decides what to do at each step based on learned experience, and the harness maintains the state and evidence for each search iteration.

5. Testing the System

Finally, we'll test our implementation with sample data to see how it performs.

def main():
    # Initialize Chroma client
    client = Client()
    
    # Create harness and policy
    harness = SearchHarness(client)
    policy = PolicyAgent(num_actions=4)
    
    # Add sample candidates
    sample_docs = [
        {'id': '1', 'content': 'Machine learning algorithms', 'importance': 0.8},
        {'id': '2', 'content': 'Deep learning neural networks', 'importance': 0.9},
        {'id': '3', 'content': 'Natural language processing', 'importance': 0.7},
        {'id': '4', 'content': 'Computer vision applications', 'importance': 0.6}
    ]
    harness.add_candidates(sample_docs)
    
    # Run search
    results = search_with_harness(harness, policy, "AI research", max_iterations=3)
    
    print("\nFinal search results:")
    for result in results:
        print(f"ID: {result['id']}, Content: {result['content']}")
    
    print("\nEvidence graph:")
    evidence = harness.get_evidence("AI research")
    for item in evidence:
        print(f"Found: {item['content']}")

Why: This test demonstrates the full workflow of our search harness, showing how it maintains state, makes decisions, and builds up evidence over multiple search iterations.

Summary

In this tutorial, we've built a simplified but functional search harness system inspired by Harness-1. We've implemented key components including:

A stateful search harness that maintains candidate pools and evidence graphs
A reinforcement learning-based policy agent that makes decisions about search actions
A complete search loop that integrates these components

While this is a simplified version of the full Harness-1 system, it demonstrates the core principles: maintaining state information, learning from feedback, and making intelligent decisions about search behavior. The system shows how reinforcement learning can be used to optimize search strategies, similar to how Harness-1 achieved its impressive recall scores.

This foundation can be extended with more sophisticated components like actual search APIs, better Q-learning implementations, or integration with large language models for more complex decision-making.