AI Sparks

Code Execution for Parsing, Analyzing, Visualizing, and Debugging Agent Reasoning Traces using the lambda/hermes-agent-reasoning-traces dataset

In this lesson, we examine the lambda/hermes-agent-reasoning-traces dataset understanding how agent-based models think, use tools, and generate responses across multi-curve conversations. We start by loading and examining the dataset, examining its structure, categories, and dialog format to get a clear view of the available information. We then developed simple parsers to extract important components such as logic traces, tool calls, and tool responses, allowing us to separate internal logic from external actions. Also, we analyze patterns such as tool usage frequency, conversation length, and error rates to better understand agent behavior. We also create visualizations to highlight these trends and simplify analysis. Finally, we prepare the training dataset by converting it to a model-friendly format, making it suitable for tasks such as supervised tuning.

Copy the CodeCopiedUse a different browser
!pip -q install -U datasets pandas matplotlib seaborn transformers accelerate trl


import json, re, random, textwrap
from collections import Counter, defaultdict
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datasets import load_dataset, concatenate_datasets


random.seed(0)


CONFIG = "kimi"
ds = load_dataset("lambda/hermes-agent-reasoning-traces", CONFIG, split="train")
print(ds)
print("Config:", CONFIG, "| Fields:", ds.column_names)
print("Categories:", sorted(set(ds["category"])))


COMPARE_BOTH = False
if COMPARE_BOTH:
   ds_kimi = load_dataset("lambda/hermes-agent-reasoning-traces", "kimi", split="train")
   ds_glm  = load_dataset("lambda/hermes-agent-reasoning-traces", "glm-5.1", split="train")
   ds_kimi = ds_kimi.add_column("source", ["kimi"] * len(ds_kimi))
   ds_glm  = ds_glm.add_column("source", ["glm-5.1"] * len(ds_glm))
   ds = concatenate_datasets([ds_kimi, ds_glm]).shuffle(seed=0)
   print("Combined:", ds, "→ counts:", Counter(ds["source"]))


sample = ds[0]
print("n=== Sample 0 ===")
print("id        :", sample["id"])
print("category  :", sample["category"], "/", sample["subcategory"])
print("task      :", sample["task"])
print("turns     :", len(sample["conversations"]))
print("system[0] :", sample["conversations"][0]["value"][:220], "...n")

We install all the necessary libraries and import the necessary modules to set up our site. We then load the lambda/hermes-agent-reasoning-traces dataset and examine its structure, fields, and categories. We also compile multiple dataset configurations and test a sample to understand the format of the discussion.

Copy the CodeCopiedUse a different browser
THINK_RE     = re.compile(r"<think>(.*?)</think>", re.DOTALL)
TOOL_CALL_RE = re.compile(r"<tool_call>s*({.*?})s*</tool_call>", re.DOTALL)
TOOL_RESP_RE = re.compile(r"<tool_response>s*(.*?)s*</tool_response>", re.DOTALL)


def parse_assistant(value: str) -> dict:
   thoughts = [t.strip() for t in THINK_RE.findall(value)]
   calls = []
   for raw in TOOL_CALL_RE.findall(value):
       try:
           calls.append(json.loads(raw))
       except json.JSONDecodeError:
           calls.append({"name": "<malformed>", "arguments": {}})
   final = TOOL_CALL_RE.sub("", THINK_RE.sub("", value)).strip()
   return {"thoughts": thoughts, "tool_calls": calls, "final": final}


def parse_tool(value: str):
   raw = TOOL_RESP_RE.search(value)
   if not raw: return {"raw": value}
   body = raw.group(1)
   try:    return json.loads(body)
   except: return {"raw": body}


first_gpt = next(t for t in sample["conversations"] if t["from"] == "gpt")
p = parse_assistant(first_gpt["value"])
print("Thought preview :", (p["thoughts"][0][:160] + "...") if p["thoughts"] else "(none)")
print("Tool calls       :", [(c.get("name"), list(c.get("arguments", {}).keys())) for c in p["tool_calls"]])

We describe regex-based parsers to extract logic traces, tool calls, and tool responses from a dataset. We process the assistant’s messages to categorize thoughts, actions, and end results in a systematic way. We then test the attacker in a sample dialog to make sure the extraction is working properly.

Copy the CodeCopiedUse a different browser
N = 3000
sub = ds.select(range(min(N, len(ds))))


tool_calls         = Counter()
parallel_widths    = Counter()
thoughts_per_turn  = []
calls_per_traj     = []
errors_per_traj    = []
turns_per_traj     = []
cat_counts         = Counter()


for ex in sub:
   cat_counts[ex["category"]] += 1
   n_calls = n_err = 0
   turns_per_traj.append(len(ex["conversations"]))
   for t in ex["conversations"]:
       if t["from"] == "gpt":
           p = parse_assistant(t["value"])
           thoughts_per_turn.append(len(p["thoughts"]))
           if p["tool_calls"]:
               parallel_widths[len(p["tool_calls"])] += 1
               for c in p["tool_calls"]:
                   tool_calls[c.get("name", "<unknown>")] += 1
               n_calls += len(p["tool_calls"])
       elif t["from"] == "tool":
           r = parse_tool(t["value"])
           blob = json.dumps(r).lower()
           if "error" in blob or '"exit_code": 1' in blob or "traceback" in blob:
               n_err += 1
   calls_per_traj.append(n_calls)
   errors_per_traj.append(n_err)


print(f"nScanned {len(sub)} trajectories")
print(f"Avg turns/traj      : {np.mean(turns_per_traj):.1f}")
print(f"Avg tool calls/traj : {np.mean(calls_per_traj):.1f}")
print(f"% with >=1 error    : {100*np.mean([e>0 for e in errors_per_traj]):.1f}%")
print(f"% parallel turns    : {100*sum(v for k,v in parallel_widths.items() if k>1)/max(1,sum(parallel_widths.values())):.1f}%")
print("Top 10 tools        :", tool_calls.most_common(10))


fig, axes = plt.subplots(2, 2, figsize=(13, 9))


top = tool_calls.most_common(15)
axes[0,0].barh([t for t,_ in top][::-1], [c for _,c in top][::-1], color="teal")
axes[0,0].set_title("Top 15 tools by call volume")
axes[0,0].set_xlabel("calls")


ks = sorted(parallel_widths)
axes[0,1].bar([str(k) for k in ks], [parallel_widths[k] for k in ks], color="coral")
axes[0,1].set_title("Tool-calls per assistant turn (parallel width)")
axes[0,1].set_xlabel("# tool calls in one turn"); axes[0,1].set_ylabel("count")
axes[0,1].set_yscale("log")


axes[1,0].hist(turns_per_traj, bins=40, color="steelblue")
axes[1,0].set_title("Conversation length"); axes[1,0].set_xlabel("turns")


cats, vals = zip(*cat_counts.most_common())
axes[1,1].pie(vals, labels=cats, autopct="%1.0f%%", startangle=90)
axes[1,1].set_title("Category distribution")


plt.tight_layout(); plt.show()

We performed extensive analysis of the dataset to measure tool usage, conversation length, and error patterns. We aggregate statistics across multiple samples to understand the agent’s overall behavior. We also create visualizations to highlight trends such as instrument frequency, parallel calls, and phase distribution.

Copy the CodeCopiedUse a different browser
def render_trace(ex, max_chars=350):
   print(f"n{'='*72}nTASK [{ex['category']} / {ex['subcategory']}]: {ex['task']}n{'='*72}")
   for t in ex["conversations"]:
       role = t["from"]
       if role == "system":
           continue
       if role == "human":
           print(f"n[USER]n{textwrap.shorten(t['value'], 600)}")
       elif role == "gpt":
           p = parse_assistant(t["value"])
           for th in p["thoughts"]:
               print(f"n[THINK]n{textwrap.shorten(th, max_chars)}")
           for c in p["tool_calls"]:
               args = json.dumps(c.get("arguments", {}))[:200]
               print(f"[CALL] {c.get('name')}({args})")
           if p["final"]:
               print(f"n[ANSWER]n{textwrap.shorten(p['final'], max_chars)}")
       elif role == "tool":
           print(f"[TOOL_RESPONSE] {textwrap.shorten(t['value'], 220)}")
   print("="*72)


idx = int(np.argmin(np.abs(np.array(turns_per_traj) - 10)))
render_trace(sub[idx])


def get_tool_schemas(ex):
   try:    return json.loads(ex["tools"])
   except: return []


schemas = get_tool_schemas(sample)
print(f"nSample 0 has {len(schemas)} tools available")
for s in schemas[:3]:
   fn = s.get("function", {})
   print(" -", fn.get("name"), "—", (fn.get("description") or "")[:80])


ROLE_MAP = {"system": "system", "human": "user", "gpt": "assistant", "tool": "tool"}


def to_openai_messages(conv):
   return [{"role": ROLE_MAP[t["from"]], "content": t["value"]} for t in conv]


example_msgs = to_openai_messages(sample["conversations"])
print("nFirst 2 OpenAI messages:")
for m in example_msgs[:2]:
   print(" ", m["role"], "→", m["content"][:120].replace("n", " "), "...")

We build resources to provide a full conversation trace in a readable format for in-depth analysis. We also extract tool schemas and convert the dataset into an OpenAI-style message format to be compatible with training pipelines. This helps us to better understand both the structure of the tools and how the discussions can be structured.

Copy the CodeCopiedUse a different browser
from transformers import AutoTokenizer
TOK_ID = "Qwen/Qwen2.5-0.5B-Instruct"
tok = AutoTokenizer.from_pretrained(TOK_ID)


def build_masked(conv, tokenizer, max_len=2048):
   msgs = to_openai_messages(conv)
   for m in msgs:
       if m["role"] == "tool":
           m["role"] = "user"
           m["content"] = "[TOOL OUTPUT]n" + m["content"]
   input_ids, labels = [], []
   for m in msgs:
       text = tokenizer.apply_chat_template([m], tokenize=False, add_generation_prompt=False)
       ids = tokenizer.encode(text, add_special_tokens=False)
       input_ids.extend(ids)
       labels.extend(ids if m["role"] == "assistant" else [-100] * len(ids))
   return input_ids[:max_len], labels[:max_len]


ids, lbls = build_masked(sample["conversations"], tok)
trainable = sum(1 for x in lbls if x != -100)
print(f"nTokenized example: {len(ids)} tokens, {trainable} trainable ({100*trainable/len(ids):.1f}%)")


think_lens, call_lens, ans_lens = [], [], []
for ex in sub.select(range(min(500, len(sub)))):
   for t in ex["conversations"]:
       if t["from"] != "gpt": continue
       p = parse_assistant(t["value"])
       for th in p["thoughts"]: think_lens.append(len(th))
       for c in p["tool_calls"]: call_lens.append(len(json.dumps(c)))
       if p["final"]: ans_lens.append(len(p["final"]))


plt.figure(figsize=(10,4))
plt.hist([think_lens, call_lens, ans_lens], bins=40, log=True,
        label=["<think>", "<tool_call>", "final answer"], stacked=False)
plt.legend(); plt.xlabel("characters"); plt.title("Length distributions (log y)")
plt.tight_layout(); plt.show()


class TraceReplayer:
   def __init__(self, ex):
       self.ex = ex
       self.steps = []
       pending = None
       for t in ex["conversations"]:
           if t["from"] == "gpt":
               if pending: self.steps.append(pending)
               pending = {"think": parse_assistant(t["value"]), "responses": []}
           elif t["from"] == "tool" and pending:
               pending["responses"].append(parse_tool(t["value"]))
       if pending: self.steps.append(pending)
   def __len__(self): return len(self.steps)
   def play(self, i):
       s = self.steps[i]
       print(f"n── Step {i+1}/{len(self)} ──")
       for th in s["think"]["thoughts"]:
           print(f" {textwrap.shorten(th, 280)}")
       for c in s["think"]["tool_calls"]:
           print(f"  {c.get('igama')}({json.dumps(c.get('arguments', {}))[:140]})") kwe-r ku-s["responses"]: phrinta (f"📥 {textwrap.shorten(json.dumps(r), 200)}") if sec["think"]["final"]: print (f"💬 {text.shorten['think']['final']200)}") rp = TraceReplayer(sample) for i in range(min(3, len(rp))): rp.play(i) TRAIN = False if TRAIN: import torch from transformers import AutoModelForCausalLM from trl import SFTTrainer, SFTConfig train_select0(range train_subset)(range train_subset) to_text(bundle): msgs = to_openai_messages(bundle["conversations"]) in m messages: if m["role"] == "tool": m["role"] = "user"; m["content"] = "[TOOL]n" + m["content"]
       collection["text"] = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=False) return bundle train_subset = train_subset.map(to_text) model = AutoModelForCausalLM.from_pretrained(TOK_ID, torch_dtype=torch.float16.itorch.il2) device_map="default" if torch.cuda.is_available() else ) cfg = None SFTConfig( output_dir="hermes-sft-demo", per_device_train_batch_size=1, gradient_accumulation_steps=4, max_steps=20, learning_rate,2_step-2e max_seq_length=1024, dataset_text_field="text", report_to="none", fp16=torch.cuda.is_available(), ) SFTTrainer(model=model, args=cfg, train_dataset=train_subset, processing_class=tok).train() print("completed) print ok."Pyright’s Type Coding Implementation for Testing Covering Generics, Protocols, Strict Mode, Type Reduction, and Modern Python Typing The tutorial is complete. You now have analyzers, statistics, plots, a player and, " "SFT examples with the token + hidden label, and an optional training hook.")

We tokenize conversations and use label masking so only assistant responses contribute to training. We analyze the length of the thought distribution, tool calls, and responses to get more information. We also use a trace replayer to step through the behavior of the agent and use a small fine-tuning loop.

In conclusion, we developed a structured workflow to analyze, analyze, and efficiently process the agent’s thought tracks. We were able to break down conversations into meaningful parts, examine how agents think step by step, and measure how they interact with tools during problem solving. Using visualization and statistics, we found insights into similar patterns and behaviors across datasets. In addition, we converted the data into a format suitable for training language models, including handling tokens and hiding the labels of the assistant’s responses. Also, this process provides a solid foundation for learning, testing, and developing AI systems that use tools in a realistic, scalable way.


Check it out Full Codes with notebook. Also, feel free to follow us Twitter and don’t forget to join our 130k+ ML SubReddit and Subscribe to Our newspaper. Wait! are you on telegram? now you can join us on telegram too.

Need to work with us on developing your GitHub Repo OR Hug Face Page OR Product Release OR Webinar etc.?Connect with us

The post Implementation of Code in Analyzing, Analyzing, Visualizing, and Reasoning Traces of a Reasoning Agent Using the lambda/hermes-agent-reasoning-traces Dataset appeared first on MarkTechPost.

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button