starving for wisdom

a short story about drowning in information

Dec 23, 2025

ninety-nine problems

Well, 117 problems to be exact.

At Aiwyn, I recently used Claude Code to pull evidence-backed problem statements out of ten hours of calls and sync them up with both Notion and Vistaly1.

While Claude Code cooked in the background with minimal steering from me, I caught up with an acquaintance over coffee, traveled to an out-of-state off-site, jammed on mission and vision with coworkers, and just generally went about business as usual.

“But,” you may be asking, “why on earth did you even need to review all those calls to begin with?”

Well, it all started with an acquisition…

new form, who dis?

Column Tax recently announced they’re joining Aiwyn.

The Column Tax team has built an amazing app for consumers that we’re working on repurposing for professionals, but the app just happens to address tax situations that sit just adjacent to the areas we’ve been focusing on for the last six months. The acquisition opens up a massive opportunity that we’re now uniquely positioned to address, which means understanding the problems accounting firms run into with these tax situations as quickly and as comprehensively as we can.

Of course we've lined up as many conversations as we can with our existing customers and prospects to learn more about the pain they experience with incumbent tax solutions, but there’s also a literal treasure trove of prior calls that we can draw insights from without forcing our customers to go over the exact same situations with a slightly different cast of characters from Aiwyn.

Unfortunately, Aiwyn was not kind to its future self, and all of those call recordings lie dormant, unstructured and unsynthesized, just waiting for someone—or some thing—to take an interest…

a little learning is a dangerous thing

In just one list there were enough calls that tackling them with nothing but human listening and notetaking could easily take a full 24 hours, not to mention the time and effort necessary to reflect, collect, and organize.

The astute reader may recall that I’ve written before about using AI to synthesize call recordings, so what’s the big deal? Why not just use the workflow I already tried to run through the calls? Why reinvent the wheel?

Well, as it turns out, Claude Code hallucinates just as aptly as 3.5-era ChatGPT.

As I was tweaking the workflow to pull out problems specifically without necessarily producing specific insights, I did what any good AI tinkerer would do and used an eval2 to make sure the new workflow was producing good output from a known-good source, and in particular a recording I’d transcribed once already.

And I just couldn’t get the new workflow to jive with the old workflow. The old workflow produced what seemed to be a deeper and richer set of problems that the new workflow was somehow glossing over. Claude and I both banged our heads against the problem for longer than it would have taken to listen to the call again in its entirety, running and re-running the transcript through the process and failing each time.

In an act of desperation, I switched Claude’s model from Sonnet to Opus, and Opus figured it out immediately.

Among the vast majority of accurate insights, the old workflow had just made some shit up.

A known-good source, indeed!

We’re building a business, and product development can tank on the basis of bad research, and so there had to be guardrails. Speed could not come at the expense of quality.

Fortunately, society is incentivized to solve for verifiable quality in a lot of different contexts. This situation as a broad concern is neither new nor uncommon, even if the patterns to overcome it aren’t often applied to qualitative research.

subagents are all you need

The quality unlock came in the form of a subagent I’d created at first as a simple reviewer but later became the core of a feedback loop that produces, as far as I can tell, highly accurate output for every call I’ve thrown at it so far3.

While collaborating on the workflow to get it to hum, I suggested to Claude that we extract the problem extraction instructions out into a reusable skill to save tokens. When I suggested that we also extract the problem verification instructions into a separate reusable skill, Claude clapped back: just give the verification subagent the exact same skill, precisely the same context as the extraction subagent, but have it apply the skill differently.

The approach is so elegant and intuitive and it really just works.

Think about the human analog, let’s say with code review in a software engineering context: the software engineer employs their skill to write code, and then another software engineer with the same skill to write code can effectively apply that skill to review the code the first engineer wrote. The reviewer may have some additional context. The reviewer is certainly primed to apply their skill in a different way—e.g. to find mistakes, inconsistencies, or incoherence—but the same underlying skill applies.

Same principle, different task.

The emergent interaction between the extractor subagent and the verifier subagent makes me giddy. The extractor presents its analysis for review; the verifier points out its flaws; the extractor drafts another data set to correct its mistake; and then the verifier reviews it all again (incidentally almost always giving the second pass the all-clear).

The rest of the workflow—syncing to different tools and self-reflection—is just basic orchestration with subagents set up to keep the main thread’s context clear of MCP server tokens. Exporting the problems to any arbitrary destination is trivial, so the pattern can fit into almost any research pipeline with any set of tools one can imagine.

After running through the first couple of calls as a proof of concept, I wound up with an /extract-problems custom slash command that runs an end-to-end workflow that goes like this:

Store the call transcript in a standard format locally
Create a separate file that summarizes problems explicitly stated in the call
Validate that every problem is backed by an actual quote and isn’t an inference or hallucination
Sync the problems to a flat table in Notion
Sync each problem into the most appropriate place in the tree in Vistaly
Review the Vistaly tree holistically for opportunities to group themes
Reflect on the workflow to address hiccups or inefficiencies

This workflow helped me get up to speed quickly on patterns that emerged from ten different discovery/sales calls in the course of a week without having to carve out the time to review each call manually, saving me from hours and hours of screen time and I don’t know how many clicks.

so the robots are researchers now?

The big caveat here is that having an automated workflow isn’t a license to outsource thinking or the rigorous work of actual synthesis. At its core, this was a data extraction and categorization problem that just so happens to have data with characteristics so inconsistent that using anything other than modern AI for automated extraction is impractical.

If you try to pretend AI is doing the research… well, there be dragons.

Even supported by direct, verified quotes from call transcripts, the problems as they’ve been articulated in isolation lack some amount of broader context from the conversation that surrounded them, and they’re certainly devoid of any emotional cues infused by the human who produced the words.

This workflow is a valuable tool to identify themes and to systematically collect evidence that may form the basis of hypotheses that will require, for now, human effort and the kind of empathy and intuition that can only come from authentic curiosity and connection.

The hours I didn’t spend watching call recordings free me up to do higher-leverage human work: being present in those conversations, synthesizing an evidence-backed set of problems, and applying product sense to make quality decisions.

Opportunity-solution tree SaaS: https://www.vistaly.com/

How evals drive the next chapter in AI for businesses, OpenAI

Granted, an n of 10 is not exactly scientifically rigorous, but 10/10 is a strong signal!

Prompt and Catch Fire

Discussion about this post

Ready for more?