Launch HN: Roark (YC W25) – Taking the pain out of voice AI testing
Hey HN, we’re James and Daniel, co-founders of Roark (https://roark.ai). We built a tool that lets developers replay real production calls against their latest Voice AI changes, so they can catch failures, test updates, and iterate with confidence.
Here’s a demo video: https://www.youtube.com/watch?v=eu8mo28LsTc.
We ran into this problem while building a voice AI agent for a dental clinic. Patients kept getting stuck in loops, failing to confirm insurance, or misunderstanding responses. The only way to test fixes was to manually call the agent or read through hundreds of transcripts, hoping to catch issues. It was slow, frustrating, and unreliable.
Talking to other teams, we found this wasn’t just a niche issue - every team building Voice AI struggled to validate performance efficiently. Debugging meant calling the agent over and over. Updates shipped with unknown regressions. Sentiment analysis relied only on text, missing key audio cues like hesitation or frustration, which often signal deeper issues.
That’s why we built Roark. Instead of relying on scripted test cases, Roark captures real production calls from VAPI, Retell, or a custom-built agent via API and replays them against your latest agent changes. We don’t just feed back text, we preserve what the user said, how they said it, and when they said it, mimicking pauses, sentiment, and tone up until the conversation flow changes. This ensures your agent is tested under real-world conditions, not just synthetic scripts.
For each replay that we run, Roark checks if the agent follows key flows (e.g. verifying identity before sharing account details). Our speech based evaluators also detect sentiments such as frustration and confusion, long pauses, and interruptions - things that regular transcripts miss.
After testing, Roark provides Mixpanel-style analytics to track failures, conversation flows, and key performance metrics, helping teams debug faster and ship with confidence. Instead of hoping changes work, teams get immediate pass/fail results, side-by-side transcript comparisons, and real-world insights.
We’re already working with teams in healthcare, legal, and customer service who rely on Voice AI for critical interactions. They use Roark to debug AI failures faster, test updates before they go live, and improve customer experiences - without manually calling their bots dozens of times.
Our product isn’t quite ready yet for self-service, so you’ll still see the dreaded “book a demo” on our home page. If you’re reading this, though, we’d love to fast-track you, so we made a special page for HN signups here: https://roark.ai/hn-access. If you’re working on Voice AI and want to try us out, please do!
Would love any feedback, thoughts, or questions from the HN community!
It looks great! Although the demo shows horrible security practices...
Clearly authentication shouldn't rely on prompt engineering.
Particularly when at the end of the demo it says "we have tested it again and now it shows that the security issue is fixed" - No it's not fixed! It's hidden! Still a gaping security hole. Clearly just a very bad example, particularly considering the context is banking.
Appreciate the feedback! Completely agree - authentication should be handled at the system level, not just in prompts. This demo is meant to showcase how teams can build test cases from real failures and ensure fixes work before deployment. We’ll consider using a better example.
Your post suggests authorization as a feature:
> For each replay that we run, Roark checks if the agent follows key flows (e.g. verifying identity before sharing account details)
I don't know if AI will be more susceptible or less susceptible to phishing than humans, but this feels like a bad practice.
Appreciate the feedback! To clarify, Roark isn’t handling authentication itself - it’s a testing and observability tool to help teams catch when their AI fails to follow expected security protocols (like verifying identity before sharing sensitive info).
That said, totally fair point that this example could be clearer—we’ll keep that in mind for future demos. Thanks for calling it out!
As someone who's building a personal work assistant for voice - I see the merit in automating test case generation and validation.
All products in this space by YC teams are targeted at scaled voice agent startups or teams.
- Roark (https://roark.ai/)
- Hammin (https://hamming.ai/)
- Coval (https://www.coval.dev/)
- Vocera (https://www.vocera.ai/)
How do you differentiate - who is this for? Voice agent devs paying $500/mo. for early stage software?
Great question! There’s been a lot of movement in this space, but most existing solutions focus on simulation-based testing—generating synthetic test cases or scripted evaluations.
Roark takes a different approach: we replay real production calls against updated AI logic, preserving actual user inputs, tone, and timing. This helps teams catch failures that scripted tests miss—especially in high-stakes industries like healthcare, legal, and finance, where accuracy and compliance matter.
Beyond replays, we provide rich analytics, sentiment & vocal cue detection, and automated evaluations, all based on audio—not just transcripts. This lets teams track frustration, long pauses, and interruptions that often signal deeper issues.
Would love to hear more about your assistant - how are you thinking about testing and iteration?
We are presently evaluating at least two of the options mentioned above and the pricing for 1000 minutes at both comes out to well under 10% the cost of your current listed rate. I know you've probably been told not to compete on price but this is a space where I think it's hard to compete on quality yet. As for other analysis features, I think you're going to find yourself locked into a commoditised feature race which you're currently 6 months behind on.
What nobody is doing well at the moment is effective prompt versioning, comparison, and deployment via git or similar. This would be a killer feature for us but nobody is close to having shipped it from what I can see.
> Roark takes a different approach: we replay real production calls against updated AI logic
How does that work? As soon as the AI responds with something different, the rest of the customer call is mismatched.
Wow, YC has so much conviction in this space that they invested in four of the same company.
If anyone ever doubts if YC will invest in direct competitors, this should be your answer.
Roark: Best of luck. Talk to customers, talk to customers, talk to customers!
And often when YC invests in multiple companies in a space there are multiple winners:
- Gusto, Rippling, and Deel
- Mixpanel, Amplitude, Posthog
“Spaces” are much less important for early stage companies in newish spaces.
There was a time when Facebook, Twitter, GitHub and LinkedIn were all considered social media companies- competing to become the place I talked to dev friends.
In a way they all succeeded, and in another way they all failed.
Looks somewhat useful for voice AI QA.
But I wonder if a company is deploying voice AI, wouldn't they have their own testing and quality assurance flows?
Is this targeted at companies without an engineering department or something? In which case I find it surprising they're able to slot in some voice AI assistant in the first place.
Great question! Most teams deploying voice AI do have testing and QA flows, but they’re often manual, brittle, or incomplete. Unlike traditional software, voice agents don’t have structured inputs and outputs — users phrase things unpredictably, talk over the bot, or express frustration in subtle ways.
Some engineering teams try to build internal testing frameworks, but it’s a massive effort - they have to log and store call data, build a replay system, define evaluation criteria, and continuously update it as the AI evolves. Most don’t want to spend engineering time reinventing the wheel when they could be improving their AI instead.
The teams that benefit most from Roark are the ones with strong QA processes — they already know how critical testing is, but they’re stuck with brittle, time-consuming, or incomplete workflows.
I noticed this on your website regarding transcription --
"More accurate than Deepgram, supporting 50+ languages with a word error rate of just 8.6%."
Can you explain how this helps me? At the end of the day you are not my transcriber, wouldn't I want to test using transcriptions produced by the transcriber that I'm actually using in production?
We help capture discrepancies between your transcription model and errors in order to effectively calculate a Word Error Rate (WER) as part of your evaluation process. Post-call transcription tends to be more accurate, and we’ve seen teams manually do this by hiring humans to label a dataset and test against it for WER calculations.
By providing a more accurate baseline, Roark helps teams quantify how well their production transcriptions match reality and flag cases where the model is introducing errors that could impact downstream agent performance. That way, you’re not just testing if your agent responds correctly, but whether it’s getting the right inputs in the first place.
This seems useful for issues early in the convo but what if the AI responses diverge from the recorded convo prior to the issue being hit?
That’s a great point! While we do our best to simulate an identical case, if the agent responds differently, our focus is on whether the key evaluator or goal for that replay set passes or fails. We use that as the source of truth and flag the exact moment where the conversation diverges from the expected flow.
Curious, who do you guys consider is the leader in the space with real-time voice interactions, interruptions, etc?
Why is it called Roark?
Great question! Roark is named after Ted Roark from Chuck — both of us (Daniel and I) watched Chuck a lot growing up, and the name just stuck with us.
No deep meaning, just something we liked and thought sounded cool!
I had assumed it was a reference to the highly polarizing and political author Ayn Rand—might be worth giving some thought to!
My first thought when I saw the name was it would be an AI architecture firm related to Howard Roark from Atlas Shrugged.
*the fountainhead
Right!
Congrats on the launch!
Thank you!
[dead]
Super cool to see this just now. We are building in the space of computer screen analysis and started to experience something similar, hence want to build something similar for Pixels instead of voice.
Would love to chat to you, jan@kontext21.com
Thanks, Jan! Would love to hear more about what you’re working on! Just sent you an email.