Launch HN: Cyberdesk (YC S25) – Automate Windows legacy desktop apps
Hi HN, We’re Mahmoud and Alan, building Cyberdesk (https://www.cyberdesk.io/), a deterministic computer use agent for automating Windows desktop applications. Developers use us to automate repetitive tasks in legacy software in healthcare, accounting, construction, and more, by executing clicks and keystrokes directly into the desktop.
Here’s a couple demos of Cyberdesk’s computer use agent:
A fast file import automation into a legacy desktop app: https://youtu.be/H_lRzrCCN0E
Working on a monster of a Windows monolith called OpenDental (showcases agent learning process as well): https://youtu.be/nXiJDebOJD0.
Filing a W-2 tax form: https://youtu.be/6VNEzHdc8mc
Many industries are stuck with legacy Windows desktop applications, with staff plagued by repetitive tasks that are incredibly time consuming. Vendors offering automations for these end up writing brittle Robotic Process Automation (RPA) scripts or hiring off-shore teams for manual task execution. RPA often breaks due to inevitable UI changes or unexpected popups like a Windows update or a random in-app notification. Off-shore teams are often unreliable and costlier than software, plus they’re not always an option for regulated industries.
I previously built RPA scripts impacting 20K+ employees at a Fortune 100 company where I experienced first hand RPA’s brittleness and inflexibility. It was obvious to me that this was a bandaid solution to an unsolved problem. Alan was building a computer use agent for his previous startup and realized its huge potential to automate a ton of manual computer tasks across many industries, so we started working on Cyberdesk.
Computer use models can struggle with abstract, long-horizon tasks, but they excel at making context-aware decisions on a screen-by-screen basis, so they’re a good fit for automating these desktop apps.
The key to reliability is crafting prompts that are highly specific and well thought out. Much like with ChatGPT, vague or ambiguous prompts won’t get you the results you want. This is especially true in computer use because the model is processing nearly an entire desktop screen’s worth of extra visual information; without precise instructions, it doesn’t know which details to focus on or how to act.
Unlike RPA, Cyberdesk’s agents don’t blindly replay clicks. They read the screen state before every action and self-correct when flows drift (pop-ups, latency, UI changes). Unlike off-the-shelf computer use AIs, Cyberdesk runs deterministically in production: the agent primarily follows the steps it has learned and only falls back to reasoning when anomalies occur. Cyberdesk learns workflows from natural-language instructions, capturing nuance and handling dynamic tasks - far beyond what a simple screen recording of a few runs can encode.
This approach is good for both reliability and cost: reliability, because we fall back to a computer use model in unexpected situations; and cost because the computer use models are expensive and we only use them when we need to. Otherwise we leverage faster, more affordable visual LLMs for checking the screen state step-by-step during deterministic runs. Our agents are also equipped with tools like failsafes, data extraction, screen evaluation to handle dynamic and sensitive situations.
How it works: you install our open source driver on any Windows machine (https://github.com/cyberdesk-hq/cyberdriver). It communicates with our backend to receive commands (click, type, scroll, screenshot) and sends back data (screenshots, API responses, etc). You give our computer use agent a detailed natural language description of the process for a given task, just like an SOP for an employee learning a new task for the first time. The agent then leverages computer use AI models to learn the steps and memorizes them by saving each screenshot alongside its action (click on these coordinates, type XYZ, wait for page to load, etc).
The agent deterministically runs through these steps to run fast and predictably. In order to account for popups and UI changes, our agent checks the live screen state against the memorized state to determine whether it’s safe to proceed with the memorized step. If no major changes prevent safe execution of the memorized step, it proceeds; otherwise, it falls back to a computer use model with context on past actions and the remaining task.
Customers are currently using us for manual tasks like importing and exporting files from legacy desktop applications, booking appointments for patients on a desktop PMS, and data entry for filling our forms like patient profiles and such in an EMR.
We don't have a self-serve option yet but we'd love to onboard you manually. Book a demo here to learn more! (https://www.cyberdesk.io/) If you’d rather wait for the self-serve option a little later down the line, please do submit your email here (https://forms.gle/HfQLxMXKcv9Eh8Gs8) so you can be notified as soon as that’s ready. You can also check out our docs here: https://docs.cyberdesk.io/.
We’d absolutely love to hear your thoughts on our approach and on desktop automation for legacy industries!
Congrats! I think the space is very interesting, I was a founder of a similar windows CUA infra/ RPA agents but pivoted. My thoughts:
1) The funny thing about determinism is how deterministic you should be when to break, its kind of a recursive problem. agents are inherently very tough to guardrail on an action space so big like in CUA. The guys from browser use realized it as well and built workflow-use. Or you could try RL or finetuning per task but is not viable(economically or tech wise) currently.
2) As you know, It's a very client facing/customized solution space You might find this interesting, it reflects my thoughts in the space as well. Tough to scale as a fresh startup unless you really niche down on some specific workflows. https://x.com/erikdunteman/status/1923140514549043413 (he is also building in the deterministic agent space now funnily enough) 3) It actually is annoyingly expensive with Claude if you break caching, which you have to at some point if you feed in every screenshot etc. You mentioned you use multiple models (i guess uitars/omniparser?), but in the comments you said claude?
4) Ultimately the big bet in the RPA space, as again you know, is that the TAM wont shrink a lot due to more and more SAP's, ERP's etc implementing API's. Of course the big money will always be in ancient apps that wont, but then again in that space, uipath and the others have a chokehold. (and their agentic tech is actually surprisingly good when i had a look 3 months ago)
Good luck in any case! I feel like its one of those spaces that we are definitely still a touch too early, but its such a big market that there is plenty of space for a lot of people.
Looks great to automate workload for Windows desktop application. I'd love to understand more deeply how your application works, so the set of commands your backend send is click, scroll, screenshot. Does it send command to say type character into an input field? How is it able to pin point a text field from a screenshot? Is LLM reliable to pin point x and y to click on a field?
Also, to have this run in a large scale, Does it become prohibitively expensive to run on daily basis on thousand of custom workflows? I assume this runs on the cloud.
Thanks! And yes, so our pathfinder agents utilize Sonnet 4's precise coordinate generation capabilities. You give it a screenshot, give it a task, and it can output exact coordinates of where to click on an input field, for example.
And yes we've found the computer use models are quite reliable.
Great questions on scale: the whole way we designed our engine is that in the happy path, we actually use very little LLMs. The agent runs deterministically, only checking at various critical spots if anomalies occurred (if it does, we fallback to computer use to take it home). If not, our system can complete an entire task end to end, on the order of less than $0.0001.
So it's a hybrid system at the end of the day. This results in really low costs at scale, as well as speed and reliability improvements (since in the happy path, we run exactly what has worked before).
Can it do assertions? This could be useful for testing old software.
Autoit must be a good 20 years old: https://www.autoitscript.com/site/
Have you looked at using accessibility APIs, such as UI Automation on Windows, to augment screenshots and simulated mouse clicks?
Isn’t this an optional feature for developers? They can disable it / remove the names of the buttons, etc to make RPA harder?
Looks great. For the EMR use cases, do you sign BAAs? Which CUA models are being used? No data retention?
We sign BAAs with all our healthcare customers + all our vendors. Currently using Claude computer-use. Zero-data retention signed with both Anthropic and OpenAI, so none of the information getting sent to their LLMs ever get retained
>none of the information getting sent to their LLMs ever get retained
Is it possible to verify that?
Yup! We have signed certificates that explicitly state this, with all LLM providers we use.
That's not "verification" by any definition of the word.
Good point. In a way we can verify to a customer that we have that policy set up with them by showing them the certificate. But you are correct in that we haven't gone as far as asking for proof from Anthropic or OpenAI on not retaining any of our data but what we did do is we got their SOC 2 Type II reports and they showed no significant security vulnerabilities that will impact our usage of their service. So now we have been operating under the assumption that they are honoring our signed agreement within the context of the SOC 2 Type II report we retrieved, and our customers have been okay with that. But we are definitely open to pursuing that kind of proof at some point.
Honestly, I'm surprised your lawyers let you post that here.
+1 for honesty and transparency
Typically with this sort of thing the way it really works is that you, the startup, use a service provider (like OpenAI) who publish their own external audit reports (like a SOC 2 Type 2) and then the SOC 2 auditors will see that the service provider company has a policy related to how it handles customer data for customers covered by Agreement XYZ, and require evidence to prove that the service provider company is following its policies related to not using that data for undeclared purposes or whatever else.
Audit rights are all about who has the most power in a given situation. Just like very few customers are big enough to go to AWS and say "let us audit you", you're not going to get that right with a vendor like Anthropic or OpenAI unless you're certifiably huge, and even then it will come with lots of caveats. Instead, you trust the audit results they publish and implicitly are trusting the auditors they hire.
Whether that is sufficient level of trust is really up to the customer buying the service. There's a reason many companies sell on-prem hosted solutions or even support airgapped deployments, because no level of external trust is quite enough. But for many other companies and industries, some level of trust in a reputable auditor is acceptable.
Thanks for the breakdown Seth! We did indeed get their SOC 2 Type II reports and made sure they showed no significant security vulnerabilities that will impact our usage of their service.
Is it a 3rd party that is verifying?
We haven't looked into this kind of approach yet, but definitely worthwhile to do at some point!
So you’re taking the largest copywriting infringements at their word for it?
Right now we are taking the policies we signed with our LLM vendors as a verification of a zero data retention policy. We did also get their SOC 2 Type II reports and they showed no significant security vulnerabilities that will impact our usage of their service. We're doing our best to deliver value while taking as many security precautions as possible: our own data retention policy, encrypting data at rest and in transit, row-level security, SOC 2 Type I and HIPAA compliance (in observation for Type II), secret managers. We have other measures we plan to take like de-identifying screenshots before sending them up. Would love to get your thoughts on any other security measures you would recommend!
I’m guessing OP is asking if it’s possible to verify they’re honoring the contract and deleting the data?
Nope.
Personally I think this approach is flawed because it runs in the cloud. If it were an agent I could run locally I'd be much more interested.
Are you referring to the LLM being used or where the actions (click, type, etc) are being executed? The actual actions can be executed on any windows machine, so the actual execution can take place locally on your device. The LLMs we're using right now are cloud LLMs. We haven't done an LLM self hosting option yet. Can I ask what reservations you have about running in the cloud? We have zero-date retention signed with our LLM vendors, so none of the data getting sent to them ever gets retained.
I'm talking about the LLM (and any other infrastructure involved). Reasons are:
- Pricing. If I grow to do this at scale, I don't want to be paying per-action, per-month, per-token, etc.
- Privacy. I don't want my data, screenshots, whatever being sent to you or the cloud AI providers.
- Control. I don't want to be vulnerable to you or other third parties going bankrupt, arbitrarily deciding to kill the product or it's dependencies, or restructuring plans/pricing/etc. I also want to be able to keep my day to day operations running even if there's a major cloud outage (that's one reason we're still using this "old fashioned", non-cloud software in the first place).
I think I'm simply not your target market.
I advise several companies who could be (they run "legacy" software with vast teams of human operators whose daily tasks include some portion of work that would be a good candidate for increased automation), but most of them are in a space where one or more of the above factors would be potential deal breakers.
The retention agreements between you and your vendors are great (I mean that sincerely), but I'm not party to them so they don't do anything for me. If you offered a contractual agreement with some teeth in it (eg. underwritten or bond-backed to the tune of several digits, committing to specific security-related measures that are audited, with a tacit acknowledgement any proven breach of contract in and of itself constitutes damages) it could go a long way to address the privacy issues.
In terms of pricing it feels like the core of your product is an outside vendor's computer-operating AI model, and you've written a prompt wrapper and plumbing around it that ferries screenshots and directives back and forth. This could be totally awesome for a small scale customer that wants to dip their toes into AI automation and try it out as a turnkey solution. But the moat doesn't seem very big, and I'd need to be convinced it's a really slick solution in order to favour that route instead of rolling my own wrapper.
Please don't take this the wrong way, it's just one datapoint of feedback and I do wish you luck with your venture.
If this can't run full-local, isn't that basically a botnet? You're talking about installing a kernel-level driver that receives instructions on what to do from a cloud service.
Great point! Yes you are correct in that the actual "agent" lives in the cloud and its actions are executed by a proxy running on the desktop. Hopefully at some point we can set up a straightforward installation procedure to have the AI models running entirely on the desktop, but that's constrained by desktop specs for now. VMs and desktops with the specs to handle that would be prohibitively expensive for a lot of teams trying to build these automations.
Out of curiosity, what would the minimum specs need to be in order to run this locally?
My PC is just good enough to run a DeepSeek distill. Is that on par with the requirements for your model?
There isn't a viable computer use model that can be ran locally yet unfortunately. Am extremely excited for the day that happens though. Essentially the key capability that makes a model a computer use model is precise coordinate generation.
So if you come across a local model that can do that well, let us know! We're also keeping a close watch.
Haven’t looked into them much but I thought the Chinese labs had released some for this kind of thing
What would it take to train your own?
[dead]
[dead]
Frankly quite insulting to call any Windows app legacy
sorry it came off that way! could you elaborate on that thought?
windows itself is legacy.