Launch HN: Skyvern (YC S23) – open-source AI agent for browser automations

321 points by suchintan 3 days ago

Hey HN, we’re Suchintan and Shu from Skyvern (https://www.skyvern.com). We’re building an open source tool to help companies automate browser-based workflows using LLMs.

Our open source repo is at https://github.com/Skyvern-AI/Skyvern, and we're excited to share our cloud version with you (https://app.skyvern.com) :)

Skyvern allows you to define a single (or a series of) goal-based prompts to instruct an agent to complete complex tasks on websites. Here’s a quick demo of Skyvern: https://www.loom.com/share/76b231309df74a528061fcf102e1967f

We built this to solve a specific problem: building browser automations often requires companies to either hire people and scale out operations teams to do tedious manual work, or hire developers to use products like UI-Path or Selenium to build automations.

Code-based solutions always run into the same problem: they’re brittle (wow this website added a new pop-up dialog and my script broke), and fail to achieve the same objective across multiple websites (how can I fill out a contact-us form on hundreds of different websites?)

We did a Show HN a few months ago (https://news.ycombinator.com/item?id=39706004), and since then, we’ve onboarded customers for a wide variety of use cases: generating insurance quotes on websites like Geico.com; applying to jobs on websites like lever.co; automating filing permits in local government portals; registering new corporations for employment identification; fetching invoices from hundreds of different portals such as hydroone.com; automating purchasing on a handful of e-commerce websites like zooplus.com; and filling out contact us forms on a bunch of random smb websites (such as HVAC websites).

To be able to service all of these, we’ve built and open-sourced quite a few interesting features:

(1) a fully-featured React application allowing you to see every action Skyvern is taking in real-time;

(2) livestreaming browser instances to allow our users to see what Skyvern is doing when running inside of a docker container;

(3) authenticated sessions, integrating with Bitwarden and allowing users to specify Email + Phone + QR-code based 2FAs;

(4) “workflows” allowing users to chain multiple goal-based prompts together, which can handle tasks like invoice downloading, or automating purchasing pipelines;

(5) processing HTML Elements (ex. identifying + summarizing SVGs) and performing website interactions (ex. Iterating over dynamic autocompletes to fill in address information correctly)

(6) “cached workflows”, allowing Skyvern to memorize previous interactions (ie text inputs) and re-use them in future runs.

We’ve also been blessed with a few model advancements to solve some of the cost concerns the community brought up. Skyvern’s token costs went down 80% from $15 / 1M tokens (GPT-4V) to $2.50 / 1M tokens (GPT-4O)

Despite the model costs going down 80%, Skyvern is still quite expensive to run, so we give every new user $5 of credits to try it out and see if it can be useful for you.

We would be honored if you could give it a try at https://app.skyvern.com and share some feedback with us, and we look forward to any and all of your comments!

glorpsicle 3 days ago

Congrats on the launch! I've been keeping up with you folks since you last posted (a few months ago, I believe). How does Anthropic's recent announcement of Claude's "computer use" abilities grab you? What key differentiators does Skyvern have, at this point in time ("computer use" with Claude being relatively new)?

philipbjorge 3 days ago

I work in this space and Claude's ability to count pixels and interact with a screen using precise coordinates seems like a genuinely useful innovation that I expect will improve upon existing approaches.
Existing approaches tend to involve drawing marked bounding boxes around interactive elements and then asking the LLM to provide a tool call like `click('A12')` where A12 remaps to the underlying HTML element and we perform some sort of Selenium/JS action. Using heuristics to draw those bounding boxes is tricky. Even performing the correct action can be tricky as it might be that click handlers are attached to a different DOM element.
Avoiding this remapping between a visual to an HTML element and instead working with high level operations like `click(x, y)` or `type("foo")` directly on the screen will probably be more effective at automating usecases.
That being said, providing HTML to the LLM as context does tend to improve performance on top of just visual inference right now.
So I dunno... I'm more optimistic about Claude's approach and am very excited about it... especially if visual inference continues to improve.
- suchintan 3 days ago
  
  Agreed. In the short term (X months) I expect the HTML Distillation + giving text to LLMs to win out.. but the long term (Y years) screenshot only + pixels will definitely be the more "scalable" approach
  One very subtle advantage of doing HTML analysis is that you can cut out a decent number of LLM calls by doing static analysis of the page
  For example, you don't need to click on a dropdown to understand the options behind it, or scroll down on a page to find a button to click.
  Certainly, as LLMs get cheaper the extra LLM calls will matter less (similar to what we're seeing happen with Solar panels where cost of panel < cost of labour now, but was reversed the preceding decade)
- drothlis 2 days ago
  
  > Claude's ability to count pixels and interact with a screen using precise coordinate
  I guess you mean its "Computer use" API that can (if I understand correctly) send mouse click at specific coordinates?
  I got excited thinking Claude can finally do accurate object detection, but alas no. Here's its output:
  > Looking at the image directly, the SPACE key appears near the bottom left of the keyboard interface, but I cannot determine its exact pixel coordinates just by looking at the image. I can see it's positioned below the letter grid and appears wider than the regular letter keys, but I apologize - I cannot reliably extract specific pixel coordinates from just viewing the screenshot.
  This is 3.5 Sonnet (their most current model).
  And they explicitly call out spatial reasoning as a limitation:
  > Claude’s spatial reasoning abilities are limited. It may struggle with tasks requiring precise localization or layouts, like reading an analog clock face or describing exact positions of chess pieces.
  --https://docs.anthropic.com/en/docs/build-with-claude/vision#...
  Since 2022 I occasionally dip in and test this use-case with the latest models but haven't seen much progress on the spatial reasoning. The multi-modality has been a neat addition though.
  - wintonzheng 2 days ago
    
    Curious: what use cases do you use to test the spacial reasoning ability of these models?
- makestuff 2 days ago
  
  I don't use LLMs that often, but I recently used Claude Sonnet and was more impressed than I was with Chat GPT for similar AWS CDK questions.
  In your opinion is Claude in the lead now? Or is it still really just dependent on what use case/question you are trying to solve?
suchintan 3 days ago

Great question -- I was waiting for someone to ask this!
Their product and launch is super cool. It's incredible how much it's able to do by just relying on tool use + micro agents + screen shots + coordinates to interact with websites.
There are a couple of thoughts here:
(1) Will their competitors wait around and not build something similar? Will xAI / Gemini / OpenAI / Mistral / MetaAI teams wait around? Probably not. This is likely a huge part of the future, and one company will not "take it all"
(2) How is value actually derived from these systems? Is a demo + cool usable product enough? Likely not. Most people actually want their workflow automated. For personal use-cases, this might be enough.. but enterprises likely want something more complex
(3) Will this be optimized for Claude only? What if you want to run this with your own open source LLMs? Or you want to point this at the best model on the market all the time? Will you get that flexibility through a solution provided by a big player? Likely not -- Anthropic has incentive to get you to use Claude under the hood
The last point is the one that gives me hope. Our open source users are able to pick their favourite model to run on. You're not locked into Cluade. You can run it on Gemini / GPT-4O or open source ones such as Llama 3.2.
- helloericsf 3 days ago
  
  Congrats on the launch! Curious to know, which OSS models you see works best at the moment?
  - suchintan 3 days ago
    
    We've had a decent amount of luck with InternVL 2.0 w/ Llama, and are pretty excited about Llama 3.2
    It's still super early in the open source x vision model space. The limiter actually seems to be the vision encoder -- advancements here will pay off huge dividends
    https://huggingface.co/spaces/opencompass/open_vlm_leaderboa...
    
    helloericsf 3 days ago
    
    Thank you! Great insight.

sahmeepee 2 days ago

Probably not the first AI wrapper around Playwright this week, and certainly not the first this month.

I think this use case of automation in a BPA sense is more compelling than using it for test automation, because the latter is much more concerned with the precision and repeatability of the process. For the BPA task, arguably you care only about the outcome and it often doesn't matter if it gets there via some crazy route.

Part of the problem for me is that your example video shows a big wodge of prompt that had to be written to make this work and then a few kb of payload data (parameters) in a plaintext, non-csv format. If the expectation is that this replaces someone just using Playwright with codegen due to that being too technical, I'm not convinced there is a huge group of people who can manage one task but not the other.

Furthermore, you are expecting them to pass over their website login credentials and apparently their credit card details too, in plain text. You had better have a very solid idea of how to handle that sensitive data to avoid serious consequences if your users' skyvern accounts are compromised.

I think the frequency of website redesigns is oversold by people producing these LLM-driven Playwright wrappers, especially when targeting old-fashioned or government sites. As an example, we have had a suite of lengthy Playwright browser automations to interact with a government site for a few years and have had to maintain them only once, when the agency's business process changed. The prompt would also have needed to change had we used Skyvern, as would the payload, because the process was different. The difference with the Playwright automation, though, is that we could use assertions to verify steps had succeeded/failed and data had been recorded correctly, so we would know the process needed updating. I can't see that option in Skyvern which would have me worrying that process changes would be overlooked and we would unknowingly start entering the wrong data or missing steps.

suchintan 2 days ago

You're making some really good points here
1/ the current prompt + payload structure is definitely on the complicated end of the spectrum, but we've found that we can use an LLM to help generate this payload for our users
The technical users want to learn more and generate their own payloads, and the non technical users prompt LLMs to help them generate the ultimate skyvern prompt to get going
This was very unexpected -- but a surprisingly logical chain of events.
Phase 1: build the thing the complex way (playwright) Phase 2: build the playwright thing with complex prompts (we are here right now) Phase 3: build the thing that builds the playwright thing with simpler prompts
Each phase lowers the technical bar to build your automations
2/ re: frequency of website changes
This IMO is a smaller value prop of LLM based automations. The biggest one is being able to handle highly dynamic situations. Consider the case where you're automating an e-commerce website where the popup offer changes every week. skyvern doesn't even notice those, but playwright scripts would break
Similarly, I love using the Geico example because it highlights something that was very difficult to automate before: The form changes every time you run it
Skyvern breezes through it.. but another case that was hard to automate before.
3/ data correctness
We're actually rolling out a workflows feature that allows you to chain multiple tasks together. The cool thing about this feature is that you can add steps in to have Skyvern self-validate it's own unless before continuing.
For example, you can add n products to cart, then navigate to the cart and validate the cart state
... As you can guess, this creates the foundation to have another agent go and use these tools to self-build workflows with simpler prompts
TL;DR -- we're on a pretty long journey to use LLMs to make BPA easier and easier, and this is just the first step

Workaccount2 3 days ago

Anyone building a start-up on 3rd party LLMs at this point has to have some big cajones. Or you need a smash-and-grab business model. Serious risk if your horizon is measured in years instead of months.

Anthropic threw their hat in this ring yesterday, and it will very likely be followed by OpenAI and Google soon. Godspeed.

timabdulla 2 days ago

Many companies (like Vercel, Supabase, and so on) have built big businesses "wrapping" AWS. They literally compete with AWS and use AWS to deliver their offering.
This is a big market. There are room for lots of approaches.
I'm sure OpenAI, Anthropic, and Google will make a big business of this, but I don't see how you can rule out anyone else having good ideas and relying on big infrastructure providers to make them a reality.
_HMCB_ 3 days ago

What do you mean they threw in their hat? I am not aware as to what happened.
- paladin314159 3 days ago
  
  https://www.anthropic.com/news/3-5-models-and-computer-use
  - _HMCB_ 3 days ago
    
    Thank you.

mmaunder 3 days ago

Congrats!!! And super cool that you've open sourced it under the AGPL. Sorry if this is answered in the docs but I did a brief search on the source and noticed you're not using LangChain but do plan to integrate it so it can be offered to that community. I'm curious if you wouldn't mind talking about what you did use to create the chain of thought/actions logic in Skyvern and if you had to start work today if you'd consider going the LangChain/Graph route? Thanks.

suchintan 3 days ago

We actually started off using the AutoGPT framework. There are a ton of remnants of that (tasks, steps) but we found the framework extremely limiting as we wanted to expand and do more complex things
For example, we're currently using a multi agent architecture where we have micro agents run to analyze SVGs, fill out dynamic autocompletes. This would have been really hard.
Frameworks like langchain are good for early prototyping, but it's too restricting when you want to push the limits

dboreham 3 days ago

In case anyone else is confused as to what "browser automations" is : this is about making a program that drives a target web site (owned by someone else typically), in the manner of selenium or the like --- inserting key press events and mouse move/click events, to make that target web site do something. Once you know that the rest of the description makes sense.

sirmarksalot 3 days ago

As with any of these LLM workflow automation tools, it raises a few questions about each potential use case, and the likely long-term outcomes.

1. Is this working around friction due to a lack of interoperability between tools? For example, is this something that would be more efficient if the owner of the website exposed a REST service? Will the existence of this tool disincentivize companies from exposing services when it makes sense?

2. If there is a good reason for the lack of a service endpoint, perhaps for security reasons, will your automation workflow be used to bypass those security measures? Could your tool be used by malicious actors to disable major services? Are you that malicious actor yourself? Will your tool be used by scalpers to prevent consumers from buying high-demand products?

3. If this is being used to work around deferred maintenance with internal tools and processes, will the existence of these kind of tools be used by management to justify further deferral of that maintenance? Will your tool become a critical piece of the support staff's workflow?

4. If your tool is being used in good faith to work around anti-patterns in website design, will the owner of the website be incentivized to break your workflow? Is your use case just a step in an arms race?

These are the thoughts that go through my head whenever I hear about software being laid on top of complicated processes, where instead of simplifying the underlying processes, we add another layer of complexity to sweep it under the rug. I'm sure that people will find your project useful, but I wonder what the longer-term effects will be.

suchintan 3 days ago

1. Yes absolutely. But the issue is a little bit more nuanced than that. Websites without APIs don't have them for one of two reasons: (1) They want to protect their data (LinkedIn) or (2) can't be bothered to make an API (boutique websites, government portals). This solves that problem, but also makes it so these websites never have to build an API (after LLM costs go down).
2. We don't want Skyvern to be used on websites that prohibit this kind of behaviour (LinkedIn is the obvious example). Specifically, we didn't open source any of our anti-bot or captcha related code because we get requests to make "Reddit upvote rings" and such. We don't want to support bad actors like that
(3) I think this is a net net good thing. AI browser automations= less need for APIs = no need to maintain both an API and UI = streamlined experience + less code = simpler systems
(4) I'm not 100% sure about this one. We usually just assume companies don't build APIs because they don't have budget for it. Ie for non malicious reasons. Companies like LinkedIn will likely thwart any attempts at automation, but we're not interested in participating in this cat mouse game
- rmbyrro 2 days ago
  
  > after LLM costs go down
  I think 100 Gb of GPU memory will always cost multiples of CPU + regular memory.
  Using LLMs and computer vision for these kinds of tasks only make sense in small scales. If the task is extensive and repeated frequently, you're better off using an LLM to generate a script using Selenium or whatever, then running that script almost for free (compared to LLM). O1 is very good at it, by the way. For the $0.10 of 1 page interaction charged by Skyvern, I can create several scripts using O1.

thedays 2 days ago

Is Skyvern able to scrape data from multiple websites with different structures and combine this data into structured data in one CSV or JSON file? Example: scrape interest rates offered on savings accounts from multiple bank websites and extract the name of the bank, bank logo, product name and interest rate for each account and run this saved query on a regular schedule (daily, weekly etc)?

suchintan 2 days ago

Yes -- in theory. You'd need to use our workflows feature to get that set up and chain a few tasks together to collect that information!

DennisSFO 3 days ago

Congrats on the launch. I'm curious if you had any experience running skyvern on airline websites (for example to extract award availability for miles tickets from point A to B)? It seems like airlines always change things around and have robust anti scraping measures.

suchintan 3 days ago

Great question. We haven't helped anyone with that exact use case yet, but we're in the middle of integrating with a company to help them automate purchasing flights with Alaska and Southwest (on the behalf of real people)
It's going to be our way of beta testing CC transaction and testing them for reliabilty

msp26 3 days ago

Awesome, I've been working on a similar thing at a smaller scale and I think this area is very promising.

I've limited my problem scope to single page interactions / scraping which has been very reliable and useful for my company. But agentic automation does sound fun.

suchintan 3 days ago

Yeah! We've seen this especially useful if you want to work in highly dynamic situations
Ex: filling out contact forms on hundreds of websites? It's really tough for normal code to be able to handle that cardinality. No problem for an AI agent
- msp26 3 days ago
  
  Just out of curiosity, what sort of challenges did you run into when scaling this up?
  I don't see a need for my current solution to go past a handful of browser instances but I'd imagine it might get crazy.
  - suchintan 3 days ago
    
    I made a LinkedIn post about it yesterday, but the funniest has been our customer DoSing our service by accident (sending 10K tasks per hour for 24h straight)
    Toughest was Skyvern accidentally talking to a support agent when the website said "your request failed, please contact support"
    https://www.linkedin.com/posts/suchintansingh_we-received-20...

sergeyk 3 days ago

Congrats! Do you have numbers on WebArena (https://webarena.dev) or VisualWebArena (https://jykoh.com/vwa)?

suchintan 3 days ago

Not yet! We haven't shared them publicly yet because our internal dataset is super biased. Keep your eyes peeled though! They'll be coming out in the next few weeks :)

modo_ 3 days ago

Congrats on the launch! This is really cool - one of the applications of LLM I find most compelling. I've seen so many back office processes that have hundreds of steps, are incredibly error prone, and traditionally couldn't be automated due to API limitations. Solutions like Skyvern are going to supercharge businesses that have had historically low margins due to the number of humans required. (Not as a replacement for a human, but as a force multiplier)

suchintan 3 days ago

The most fascinating part is how tough that work really is. Everyone we've talked to loathes the manual stuff, but until a better solution comes out, you have to allocate X% of your time to it

drewsonian 3 days ago

This is great, and I can think of several business uses and some personal.

Like this: Could I use this to pull screenshots or PDFs of my grocery receipts from a major grocery chain?

suchintan 3 days ago

Yes! We're helping a few companies with this right now. This use-case actually surprised me.
I never realized how important it is to track invoices in Europe (where VAT needs to be closely tracked), and a large % of vendors require you to log into their portal to fetch them

delusional 3 days ago

The plaintext version of your signup email replaces the ampersand in the url with an & XML entity. You probably don't want that.

suchintan 3 days ago

Interesting. We will fix it

jackb4040 3 days ago

> You won't be able to run Skyvern unless you enable at least one provider.

Any plans on bundling a local LLM / supporting local LLMs?

suchintan 3 days ago

We have an open issue for this right now -- we would LOVE some contributions here. The biggest problem until Llama 3.2 came out was that most (good) open source llms were text-only, and Skyvern needs vision to perform well
This isn't true anymore -- we just need to build and launch support for it
- socksy 2 days ago
  
  In theory to support ollama all you should need to do is be able to change the URL that would otherwise go to OpenAI, and select the model. The only gotcha is that the llama3.2 builds for ollama are currently text only — however they've just added support for arbitrary hugging face models so you're not limited by the officially supported models.

imp0cat 2 days ago

> how can I fill out a contact-us form on hundreds of different websites?

What's the use case here exactly? Sorry for being a bit pessimistic, but this sounds like an easy way to automatically send a lot of spam.

bluerooibos 2 days ago

Looks super interesting!

Unfortunately the mobile experience is pretty bad - practically unusable. I'd expect any web application made in the last decade to be mobile-first.

suchintan 2 days ago

Yep. This is totally fair feedback -- we're still a super early product and haven't had a chance to optimize the phone experience.. largely because it's tough to see the magic from the phone
We'll improve it soon!

ganeshkrishnan 3 days ago

awesome work. I had the github starred from the day I saw on Show HN but never got around to using it.

I want to use this to automate approving/declining group members for our facebook group which is approaching half million members and fb admin tools are pretty lacking

suchintan 3 days ago

Thank you for the star! We had someone talk about us the other day on r/localllama (https://www.reddit.com/r/LocalLLaMA/comments/1g9zhbd/if_your...) and I still couldn't believe that we ever got past 50 stars

BrandiATMuhkuh 3 days ago

Congratulations on the launch. This is really cool. I was recently tinkering with the same idea. But based on a browser extension.

There are many back office tasks where people copy data from page 1 into a form of page 2.

suchintan 3 days ago

Yeah we've been surprised by how many interesting things companies do in the background to keep them running
The craziest one we heard about was this government portal in India that was hard to automate because halfway through the portal you had to refresh the page a bunch of times to get a button to show up
- selimthegrim 3 days ago
  
  The railway ticket site?
  - suchintan 3 days ago
    
    It was a state level permit website I think. Very interesting!

ProofHouse 2 days ago

Cool but pricing is utterly insane

TZubiri 2 days ago

Sounds good.

Question, if it's computer vision based, does that mean that it can be trivially ported to support desktop automations?

andychert 3 days ago

Do I understand correctly that this is an open source of the GUI only, you don't show the model itself?

andychert 3 days ago

Or you don't have your own model, you use GPT-4V to determine the coordinates of where to click the bot?

shaburn 2 days ago

Would be great to have a fixed blockchain based event log, ideally encrypted.

infocollector 3 days ago

Quick question: What does DataDog's ddtrace do in the opensource version?

suchintan 3 days ago

Nothing -- we use DataDog for our cloud telemetry and haven't built a great way to separate dependencies between cloud and open source

rokhayakebe 3 days ago

Can I use this to make changes to a Wordpress website if given login?

suchintan 3 days ago

Depends on the scope of the changes. What did you have in mind?
- rokhayakebe 3 days ago
  
  Maybe add a new page or update a link.
  - biosboiii 2 days ago
    
    you can use the official API for that, right? without having to pay ChatGPT and click pixels.

drippingfist 3 days ago

This is very cool. Do you think I could use to do UX/UI testing?

suchintan 3 days ago

Give it a try! It's very capable of doing simple tasks like logging in and clicking around. You'll need to prompt assertions like "Complete if..." and "Terminate if..."

tdsone3 2 days ago

Has someone run this on modal.com yet?

PeterStuer 2 days ago

But will Cloudflare brick it?

Cheesman123 3 days ago

Congrats on the launch - love the tool

ji_zai 2 days ago

Congrats!! This is super neat. I've been looking for good ways to have AI browse the internet on my behalf - the way I normally do, and give me a presentation / summary of the highlights, so that I don't have to open myself up as much to social media and the chance for doomscrolling, etc.

I'm going to be playing with this.

bbor 2 days ago

Wait until the media gets wind of what the industries been doing this fall… a whole repo on using AI to autonomously use other people’s websites, and not a single paragraph on safety — for the websites or for us. Technically incredible ofc, and it’s a beautiful repo. I wish it didn’t make so anxious.

wintonzheng 2 days ago

Autonomous vehicles went through the same phases. The reliability part of autonomous agent has to become really really reliable first. The iterations in software is much faster than hardware though

mkrishnan 3 days ago

[dead]

mkrishnan 3 days ago

[dead]

bev-erage 3 days ago

[dead]