Show HN: PG-Capture – a better way to sync Postgres with Algolia (or Elastic)
pg-capture.onrender.comHello HN! Keeping Elasticsearch or Algolia in sync with Postgres is a pain when your data spans multiple tables. Traditional CDC tools like Debezium capture table-level events but don’t help when you need to index a full entity composed of related records. That's why I built PG-Capture.
PG-Capture lets you define a schema as complex as you need, aggregate table-level events, and reconstruct structured objects spanning multiple tables. It then publishes them as meaningful domain-level events to any queue (RabbitMQ, SQS, etc.).
PG-Capture does NOT replace Debezium, it works on top of it (or any other event source). This keeps it flexible and tool-agnostic.
I'd love to hear your thoughts! Does this solve a problem you've run into? Would you use something like this in your stack? Feedback is welcome!
Please don't forget to add a license to your GH repo as I doubt very seriously the "license" field in package.json is authoritative, and doubly so since I think that's one of the "yeah, yeah, just let me work" defaults https://github.com/Tggl/pg-capture/blob/main/package.json#L3...
Good catch, I just added an MIT license.
The package.json still says ISC.
Thanks for sharing, even though I don't use Algolia or Elastic but I love the idea of tracking Postgres Triggers and sending it to the queue and sending wherever you need it. Brilliant !!
Thanks! Syncing PG to another data store like Algolia or Elastic is just a very common use case that I use to pitch the idea. But Change-Data-Capture can be used for much more: emitting events when data changes, transforming data, caching data...
All of those use cases are really painful with raw table-level events.
What do you use for search then? Out of curiosity.
Today I use PG-Sync to index things in MeiliSearch and to populate cache in production.
Looks pretty promising, would be great to see some more documentation around pushing data out to elastic, etc.
Small nit, the GitHub link at the bottom of the page leads to docusaurus' GitHub, not the project. It took me more than one attempt to realise the project's gh was in the sidebar.
Documentation is indeed very early, the main focus now was to pitch the idea correctly and help people understand what it does. I will then work on integrating multiple sources (Debezium, WAL-Listener...) and multiple destinations (Elastic, Redis...).
Good catch for the footer, thx!
Hi, congrats on the show HN. Is it possible for you to Provide some comparison of this tool against other existing tools that can sync Postgres to Elastic or other third-party data providers.
I tried to give a detailed comparison with PGSync in another comment. But in a nutshell, PG to Elastic is just one use-case for PG-Capture. The goal for PG-Capture is to be a schema-based Change-Data-Capture utility for Postgres that lets you integrate into your existing stack.
It has no opinion on what you should use to capture low-level events (Debezium, WAL-Listener...) and what you should do with the resulting high-level events (indexation, caching, event bus..).
I am pitching it as a PG to Elastic tool simply because it is a widespread use-case that everyone understands.
How does this compare to PGSync? (https://pgsync.com/)
I believe it is very comparable to PGSync in the way it works (schema-based CDC), the main differences are:
- PGSync is a full-stack solution and PG-Sync is "just" a library. PGSync will work out of the box while PG-Capture will require more setup but you'll get more flexibility
- PGSync does not let you choose where you get your data from, it handles everything for you. PG-Capture lets you source events from Debezium, WAL-Listener, PG directely...
- PGSync is only meant to move data from PG to Elastic or Open-Search. While this use-case is perfectly feasible with PG-Capture, you can use it for many more things: populating cache, indexing into Algolia, sending events to an event bus...
All in all, the main difference is that PG-Capture is agnostic of the stack you want as input and output, allowing you to do pretty much anything while PGSync is focused on indexing data from PG to Elastic. I hope that clears things up!
This looks awesome, was hoping for an easier way to use debezium for event capture and will try it
Happy if it helps! Feel free to share your feedback here or on GitHub once you do!
There are many ways to do ETL (extract, transform, load). Probably if your transform function is a no-op, you are doing it wrong though (see below for why). Extract and load are the easy parts, generally.
Your database has cursors. You can use those to dump data efficiently. Dump it to a file. Put it somewhere. Use Kafka or something similarly fancy if you need to juice up your resume. But otherwise, files actually go a long way. Maybe put them in S3. Anyway, that's extract covered. If this code is in any way complicated, you are doing it wrong.
Now process that stuff item by item by line. Chunk it up. Use a nice framework to make this concurrent and fast if you must. There are loads of options for this. That's where your transform logic lives. This is where you do all the clever stuff you need to do to sure querying is fast. This bit can be expensive and complex. This is why you don't want to run this on your database server while it is serving traffic, typically. Bad idea.
The output of transform can be another file.
The load it into wherever it needs to go. This code too should be simple simple. Use batch/bulk inserts. Whatever works fast. I've seen systems load data by the GB per second. That only works if it does absolutely nothing smart whatsoever.
Here's the key advice: don't do your ETL in one function. Separate those concerns from day 1. Long term they are probably not going to run on the same hardware.
Data transformations are what will make the difference. Indexing exactly what you store in your database is rarely optimal.
Things you can do during data transformation
- merge in other data from elsewhere to make it easier to search on that data.
- calculate expensive things that are hard to calculate at query time, things like page rank, quality scores, embeddings, etc.
- denormalize things from various database tables into your search index (most search engines don't really do joins, storage is cheap)
- filter out stuff you don't actually search for
- etc.
The reason ETL is important is that things change. Your data model might change (new fields, tables, whatever). Your transformation logic might change (new features, bug fixes, etc.). Your business logic might change. Etc. When stuff changes, you probably should reindex all of your data. And your ETL pipeline needs to be ready for that. Recreating indices from scratch needs to be a completely routine thing. Quick and easy. If it's not, you are never going to do it and it's always going to be inconvenient. It will block all progress in your team.
Your ETL needs to have two modes: incremental and full reindex.
I consult clients on this stuff, well over 90% of my clients do this wrong and then get blocked on not being able to evolve their indexing strategy or do rapid iteration or experiment with how their search works.
I had a client recently that was complaining it took 24 hours to rebuild their index. And it wasn't because they had a lot of data. But because they were doing a lot of work against their production database because their ETL strategy was tightly coupled with that. We're talking stored procedures here. Probably seemed like a good idea at the time. Their ETL speed was limited by their database speed. While it was serving traffic.
A few potential issues with the approach suggested:
- it tackles incremental but not full reindex, that's a mistake
- it sounds like it combines Extract and Transform into one step; I wouldn't do that.
- Probably doing a lot of joins is going to complicate things; do you even want to be doing that on your production cluster? Especially when doing a full reindex. Do all those tables change all the time? This blurs the line between Extract and Transform and lets the database do a lot of the work.
Not saying that it's all wrong but it probably isn't optimal and might become a problem if you scale.
Thanks for the extremely detailed feedback. I'll try to address your (very valid) concerns:
- I don't know if you had a look at the "How does it work?" page, here I try to explain using sequence diagrams how the process is split in two: first aggregating events into root IDs and then building the final objects from those root IDs.
- Each of those two steps hit the DB but: (i) it should not be the production database but a read-only replica, (ii) those two queries are independent and can be run separately. So instead of rebuilding extraction from scratch, I decided to rely on already existing replication strategies which in essence do exactly what you suggest.
- This library is not at all concerned about transformation, this step should indeed be separated. In our production environment, we transform the high-level events that PG-Capture sends with an async worker that does not hit the DB at all, it just transforms the data it receives.
- I agree that you should not index directly what is in the DB, which is why you should transform the data I suggest in my previous point. But that data has to come from somewhere, and PG-Sync aims at making that part of the process smooth and robust.
- Regarding full-indexation, it is actually pretty straightforward: push all IDs of your root table into the store (can be streamed) and your consumer should already be building objects and publishing high-level events. The good part is that the consumer will not do one query per object but can build a lot of objects at once with a single query.
We have been using PG-Capture in production for half a year so far, but we are not yet at the scale of a few Gb per second.
Eager to have your feedback regarding those points.
Does this work out of the box with Prisma?
It does if you are using PG (other SQL databases will be added later). Under the hood, PG-Capture listens to raw Postgres events, it does not matter if the data was updated via an ORM (like Prisma), raw SQL, or even a developer's IDE...