Fearless SSH: Short-lived certificates bring Zero Trust to infrastructure

blog.cloudflare.com

151 points by mfrw 8 months ago

edelbitter 8 months ago

Why does the title say "Zero Trust", when the article explains that this only works as long as every involved component of the Cloudflare MitM keylogger and its CA can be trusted? If hosts keys are worthless because you do not know in advance what key the proxy will have.. than this scheme is back to trusting servers merely because they are in Cloudflare address space, no?

hedora 8 months ago

Every zero trust architecture ends up trusting an unbounded set of machines. Like most marketing terms, it’s probably easier to assume it does the inverse of what it claims.
My mental model:
With 1 trust (the default) any trusted machine with credentials is provided access and therefore gets one unit of access. With 2-trust, we’d need at least units of trust, so two machines. Equivalently, each credential-bearing machine is half trusted (think ssh bastion hosts or 2FA / mobikeys for 2 trust).
This generalizes to 1/N, so for zero trust, we place 1/0 = infinite units of trust in every machine that has a credential. In other words, if we provision any one machine for access, we necessarily provision an unbounded number of other machines for the same level of access.
As snarky as this math is, I’ve yet to see a more accurate formulation of what zero trust architectures actually provide.
YMmV.
- choeger 8 months ago
  
  I think your model is absolutely right. But there's a catch: Zero Trust (TM) is about not giving any machine any particular kind of access. So it's an infinite amount of machines with zero access.
  The point of Zero Trust (TM) is to authenticate and authorize the human being behind the machine, not the machine itself.
  (Clearly, that doesn't work for all kinds of automated access and it comes with a lot of question in terms of implementation details (E.g., do we trust the 2FA device?) but that's the gist.)
- glitchc 8 months ago
  
  That's not the intention of zero-trust. As others have said, it's about authenticating the user and associated privilege, not the machine itself. Simply put, zero trust means machines on the intranet must undergo a user-centric authentication and authorization step prior to accessing any resource. Additionally, once authenticated, a distinct secure channel can be established between the specific endpoint and the resource that cannot be observed or manipulated by others on the same network.
- EthanHeilman 8 months ago
  
  In my view the eventual goal of security is to reduce all excess trust to zero. Excess trust is all trust which is not fundamental to thing you are trying to do. If you want a feature that let's Alice update policy, you need to trust Alice to update policy. I believe that a system without any excess trust is worth building that's why I founded BastionZero and why I joined Cloudflare to work on this.
  Getting there is a long walk through the woods on a moonless night.
  > With 2-trust, we’d need at least units of trust, so two machines. Equivalently, each credential-bearing machine is half trusted (think ssh bastion hosts or 2FA / mobikeys for 2 trust).
  You might be interested in OpenPubkey[0, 1] which was developed at BastionZero. It has 1/2 trust for OpenIDConnect and can be used for SSH.
  > As snarky as this math is, I’ve yet to see a more accurate formulation of what zero trust architectures actually provide.
  I prefer the term epsilon-trust to reflect the nature of security and trust reduction as an iterative process. The trust in a system approaches but never fully reaches zero.
  [0]: OpenPubkey: Augmenting OpenID Connect with User held Signing Keys https://eprint.iacr.org/2023/296
  [1]: https://github.com/openpubkey/openpubkey/
varenc 8 months ago

https://www.cloudflare.com/learning/security/glossary/what-i...
Zero Trust just means you stop inherently trusting your private network and verify every user/device/request regardless. If you opt in to using Cloudflare to do this then it requires running Cloudflare software.
- PLG88 8 months ago
  
  Thats one interpretation... ZT also posits assuming the network is compromised and hostile, that also applies to CF and their cloud/network. It blows my mind that so many solutions claim ZT while mandating TLS to their infra/cloud, you can trust their decryption of your date, and worst IMHO, they will MITM your OICD/SAML key to ensure the endpoint can authenticate and access services... that is a hell of a lot of implicit trust in them, not least of them being served a court order to decrypt your data.
  Zero trust done correctly done not have those same drawbacks.
  - sshine 8 months ago
    
    One element is buzzword inflation, and another is raising the bar.
    On the one hand, entirely trusting Cloudflare isn't really zero trust.
    On the other hand, not trusting any network is one narrow definition.
    I'll give you SSH keys when you pry them from my cold, dead FDE SSDs.
- michaelt 8 months ago
  
  Zero Trust means you stop trusting your private network, and start trusting Cloudflare, and installing their special root certificate so they can MITM all your web traffic. To keep you safe.
  - gobip 8 months ago
    
    Same thing with their "serverless" servers where you host everything there.
- bdd8f1df777b 8 months ago
  
  But with public key auth I'm already distrusting everyone on my private network.
  - resoluteteeth 8 months ago
    
    Technically I guess that's "zero trust" in the sense of meeting the requirement of not trusting internal connections more than external ones, but in practice I guess "zero trust" also typically entails making every connection go through the same user-based authentication system, which uploading specific keys to specific servers manually definitely doesn't achieve.
fs111 8 months ago

Zero Trust is a marketing label that executives can seek out and buy a thing for because it is super-hot thing to have these days. That's mostly it.
ozim 8 months ago

“Zero Trust” is not assuming user has access or he is somehow trusted just because he is in trusted context. So you always check users access rights.
TLS having trusted CA cert publisher is not context of “Zero Trust”.
- quectophoton 8 months ago
  
  Question. Not specifically for you, but related to this comment.
  Would this mean that a PostgreSQL listening on localhost and always asking for user and password is considered Zero Trust, but peer authentication is not?
  This part is always a bit confusing for me because there's already been authentication (OS login) creating a session for a specific user (OS user) accessing a specific service (through a unix domain socket) with the specific connection being validated (the unix domain socket permissions).
  And from my limited knowledge, the OS login looks like an IdP (Identity Provider), the OS session looks like a JWT already validated by a middleware (the OS vs some API Gateway), connecting to a service using this "token" (OS session vs JWT), and only allowing access to this specific connection (the connection to the socket) if the token is valid (OS session has permissions vs JWT has good signature) and has permissions to the application itself (PostgreSQL checking the connecting user has access to this resource vs the application checking the connecting user has access to this resource).
  So I can see this as Zero Trust because the pattern is kinda matching ("the letter"), but also as Not Zero Trust because I feel like this would still be considered a "trusted context" by what the term tries to convey ("the spirit").
znpy 8 months ago

> Why does the title say "Zero Trust", when the article explains that this only works as long as every involved component of the Cloudflare MitM keylogger and its CA can be trusted?
The truth-yness of "zero trust" really depends on who's trusting who.
pjc50 8 months ago

> Cloudflare MitM keylogger
Would you like to explain what you mean by this?
- jdbernard 8 months ago
  
  Different responder, but I imagine they are referring to CloudFlare's stated ability to:
  Provide command logs and session recordings to allow administrators to audit and replay their developers’ interactions with the organization’s infrastructure.
  The only way they can do this is if they record and store the session text, effectively a keylogger between you and the machine you are SSH'ing into.
  - acdha 8 months ago
    
    Keylogger has a specific meaning which doesn’t refer to audit logging. Trying to scare people by misusing loaded terms has the opposite effect from what you intend.
    
    be_erik 8 months ago
    
    Keyloggers are absolutely used for audit logging. I've implemented these MiTM patterns specifically so we could log all keystrokes. The addition of a keylogger is only an issue if you don't trust Cloudflare, but usually a checklist item for these kinds of bastion hosts in certain compliance environments.
    
    acdha 8 months ago
    
    Yes, but it’s not a man in the middle attack when it’s monitoring your own servers any more than it’s a privacy breach when HR looks at your file. My intent was simply that trying to make things sound scary by using language normally used in adversarial contexts really isn’t helpful when talking about things companies need to do. There isn’t an expectation of privacy what what you do on company servers.
    
    michaelt 8 months ago
    
    "mitm keylogger" has a specific meaning and refers to a party in the middle of a connection, logging the keystrokes.
    
    acdha 8 months ago
    
    Both terms are used to refer to attacks, not oversight of your own systems.
    
    hiatus 8 months ago
    
    No, here's a counter example: https://www.interguardsoftware.com/keylogger-software/
    
    acdha 8 months ago
    
    I’ll concede that keylogger is sometimes used in a corporate workstation monitoring context but it isn’t really the same as session monitoring on servers. The main thrust of my comment was simply that using loaded language to make common needs sound scary is distracting from rather than helping matters.
    
    jdbernard 8 months ago
    
    I think the original poster's intention was to be somewhat inflammatory as a way to draw attention to the very high level of trust you are granting to CloudFlare in this model. You are effectively giving them whatever privileges you yourself have on those boxes.
    Of course, CloudFlare is making it their business to be and convince others that they are that trusted third-party.
- debarshri 8 months ago
  
  In Privilege access management platform (including ours [1]), every operation that a user does is multiplexed via (stdout/stdin) and captured for auditing. This is a compliance requirement for SOX, PCI etc.
  [1] https://adaptive.dev

tptacek 8 months ago

I'm a fan of SSH certificates and cannot understand why anyone would set up certificate authentication with an external third-party CA. When I'm selling people on SSH CA's, the first thing I usually have to convince them of is that I'm not saying they should trust some third party. You know where all your servers are. External CAs exist to solve the counterparty introduction problem, which is a problem SSH servers do not have.

michaelt 8 months ago

> I'm a fan of SSH certificates and cannot understand why anyone would set up certificate authentication with an external third-party CA.
I think the sales pitch for these sorts of service is: "Get an SSH-like experience, but it integrates with your corporate single-sign-on system, has activity logs that can't be deleted even if you're root on the target, sorts out your every-ephemeral-cloud-instance-has-a-different-fingerprint issues, and we'll sort out all the reverse-tunnelling-through-NAT and bastion-host-for-virtual-private-cloud stuff too"
Big businesses pursuing SOC2 compliance love this sort of thing.
- tptacek 8 months ago
  
  We've been SOC2 Type 2 for several years and I'd push back on that.
  https://fly.io/blog/soc2-the-screenshots-will-continue-until...
kevin_nisbet 8 months ago

I'm with you, I imagine it's mostly people just drawing parallels, they can figure out how to get a web certificate so think SSH is the same thing.
The second order problem I've found is when you dig in there are plenty of people who ask for certs but when push comes to shove really want functionality where when user access is cancelled all active sessions get torn down immediatly as well.
xyst 8 months ago

Same reasons for companies still buying “CrowdStrike” and installing that crapware. It’s all for regulatory checkboxes (ie, fedramp cert).
- tptacek 8 months ago
  
  I do not believe you in fact need any kind of SSH CA, let alone one run by a third party, to be FedRAMP-compliant.

mdaniel 8 months ago

I really enjoyed my time with Vault's ssh-ca (back when it had a sane license) but have now grown up and believe that any ssh access is an antipattern. For context, I'm also one of those "immutable OS or GTFO" chaps because in my experience the next thing that happens after some rando ssh-es into a machine is they launch vi or apt-get or whatever and now it's a snowflake with zero auditing of the actions taken to it

I don't mean to detract from this, because short-lived creds are always better, but for my money I hope I never have sshd running on any machine again

akira2501 8 months ago

> any ssh access is an antipattern.
Not generally. In one particular class of deployments allowing ssh access to root enabled accounts without auditing may be.. but this is an exceptionally narrowed definition.
> I hope I never have sshd running on any machine again
Sounds great for production and ridiculous for development and testing.
- mdaniel 8 months ago
  
  > Sounds great for production and ridiculous for development and testing.
  I believe that "practice how you're going to play" to get devs into the habit of not using a crutch to treat deployments like they are their local machine. The time to anticipate failures is in the thinking time, not "throw it over the wall and we'll think later"
  - akira2501 8 months ago
    
    Your devs should not be managing deployments. Making deployable software and actually worrying about the deployment environment don't need particularly tight coordination. I'd also worry that you're overfitting your software to whatever third party deployment target you've selected.
advael 8 months ago

Principle of least privilege trivially prevents updating system packages. Like if you don't want people using apt, don't give people root on your servers?
blueflow 8 months ago

Even for immutable OSes, SSH is a great protocol for bidirectionally authenticated data / file transfer.
ashconnor 8 months ago

You can audit if you put something like hoop.dev, Tailscale, Teleport or Boundary in between the client and server.
Disclaimer: I work at Hashicorp.
- LtWorf 8 months ago
  
  But I avoid hascicorp stuff whenever I can!
ozim 8 months ago

How do you handle db.
Stuff I work on is write heavy so spawning dozens of app copies doesn’t make sense if I just hog the db with Erie locks.
- mdaniel 8 months ago
  
  I must resist the urge to write "users can access the DB via the APIs in front of it" :-D
  But, seriously, Teleport (back before they did a licensing rug-pull) is great at that and no SSH required. I'm super positive there are a bazillion other "don't use ssh as a poor person's VPN" solutions
  - zavec 8 months ago
    
    This led me to google "teleport license," which sounds like a search from a much more interesting world.
    
    aspenmayer 8 months ago
    
    You might be interested in Peter F. Hamilton's Commonwealth Saga sci-fi series, then.
    Among other tech, it involves the founding of a megacorp that exploits the discovery and monopolization of wormhole technology for profit, causing a rift between the two founders, who each remind me of Steve Jobs and Steve Wozniak in their cooperation and divergence.
    https://en.wikipedia.org/wiki/Commonwealth_Saga
    
    Hikikomori 8 months ago
    
    Yo, dudes, how’s it hanging?
    
    aspenmayer 8 months ago
    
    Is this a reference to the books? It's been a while since I read them.
    
    Hikikomori 8 months ago
    
    Its what Ozzie or Nigel say over the radio after they landed.
    
    aspenmayer 8 months ago
    
    Ah yeah, that's a great scene! The bravado and hubris of gatecrashing an interplanetary livestream to launch your startup out of stealth is just chef's kiss.
    
    mdaniel 8 months ago
    
    To save others the search: https://github.com/gravitational/teleport/pull/35259 Apache to AGPLv3
namxam 8 months ago

But what is the alternative?
- mdaniel 8 months ago
  
  There's not one answer to your question, but here's mine: kubelet and AWS SSM (which, to the best of my knowledge will work on non-AWS infra it just needs to be provided creds). Bottlerocket <https://github.com/bottlerocket-os/bottlerocket#setup> comes batteries included with both of those things, and is cheaply provisioned with (ahem) TOML user-data <https://github.com/bottlerocket-os/bottlerocket#description-...>
  In that specific case, one can also have "systemd for normal people" via its support for static Pod definitions, so one can run containerized toys on boot even without being a formal member of a kubernetes cluster
  AWS SSM provides auditing of what a person might normally type via ssh, and kubelet similarly, just at a different abstraction level. For clarity, I am aware that it's possible via some sshd trickery one could get similar audit and log egress, but I haven't seen one of those in practice whereas kubelet and AWS SSM provide it out of the box
  - cyberax 8 months ago
    
    Be careful with SSM, it can provide pretty much unlimited access: https://github.com/Cyberax/gimlet
    You can use it to tunnel arbitrary traffic inside your VPC.
  - _hyn3 8 months ago
    
    [dead]
- ndndjdueej 8 months ago
  
  IaC, send out logs to Splunk, health checks, slow rollouts, feature flags etc?
  Allow SSH in non prod environments and reproduce issue there?
  In prod you are aiming for "not broken" rather than "do whatever I want as admin".
- candiddevmike 8 months ago
  
  I built a config management tool, Etcha, that uses short lived JWTs. I extended it to offer a full shell over HTTP using JWTs:
  https://etcha.dev/docs/guides/shell-access/
  It works well and I can "expose" servers using reverse proxies since the entire shell session is over HTTP using SSE.
  - artificialLimbs 8 months ago
    
    I don’t understand why this is more secure than limiting SSH to local network only and doing ‘normal’ ssh hardening.
    
    candiddevmike 8 months ago
    
    None of that is required here? Etcha can be exposed on the Internet with a smaller risk profile than SSH:
    - Sane, secure defaults
    - HTTP-based--no fingerprinting, requires the correct path (which can be another secret), plays nicely with reverse proxies and forwarders (no need for jump boxes)
    - Rate limited by default
    - Only works with PKI auth
    - Clients verify/validate HTTPS certificates, no need for SSHFP records.
  - g-b-r 8 months ago
    
    “All JWTs are sent with low expirations (5 seconds) to limit replability”
    Do you know how many times a few packets can be replayed in 5 seconds?
    
    candiddevmike 8 months ago
    
    Sure, but this is all happening over HTTPS (Etcha only listens on HTTPS), it's just an added form of protection/expiration.
riddley 8 months ago

How do you troubleshoot?
- bigiain 8 months ago
  
  I think ssh-ing into production is a sign of not fully mature devops practices.
  We are still stuck there, but we're striving to get to the place where we can turn off sshd on Prod and rely on the CI/CD pipeline to blow away and reprovision instances, and be 100% confident we can test and troubleshoot in dev and stage and by looking at off-instance logs from Prod.
  How important it is to get there is something I ponder about my motivations for - it's cleary not worthwhile if your project is one or 2 prod servers perhaps running something like HA WordPress, but it's obvious that at Netflix type scale that nobody is sshing into individual instances to troubleshoot. We are a long way (a long long long long way) from Netflix scale, and are unlikely to ever get there. But somewhere between dozens and hundreds of instances is about where I reckon the work required to get close to there stars paying off.
  - xorcist 8 months ago
    
    > at Netflix type scale that nobody is sshing into individual instances to troubleshoot
    Have you worked at Netflix?
    I haven't, but I have worked with large scale operations, and I wouldn't hesitate to say that the ability to ssh (or other ways to run commands remotely, which are all either built on ssh or likely not as secure and well tested) is absolutely crucial to running at scale.
    The more complex and non-heterogenous environments you have, the more likely you are to encounter strange flukes. Handshakes that only fail a fraction of a percent of all times and so on. Multiple products and providers interaction. Tools like tcpdump and eBPF becomes essential.
    Why would you want to deploy on a mature operating system such as Linux and not use tools such as eBPF? I know the modern way is just to yolo it and restart stuff that crashes, but as a startup or small scale you have other things to worry about. When you are at scale you really want to understand your performance profile and iron out all the kinks.
    
    Hikikomori 8 months ago
    
    Can also use stuff like Datadog NPM/APM that uses eBPF to pick up most of what you need. Its been a long time since I've needed anything else.
    
    xorcist 8 months ago
    
    Yes, there are numerous other ways to run remote commands than ssh, all of them less secure. (Running commands via your monitoring system can even be a very handy back door in a pinch.)
    The argument here was that remote commands was less useful at scale, not that ssh was a particularly bad way of implementing it. Which doesn't make sense. You tend to have more complex system interactions at scale, not less.
  - imiric 8 months ago
    
    Right. The answer is having systems that are resilient to failure, and if they do fail being able to quickly replace any node, hopefully automatically, along with solid observability to give you insight into what failed and how to fix it. The process of logging into a machine to troubleshoot it in real-time while the system is on fire is so antiquated, not to mention stressful. On-call shouldn't really be a major part of our industry. Systems should be self-healing, and troubleshooting done during working hours.
    Achieving this is difficult, but we have the tools to do it. The hurdles are often organizational rather than technical.
    
    bigiain 8 months ago
    
    > The hurdles are often organizational rather than technical.
    Yeah. And in my opinion "organizational" reasons can (and should) include "we are just not at the scale where achieving that makes sense".
    If you have single digit numbers of machines, the whole solid observability/ automated node replacement/self-healing setup overhead is unlikely to pay off. Especially if the SLAs don't require 2am weekend hair-on-fire platform recovery. For a _lot_ things, you can almost completely avoid on-call incidents with straightforward redundant (over provisioned) HA architectures, no single points of failure, and sensible office hours only deployment rules (and never _ever_ deploy to Prod on a Friday afternoon).
    Scrappy startups, and web/mobile platforms for anything where a few hours of downtime is not going to be an existential threat to the money flow or a big story in the tech press - probably have more important things to be doing than setting up log aggregation and request tracing. Work towards that, sure, but probably prioritise the dev productivity parts first. Get your CI/CD pipeline rock solid. Get some decent monitoring of the redundant components of your HA setup (as well as the Prod load balancer monitoring) so you know when you're degraded but not down (giving you some breathing space to troubleshoot).
    And aspire to fully resilient systems and have a plan for what they might look like in the future to avoid painting yourself into a corner that makes it harder then necessary to get there one day.
    But if you've got a guy spending 6 months setting up chaos monkey and chaos doctor for your WordPress site that's only getting a few thousand visits a day, you're definitely going it wrong. Five nines are expensive. If your users are gonna be "happy enough" with three nines or even two nines, you've probably got way better things to do with that budget.
    
    Aeolun 8 months ago
    
    > For a _lot_ things, you can almost completely avoid on-call incidents with straightforward redundant (over provisioned) HA architectures, no single points of failure, and sensible office hours only deployment rules (and never _ever_ deploy to Prod on a Friday afternoon).
    For a lot of things the lack of complexity inherent in a single VPS server will mean you have better availability than any of those bizarrely complex autoscaling/recovery setups
    
    imiric 8 months ago
    
    I'm not so sure about all of that.
    The thing is that all companies regardless of their scale would benefit from these good practices. Scrappy startups definitely have more important things to do than maintaining their infra, whether that involves setting up observability and automation or manually troubleshooting and deploying. Both involve resources and trade-offs, but one of them eventually leads to a reduction of required resources and stability/reliability improvements, while the other leads to a hole of technical debt that is difficult to get out of if you ever want to improve stability/reliability.
    What I find more harmful is the prevailing notion that "complexity" must be avoided at smaller scales, and that somehow copying a binary to a single VPS is the correct way to deploy at this stage. You see this in the sibling comment from Aeolun here.
    The reality is that doing all of this right is an inherently complex problem. There's no getting around that. It's true that at smaller scales some of these practices can be ignored, and determining which is a skill on its own. But what usually happens is that companies build their own hodgepodge solutions to these problems as they run into them, which accumulate over time, and they end up having to maintain their Rube Goldberg machines in perpetuity because of sunk costs. This means that they never achieve the benefits they would have had they just adopted good practices and tooling from the start.
    I'm not saying that starting with k8s and such is always a good idea, especially if the company is not well established yet, but we have tools and services nowadays that handle these problems for us. Shunning cloud providers, containers, k8s, or any other technology out of an irrational fear of complexity is more harmful than beneficial.
    
    LtWorf 8 months ago
    
    If you don't know why they failed, replacing them is pointless.
  - naikrovek 8 months ago
    
    > I think ssh-ing into production is a sign of not fully mature devops practices.
    that's great and completely correct when you are one of the very few places in the universe where everything is fully mature and stable. the rest of us work on software. :)
  - otabdeveloper4 8 months ago
    
    A whole lot of words to say "we don't troubleshoot and just live with bugs, #yolo".
  - sleepydog 8 months ago
    
    It's a good mindset to have, but I think ssh access should still be available as a last resort on prod systems, and perhaps trigger some sort of postmortem process, with steps to detect the problem without ssh in the future. There is always going to be a bug, that you cannot reproduce outside of prod, that you cannot diagnose with just a core dump, and that is a show stopper. It's one thing to ignore a minor performance degradation, but if the problem corrupts your state you cannot ignore it.
    Moreover, if you are in the cloud, part of your infrastructure is not under your control, making it even harder to reproduce a problem.
    I've worked with companies at Netflix's scale and they still have last-resort ssh access to their systems.
- mdaniel 8 months ago
  
  In my world, if a developer needs access to the Node upon which their app is deployed to troubleshoot, that's 100% a bug in their application. I am cognizant that being whole-hog on 12 Factor apps is a journey, but for my money get on the train because "let me just ssh in and edit this one config file" is the road to ruin when no one knows who edited what to set it to what new value. Running $(kubectl edit) allows $(kubectl rollout undo) to put it back, and also shows what was changed from what to what
  - megous 8 months ago
    
    Your world is very narrow and limited. Some devs also have to deal with customer provisioned HW infrastructure, with buggy interactions between HW/virtualization solutions that every 5 minutes duplicate all packets for a few seconds; with applications that interact with customer only onsite HW you only have remote access to via production deployment; with quirky virtualization like vmware stopping the vCPU on you for hundreds of ms if you load it too much which you'll not replicate locally; with things you can't predict you'll need to observe ahead of time, etc. And it does not involve editing any configs. It's just troubleshooting.
  - yjftsjthsd-h 8 months ago
    
    How do you debug the worker itself?
    
    mdaniel 8 months ago
    
    Separate from my sibling comment about AWS SSM, I also believe that if one cannot know that a Node is sick by the metrics or log egress from it, that's a deployment bug. I'm firmly in the "Cattle" camp, and am getting closer and closer to the "Reverse Uptime" camp - made easier by ASG's newfound "Instance Lifespan" setting to make it basically one-click to get onboard that train
    Even as I type all these answers out, I'm super cognizant that there's not one hammer for all nails, and I am for sure guilty of yanking Nodes out of the ASG in order to figure out what the hell has gone wrong with them, but I try very very hard not to place my Nodes in a precarious situation to begin with so that such extreme troubleshooting becomes a minor severity incident and not Situation Normal
    
    __turbobrew__ 8 months ago
    
    If accidentally nuking a single node while debugging causes issues you have bigger problems. Especially if you are running kubernetes any node should be able to fall off the earth at any time without issues.
    I agree that you should set a maximum lifetime for a node on the order of a few weeks.
    I also agree that you shouldn’t be giving randos access to production infra, but and the end of the day there needs to be some people at the company who have the keys to the kingdom because you don’t know what you don’t know and you need to be able to deal with unexpected faults or outages of the telemetry and logging systems.
    I once bootstrapped an entire datacenter with tens of thousands of nodes from an SSH terminal after an abrupt power failure. It turns out infrastructure has lots of circular dependencies and we had to manually break that dependency.
    
    ramzyo 8 months ago
    
    Exactly this. Have heard it referred to as "break glass access". Some form of remote access, be it SSH or otherwise, in case of serious emergency.
    
    viraptor 8 months ago
    
    Passive metrics/logs won't let you debug all the issues. At some point you either need a system for automatic memory dumps and submitting bpf scripts to live nodes... or you need SSH access to do that.
    
    otabdeveloper4 8 months ago
    
    This "system for automatic dumps" 100 percent uses ssh under the hood. Probably with some eternal sudo administrator key.
    Personal ssh access is always better (from a security standpoint) than bot tokens and keys.
    
    viraptor 8 months ago
    
    There's a thousand ways to do it without SSH. It can be built into the app itself. It can be a special authenticated route to a suid script. It can be built into the current orchestration system. It can be pull-based using the a queue for system monitoring commands. It can be part of the existing monitoring agent. It can be run through AWS SSM. There's really no reason it has to be SSH.
    And even got SSH you can have special keys with access authorised to only specific commands, so a service account would be better than personal in that case.
    
    acdha 8 months ago
    
    > Separate from my sibling comment about AWS SSM,
    This seems like it’s conceding the point since SSM also allows you to run commands on nodes - I use it interchangeably with SSH to have Ansible manage legacy servers. Maybe what you’re trying to say is that it shouldn’t be routine and that there should be more of a review process so it’s not just a random unrestricted shell session? I think that’s less controversial, and especially when combined with some kind of “taint” mode where your access to a server triggers a rebuild after the dust has settled.
    
    mdaniel 8 months ago
    
    Yes, you nailed it with "it shouldn't be routine" and there for sure should be a review process. My primary concern with the audit logs actually isn't security it's lowering the cowboy of the software lifecycle
    > combined with some kind of “taint” mode where your access to a server triggers a rebuild after the dust has settled.
    Oh, I love that idea: thanks for bringing it to my attention. I'll for sure incorporate that into my process going forward
    
    acdha 8 months ago
    
    The first time I heard it was a very simple idea: they had a wrapper for the command which installed SSH keys on an EC2 instance which also set a delete-after tag which CloudCustodian queried.
    
    from-nibly 8 months ago
    
    You don't. you shoot it in the head and get a new one. If you need logging / telemetry bake it into the image.
    
    otabdeveloper4 8 months ago
    
    Are you from techsupport?
    Actually not every problem is solved with the "have you tried turning it off and back on again" trick.
    
    mdaniel 8 months ago
    
    No, what we're talking about is (to extend your very condescending tech support analogy) shipping the customer a new PC from the factory, and telling them to throw the old one away because it doesn't matter. It will only start to matter if they have 3 bad PCs in a row, at which time it becomes (a) a demonstrable failure and not just stray neutron rays (b) an incident which will carry a postmortum of how the organization could have prevented that failure for next time
    I did start the whole thread by saying "and then I grew up," and not everyone is at the same place in their organizational maturity model. So, if you're happy with the process you have now, keep using it. I was unhappy, so I studied hard, incorporated supporting technology, and lobbied my heart out for change. Without maturity levels we'd all still be using telnet and tar based version control
    
    otabdeveloper4 8 months ago
    
    Once you reach the next maturity level you realize that some problems and bugs are from bad design, and cannot be fixed by restarting the server.
    Fixing bad design is an art, not an organizational discipline. (Sadly.)
- LtWorf 8 months ago
  
  He asks the senior developer to do it.

TechnicalVault 8 months ago

The whole MITM just makes me deeply uncomfortable, it's introducing a single point of trust with the keys to the kingdom. If I want to log what someone is doing, I do it server side e.g. some kind of rsyslog. That way I can leverage existing log anomaly detection systems to pick up and isolate the server if we detect any bad behaviour.

naikrovek 8 months ago

yeah the MITM thing is ... concerning.
this just moves the trusted component from the SSH key to Cloudflare, and you still must trust something implicitly. except now it's a company that has agency and a will of its own instead of just some files on a filesystem.
I'll stick to forced key rotation, thanks.
- EthanHeilman 8 months ago
  
  > you still must trust something implicitly. except now it's a company that has agency and a will of its own instead of just some files on a filesystem.
  Some keys on a file system on a large number of user endhosts is a security nightmare. At big companies user endhosts are compromised hourly.
  When you say forced key rotation how do you accomplish that and how often do you rotate? What if you want to disallow access to a user on a faster tempo then your rotation period? How do you ensure that you are giving out the new keys to only authorized people?
  My experience has been, when you really invest in building a highly secure key rotation system, you end up building something similar to our system.
  1. You want SSO integration with policy to ensure only the right people get the right keys to ensure the right keys end up on the right hosts. This is a hard problem.
  2. You end up using a SSH CA with short lived certificates because "key expires after 3 minutes" is far more secure than "key rotated every 90 days".
  3. Compliance requirements typically require session recording and logging, do you end up creating a MITM SSH Proxy to do this?
  Building all this stuff is expensive and it needs to be kept up to date. Instead of building it in-house and hoping you build it right, buy a zero trust SSH product.
  For many companies the alternative isn't key rotation it just an endless growing set of keys that never expire. To quote Tatu Ylonen the inventor of SSH:
  > "In analyzing SSH keys for dozens of large enterprises, it has turned out that in many environments 90% of all authorized keys are no longer used. They represent access that was provisioned, but never terminated when the person left or the need for access ceased to exist. Some of the authorized keys are 10-20 years old, and typically about 10% of them grant root access or other privileged access. The vast majority of private user keys found in most enviroments do not have passphrases."
  Challenges in Managing SSH Keys – and a Call for Solutions https://ylonen.org/papers/ssh-key-challenges.pdf

antoniomika 8 months ago

I wrote a system that did this >5 years ago (luckily was able to open source it before the startup went under[0]). The bastion would record ssh sessions in asciicast v2 format and store those for later playback directly from a control panel. The main issue that still isn't solved by a solution like this is user management on the remote (ssh server) side. In a more recent implementation, integration with LDAP made the most sense and allows for separation of user and login credentials. A single integrated solution is likely the holy grail in this space.

[0] https://github.com/notion/bastion

mdaniel 8 months ago

Out of curiosity, why ignore this PR? https://github.com/notion/bastion/pull/13
I would think even a simple "sorry, this change does not align with the project's goals" -> closed would help the submitter (and others) have some clarity versus the PR limbo it's currently in
That aside, thanks so much for pointing this out: it looks like good fun, especially the Asciicast support!
- antoniomika 8 months ago
  
  Honestly never had a chance to merge it/review it. Once the company wound down, I had to move onto other things (find a new job, work on other priorities, etc) and lost access to be able to do anything with it after. I thought about forking it and modernizing it but never came to fruition.

shermantanktop 8 months ago

I didn’t understand the marketing term “zero trust” and I still don’t.

In practice, I get it - a network zone shouldn’t require a lower authn/z bar on the implicit assumption that admission to that zone must have required a higher bar.

But all these systems are built on trust, and if it isn’t based on network zoning, it’s based on something else. Maybe that other thing is better, maybe not. But it exists and it needs to be understood.

An actual zero trust system is the proverbial unpowered computer in a bunker.

athorax 8 months ago

It means there is zero trust of a device/service/user on your network until they have been fully authenticated. It is about having zero trust in something just because it is inside your network perimeter.
- shermantanktop 8 months ago
  
  Maybe it should be called "zero trust of a device/service/user on your network until they have been fully authenticated." But that wouldn't sell high-dollar consulting services.
wmf 8 months ago

The something else is specifically user/service identity. Not machine identity, not IP address. It is somewhat silly to have a buzzword that means "no, actually authenticate users" but here we are.
ngneer 8 months ago

With you there. The marketing term makes Zero Sense to me.
acdha 8 months ago

Yeah, it’s not a great name. Twenty years ago we called it “end to end authentication” and I think that’s better because it focuses on the most important aspect, but it probably doesn’t sound as cool for marketing purposes.
I also like how that makes it easier to understand how variation is normal: for example, authentication comes in various flavors and that’s okay whereas some of that zero trust vendors will try to claim that something is or isn’t ZT based on feature gaps in their competitors’ and it’s just so tedious to play that game.

blueflow 8 months ago

Instead of stealing your password/keypair, the baddies will now have to spoof your authentication with cloudflare. If thats just a password, you gained nothing. If you have 2FA set up for that, you could equally use that for SSH directly, using a ssh key on a physical FIDO stick. OpenSSH already has native support for that (ecdsa-sk and ed25519-sk key formats).

The gain here is minimal.

keepamovin 8 months ago

Does this give CloudFlare a backdoor to all your servers? That would not strictly be ZT, as some identify in the comments here.

udev4096 8 months ago

For cloudflare, all their fancy ZT excludes themselves. It's just like the well known MiTM they perform while using their CA
- megous 8 months ago
  
  Sounds like their modus operandi for most of their products, incl. the original one.
  - keepamovin 8 months ago
    
    And if China hacks CloudFlare? I guess we're all fucked.
knallfrosch 8 months ago

Everything rests on CloudFlare's key.
ChoHag 8 months ago

[dead]

johnklos 8 months ago

So... don't trust long lived ssh keys, but trust Cloudflare's CA. Why? What has Cloudflare done to earn trust?

If that alone weren't reason enough to dismiss this, the article has marketing BS throughout. For instance, "SSH access to a server often comes with elevated privileges". Ummm... Every authentication system ever has whatever privileges that come with that authentication system. This is the kind of bull you say / write when you want to snow someone who doesn't know any better. To those of us who do understand this, this is almost AI level bullshit.

The same is true of their supposed selling points:

> Author fine-grained policy to govern who can SSH to your servers and through which SSH user(s) they can log in as.

That's exactly what ssh does. You set up precisely which authentication methods you accept, you set up keys for exactly that purpose, and you set up individual accounts. Do Cloudflare really think we're setting up a single user account and giving access to lots of different people, and we need them to save us? (now that I think about it, I bet some people do this, but this is still a ridiculous selling point)

> Monitor infrastructure access with Access and SSH command logs

So they're MITM all of our connections? We're supposed to trust them, even though they have a long history of not only working with scammers and malicious actors, but protecting them?

I suppose there's a sucker born every minute, so Cloudflare will undoubtedly sell some people on this silliness, but to me it just looks like yet another way that Cloudflare wants to recentralize the Internet around them. If they had their way, then in a few years, were they to go down, a majority of the Internet would literally stop working. That should scare everyone.

EthanHeilman 8 months ago

I'm a member of the team that worked on this happy to answer any questions.

We (BastionZero) recently got bought by Cloudflare and it is exciting bringing our SSH ideas to Cloudflare.

lenova 8 months ago

I'd love to hear about the acquisition story with Cloudflare.
- EthanHeilman 8 months ago
  
  Are particular questions?
  So far my experience with joining and working at Cloudflare has been fantastic. Coming from a background of startups and academia, the size and scope of what Cloudflare is building and currently runs is overwhelming.
  In academia I've seen lots of excellent academic computer science papers that never benefit anyone because they never get turned into a tool that someone can just pick up and use. Ideas have inherent value, even useless ideas, but it feels good to see great ideas have impact. What appealed to me the most about getting acquired by Cloudflare is seeing research applied directly to products and used by people. Cloudflare does an excellent job both inventing innovative ideas and then actually making them real. There used to be a lot of companies that did this 10 years ago, but Cloudflare now seems rare in that respect.
- FlyingSnake 8 months ago
  
  You can read the details here: https://blog.cloudflare.com/cloudflare-acquires-bastionzero/
mdaniel 8 months ago

I just wanted to offer my congratulations on the acquisition. I don't know any details about your specific one, but I have been around enough to know that it's still worth celebrating o/

WesolyKubeczek 8 months ago

I can’t wait for a bug to happen when you authenticate correctly but unexpectedly slide into someone else’s network.

nonameiguess 8 months ago

Basic summary seems to be:

* This has nothing to do with zero-trust. If you already require pubkey auth to every connection made to a server regardless of origin, that's already meeting the definition of zero trust.

* What this actually gives you is a solution to the problem of centrally revoking long-lived keys by not having any and instead using certificate auth. Now the CA is the only long-lived key.

* This is a reasonable thing large orgs should probably do. There is no reason the CA should be an external third-party like Cloudflare, however.

* This also integrates with existing SSO providers so human users can be granted short-lived session certs based on whatever you use to authenticate them to the SSO provider. Also reasonable, also no reason this should be offered as a service from Cloudflare as opposed to something you can self-host like Kerberos.

* This also provides ssh command logging by proxying the session and capturing all commands as they get relayed. Arguably not a bad idea in principle, but a log collector like rsyslogd sending to an aggregator accomplishes the same thing in practice, and again, I would think you'd want to self-host a proxy if you choose to go that route, not rent it from Cloudflare.

All in all, good things a lot of orgs should do, but they should probably actually do. I get the "well, it's hard" angle, but you're usually looking at large, well-funded orgs when you're talking things like SOC and FedRamp compliance. If you want to be a bank or whatever, yeah, that's hard. It's supposed to be. As I understand it, at least part of the spirit of SOC and FedRamp and the like is your organization has processes, plans, procedures, and personnel in place with the expertise and care to take security seriously, not "we have no idea what any of this means, why it matters, and don't have the time, but we pay a subscription fee to Cloudflare and they say they take care of it."

andriosr 8 months ago

hoopdev here. Zero trust for SSH is just table stakes these days. Real challenge is getting devs to actually adopt better practices without the tooling getting in their way.

Found in practice that certs > keys but you need to think beyond just SSH. Most teams have a mix of SSH, K8s, DBs etc. Using separate tools for each just creates more headache.

Haven't tried Boundary but Teleport/hoop/Tailscale all handle the mixed protocol issue decently. Main difference is hoop focuses more on protocol-level DLP and automated reviews vs pure network access. Horses for courses though, they're all valid approaches.

Key is picking something devs will actually use vs work around. Nothing worse than a "secure" solution that drives people to create workarounds.

curben 8 months ago

Cloudflare has been offering SSH CA-based authentication for more than 2 years [1], I wrote a guide back in feb '23 [2]. The announcement is more about offering new features, such as more granular user control.

[1]:https://web.archive.org/web/20210418143636/https://developer...

[2]: https://mdleom.com/blog/2023/02/13/ssh-certificate-cloudflar...

singhrac 8 months ago

I get that HN does not like Cloudflare and does not like the term “Zero Trust”, but geez these comments are repetitive. Can anyone compare to Tailscale SSH? Are they basically offering an (even more) enterprise version of Tailscale’s product line?

cyberax 8 months ago

Hah. I did pretty much all the same stuff in my previous company.

One thing that we did a bit better: we used AWS SSM to provision our SSH-CA certificates onto the running AWS EC2 instances during the first connection.

It would be even better if AWS allowed to use SSH CA certs as keys, but alas...

pugz 8 months ago

FYI I love your work with Gimlet, etc.
I too would love "native" support for SSH CAs in EC2. What I ended up doing is adding a line to every EC2 userdata script that would rewrite the /home/ec2-user/.ssh/authorized_keys file to treat the provided EC2 keypair as a CA instead of a regular pubkey.

dmuth 8 months ago

Using CAs and signed certificates in SSH is definitely the way.

If anyone wants to play around with that, without the risk of locking themselves out of a server, I built a little "playground" awhile back whihc is a series of Docker containers that can SSH to each other. Give it a try at https://github.com/dmuth/ssh-principal-and-ca-playground

(I haven't touched the project in awhile, so if there are any issues, please open an Issue and I'll gladly look at it!)

arianvanp 8 months ago

Zero trust. But they don't solve the more interesting problem: host key authentication.

Would be nice if they can replace TOFU access with SSH CA as well. Ideally based on device posture of the server (e.g. TPM2 attestation)

mdaniel 8 months ago

While not applicable for all situations <https://en.wikipedia.org/wiki/SSHFP_record> and its <https://man.openbsd.org/ssh_config.5#VerifyHostKeyDNS> friend may interest you
- udev4096 8 months ago
  
  I was not aware of using random[0] ASCII art as a way to check if the host key has changed
  [0] - https://man.openbsd.org/ssh.1#random
anilakar 8 months ago

As far as I know you can CA sign host keys the same way you can sign users' public keys.
As always, the main issue is that certificate chaining is not possible in SSH PK"I", so you need to have absolute trust in the machine that does the signing.

amar0c 8 months ago

Is there anything similar ("central point of SSH access/keys management" ) that is not Cloudflare ? I know about Tailscale and it's SSH but recently it introduced so much latency (even tho they say it's P2P between A and B) that is unusable.

Ideally something self hosted but not hard requirement

udev4096 8 months ago

Like a ssh key management cli?

nanis 8 months ago

> the SSH certificates issued by the Cloudflare CA include a field called ValidPrinciples

Having implemented similar systems before, I was interested to read this post. Then I see this. Now I have to find out if that really is the field, if this was ChatGPT spellcheck, or something else entirely.

blueflow 8 months ago

For the others: The correct naming is "principals".
- jgrahamc 8 months ago
  
  Sigh. I'll get that fixed and figure out how that happened.
  - nanis 8 months ago
    
    This was corrected to:
    > ... SSH certificates issued by the Cloudflare CA include a field called valid_principals
    which indicates it wasn't just the spelling of `principals`.
    
    jgrahamc 8 months ago
    
    It depends... ssh-keygen -L displays the fields as Principals (which are set using the -n parameter) and internally a lot of the OpenSSH code talks about AuthorizedPrincipals...

udev4096 8 months ago

> You no longer need to manage long-lived SSH keys

Well, now you are managing CAs. Sure, it's short lived but it's not different than having a policy for rotating your SSH keys

acdha 8 months ago

It’s really important to understand why those are different. CAs are organizational and tightly restricted: I don’t use or have access to my CA’s private key but my SSH key is on every client I use. If I leave the company, you have to check every authorized key file on every server to ensure my keys are no longer present. In contrast, the CA doesn’t need to rotate since I never had access to it and since the CA will set an expiration time on each of the keys I do get it’s probably unusable shortly after my departure even if you missed something.

INTPenis 8 months ago

Properly setup IaC, that treats Linux as an appliance instead, could get rid of SSH altogether.

I'm only saying this because after 20+ years as a sysadmin I feel like there have been no decent solutions presented. On the other hand, to protect my IaC and Gitops I have seen very decent and mature solutions.

otabdeveloper4 8 months ago

I don't know what exactly you mean by "IaC" here, but the ones I know use SSH under the hood somewhere. (Except with some sort of "bot admin" key now, which is strictly worse.)
- INTPenis 8 months ago
  
  I mean that you treat Linux servers as appliances, you do everything in IaC at provisioning and you never login over SSH.
  - otabdeveloper4 8 months ago
    
    "IaC at provisioning" means (in practice) a webapp and an eternal root access token that does login over SSH for you behind the scenes.
    That's sctrictly worse from a security point of view.
    In an ideal world we would have private CAs and short-lived certificates that get bubbled through all the layers of the software stack. Going back to webapps and tokens is a step backwards.
    
    INTPenis 8 months ago
    
    That's a bad practice. I have better security experience from the infrastructure around IaC than SSH.
    Because for IaC we used Gitlab, hidden by a Keycloak, or connected to an Azure AD, protected by a MFA VPN. And for provisioning we used containers, no SSH required there either.
    The major revolution that allowed me to move away from SSH in server provisioning is container hosts, ignition (or cloud-init), and these days the cutting edge is bootc.

anilakar 8 months ago

Every now and then a new SSH key management solution emerges and every time it is yet another connection-terminating proxy and not a real PKI solution.

koutsie 8 months ago

How is trusting Cloudflare "zero-trust" ?

advael 8 months ago

You know you can just do this with keyauth and a cron job, right?

wmf 8 months ago

And Dropbox is a wrapper around rsync.
- advael 8 months ago
  
  Generally speaking a lot of "essential tools" in "cloud computing" are available as free, boring operating system utilities.
  - kkielhofner 8 months ago
    
    It’s a joke from a famous moment in HN history:
    https://news.ycombinator.com/item?id=9224
    
    advael 8 months ago
    
    That is pretty funny, and the whole idea that you can't make money packaging open-source software in a way that's more appealing to people is definitely funny given that this is the business model of a lot of successful companies
    I do however think this leads to a lot of problems when those companies try to protect their business models, as we are seeing a lot of today

c-linkage 8 months ago

Welcome to Kerberos[0] over HTTP.

[0] https://www.geeksforgeeks.org/kerberos/

xyst 8 months ago

Underlying tech is “Openpubkey”.

https://github.com/openpubkey/openpubkey

BastionZero just builds on top of that to provide a “seamless” UX for ssh sessions and some auditing/fedramp certification.

Personally, not a fan of relying on CF. Need less centralization/consolidation into a few companies. It’s bad enough with MS dominating the OS (consumer) space. AWS dominating cloud computing. And CF filling the gaps between the stack.

aethros 8 months ago

The people at BastionZero built openpubkey. They are the paper authors. https://eprint.iacr.org/2023/296
They didn't "build on top of"--they built the thing.
lmz 8 months ago

By "just builds on top of that" it sounds like the same people are building it https://news.ycombinator.com/item?id=41929483 (compare username against GH repo).
- EthanHeilman 8 months ago
  
  Can confirm, I am same person
ranger_danger 8 months ago

Completely agree. I also don't want to trust certificate authorities for my SSH connections let alone CF. Would not be surprised if it/they were compromised.
- looofooo0 8 months ago
  
  https://www.usenix.org/system/files/login/articles/105484-Gu... Well better than current run of things.
- yjftsjthsd-h 8 months ago
  
  > I also don't want to trust certificate authorities for my SSH connections let alone CF. Would not be surprised if it/they were compromised.
  OpenPubkey, or in general? Normal SSH CAs don't do PKI like browsers use, you make and trust your own CA(s). And if an attacker can compromise your CA private key, why can't they compromise your SSH private key directly?
  - looofooo0 8 months ago
    
    https://www.usenix.org/system/files/login/articles/105484-Gu... + People just don't check ssh keys normally.
    
    yjftsjthsd-h 8 months ago
    
    That's about host keys, not user keys. And... I'm struggling to think of a threat model where that problem manifests in a compromise? Like, what's your threat model?
    That said, CAs actually really help with that problem, because if a server has its host keys signed with a CA and then the user trusts that CA then they don't have to TOFU the host keys.
EthanHeilman 8 months ago

Author of OpenPubkey here (now at Cloudflare). Happy to answer any OpenPubkey questions.
datadeft 8 months ago

> BastionZero just builds on top of that to provide a “seamless” UX
Isn't this what many of the companies do?
debarshri 8 months ago

I think teleport operates in similar style.

rdtsc 8 months ago

By “ValidPrinciples” did they mean “ValidPrincipals”?

And by ZeroTrust they really mean OneTrust: trust CF. A classic off-by-one error :-)

dangsux 8 months ago

[dead]