I'm considering how to deploy a service that needs SSH access to many important boxes in my infrastructure. Rather than store a long-lived SSH private key in a key store that the service could request, I'm considering using short-lived SSH certificates to allow SSH access for the service. So the two architectures I'm comparing are as follows (and I'm not mentioning the technologies at play, because I'm more interested in the theory and reasoning):
The tradeoffs I see, is that with certificate-based auth, compromised certificates are quickly expired and thus less risky. If a service using SSH is compromised, I can revoke it's ability to request new certificates without having to do any config on any of the servers and without taking away the ability for other such services to authenticate. However, this architecture is more complex, and in the end, the SSHing service still has to auth somehow to the CA server to authorize the signing. Whether this is from a provider role permission, shared secret (hard-coded or accessible by a secrets store), IP address, or some sort of PKI (having the service provided a signed cert by its provisioner).
But whatever the mechanism, is this providing a benefit above and beyond just giving the services access to the private key, because if the services are ever compromised, an attacker can just as easily request a valid cert and use it just as well as a private key.
Is there a method for providing a way of securely authing to the CA server for signing requests that doesn't require human intervention and is resistant to the service being compromised? Or is there some other benefit to this architecture that would justify the extra complexity?
I don't want to confuse the discussion too much by bringing specific technologies into it, but to prevent this from being too abstract, this would operate in a kubernetes, EC2, or similar cloud platform where I can provide a set of API permissions to a service from the platform itself using RBAC. The SSH services might be short-lived push-style tasks or long-lived services like Ansible Tower.
Lets break down your problem into three parts:
1) Identity provisioning: Ideally you should strive to provide an identity to each of your servicer at the time of deployment. This is generally done using leaf X.509 certificates signed by a CA which is trusted by all entities in your infrastructure. The private key of the CA should not be accessible to anyone. For kubernetes and most cloud vendors, this is achievable via some kind of Cert management service.
2) Authorization: This is an open area for creativeness. Even in large DC operations like google, state of the art is to maintain an authorization file which contains data like ACL (which service can talk to which service). All services pull this file from a central known location periodically and use them to provide access.
3) Certificate revocation list: This list is stored in a central known location where all certificates which are compromised stored. All certificates keep a local copy of CRL and update it periodically. The minimum time you keep the refresh, the window a hacker has is small.
Now during SSH call, your server needs to check following:
1) Incoming client has an identity signed by CA.
2) Is the incoming client authorized to access my service?
3) Has the identity of this client been revoked?
If you follow all the three guidelines, you can have a robust mechanism to secure infrastructure which can scale easily to several thousand services.