Skip to content

Certificate-as-Code

Why This Matters

For executives: Certificate-as-Code reduces operational risk by eliminating manual certificate processes that cause 94% of preventable outages. It enables infrastructure automation that scales without linear cost increases.

For security leaders: Treating certificates as code provides complete audit trails (Git history), consistent policy enforcement, and prevents the "SSH into production server to fix certificate" pattern that bypasses security controls. It's foundational for DevSecOps and compliance automation.

For engineers: You need Certificate-as-Code when deploying to Kubernetes, using infrastructure-as-code (Terraform, CloudFormation), or implementing GitOps workflows. It's how you avoid certificate management becoming a deployment bottleneck.

Common scenario: Your team is deploying microservices to Kubernetes. Developers need certificates for new services but the current process requires submitting tickets to InfoSec and waiting 2-4 weeks. Certificate provisioning is blocking deployment velocity. You need self-service certificate management with automated policy enforcement.


Overview

Certificate-as-Code treats certificate definitions, policies, and lifecycle management as code—versioned, reviewed, tested, and automatically deployed. This approach brings infrastructure-as-code principles to PKI, enabling consistent, auditable, and scalable certificate management.

Core principle: Certificate requests, configurations, and policies should be declared in code, reviewed like code, tested like code, and deployed automatically. Manual certificate operations don't scale.

Why Certificate-as-Code

Traditional manual certificate management fails at scale:

  • Error-prone manual processes
  • Inconsistent configurations
  • Poor auditability
  • Slow provisioning
  • Difficult disaster recovery

Certificate-as-Code provides:

  • Version-controlled certificate definitions
  • Automated provisioning and renewal
  • Consistent enforcement of policies
  • Complete audit trail (Git history)
  • Infrastructure-as-code integration

Decision Framework

Use Certificate-as-Code when:

  • Managing 100+ certificates across infrastructure
  • Using infrastructure-as-code tools (Terraform, CloudFormation, Kubernetes)
  • Implementing DevOps/GitOps workflows
  • Need automated compliance audit trails
  • Frequent certificate provisioning (daily/weekly deployments)

Don't use Certificate-as-Code when:

  • Small scale (<20 certificates) with infrequent changes
  • Manual processes are working fine and won't scale
  • Team lacks Git/IaC expertise and can't invest in training
  • Legacy systems that can't integrate with automation

Hybrid approach when:

  • Mixed environment (some modern, some legacy)
  • Gradual migration from manual to automated processes
  • Different certificate types with different management needs (long-lived certs manually, short-lived certs automated)

Red flags:

  • Implementing Certificate-as-Code without automated certificate management platform (will just automate the manual work, not eliminate it)
  • No code review process (defeats audit trail benefit)
  • Storing private keys in code repositories (never do this)
  • Treating Certificate-as-Code as "set and forget" without ongoing maintenance

Terraform for Certificates

Define certificates in Terraform:

# Certificate resource
resource "aws_acm_certificate" "api" {
  domain_name               = "api.example.com"
  subject_alternative_names = ["*.api.example.com"]
  validation_method         = "DNS"

  lifecycle {
    create_before_destroy = true
  }

  tags = {
    Name        = "api-certificate"
    Environment = "production"
    Team        = "platform"
    AutoRenew   = "true"
  }
}

# Validation records
resource "aws_route53_record" "cert_validation" {
  for_each = {
    for dvo in aws_acm_certificate.api.domain_validation_options : dvo.domain_name => {
      name   = dvo.resource_record_name
      record = dvo.resource_record_value
      type   = dvo.resource_record_type
    }
  }

  zone_id = aws_route53_zone.main.zone_id
  name    = each.value.name
  type    = each.value.type
  records = [each.value.record]
  ttl     = 60
}

# Load balancer using certificate
resource "aws_lb_listener" "https" {
  load_balancer_arn = aws_lb.main.arn
  port              = 443
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-TLS-1-2-2017-01"
  certificate_arn   = aws_acm_certificate.api.arn

  default_action {
    type             = "forward"
    target_group_arn = aws_lb_target_group.api.arn
  }
}

Kubernetes Certificate Resources

Cert-manager provides Kubernetes-native certificate management:

# Certificate resource
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: api-tls
  namespace: production
spec:
  secretName: api-tls-secret
  duration: 2160h  # 90 days
  renewBefore: 720h  # Renew 30 days before expiry

  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer

  dnsNames:
    - api.example.com
    - "*.api.example.com"

  privateKey:
    algorithm: ECDSA
    size: 256
    rotationPolicy: Always

---
# ClusterIssuer for Let's Encrypt
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: [email protected]
    privateKeySecretRef:
      name: letsencrypt-prod
    solvers:
      - dns01:
          route53:
            region: us-east-1

---
# Ingress using certificate
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-ingress
  annotations:
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  tls:
    - hosts:
        - api.example.com
      secretName: api-tls-secret
  rules:
    - host: api.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: api-service
                port:
                  number: 80

GitOps Workflow

Manage certificates through Git:

Developer                  Git Repo                  Cluster
    │                         │                         │
    │─── Create cert.yaml ───>│                         │
    │                         │                         │
    │─── Pull Request ───────>│                         │
    │                         │                         │
    │     Review/Approve      │                         │
    │                         │                         │
    │─── Merge ──────────────>│                         │
    │                         │                         │
    │                         │─── ArgoCD Sync ───────>│
    │                         │                         │
    │                         │        cert-manager     │
    │                         │        issues cert      │
    │                         │                         │
    │<──── Notification ──────┴─────<< deployed >>>────│

Policy as Code

Define certificate policies in code:

# conftest.rego (OPA policy)
package certificate_policy

# Deny certificates with validity > 90 days
deny[msg] {
    input.kind == "Certificate"
    duration_hours := time.parse_duration_ns(input.spec.duration) / 3600000000000
    duration_hours > 2160  # 90 days
    msg := sprintf("Certificate validity %v exceeds maximum 90 days", [duration_hours / 24])
}

# Require ECDSA for new certificates
deny[msg] {
    input.kind == "Certificate"
    input.spec.privateKey.algorithm != "ECDSA"
    msg := "Certificates must use ECDSA algorithm"
}

# Require rotation policy
deny[msg] {
    input.kind == "Certificate"
    not input.spec.privateKey.rotationPolicy
    msg := "Certificate must specify key rotation policy"
}

Apply policy in CI/CD:

# Validate certificate definition against policy
conftest test certificate.yaml

Ansible for Certificate Deployment

Automate certificate deployment:

---
- name: Deploy TLS Certificate
  hosts: web_servers
  tasks:
    - name: Generate private key
      openssl_privatekey:
        path: /etc/ssl/private/{{ cert_name }}.key
        size: 2048
        mode: '0600'

    - name: Generate CSR
      openssl_csr:
        path: /etc/ssl/csr/{{ cert_name }}.csr
        privatekey_path: /etc/ssl/private/{{ cert_name }}.key
        common_name: "{{ cert_common_name }}"
        subject_alt_name: "{{ cert_san }}"

    - name: Submit CSR to CA
      uri:
        url: "{{ ca_api_url }}/issue"
        method: POST
        body: "{{ lookup('file', '/etc/ssl/csr/' + cert_name + '.csr') }}"
        headers:
          Authorization: "Bearer {{ ca_api_token }}"
      register: cert_response

    - name: Install certificate
      copy:
        content: "{{ cert_response.json.certificate }}"
        dest: /etc/ssl/certs/{{ cert_name }}.crt
        mode: '0644'
      notify: Reload nginx

  handlers:
    - name: Reload nginx
      service:
        name: nginx
        state: reloaded

CI/CD Integration

Integrate certificate validation into pipelines:

# GitHub Actions
name: Certificate Validation
on: [pull_request]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: Validate certificate definitions
        run: |
          # Check certificate YAML syntax
          yamllint certificates/

          # Validate against policy
          conftest test certificates/

          # Check for secrets in code
          gitleaks detect

      - name: Preview changes
        run: |
          terraform plan -out=plan.tfplan

      - name: Comment plan on PR
        uses: actions/github-script@v6
        with:
          script: |
            const output = await exec.getExecOutput('terraform', ['show', '-no-color', 'plan.tfplan']);
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: output.stdout
            });

Lessons from Production

What We Learned in a Project (Kubernetes + cert-manager)

A client implemented Certificate-as-Code using cert-manager in Kubernetes for 15,000+ certificates. Initial implementation had challenges:

Problem 1: "Everything automated" created blind spots

We assumed that once cert-manager was configured, certificates would "just work." In production:

  • Certificate validation failures were silent (pods just failed to start)
  • No visibility into certificate issuance attempts or failures
  • When Let's Encrypt rate limits hit, we had no warning system
  • Debugging required diving into cert-manager logs across multiple clusters

What we did: Built comprehensive observability layer:

  • Prometheus metrics for certificate issuance success/failure rates
  • Alerts for certificates not issuing within 5 minutes of request
  • Dashboard showing certificate status, expiry, and renewal attempts
  • Automated Slack notifications for failed issuance with actionable error messages

Problem 2: Policy-as-code was too restrictive at first

We implemented strict OPA policies requiring:

  • All certificates ECDSA (not RSA)
  • All certificates 90 days or less
  • All certificates use DNS-01 validation

This broke legitimate use cases:

  • Some legacy applications only supported RSA
  • External partners required longer-lived certificates
  • Some domains couldn't use DNS-01 (no API access to DNS provider)

What we did: Implemented policy exceptions with approval workflow:

  • Default policies apply to 95% of certificates
  • Exception process for legitimate edge cases
  • Exceptions documented in code with justification
  • Quarterly review of exceptions to reduce over time

Problem 3: Git became operational bottleneck

With 50+ developers deploying services, certificate PRs piled up:

  • Platform team reviewing hundreds of certificate PRs per week
  • Developers waited hours/days for certificate approval
  • "Just copy/paste from another certificate" led to inconsistent configurations

What we did: Implemented self-service with automated policy enforcement:

  • Developers create certificate definitions in their service repos
  • CI/CD automatically validates against policies
  • Auto-approve if policy compliant
  • Only manual review for policy exceptions
  • Reduced platform team review burden by 90%

Warning signs you're heading for same mistakes:

  • Implementing Certificate-as-Code without observability into certificate operations
  • Setting policies without understanding existing legitimate use cases
  • Centralizing certificate definitions when scale demands distributed ownership
  • Assuming "automated" means "zero operational overhead"

What We Learned (Terraform + Multi-Cloud)

A banking client implemented Certificate-as-Code with Terraform managing certificates across AWS, Azure, and on-premises. Challenges:

Problem 1: State management became complex

Certificate state in Terraform included sensitive data:

  • Private keys (should never be in state)
  • Certificate serial numbers and expiry dates
  • Deployment locations

With 25,000+ certificates, Terraform state files grew to hundreds of MB. State management became operational burden:

  • Long terraform plan/apply times
  • Merge conflicts in state
  • Difficulty troubleshooting state drift

What we did: Hybrid approach with state separation:

  • Terraform manages certificate definitions and policies
  • cert-manager/Venafi manages actual certificate issuance and renewal
  • Terraform references certificates by identifier, doesn't manage full lifecycle
  • Reduced state size by 90%, eliminated sensitive data in state

Problem 2: Certificate rotation caused Terraform drift

Certificates auto-renewed by cert-manager or Venafi would have different serial numbers than Terraform expected. Terraform plan would show "drift" even though everything was working correctly.

What we did: Configure Terraform to ignore certificate serial numbers and expiry dates:

lifecycle {
  ignore_changes = [
    certificate_body,  # Changes on renewal
    not_after,         # Changes on renewal
  ]
}

Problem 3: Multi-cloud complexity

Different cloud providers had different certificate management capabilities:

  • AWS ACM: Automatic renewal, limited export
  • Azure Key Vault: Manual renewal, full export capability
  • On-premises: Full manual management

Trying to abstract this into single Terraform module created more complexity than it solved.

What we did: Platform-specific implementations with shared policy layer:

  • Separate Terraform modules for AWS, Azure, on-prem
  • Shared OPA policies enforced across all platforms
  • Accept that certificate management will look different per platform
  • Focus on consistent outcomes (all certificates monitored, all auto-renewed) not consistent implementation

Warning signs you're heading for same mistakes:

  • Putting sensitive data in Terraform state
  • Ignoring state drift from certificate renewal
  • Trying to abstract multi-cloud differences into single implementation
  • Managing certificate lifecycle entirely in Terraform instead of delegating to specialized tools

Best Practices

Version control:

  • All certificate definitions in Git
  • Meaningful commit messages explaining certificate purpose
  • Required code reviews for certificate changes
  • Separate repos for production vs non-production environments

Automation:

  • Automatic certificate issuance on merge
  • Automatic renewal without human intervention
  • Automatic deployment to target systems
  • Zero manual SSH into servers for certificate operations

Testing:

  • Validate syntax in CI (yamllint, terraform validate)
  • Test against policies before merge (conftest, OPA)
  • Preview changes before apply (terraform plan)
  • Smoke tests after deployment (curl with certificate validation)

Security:

  • NEVER commit private keys to Git
  • Use secrets management (Vault, AWS Secrets Manager, Sealed Secrets)
  • Least-privilege service accounts for certificate operations
  • Audit all certificate changes through Git history

Observability:

  • Metrics for certificate issuance success/failure
  • Alerts for failed certificate operations
  • Dashboard showing certificate inventory and expiry
  • Automated notifications for upcoming renewals

Common Anti-Patterns

Anti-pattern 1: Storing private keys in Git

# NEVER DO THIS
resource "aws_acm_certificate" "bad" {
  private_key      = file("private-key.pem")  # NEVER in Git!
  certificate_body = file("certificate.pem")
}

Correct approach:

# Reference certificates by identifier, let cert-manager manage keys
resource "aws_lb_listener_certificate" "api" {
  listener_arn    = aws_lb_listener.https.arn
  certificate_arn = data.aws_acm_certificate.api.arn  # Reference only
}

Anti-pattern 2: Manual certificate operations mixed with automation

Half the certificates automated, half manual. This creates confusion about source of truth and leads to drift.

Correct approach: Gradual migration - automate progressively, but maintain clear separation between automated and manual certificates until migration complete.

Anti-pattern 3: No policy enforcement

Allowing any certificate configuration in code without validation. Defeats benefit of consistency.

Correct approach: Policy-as-code with CI/CD validation. Automatically reject non-compliant certificates, provide clear error messages.

Business Impact

Cost of getting this wrong: Manual certificate management at scale costs $120K-$240K annually in labor alone (for 1,000 certificates). Without Certificate-as-Code, organizations experience 3-4 certificate-related outages per year, each costing $300K-$1M. Certificate provisioning becomes deployment bottleneck, slowing feature velocity and time-to-market.

Value of getting this right: Certificate-as-Code reduces operational overhead by 90%, eliminates manual certificate-related outages, and enables rapid deployment velocity. Git-based audit trails simplify compliance (SOC 2, PCI-DSS), reducing audit preparation from weeks to hours. Infrastructure automation scales without linear cost increases.

Executive summary: See ROI of Automation for business case framework.


When to Bring in Expertise

You can probably handle this yourself if:

  • You have <500 certificates and single cloud environment
  • Team has strong IaC and GitOps experience
  • You're using mature tooling (cert-manager, Terraform cloud providers)
  • Simple use cases without complex policy requirements

Consider getting help if:

  • You have 1,000+ certificates or multi-cloud complexity
  • Need to implement policy-as-code with exception handling
  • Migrating from manual to automated certificate management
  • Team lacks Certificate-as-Code experience and needs training

Definitely call us if:

  • You have 5,000+ certificates across complex infrastructure
  • Need to integrate Certificate-as-Code with existing enterprise PKI
  • Implementing in regulated environment (financial services, healthcare)
  • Previous automation attempts failed and need troubleshooting

We've implemented Certificate-as-Code at an internet company (15,000+ certificates, Kubernetes/cert-manager), Deutsche Bank (multi-cloud, home-brew PKI service), and Barclays (enterprise PKI integration). We know where the complexity hides and what actually works at scale.


References

Infrastructure as Code

"Infrastructure as Code" (O'Reilly) - Morris, K. "Infrastructure as Code: Managing Servers in the Cloud." 2nd Edition, O'Reilly, 2020.

Terraform Documentation - HashiCorp. "Terraform Documentation." - https://developer.hashicorp.com/terraform/docs

Terraform AWS Provider - ACM - HashiCorp. "AWS Provider: aws_acm_certificate." - https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/acm_certificate

Kubernetes Certificate Management

cert-manager Documentation - cert-manager. "cert-manager Documentation." - https://cert-manager.io/docs/

Kubernetes Documentation - Managing TLS in a Cluster - Kubernetes. "Managing TLS in a Cluster." - https://kubernetes.io/docs/tasks/tls/managing-tls-in-a-cluster/

GitOps

"GitOps - Operations by Pull Request" (Weaveworks) - Weaveworks. "Guide to GitOps." - https://www.weave.works/technologies/gitops/

Argo CD Documentation - Argo Project. "Argo CD - Declarative GitOps CD for Kubernetes." - https://argo-cd.readthedocs.io/

Flux Documentation - Flux Project. "Flux - GitOps for Kubernetes." - https://fluxcd.io/docs/

Policy as Code

Open Policy Agent Documentation - Open Policy Agent. "OPA Documentation." - https://www.openpolicyagent.org/docs/

Conftest - Open Policy Agent. "Conftest - Write tests against structured configuration data." - https://www.conftest.dev/

Rego Policy Language - OPA. "Policy Language." - https://www.openpolicyagent.org/docs/latest/policy-language/

CI/CD Integration

"Continuous Delivery" (Addison-Wesley) - Humble, J., Farley, D. "Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation." 2010.

GitHub Actions Documentation - GitHub. "GitHub Actions Documentation." - https://docs.github.com/en/actions

GitLab CI/CD - GitLab. "GitLab CI/CD." - https://docs.gitlab.com/ee/ci/

Configuration Management

Ansible Documentation - Red Hat. "Ansible Documentation." - https://docs.ansible.com/

Ansible openssl Modules - Ansible. "Community.crypto Collection." - https://docs.ansible.com/ansible/latest/collections/community/crypto/

ACME Protocol

RFC 8555 - ACME - Barnes, R., et al. "Automatic Certificate Management Environment (ACME)." RFC 8555, March 2019. - https://tools.ietf.org/html/rfc8555

Let's Encrypt Documentation - Let's Encrypt. "Let's Encrypt Documentation." - https://letsencrypt.org/docs/

Boulder - ACME Server - Let's Encrypt. "Boulder - An ACME-based CA." - https://github.com/letsencrypt/boulder

Secrets Management

HashiCorp Vault Documentation - HashiCorp. "Vault Documentation." - https://developer.hashicorp.com/vault/docs

AWS Secrets Manager - AWS. "AWS Secrets Manager Documentation." - https://docs.aws.amazon.com/secretsmanager/

Azure Key Vault - Microsoft. "Azure Key Vault Documentation." - https://docs.microsoft.com/en-us/azure/key-vault/

Security Scanning

gitleaks - Gitleaks. "Protect and discover secrets using Gitleaks." - https://github.com/gitleaks/gitleaks

TruffleHog - Truffle Security. "Find credentials all over the place." - https://github.com/trufflesecurity/trufflehog

Best Practices

"Site Reliability Engineering" (O'Reilly) - Beyer, B., et al. "Site Reliability Engineering: How Google Runs Production Systems." O'Reilly, 2016. - Automation and toil reduction

"The DevOps Handbook" (IT Revolution Press) - Kim, G., et al. "The DevOps Handbook." IT Revolution Press, 2016. - Infrastructure automation - Deployment pipelines