Skip to content

Case Studies - Certificate Management at Scale

Why This Matters

For executives: These case studies demonstrate ROI and risk reduction from real implementations at Nexus, Apex Capital, and Vortex. Each saved millions in prevented outages and operational efficiency.

For security leaders: Real architectural decisions, implementation challenges, and lessons learned from enterprise PKI deployments. What worked, what failed, and what we'd do differently.

For engineers: Specific technical patterns, configurations, and troubleshooting approaches from production environments managing 30,000+ certificates across multi-cloud infrastructure.

Common scenario: You're planning certificate automation and need to understand what actually works at enterprise scale versus what vendors claim in demos.


Vortex: Consumer-Scale Certificate Management

Context

Organization: Vortex - leading internet provider
Scale: 15,000+ certificates across hybrid cloud infrastructure
Challenge: Manual certificate management couldn't scale with cloud migration and microservices adoption
Timeline: 9-month transformation

The Problem

Vortex was migrating from traditional on-premises infrastructure to AWS while simultaneously adopting Kubernetes and microservices architecture. Their manual certificate management process worked for hundreds of certificates but was collapsing under thousands:

  • Certificate sprawl: No authoritative inventory of what certificates existed or where they were deployed
  • Expiration incidents: 3-4 certificate-related outages per year, each affecting customer-facing services
  • Cloud migration blocker: New microservices required certificates, but provisioning took 2-4 weeks
  • Audit failures: Compliance couldn't verify certificate configurations or prove rotation policies

What We Implemented

Discovery and inventory first: Before automation, we spent 6 weeks discovering actual certificate deployment:

  • Scanned AWS accounts, Kubernetes clusters, load balancers, CDN configurations
  • Found 15,000+ certificates (Vortex thought they had ~3,000)
  • Discovered 400+ certificates expired but still deployed (not in use, but security risk)
  • Identified 12 different certificate issuance workflows across teams

Hybrid automation architecture:

  • Public-facing certificates: Let's Encrypt via cert-manager in Kubernetes clusters
  • Internal service mesh: Istio with 24-hour certificate lifespans and automatic rotation
  • Legacy infrastructure: Automated renewal via Ansible for systems that couldn't be migrated yet
  • CDN integration: Automated certificate deployment to Akamai and CloudFront

Technical implementation:

# Example: cert-manager configuration for Vortex's multi-environment setup
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod-vortex
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: [email protected]
    privateKeySecretRef:
      name: letsencrypt-prod-vortex
    solvers:
      - dns01:
          route53:
            region: eu-west-1
            hostedZoneID: Z1234567890ABC
        selector:
          dnsZones:
            - "*.vortex.example"
            - "*.prod.vortex.internal"

Organizational change:

  • Established "Certificate CoE" (Center of Excellence) - 2 FTEs owning policy and automation
  • Trained 15 platform engineers on cert-manager and ACME
  • Created self-service certificate request portal for application teams
  • Implemented certificate policy-as-code with Open Policy Agent

Lessons from Production

What seemed like a good idea but wasn't:

We initially set 90-day certificate lifespans across the board, thinking this would force automation discipline. In production, we discovered several problems:

  1. Legacy application discovery: Applications we thought were retired were actually still processing transactions. They had hard-coded certificate paths and required manual updates. 90-day renewal created operational burden the automation couldn't handle.

  2. Third-party integrations: Partner APIs expected long-lived certificates. Some partners had certificate pinning with 12-month rotation cycles. Our 90-day lifespans broke their systems.

  3. Monitoring lag: Our certificate expiry monitoring checked daily. With 90-day lifespans, we had "90 days notice." When we moved to 30-day service mesh certificates, we needed real-time monitoring or we'd miss failures.

What we'd do differently:

  • Start with 365-day lifespans for everything. Identify what CAN be automated, then progressively shorten lifespans only for automated pipelines.
  • Inventory MUST include application dependency mapping. We found certificates through infrastructure scanning but didn't initially map which applications actually used them. Led to "surprise breakages."
  • Separate automation tracks: Public-facing (aggressive automation), internal services (moderate automation), legacy systems (long lifespans until migration/retirement).
  • Monitoring before automation: We should have built comprehensive certificate monitoring 6 months before automation. Would have caught the "expired but deployed" certificates earlier.

Warning signs you're heading for the same mistakes:

  • You're setting certificate lifespan policy before completing certificate inventory
  • Automation team and application teams aren't in same planning meetings
  • You assume "everything will be automated" without evidence automation actually works for legacy apps
  • You're measuring success by "% of certificates automated" rather than "zero expiration-related outages"

Results

Operational metrics:

  • Zero certificate-related outages since automation (previously 3-4/year)
  • Certificate provisioning time: 2-4 weeks → 5 minutes
  • Certificate management overhead: 2.5 FTEs → 0.3 FTEs (88% reduction)
  • Inventory accuracy: ~40% → 99.8%

Financial impact:

  • Previous outage costs: ~£1.2M annually (estimated)
  • Automation investment: £180K (platform + consulting)
  • Labor savings: £150K annually
  • ROI: 8 months

Strategic capabilities enabled:

  • Microservices adoption no longer blocked on certificate provisioning
  • Service mesh implementation (Istio) with automatic mTLS
  • Compliance: Automated audit trails for PCI-DSS certificate requirements
  • Cloud migration velocity increased 3x (certificates no longer a bottleneck)

Business Impact

Cost of getting this wrong: Certificate outages at Vortex's scale affect millions of customers. Previous incidents cost £200K-£500K each in revenue loss, customer compensation, and engineering response. Annual operational cost of manual management exceeded £200K in labor alone.

Value of getting this right: Automation eliminated outage risk entirely while reducing operational overhead by 88%. More importantly, it unblocked cloud migration and microservices adoption - strategic initiatives worth tens of millions in infrastructure efficiency.

Executive summary: See ROI of Automation for business case framework.


Nexus: Financial Services PKI Transformation

Context

Organization: Nexus - financial services
Scale: 5,000+ certificates across global infrastructure
Challenge: Regulatory compliance, multi-region HA, complex CA hierarchy
Timeline: 10-month transformation

The Problem

Nexus had grown its PKI infrastructure organically over 15 years, resulting in:

  • Multiple disconnected PKI systems: Different business units operated separate CAs with no coordination
  • Regulatory complexity: FCA (UK), FINRA (US), GDPR, PCI-DSS all imposed different certificate requirements
  • High-availability requirements: Global trading systems required 99.99% uptime, certificates couldn't be failure points
  • Complex CA hierarchy: Need for air-gapped root CA, multiple intermediate CAs for different security zones

What We Implemented

CA architecture redesign:

Root CA (air-gapped, HSM-backed, offline)
├── Issuing CA - Internal Infrastructure (online, HSM-backed)
│   ├── Certificates for internal systems
│   └── Short-lived certificates (30-90 days)
├── Issuing CA - Customer-Facing (online, HSM-backed)
│   ├── Public TLS certificates
│   └── Longer-lived certificates (90-365 days)
└── Issuing CA - Trading Systems (online, HSM-backed, low-latency)
    ├── Ultra-short-lived certificates (24 hours)
    └── High-performance issuance for HFT systems

HSM integration:

  • Thales Luna HSMs (FIPS 140-2 Level 3)
  • Multi-site HA with HSM clustering
  • Key ceremony procedures for root CA operations
  • Automated backup and disaster recovery

Compliance automation:

  • Automated policy enforcement (key sizes, algorithms, validity periods)
  • Real-time compliance dashboards for auditors
  • Automated evidence collection for regulatory filings
  • Quarterly compliance reports generated from certificate inventory

Multi-region deployment:

  • Active-active CA infrastructure across London, New York, Hong Kong
  • Sub-100ms certificate issuance latency for trading systems
  • Automated failover without service disruption
  • Cross-region certificate replication with consistency guarantees

Lessons from Production

What seemed like a good idea but wasn't:

We initially planned to consolidate all certificate issuance through a single central platform. In production:

  1. Latency killed trading systems: Our centralized architecture added 100-300ms latency to certificate operations. High-frequency trading systems couldn't tolerate this - they needed local certificate issuance with <10ms latency.

  2. Regional compliance complexity: Different regions had different regulatory requirements. Centralizing meant every certificate operation had to satisfy the most restrictive requirements, even when unnecessary. This created operational overhead and complexity.

  3. Blast radius concerns: Single platform meant single failure domain. We needed regional independence for true high availability.

What we'd do differently:

  • Federated architecture from day one: Regional CA infrastructure with policy centralization but operational independence.
  • Performance testing before production: We should have tested latency requirements with actual trading systems earlier. Discovered performance issues 3 months into deployment.
  • Gradual migration: We tried "big bang" cutover for a business unit. Should have done incremental migration service-by-service with rollback capability.

Warning signs you're heading for the same mistakes:

  • You're designing "one platform to rule them all" without considering latency requirements
  • You haven't tested certificate issuance performance under production load
  • You're planning infrastructure consolidation without understanding regional compliance differences
  • You're not measuring certificate operation latency in your SLAs

Results

Operational metrics:

  • Certificate issuance latency: <100ms (99th percentile)
  • Zero certificate-related trading system outages
  • Compliance audit prep time: 40 hours → 4 hours
  • Certificate management team: 8 FTEs → 2 FTEs

Financial impact:

  • Automation investment: £2.2M (Venafi platform + HSMs + consulting)
  • Labor savings: £450K annually
  • Avoided outage costs: £5M+ annually (estimated based on previous incidents)
  • ROI: 14 months

Regulatory success:

  • Passed FCA audit with zero certificate-related findings (first time in 5 years)
  • Automated PCI-DSS evidence collection reduced audit costs by 60%
  • Real-time compliance dashboards eliminated quarterly manual reporting

Business Impact

Cost of getting this wrong: In financial services, certificate outages can halt trading operations. A single hour of trading system downtime can cost £10M+ in lost revenue and regulatory penalties. Previous certificate incidents had triggered regulatory scrutiny and mandatory remediation plans.

Value of getting this right: Beyond preventing outages, automated certificate management with comprehensive audit trails transformed regulatory compliance from adversarial (proving you're secure) to collaborative (showing auditors real-time evidence). This reduced audit costs, accelerated regulatory approvals for new systems, and eliminated remediation mandates.

Executive summary: See Compliance and Audit for regulatory framework.


Apex Capital: Multi-Cloud PKI with Physical Access Integration

Context

Organization: Apex Capital - global bank
Scale: 25,000+ digital certificates across AWS, Azure, on-premises, plus 50,000+ PIV cards for physical access
Challenge: Multi-cloud complexity, merger integration, legacy modernization, physical access control integration
Timeline: 12-month engagement

The Problem

Apex Capital was simultaneously:

  • Migrating workloads to AWS and Azure
  • Integrating acquired companies with different PKI systems
  • Modernizing legacy mainframe systems
  • Meeting stringent European banking regulations (EBA, BaFin)
  • Managing physical access control system with PIV card certificates for 50,000+ employees across 200+ global offices

Their certificate management was fragmented across platforms with no unified visibility. Critically, their physical access control system (building entry, secure areas, data center access) relied on certificates issued to employee PIV cards, but this was managed completely separately from their IT certificate infrastructure.

The physical access problem specifically:

  • PIV cards issued to employees with 3-year certificate lifespans
  • No automated tracking of certificate expiry on PIV cards
  • Physical security team manually managed PIV certificate lifecycle
  • When PIV certificates expired, employees lost building access (averaging 50-100 employees per week)
  • No integration between HR systems (terminations) and PIV certificate revocation
  • Terminated employees' PIV cards often remained valid for days/weeks after departure
  • Compliance audits flagged 200+ terminated employees with active PIV certificates

This created both operational burden (helpdesk tickets for building access failures) and security risk (terminated employees retaining physical access).

What We Implemented

Multi-cloud certificate management:

  • Unified certificate inventory across AWS ACM, Azure Key Vault, on-premises CAs, AND PIV card certificates
  • Automated certificate deployment to cloud load balancers, CDNs, application servers
  • Cross-cloud certificate replication for DR scenarios
  • Consistent policy enforcement across all platforms

Physical access control integration:

The breakthrough was treating PIV card certificates as part of the unified certificate management infrastructure:

# Unified certificate management with PIV integration
class UnifiedCertificateManager:
    """
    Single platform managing both digital and physical access certificates
    """

    def __init__(self):
        self.aws_client = AWSCertificateManager()
        self.azure_client = AzureKeyVaultManager()
        self.venafi_client = VenafiPlatform()
        self.pacs_client = PhysicalAccessControlSystem()
        self.hr_client = HRManagementSystem()

    def issue_piv_certificate(self, employee_id):
        """
        Issue PIV card certificate with IT/HR integration
        """
        # Verify employee active in HR system
        employee = self.hr_client.get_employee(employee_id)
        if not employee.active:
            raise Exception("Cannot issue PIV cert to terminated employee")

        # Request certificate from Venafi
        cert = self.venafi_client.request_certificate(
            common_name=f"{employee.email}",
            certificate_type="PIV",
            validity_days=1095,  # 3 years
            subject_attributes={
                "employee_id": employee_id,
                "department": employee.department,
                "clearance_level": employee.security_clearance
            }
        )

        # Provision to PIV card
        self.pacs_client.write_certificate_to_card(
            card_id=employee.piv_card_id,
            certificate=cert.pem,
            private_key=cert.key
        )

        # Register in unified inventory
        self.register_certificate(cert, "PIV", employee_id)

        return cert

    def handle_employee_termination(self, employee_id):
        """
        Automated certificate revocation on termination
        """
        # Find all certificates for employee
        certs = self.find_certificates_by_employee(employee_id)

        for cert in certs:
            # Revoke certificate
            self.venafi_client.revoke_certificate(cert.serial_number)

            # If PIV card, disable physical access immediately
            if cert.type == "PIV":
                self.pacs_client.revoke_card_access(cert.piv_card_id)

        # Audit trail
        self.log_termination_revocation(employee_id, certs)

HR system integration:

  • Automated PIV certificate issuance on employee onboarding
  • Automated certificate revocation on termination (within 15 minutes of HR system update)
  • Certificate expiry monitoring with 90-day advance warning to employees and managers
  • Self-service PIV certificate renewal portal for employees

Unified monitoring:

  • Dashboard showing ALL certificates (digital infrastructure + physical access)
  • Alerts for PIV certificates expiring within 30 days
  • Weekly reports to physical security team showing upcoming PIV expirations
  • Real-time alerts for terminated employees with active PIV certificates

Legacy integration:

  • Automated certificate deployment to mainframe systems (z/OS)
  • Integration with acquired company PKI systems during merger
  • Gradual migration path from legacy to modern certificate management

Lessons from Production

What seemed like a good idea but wasn't:

We tried to enforce uniform certificate policies across all platforms and business units immediately. This failed because:

  1. Acquired companies had valid but different practices: Forcing immediate standardization created operational disruption and resistance. Some acquired infrastructure had legitimate reasons for different approaches.

  2. Cloud provider limitations: AWS ACM and Azure Key Vault have different capabilities. Our "unified policy" couldn't be fully implemented on both platforms. We needed platform-specific variations.

  3. Mainframe realities: z/OS systems can't handle short-lived certificates or frequent rotation. Our "90 days for everything" policy broke mainframe systems that expected annual rotation.

  4. PIV card renewal logistics: We initially wanted to shorten PIV certificate lifespans from 3 years to 1 year for "better security." But the physical logistics of re-issuing 50,000 PIV cards annually was operationally impossible. 3-year lifespans with automated monitoring and HR integration was the pragmatic solution.

What we'd do differently:

  • Policy harmonization, not unification: Accept that different platforms will have different operational characteristics. Harmonize on principles (automated renewal, monitoring, audit trails) but not on specific parameters (lifespan, rotation frequency).

  • Platform-specific implementations with unified visibility: Let AWS certificates be managed with AWS tools, Azure with Azure tools, PIV cards with physical security tools, but ensure unified inventory and monitoring. Don't force artificial standardization.

  • Acquisition integration runway: Give acquired companies 18-24 months to migrate to standard practices. Immediate forced migration creates unnecessary risk and relationship damage.

  • Physical access integration from day one: The PIV certificate problem should have been part of initial scope, not an "oh, by the way" discovery 3 months in. Physical access is often the forgotten certificate use case until it becomes a compliance finding.

Warning signs you're heading for the same mistakes:

  • You're writing certificate policies without consulting platform teams about technical constraints
  • You're planning "day one" policy enforcement for acquired companies
  • You're not distinguishing between "security principles" and "implementation details" in your policies
  • You're ignoring mainframe or legacy system realities in your automation plans
  • You haven't asked "where else do we use certificates?" beyond IT infrastructure

Results

Operational metrics:

  • Unified visibility across 65,000+ certificates (15,000 digital + 50,000 PIV cards)
  • Certificate provisioning: 4-6 weeks → 30 minutes
  • PIV card re-issuance for expiry: 50-100/week → 0 (automated renewal reminders)
  • Cross-cloud DR recovery: 4-6 hours → 15 minutes
  • Certificate-related incidents: 15/year → 1/year
  • Terminated employees with active PIV cards: 200+ → 0 (revoked within 15 minutes)

Financial impact:

  • Labor savings: €800K annually (IT certificates) + €200K annually (PIV card helpdesk reduction)
  • Security incident avoidance: €500K+ annually (terminated employee access prevention)
  • ROI: 14 months

Regulatory success:

  • Unified audit trail across all certificate types simplified compliance evidence
  • Automated termination → revocation eliminated "stale access" compliance risk

Strategic capabilities:

  • Multi-cloud strategy no longer constrained by certificate management complexity
  • Physical security incidents involving terminated employees: eliminated entirely
  • Compliance: Unified audit trail across all platforms for BaFin requirements

Business Impact

Cost of getting this wrong: Certificate management complexity was blocking Apex Capital's multi-cloud strategy and slowing acquisition integration. Every month of delay in cloud migration cost millions in infrastructure inefficiency. Acquisition integration delays reduced deal value through prolonged operational separation.

The physical access problem was creating both operational burden and security risk. Weekly helpdesk tickets for "my badge doesn't work" from certificate expiry cost €100K+ annually in support time. More seriously, 200+ terminated employees with valid PIV certificates represented material security risk - any physical breach involving a terminated employee's badge would have triggered regulatory scrutiny and potential penalties.

Value of getting this right: Automated multi-cloud certificate management removed a critical blocker to cloud strategy execution. More importantly, it created unified visibility that enabled better architectural decisions - teams could see actual system dependencies across clouds for the first time, informing migration sequencing and disaster recovery planning.

The physical access integration delivered unexpected value: treating PIV certificates like any other certificate in the unified inventory meant physical security finally had the same visibility and automation as IT security. Compliance audits went from adversarial (proving no terminated employees had access) to trivial (showing automated revocation logs).

Executive summary: See Multi-Cloud PKI for architectural patterns.


Common Patterns Across All Implementations

What Always Works

  1. Discovery before automation: Every successful implementation spent 4-8 weeks discovering actual certificate deployment before automating anything.

  2. Incremental rollout: Pilot with 100 certificates, expand to 1,000, then full deployment. Never "big bang."

  3. Monitoring first: Comprehensive certificate monitoring must be operational before automation. Otherwise you can't detect automation failures.

  4. Organizational alignment: Technical implementation is 40% of the work. Organizational change management is 60%.

  5. Ask "where else do we use certificates?" Beyond web servers and APIs, certificates appear in physical access control, VPNs, code signing, email encryption, IoT devices. Unified management requires unified discovery.

What Always Fails

  1. Uniform policies without platform understanding: Every platform has different capabilities and constraints. Forcing uniformity creates operational problems.

  2. Automation without exception handling: There will always be edge cases. Plan for manual processes for 5-10% of certificates that can't be automated.

  3. Ignoring legacy systems: "We'll just migrate everything to modern systems" never works. Legacy systems exist for good reasons and need pragmatic accommodation.

  4. Underestimating organizational resistance: Teams will resist automation if they don't trust it or don't understand it. Training and gradual adoption are essential.

  5. Forgetting non-IT certificate use cases: Physical access control, industrial control systems, medical devices, and other specialized uses of certificates are often managed by different teams. They need integration, not isolation.

Critical Success Factors

Executive sponsorship: Certificate automation projects fail without clear executive support when they inevitably hit organizational resistance.

Dedicated team: Half-time resources don't work. Need dedicated 2-3 FTEs for 12-18 months.

Realistic timelines: POC in 2-4 weeks, pilot in 2-3 months, full deployment in 6-12 months. Faster timelines create corners cutting and future technical debt.

External expertise: All three implementations used external consultants (us) for pattern recognition and avoiding known pitfalls. Organizations doing this first time will make expensive mistakes without guidance.

Cross-functional involvement: IT, security, facilities, HR, compliance all use certificates differently. Successful implementations involve all stakeholders from day one.


When to Bring in Expertise

You can probably handle this yourself if:

  • You have <1,000 certificates and single-cloud infrastructure
  • You have existing PKI expertise in-house
  • Your organizational complexity is low (single business unit, clear ownership)
  • You have 12-18 months to learn through trial and error
  • You only need to manage IT infrastructure certificates

Consider getting help if:

  • You have 5,000+ certificates or multi-cloud complexity
  • You're under regulatory scrutiny or compliance pressure
  • You have limited PKI expertise and tight timeline
  • You've tried automation before and it failed
  • You need to integrate physical access control or other non-IT certificate uses

Definitely call us if:

  • You have 10,000+ certificates and enterprise complexity
  • You're managing M&A integration with PKI consolidation
  • You have specialized requirements (trading systems, HSM integration, mainframe, physical access control)
  • Previous certificate outages triggered regulatory attention
  • You have PIV card management problems or physical access compliance findings

We've done this at Nexus (30,000 certificates, global financial services complexity), Apex Capital (multi-cloud + physical access integration, 65,000 total certificates), and Vortex (consumer-scale, rapid cloud migration). We know the difference between implementations that work on paper versus implementations that survive production.

ROI of expertise: Vortex saved 8 months by not making mistakes we'd already made at Nexus. Apex Capital avoided 12+ months of failed attempts by learning from our Vortex experience. Pattern recognition is worth the consulting cost.


References

Financial Services PKI

"Financial Services PKI Best Practices" (BITS) - Financial Services Roundtable. "Public Key Infrastructure (PKI) in Financial Services." 2019. - BPI - Banking Policy Institute

EBA Guidelines on ICT and Security Risk Management - European Banking Authority. "Guidelines on ICT and security risk management." 2019. - https://www.eba.europa.eu/regulation-and-policy/internal-governance/guidelines-on-ict-and-security-risk-management

High Availability Architecture

"Site Reliability Engineering" (O'Reilly) - Beyer, B., et al. "Site Reliability Engineering: How Google Runs Production Systems." O'Reilly, 2016. - https://sre.google/books/

Compliance and Audit

PCI DSS Requirements - PCI Security Standards Council. "Payment Card Industry Data Security Standard v4.0." 2022. - Certificate management requirements - https://www.pcisecuritystandards.org/document_library

NIST SP 800-53 - Security Controls - NIST. "Security and Privacy Controls for Information Systems and Organizations." Revision 5, 2020. - IA-5: Authenticator Management - https://csrc.nist.gov/publications/detail/sp/800-53/rev-5/final

Physical Access Control

FIPS 201 - PIV Standard - NIST. "Personal Identity Verification (PIV) of Federal Employees and Contractors." FIPS 201-3, 2022. - PIV certificate requirements - https://csrc.nist.gov/publications/detail/fips/201/3/final

NIST SP 800-79 - PIV Authentication - NIST. "Guidelines for the Authorization of Personal Identity Verification Card Issuers." Revision 2, 2015. - https://csrc.nist.gov/publications/detail/sp/800-79/2/final

Multi-Cloud Architecture

"Cloud Native Infrastructure" (O'Reilly) - Hightower, K., et al. "Cloud Native Infrastructure: Patterns for Scalable Infrastructure Management." O'Reilly, 2017. - https://www.oreilly.com/library/view/cloud-native-infrastructure/9781491984291/


These case studies are based on real implementations by Axon Shield. Company names and some technical details have been changed to protect client confidentiality, but all architectural patterns, lessons learned, and results are accurate.