About me

Technology Engineering Leader with 15+ years guiding global DevOps and Site Reliability Engineering organizations through large scale transformation. Known for building high impact engineering teams, setting technical direction, and driving architectural decisions across complex distributed systems. Combines deep expertise in infrastructure, reliability engineering, automation, and observability with a leadership style centered on clarity, accountability, and continuous improvement. Skilled at defining SLIs/SLOs, shaping resilience strategies, and implementing frameworks that reduce detection and resolution times for mission critical platforms. Aligns engineering vision with business priorities to strengthen operational excellence, accelerate delivery, and scale systems sustainably.

Flexible regarding Remote, Hybrid, or On‑Site roles across the U.S. (relocation assistance welcomed), with focused interest in opportunities in Portland, Denver, and Seattle.

LinkedIn Profile: https://www.linkedin.com/in/dchristilaw

What i'm doing

  • People Manager

    IT Management

    Providing leadership across on‑site, hybrid, remote, and offshore technology teams, ensuring cohesive execution, high‑quality delivery, and strong alignment with organizational priorities.

  • Operations

    Operations/Incident Management

    Provided strategic leadership across Operations, Incident Management, and Release Management, delivering faster engagement, mitigation, and resolution cycles and improving operational performance across the organization.

  • Release Management

    Release Management

    Provided strategic leadership for release management across monolithic and Agile development cycles, implementing process, documentation, and communication standards that ensured reliable delivery and met compliance and security expectations.

  • Observability

    Observability/SRE

    Broad SRE and observability leadership across diverse system scales, applying context‑driven strategies that balance reliability, cost, and operational maturity.

Testimonials

  • Jordan Lee

    Jordan Lee

    Dennis was a recent manager of mine and I wouldn't have asked for a better one. Never shot down ideas, always stood up for our team and greatly assisted us in moving projects forward. Sprint after Sprint. Dennis is very professional and knows what it takes to manage a team effectively. Coming from his many backgrounds in the Cloud space he has a lot of perspective to bring to the table!

    Testimonials on LinkedIn

  • Tessa Kottke Nagel

    Tessa Kottke Nagel

    I highly recommend Dennis Christilaw for any future opportunities. During our time working together at Icario, he demonstrated exceptional leadership, collaboration, and technical expertise. As an attentive leader, he fostered a positive, productive environment while consistently innovating and focusing on process improvements. His in-depth IT knowledge, combined with his eagerness to help and collaborate, made him an invaluable asset to the team. Dedicated, reliable, and always enjoyable to work with, Dennis was a standout contributor on every project and any organization is extremely lucky to have them on their team.

    Testimonials on LinkedIn

  • Nicki Roper, CTFL

    Nicki Roper, CTFL

    Dennis is very knowledgeable as a Jira administrator. He shares his Jira knowledge and best practices with others. He is also very knowledge with Change Request processes and implementation of those processes and Change Advisory Board. He has the ability to wear multiple hats even if something is out of his defined role (i.e. Jira Admin and Governance board, CAB governance, SRE Manager, etc.). He would be a great asset to any IT organization.

    Testimonials on LinkedIn

  • Jeff Troha

    Jeff Troha

    It was a pleasure to work under Dennis. He was always supportive of the team's needs and very laid back and easy to get along with. His years of experience and technical knowledge make him a great fit for management roles and his future co-workers will be lucky to have him.

    Testimonials on LinkedIn

Tools/Process

Skills

  • Cloud Platforms

    AWS, Microsoft Azure

  • Monitoring & Observability

    Datadog, Grafana, CloudWatch, ELK Stack, Splunk, Sumo Logic, KPIs/SLIs/SLOs/SLAs Ownership

  • CI/CD & Automation

    GitLab CI, GitHub Actions, Jenkins, AWS CodePipeline, Terraform

  • Leadership & Management

    Engieering Team Leadership, Incident Management, Release Management, Agile/SCRUM, Vendor Management, Cross-functional Collaboration

  • Compliance

    HIPAA, PCI-DSS, NIST

  • Documentation and Reporting

    Confluence, JIRA, KPI Dashboards, knowledge Sharing, Executive Reporting

Resume

Experience

  1. Sr. Site Relaibilty Engineer

    May 2025 — Oct 2025


    • Defined and implemented core SRE standards for observability, performance, and reliability.
    • Led Datadog and CloudWatch operations for 700,000+ IoT devices, improving visibility and system reliability.
    • Built scalable monitoring for hundreds of Lambda functions and maintained Monitoring as Code via Terraform.
    • Delivered an observability roadmap that reduced downtime by 30% and accelerated issue resolution.
    • Engineered log analytics pipelines and executive dashboards providing unified, real time system insights.
    • Architected and led the enterprise Incident Management program, including documentation, training, and audits.
    • Owned incident response, root cause analysis, and postmortems, driving long term reliability improvements.
    • Reduced alert fatigue by 35% and improved triage accuracy through streamlined monitoring and alerting.
    • Directed a distributed SRE team, achieving 30% fewer outages and 40% faster root cause identification.
    • Automated high volume operational tasks and remediation workflows, eliminating manual toil and improving resilience.
    • Mentored junior engineers, guided architectural decisions, and fostered a culture of continuous improvement.
    • Drove cross team alignment to ensure reliability goals supported product and business priorities.

    In this Senior SRE role, I served as a technical leader responsible for establishing the foundational standards and systems that elevated reliability across a large‑scale, cloud‑native environment. I defined and implemented core SRE practices for observability, performance, and reliability, building the frameworks that engineering teams relied on to operate mission‑critical services with consistency and confidence.

    I led Datadog and CloudWatch operations for more than 700,000 IoT devices, architecting scalable monitoring for hundreds of Lambda functions and maintaining Monitoring‑as‑Code through Terraform. These efforts culminated in an observability roadmap that reduced downtime by 30% and significantly accelerated issue detection and resolution. I also engineered log analytics pipelines and executive dashboards that provided unified, real‑time visibility into system health, enabling faster decision‑making across engineering and leadership.

    Beyond observability, I architected and led the enterprise Incident Management program, establishing documentation, training, and audit processes that strengthened operational readiness. I owned incident response, root‑cause analysis, and postmortems, driving long‑term reliability improvements and reducing alert fatigue by 35% through streamlined monitoring and alerting strategies.

    I also directed a distributed SRE team, achieving 30% fewer outages and a 40% improvement in root‑cause identification speed. Through automation of high‑volume operational tasks and remediation workflows, I eliminated manual toil and improved system resilience. I mentored junior engineers, influenced architectural decisions, and fostered a culture of continuous improvement, ensuring that reliability goals remained aligned with product and business priorities.

  2. SRE Manager (Including: Incident/Release Manager)

    Aug 2022 — Jan 2025


    • Led high performing, distributed SRE teams, driving engineering discipline, operational maturity, and measurable improvements in reliability across multiple product lines.
    • Promoted and enforced SRE best practices — including SLIs/SLOs, error budgets, and operational readiness — across product and platform engineering teams.
    • Architected and managed cloud native infrastructure using Infrastructure as Code and DevOps principles, ensuring scalable, secure by default environments.
    • Designed and maintained security focused system architectures, continuously improving security posture through automated guardrails and compliance controls.
    • Defined and enforced SLA/SLI/SLO standards for production systems, ensuring reliability commitments were met for both internal teams and external customers.
    • Built and maintained automated frameworks for provisioning, deployment, scaling, and monitoring, reducing manual toil and improving system resilience.
    • Led deep dive troubleshooting efforts across application, infrastructure, and network layers, resolving complex production issues and preventing recurrence.
    • Directed proof of concept initiatives to evaluate emerging technologies and guide strategic adoption across engineering teams.
    • Implemented policy and compliance checks within CI/CD pipelines, strengthening audit readiness and ensuring HIPAA/HiTRUST alignment.
    • Delivered observability and monitoring programs that reduced downtime by 28% and improved audit outcomes.
    • Refactored monitoring systems to eliminate redundancy, reduce alert noise, and cut tooling costs by 45%.
    • Developed data driven dashboards that improved operational visibility and informed engineering and leadership decision making.
    • Owned enterprise wide Incident Response and Incident Management, including standardized troubleshooting practices, root cause analysis, and postmortems.

    As SRE Manager, I inherited a team struggling with foundational SRE and observability practices and transformed it into a high‑performing, metrics‑driven organization. By establishing clear standards, rebuilding the observability platform, and coaching the team on operational discipline, we reduced incident engagement and mitigation times from hours to minutes. As the platform matured, incident volume dropped by 40% in the first year and 65% in the second, reflecting both improved detection and stronger engineering quality across the stack.

    During a period when incident KPIs were eroding client trust, I designed and implemented a formal Incident Management program that aligned engineering, product, and customer‑facing teams. This framework exposed systemic pain points, enabled rapid corrective action, and brought engagement and mitigation KPIs down from several hours to minutes. The shift from reactive firefighting to proactive reliability engineering significantly improved customer confidence and strengthened cross‑team accountability.

    Recognizing the absence of a structured Release Management function, I introduced a unified release process that integrated SRE principles, deployment governance, and compliance requirements. This brought visibility to code and infrastructure changes, improved traceability, and elevated both code quality and system stability. The addition of documentation, audit trails, and standardized workflows ensured compliance readiness and reduced deployment‑related incidents.

    To support operational consistency, I overhauled the organization’s JIRA implementation, transforming it from an underutilized tool into a standardized platform used across all engineering teams. By aligning workflows, improving documentation practices, and facilitating cross‑team training, SCRUM teams increased sprint throughput and delivered more work with higher predictability.

  3. DevOps Infrastructure Manager

    Jun 2020 - Aug 2022


    • Led a fully remote DevOps team across the US and offshore, increasing deployment velocity and improving operational coverage.
    • Designed and implemented infrastructure automation, reducing manual intervention and deployment times.
    • Built infrastructure-as-code pipelines for AWS observability, enabling consistent, repeatable, and scalable deployments.
    • Optimized engineering operations by standardizing processes and tooling, resulting in improved efficiency and system reliability.
    • Managed the JIRA platform, streamlining workflows and improving team productivity across engineering teams.
    • Monitored vulnerabilities and threats, implementing proactive alerting that improved security response times.
    • Developed Disaster Recovery and Business Continuity plans, strengthening resilience and regulatory compliance.
    • Owned release management processes, automating deployments and reducing release errors.
    • Implemented monitoring and alerting across environments, ensuring rapid incident detection and reduced downtime.
    • Created incident management and on-call processes, improving escalation workflows and reducing mean time to resolution (MTTR).
    • Built monitoring alerts, dashboards, and reports, enhancing observability and proactive issue resolution.
    • Documented automation templates and scripts in Confluence, improving knowledge sharing and team onboarding.
    • Managed AWS root and primary accounts, strengthening access controls and cloud infrastructure security.

  4. DevOps Infrastructure Architect (contract)

    Dec 2019 - Apr 2022


    • Enhanced site reliability to improve customer experience by increasing system scalability, performance, and uptime.
    • Collaborated with DevOps to automate infrastructure, accelerating deployment cycles and reducing manual errors.
    • Developed infrastructure-as-code for AWS using Terraform, enabling repeatable and consistent deployments.
    • Deployed infrastructure automation with Terraform, Ansible, Docker, and Kubernetes, improving environment scalability.
    • Led both on-site and remote DevOps teams, ensuring cross-location alignment and consistent delivery.
    • Managed release processes and implemented deployment automation, reducing errors and improving release confidence.
    • Designed monitoring and alerting systems across environments, improving issue detection and response time.
    • Built incident response and On-Call processes, reducing MTTR and improving operational readiness.
    • Created alerts, dashboards, and reports for full-stack observability across all environments.
    • Documented automation scripts and infrastructure templates in Confluence to support onboarding and knowledge retention.
    • Role concluded due to company shutdown during the COVID-19 lockdown.

  5. DevOps Infrastructure Architect (contract)

    Sep 2019 - Dec 2019 - Short-Term Contract to fill a team skills gap


    • Developed reusable CloudFormation templates to automate infrastructure provisioning, accelerating deployment and minimizing configuration errors.
    • Documented automation templates and scripts in Confluence, enhancing team knowledge sharing and reducing onboarding duration.
    • Managed AWS Systems Manager to automate patching, strengthening server security and decreasing manual maintenance efforts.
    • Administered AWS Elasticsearch Service for centralized log aggregation, facilitating streamlined issue analysis.
    • Designed standardized deployment workflows across AWS accounts, reducing manual intervention and increasing efficiency.
    • Established monitoring alerts, dashboards, and reports with Dynatrace, enhancing visibility and response times across environments.
    • Oversaw scheduled maintenance across AWS environments, ensuring uptime and adherence to change management policies.

  6. DevOps Manager

    Aug 2018 - Sep 2019


    • Led DevOps team to automate processes, enhancing system availability and reliability.
    • Implemented Site Reliability Engineering practices, increasing application performance and uptime.
    • Developed security policies aligning with ISPME PCI-DSS and US Cyber Security Framework, strengthening compliance.
    • Oversaw KYC operations utilizing in-house and third-party tools, enhancing identity verification processes.
    • Mentored Junior DevOps engineers, improving productivity and technical expertise across the team.
    • Trained field service technicians on ATM troubleshooting, boosting first-time fix rates significantly.
    • Established documentation and ticketing procedures, enhancing issue tracking and operational consistency.
    • Managed remote hardware repairs for ATMs, minimizing technician dispatches and reducing downtime.

  7. Resume Details

    These are my most recent employers, for a complete copy of my resume that highlights my Technology Experience, please request an updated copy of my resume using the link below or emailing me.

  8. Resume Request

    For my FULL resume, please email a request to: dchristilaw@pm.me
    Thank you for taking the time to review my page!

Blog

Contact

Contact Form