Skip to content
yisusvii Blog
Go back

Agent Skills for SRE/DevOps: How Claude's Skills System Is Reshaping Infrastructure Engineering in 2026

Suggest Changes

The way SREs and DevOps engineers interact with AI tools changed fundamentally in early 2026 with the arrival of Agent Skills—a portable, declarative standard that gives Claude (and any agent implementing the spec) persistent, reusable domain knowledge. Instead of re-prompting Claude every time you need it to understand your org’s naming conventions, your IaC module patterns, or your cloud security baseline, you package that knowledge once as a SKILL.md file and load it on demand.

This post covers what Agent Skills are, how the ecosystem has grown around them, the most impactful functional skills for SRE and DevOps workflows today, and finishes with a complete worked example: a Terraform skill targeting Azure with built-in Microsoft Defender for Cloud compliance.

References used in this article:


1. What Are Agent Skills?

An Agent Skill is a folder that contains a SKILL.md file (and optionally supporting scripts, examples, and reference files). When Claude loads a skill, it reads the frontmatter metadata and the markdown instructions, then applies them for the duration of the session or until the skill is unloaded.

The minimal SKILL.md structure looks like this:

---
name: my-skill-name
description: A clear description of what this skill does and when to use it
---

# My Skill Name

[Instructions that Claude will follow when this skill is active]

## Examples
- Example usage 1
- Example usage 2

## Guidelines
- Guideline 1
- Guideline 2

The frontmatter has only two required fields:

FieldPurpose
nameA unique identifier (lowercase, hyphens)
descriptionComplete description of what the skill does and when Claude should activate it

Everything in the markdown body becomes part of Claude’s active context and behavior for the session. You can include code snippets, decision trees, reference tables, and links to external scripts.

Where Skills Run

Skills work across all Claude surfaces in 2026:


2. The Skills Ecosystem in 2026

The Open Specification: agentskills.io

The Agent Skills format is now an open standard, governed at agentskills.io/specification. Any LLM agent or coding assistant that implements the spec can consume SKILL.md files, which means skills you write today work across Claude, and any future agent that adopts the standard.

Anthropic’s Official Skills Library

Anthropic maintains a reference implementation at github.com/anthropics/skills. The repository is organized into three areas:

Notable technical skills in the official library (as of Q1 2026):

SkillWhat It Does
webapp-testingPlaywright-based web app testing; manages server lifecycle automatically
mcp-builderScaffolds Model Context Protocol (MCP) servers from a description
claude-apiTeaches Claude best practices for calling its own API
skill-creatorMeta-skill: helps you create new skills from descriptions

Installing Skills in Claude Code

# Add Anthropic's official marketplace as a plugin source
/plugin marketplace add anthropics/skills

# Browse and install the example skills bundle
/plugin install example-skills@anthropic-agent-skills

# Or install a specific skill directly
/plugin install document-skills@anthropic-agent-skills

# Use the skill — just mention it naturally
# "Use the webapp-testing skill to verify the login flow at localhost:3000"

Trend 1 — Platform Engineering and Internal Developer Platforms (IDPs)

Platform teams are building skills that encode their org’s golden-path patterns. A skill might know your team’s Helm chart structure, your approved image registries, your naming conventions, and your rollout policies. Every engineer who loads the skill gets the same context, eliminating the “platform documentation nobody reads” problem.

Usage pattern: Skills are checked into the platform monorepo and auto-installed in Claude Code for every engineer via a shared .claude-plugin manifest.

Trend 2 — Policy-as-Prompt: Security and Compliance Skills

Security teams are encoding compliance requirements (CIS Benchmarks, SOC 2 controls, NIST CSF, cloud provider security baselines) into skills. Claude reviews and generates code through the lens of the active security skill, surfacing issues inline rather than in a separate scan step.

Usage pattern: The security skill is loaded automatically in CI/CD pipelines via the Claude API before any IaC plan is approved.

Trend 3 — GitOps-Native Skills

Skills are versioned in Git alongside the infrastructure they describe. When a new module is added to the Terraform monorepo, the corresponding skill documentation is updated in the same PR. Skills become living documentation that Claude can act on.

Trend 4 — Multi-Cloud Skills Libraries

Enterprises are building skill libraries per cloud provider (one for Azure, one for AWS, one for GCP), each encoding that provider’s security defaults, resource naming rules, cost tagging standards, and approved services list. Engineers switch context by swapping skills, not rewriting prompts.

Trend 5 — Agentic SRE Workflows

In 2026, SRE teams are running Claude Code in headless agentic mode (via the --headless flag or API) on a schedule or triggered by alert webhooks. A skill provides the runbook knowledge; Claude handles the investigation and remediation autonomously, filing a GitHub issue with findings.

# Agentic incident investigation triggered by PagerDuty webhook
claude --headless \
  --skill ./skills/sre-runbooks \
  --skill ./skills/kubernetes-ops \
  "Investigate the high-error-rate alert for service payments-api in namespace prod.
   Check pod logs, events, and recent deployments. File a GitHub issue with your findings."

4. Functional Skills Every SRE/DevOps Engineer Should Know

Skill Category 1 — Kubernetes Operations

A Kubernetes ops skill teaches Claude your cluster topology, approved add-ons, naming conventions, and runbook patterns. When loaded, Claude can:

Why a skill beats a one-off prompt: The skill carries the cluster context permanently. You don’t re-explain that your prod cluster has 32-CPU nodes and uses Karpenter instead of Cluster Autoscaler on every query.

Skill Category 2 — Incident Response and Runbooks

SRE skills encode the runbook library. A skill for the payments-api service knows:

When an alert fires, the engineer loads the skill and asks Claude to triage — it returns a structured investigation plan with specific commands, not generic advice.

Skill Category 3 — Infrastructure as Code (IaC) Review

An IaC review skill carries your module standards, approved resource configurations, and forbidden patterns. Examples of what it enforces:

Claude applies these rules when generating or reviewing Terraform, giving you inline policy checks without a separate tool.

Skill Category 4 — CI/CD Pipeline Authoring

A CI/CD skill knows your approved Docker registries, your OIDC trust relationships, your environment promotion flow (dev → staging → prod), and your required pipeline stages (SAST, container scan, DAST, smoke test). It generates pipelines that match your org’s standards out of the box.

Skill Category 5 — Observability and Alerting

An observability skill encodes your Prometheus metrics naming conventions, your Grafana dashboard layout standards, your alerting severity taxonomy, and the PromQL patterns approved for SLO recording rules. Claude generates dashboards and alerts that slot directly into your existing stack.


5. How Engineers Are Creating Skills Today

The Skill Creation Workflow

The real-world pattern adopted by most platform teams:

  1. Start from the template — clone the template/SKILL.md from anthropics/skills
  2. Extract existing documentation — feed Claude your runbooks, wiki pages, and module READMEs; ask it to distill them into a SKILL.md
  3. Iterate interactively — load the skill in a Claude Code session, run realistic tasks, and refine the instructions where Claude’s output diverges from your standards
  4. Version in Git — store skills in a skills/ directory at the root of the relevant repo
  5. Share via the plugin marketplace — for org-wide skills, register them in your internal Claude plugin marketplace

Using the skill-creator Meta-Skill

Anthropic ships a skill-creator skill in the official library that automates step 2:

# Install the skill-creator meta-skill
/plugin install example-skills@anthropic-agent-skills

# Use it to generate a new skill from your documentation
"Use the skill-creator skill to create a new skill for our Kubernetes platform.
 Here are our platform docs: [paste or attach docs]
 The skill should cover: cluster topology, approved add-ons, naming conventions,
 and common runbook patterns."

Claude will produce a complete SKILL.md with structured frontmatter and instructions, ready to load or refine.

Anatomy of a Well-Structured SRE Skill

A good SRE or DevOps skill includes:

---
name: skill-name
description: [Precise description — this is used by Claude to decide when to activate the skill]
---

# Skill Title

## Context
[What environment, service, or system this skill applies to]

## Decision Tree
[A flowchart or numbered checklist Claude should follow for common tasks]

## Standards and Defaults
[Tables of approved values, naming conventions, required tags, etc.]

## Prohibited Patterns
[Explicit list of things Claude must never generate]

## Examples
[Two to three representative input/output examples]

## Reference Commands
[Frequently needed CLI commands with explanations]

The description frontmatter field is the most important part of any skill. It determines when Claude activates the skill automatically. Write it as a complete sentence that specifies the domain and use cases: “Use this skill when working with Terraform code targeting Azure infrastructure, specifically to enforce Microsoft Defender for Cloud compliance and CIS Azure Benchmark controls.”


6. Building a Custom Skill — Terraform Azure Best Practices with Cloud Defender Compliance

This section walks through building a production-ready skill for a Terraform monorepo targeting Azure, with explicit rules to avoid Microsoft Defender for Cloud (MDC) alerts.

Why Defender for Cloud Matters

Microsoft Defender for Cloud (formerly Azure Security Center) continuously scans Azure resources against the Microsoft Cloud Security Benchmark (MCSB) and applicable regulatory standards (CIS, PCI-DSS, ISO 27001, NIST). Every misconfiguration generates a security recommendation that degrades your secure score. Common Terraform patterns that trigger MDC alerts include:

Resource TypeCommon AlertTerraform Fix
azurerm_storage_accountSecure transfer not enabledhttps_traffic_only_enabled = true
azurerm_storage_accountPublic blob access allowedallow_nested_items_to_be_public = false
azurerm_key_vaultSoft delete disabledsoft_delete_retention_days = 90
azurerm_key_vaultPurge protection disabledpurge_protection_enabled = true
azurerm_sql_serverTDE not using customer keytransparent_data_encryption_key_vault_key_id
azurerm_sql_serverAuditing not enabledextended_auditing_policy block required
azurerm_network_security_groupInbound SSH/RDP open to 0.0.0.0/0Restrict source addresses
azurerm_kubernetes_clusterRBAC disabledrole_based_access_control_enabled = true
azurerm_kubernetes_clusterAzure AD integration missingazure_active_directory_role_based_access_control block
azurerm_monitor_diagnostic_settingMissing for critical resourcesAdd diagnostic settings to all PaaS resources
azurerm_managed_diskNot encrypted with CMKUse disk_encryption_set_id

The Complete SKILL.md for Azure Terraform

Below is a production-ready SKILL.md you can drop into your Terraform repo as skills/azure-terraform-defender/SKILL.md. The published version of this skill is available at github.com/YISUSVII/azure-terraform-defender:

---
name: azure-terraform-defender
description: >
  Use this skill when writing, reviewing, or refactoring Terraform code that
  provisions Azure resources. The skill enforces Microsoft Cloud Security
  Benchmark (MCSB) controls, CIS Azure Foundations Benchmark v2.0 rules, and
  Azure best practices that prevent Microsoft Defender for Cloud security
  recommendations from being raised. Activate whenever the user mentions
  Azure, azurerm provider, ARM, or Defender for Cloud in a Terraform context.
---

# Azure Terraform — Defender for Cloud Compliance Skill

## Scope

This skill applies to all Terraform code using the `azurerm` provider. It
encodes the controls required to maintain a clean Microsoft Defender for Cloud
secure score and avoid security recommendations across compute, storage,
networking, identity, and PaaS services.

## Mandatory Defaults

Apply ALL of the following defaults when generating or reviewing any resource
unless the user explicitly overrides them with a documented justification.

### Storage Accounts (`azurerm_storage_account`)

```hcl
resource "azurerm_storage_account" "example" {
  # ... required fields ...

  # MDC: "Secure transfer to storage accounts should be enabled"
  https_traffic_only_enabled = true

  # MDC: "Minimum TLS version should be TLS 1.2"
  min_tls_version = "TLS1_2"

  # MDC: "Storage account public access should be disallowed"
  allow_nested_items_to_be_public = false

  # MDC: "Storage accounts should use customer-managed key for encryption"
  # (configure via azurerm_storage_account_customer_managed_key if CMK required)

  # MDC: "Storage accounts should restrict network access"
  network_rules {
    default_action             = "Deny"
    bypass                     = ["AzureServices"]
    # ip_rules and virtual_network_subnet_ids added per use case
  }

  blob_properties {
    # MDC: "Soft delete for blobs should be enabled"
    delete_retention_policy {
      days = 30
    }
    # MDC: "Soft delete for containers should be enabled"
    container_delete_retention_policy {
      days = 30
    }
    versioning_enabled = true
  }
}
```

### Key Vaults (`azurerm_key_vault`)

```hcl
resource "azurerm_key_vault" "example" {
  # ... required fields ...

  # MDC: "Key vaults should have soft delete enabled"
  soft_delete_retention_days = 90

  # MDC: "Key vaults should have purge protection enabled"
  purge_protection_enabled = true

  # MDC: "Key vault firewall should be enabled"
  network_acls {
    default_action = "Deny"
    bypass         = ["AzureServices"]
  }

  # MDC: "Diagnostic logs in Key Vault should be enabled"
  # (add azurerm_monitor_diagnostic_setting separately)

  # Never set enable_rbac_authorization = false in prod
  enable_rbac_authorization = true
}
```

### SQL Servers (`azurerm_mssql_server` / `azurerm_sql_server`)

```hcl
resource "azurerm_mssql_server" "example" {
  # ... required fields ...

  # MDC: "An Azure Active Directory administrator should be provisioned"
  azuread_administrator {
    login_username = var.sql_admin_login
    object_id      = var.sql_admin_object_id
  }

  # MDC: "SQL servers should have auditing enabled"
  # Configure via azurerm_mssql_server_extended_auditing_policy

  # Never allow public network access unless explicitly required
  public_network_access_enabled = false

  minimum_tls_version = "1.2"
}

# MDC: "Auditing on SQL server should be enabled"
resource "azurerm_mssql_server_extended_auditing_policy" "example" {
  server_id                               = azurerm_mssql_server.example.id
  storage_endpoint                        = var.audit_storage_endpoint
  storage_account_access_key              = var.audit_storage_key
  storage_account_access_key_is_secondary = false
  retention_in_days                       = 90
}
```

### AKS Clusters (`azurerm_kubernetes_cluster`)

```hcl
resource "azurerm_kubernetes_cluster" "example" {
  # ... required fields ...

  # MDC: "Role-Based Access Control should be used on Kubernetes Services"
  role_based_access_control_enabled = true

  # MDC: "Azure Kubernetes Service clusters should have Defender profile enabled"
  microsoft_defender {
    log_analytics_workspace_id = var.log_analytics_workspace_id
  }

  # MDC: "Azure Active Directory integration should be enabled for AKS"
  azure_active_directory_role_based_access_control {
    managed            = true
    azure_rbac_enabled = true
  }

  # MDC: "AKS clusters should not allow container privilege escalation"
  # Enforce via Azure Policy add-on
  azure_policy_enabled = true

  # MDC: "AKS should use managed identities"
  identity {
    type = "SystemAssigned"
  }

  network_profile {
    network_plugin = "azure"
    network_policy = "azure"
    # MDC: "Authorized IP ranges should be defined on Kubernetes Services"
    # Set api_server_authorized_ip_ranges in production
  }

  # MDC: "Kubernetes Services should be upgraded to a non-vulnerable version"
  # Always pin to a supported minor version and keep current
  kubernetes_version = var.kubernetes_version
}
```

### Network Security Groups (`azurerm_network_security_group`)

```hcl
# MDC: "Management ports should be closed on your virtual machines"
# MDC: "SSH access from the internet should be blocked"
# MDC: "RDP access from the internet should be blocked"
# NEVER generate rules with:
#   source_address_prefix = "*" or "Internet" for port 22 or 3389
#
# Always scope inbound management traffic to known CIDR ranges or
# use Azure Bastion instead.

# Prohibited pattern — NEVER generate this:
# resource "azurerm_network_security_rule" "bad" {
#   ...
#   access                     = "Allow"
#   direction                  = "Inbound"
#   protocol                   = "Tcp"
#   destination_port_range     = "22"
#   source_address_prefix      = "*"  # <-- PROHIBITED
# }
```

### Diagnostic Settings (All PaaS Resources)

```hcl
# MDC raises recommendations when diagnostic logs are missing on:
# Key Vault, Storage Account, SQL, AKS, App Service, Service Bus, etc.
# Always generate an azurerm_monitor_diagnostic_setting for every PaaS resource.

resource "azurerm_monitor_diagnostic_setting" "example" {
  name                       = "${var.resource_name}-diag"
  target_resource_id         = azurerm_key_vault.example.id
  log_analytics_workspace_id = var.log_analytics_workspace_id

  enabled_log {
    category = "AuditEvent"
  }

  metric {
    category = "AllMetrics"
    enabled  = true
  }
}
```

### Required Resource Tags

Every resource MUST include these tags unless it is a `azurerm_resource_group`
that sets them for the group:

```hcl
tags = {
  environment  = var.environment           # dev | staging | prod
  cost-center  = var.cost_center
  owner        = var.owner_email
  managed-by   = "terraform"
  created-date = formatdate("YYYY-MM-DD", timestamp())
}
```

## Prohibited Patterns

**Reject and flag any of the following when reviewing Terraform code:**

1. `https_traffic_only_enabled = false` on any storage account
2. `allow_nested_items_to_be_public = true` on any storage account
3. `purge_protection_enabled = false` on any key vault
4. `soft_delete_retention_days < 7` on any key vault
5. `public_network_access_enabled = true` on SQL servers without explicit CIDR allow-listing
6. NSG rules with `source_address_prefix = "*"` or `"Internet"` on ports 22, 3389, 5985, 5986
7. `role_based_access_control_enabled = false` on any AKS cluster
8. Service principals with client secret credentials instead of managed identities where Azure supports MSI
9. Any `azurerm_role_assignment` with `role_definition_name = "Owner"` — prefer least-privilege built-in roles
10. Resources deployed without the required tag set

## Review Checklist

When reviewing a Terraform plan or module, work through this checklist:

- [ ] All storage accounts: HTTPS-only, min TLS 1.2, no public blob, network deny-all default, soft delete enabled
- [ ] All key vaults: soft delete, purge protection, firewall deny-all default, RBAC auth
- [ ] All SQL/PostgreSQL servers: AAD admin, auditing policy, TLS 1.2 minimum, no public access
- [ ] All AKS clusters: RBAC, Azure AD integration, Defender profile, Azure Policy add-on, managed identity
- [ ] NSG rules: no wildcard source for management ports
- [ ] Diagnostic settings: present for every PaaS resource
- [ ] Tags: all required tags present on every resource
- [ ] No Owner/Contributor role assignments without documented justification

## Remediations for Common MDC Alerts

| MDC Recommendation | Terraform Resource to Add or Update |
|---|---|
| "Secure transfer to storage accounts should be enabled" | Set `https_traffic_only_enabled = true` |
| "Storage account public access should be disallowed" | Set `allow_nested_items_to_be_public = false` |
| "Key vaults should have purge protection enabled" | Set `purge_protection_enabled = true` |
| "SQL servers should have auditing enabled" | Add `azurerm_mssql_server_extended_auditing_policy` |
| "Diagnostic logs should be enabled" | Add `azurerm_monitor_diagnostic_setting` |
| "Management ports should be closed" | Remove or restrict NSG rules for ports 22/3389 |
| "AKS should use Azure AD integration" | Add `azure_active_directory_role_based_access_control` block |
| "Vulnerabilities in container images should be remediated" | Use `azurerm_container_registry` with `quarantine_policy_enabled = true` |

Using the Skill in Claude Code

Once the skill file is in place:

# Install the skill into Claude Code from the local path
cd ~/repos/my-terraform-azure-repo
/plugin install ./skills/azure-terraform-defender

# Now Claude Code applies Defender compliance rules automatically
# Example session:
> Write a Terraform module for a storage account used by our data pipeline.

# Claude will generate a fully compliant storage account with HTTPS enforcement,
# network deny rules, soft delete, versioning, and a diagnostic setting — without
# you having to specify any of this.

> Review ./modules/aks/main.tf and flag any Defender for Cloud issues.

# Claude loads the module, checks it against all rules in the skill,
# and returns a structured finding list with specific line references and fix suggestions.

Using the Skill via the Claude API

For CI/CD integration — running compliance checks on every PR:

import anthropic

client = anthropic.Anthropic()

# Upload the skill once (store the returned skill_id)
with open("skills/azure-terraform-defender/SKILL.md", "rb") as f:
    skill = client.beta.skills.upload(
        name="azure-terraform-defender",
        file=f,
        content_type="text/markdown",
    )
skill_id = skill.id

# Use the skill in a CI compliance check
def check_terraform_pr(changed_files: str) -> str:
    response = client.beta.messages.create(
        model="claude-opus-4-5",
        max_tokens=4096,
        skills=[{"type": "skill", "skill_id": skill_id}],
        messages=[
            {
                "role": "user",
                "content": (
                    "Review the following Terraform changes for Microsoft Defender "
                    "for Cloud compliance issues. For each finding, include: "
                    "file name, resource, the specific MDC recommendation it would "
                    "trigger, and the exact fix.\n\n"
                    f"```hcl\n{changed_files}\n```"
                ),
            }
        ],
    )
    return response.content[0].text

# In your GitHub Actions workflow:
# - Run `terraform show -json` on the plan
# - Pass changed resources to check_terraform_pr()
# - Fail the PR if Claude returns any HIGH severity findings

GitHub Actions Integration

# .github/workflows/terraform-defender-check.yml
name: Terraform — Defender for Cloud Compliance

on:
  pull_request:
    paths:
      - '**.tf'
      - '**.tfvars'

jobs:
  defender-check:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "~1.9"

      - name: Terraform Init
        run: terraform init -backend=false
        working-directory: ${{ github.workspace }}

      - name: Terraform Plan (no apply)
        run: |
          terraform plan -out=tfplan.binary
          terraform show -json tfplan.binary > tfplan.json
        working-directory: ${{ github.workspace }}

      - name: Run Defender Compliance Check via Claude
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          SKILL_ID: ${{ secrets.AZURE_TERRAFORM_SKILL_ID }}
        run: |
          python .github/scripts/defender_check.py tfplan.json

7. Key Takeaways

Agent Skills represent a shift in how infrastructure engineers work with Claude. Rather than writing elaborate system prompts from scratch for every session, the community is building and sharing a reusable library of skills that encode deep domain knowledge — from Kubernetes cluster topology to cloud security baselines.

For SRE and DevOps engineers the most impactful applications in 2026 are:

The Azure Terraform Defender skill shown in this article is a working starting point — the published version is available at github.com/YISUSVII/azure-terraform-defender. Fork it, extend it with your org’s specific naming conventions and approved configurations, version it in your Terraform repo, and install it in every engineer’s Claude Code instance. Over time it becomes an always-current, always-available expert on your infrastructure standards.


Published March 2026. See also: Top Tech Publications to Follow in 2026.


Suggest Changes
Share this post on:

Previous Post
Railway: Deploy Apps Without Managing Infrastructure
Next Post
Top Tech Publications and Developer Resources to Follow in 2026