Baby Steps with Policy as Code Part 1

While learning Terraform, I remembered policy-as-code from a book (GRC Engineering for AWS), and decided to give it a shot!

What is Policy-as-Code? It's the practice of writing automated rules (policies) as code to enforce security, compliance, and best practices in your infrastructure. Instead of manually reviewing configurations, you write scripts that automatically check for issues before deployment.

Step 1: Creating My Test Cases

To learn how to write policy-as-code, I first needed something to test against. So, I created a few Terraform configs to simulate common security misconfigurations.

Test Case 1: The Exposed EC2 Instance

I started with a classic: an EC2 instance with a security group that allowed SSH access from anywhere on the internet (0.0.0.0/0). This is one of the most frequent and dangerous misconfigurations.

Why is this dangerous? When you allow SSH from 0.0.0.0/0, you're essentially leaving your front door unlocked to the entire internet. Automated bots constantly scan for open SSH ports, attempting brute-force attacks and exploiting weak credentials. This is a prime target for attackers.

ec2_insecure.tf:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

# Insecure EC2 instance
resource "aws_instance" "unsecure_instance" {
  ami           = "ami-12345678"
  instance_type = "t2.micro"
  
  tags = {
    Name = "unsecure-instance-2024-11-08"
  }
}

# Security group allowing SSH from anywhere
resource "aws_security_group" "unsecure_sg" {
  name = "unsecure-sg-2024-11-08"
  
  # ISSUE: SSH open to the entire internet
  ingress {
    from_port   = 22
    to_port     = 22
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]  # ← Anyone can try to SSH in!
  }
  
  egress {
    from_port   = 0
    to_port     = 0
    protocol    = "-1"
    cidr_blocks = ["0.0.0.0/0"]
  }
  
  tags = {
    Name = "unsecure-sg-2024-11-08"
  }
}

Test Case 2: The Unencrypted S3 Bucket

Next, I created an S3 bucket with no server-side encryption. This is another common finding in security audits.

s3_no_encryption.tf:

# S3 bucket with no encryption
resource "aws_s3_bucket" "unsecure_bucket" {
  bucket = "no-encryption-bucket-2024-11-08"
  
  tags = {
    Name = "Bucket Without Encryption"
  }
}

# The problem: No encryption configuration at all.

With these known-bad configurations, I had my test subjects. Now I just needed a way to automatically detect them.

Step 2: The Challenge

Now that I had my test cases, a few questions arose:

How do I automatically detect these security issues?
What patterns should I look for in the Terraform code?
Can I build something that can explain WHY something is insecure?

Step 3: Building our tool

I'd heard of tools like Checkov and OPA that can make this process easier, but I wanted to do this myself using Python and my AI companion within Windsurf (My IDE of choice). Also, OPA looks overly complicated for an intro to policy-as-code. I will probably need longer than a weekend to get use to it.

So I decided to write my own tool in python to automatically scan Terraform files and flag security issues. This would let me understand a bit about how policy-as-code works.

Here is the directory structure of my tool:

project/
├── scripts/
│   ├── validate_terraform.py          # Main tool
│   └── policies/
│       ├── s3_policy.py               # S3 security checks
│       └── ec2_policy.py              # EC2 security checks
└── examples/
    ├── non-compliant/
    │   ├── ec2_insecure.tf            # Test case: Open SSH
    │   └── s3_no_encryption.tf        # Test case: No encryption
    └── compliant/
        ├── ec2_secure.tf              # Fixed: Restricted SSH
        └── s3_encrypted.tf            # Fixed: With encryption

How It Works (High Level)

Here's the workflow my tool follows:

┌─────────────────┐
│ Terraform Files │
│  (.tf files)    │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  File Scanner   │
│ (Find all .tf)  │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Regex Parser   │
│ (Extract AWS    │
│  resources)     │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Policy Checks   │
│ (S3, EC2 rules) │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│ Severity Check  │
│ (HIGH/MEDIUM)   │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Final Report   │
│ (Violations +   │
│  Warnings)      │
└─────────────────┘

1. Find Terraform files - Recursively scan directories for .tf files

2. Parse with regex - Use regular expressions to find AWS resources:

# Find S3 buckets
s3_buckets = re.finditer(r'resource\s+"aws_s3_bucket"\s+"([^"]+)"', content)

# Find security groups with 0.0.0.0/0
sg_rules = re.finditer(r'ingress\s*{[^}]*cidr_blocks\s*=\s*\["0\.0\.0\.0/0"\]', content)

3. Check security rules - Each policy module checks for specific issues:

S3: encryption, public access, versioning, logging
EC2: open ports, IAM profiles, hardcoded secrets

4. Classify severity:

HIGH (Violations) - Critical issues that must be fixed
MEDIUM (Warnings) - Best practices that should be implemented

Important Note on Regex Limitations: While regex works for this learning project, it's like reading a book by only looking at individual words instead of understanding sentences. Here's what it misses:

Variables - If someone writes bucket = var.bucket_name, my regex can't tell what the actual bucket name is
Spread-out code - If the Terraform is formatted weirdly across multiple lines, my simple pattern matching breaks
Reusable modules - Terraform lets you reuse code blocks (like templates), but my tool can't follow those references
Conditional stuff - Sometimes resources only get created if certain conditions are met, and my tool can't understand that logic

Why production tools are different: Tools like Checkov actually understand Terraform's language (HCL), kind of like how a spell-checker understands grammar, not just individual letters. For learning how policy-as-code works, my simple approach is fine. But for real projects, you need tools that truly understand the code.

Example: How I Check for S3 Encryption

Here's a snippet from my s3_policy.py showing how I detect unencrypted S3 buckets:

def check_s3_security(self, filename: str, content: str):
    """Check S3 bucket encryption configuration"""
    # Find all S3 bucket resources
    s3_buckets = re.finditer(r'resource\s+"aws_s3_bucket"\s+"([^"]+)"', content)
    
    for match in s3_buckets:
        bucket_name = match.group(1)
        
        # Check if encryption configuration exists
        if not re.search(r'resource\s+"aws_s3_bucket_server_side_encryption_configuration"', content):
            self.add_violation(filename, "S3_NO_ENCRYPTION", 
                             f"S3 bucket '{bucket_name}' has no encryption configuration")

The key insight: In modern Terraform, S3 encryption requires a separate aws_s3_bucket_server_side_encryption_configuration resource. My policy looks for the bucket, then checks if the corresponding encryption resource exists.

Step 4: Testing Against My Vulnerable Files

Time to see if my tool could catch the security issues I intentionally created:

python validate_terraform.py examples/non-compliant/

Output:

SECURITY VIOLATIONS (3):
1. [EC2_SSH_OPEN_TO_WORLD] HIGH
   File: ec2_insecure.tf
   Issue: Security group allows SSH (port 22) from 0.0.0.0/0

2. [S3_NO_ENCRYPTION] HIGH
   File: s3_no_encryption.tf
   Issue: S3 bucket 'unsecure_bucket' has no encryption configuration

3. [S3_NO_PUBLIC_ACCESS_BLOCK] HIGH
   File: s3_no_encryption.tf
   Issue: S3 bucket should have public access block configured

WARNINGS (3):
1. [EC2_NO_IAM_PROFILE] MEDIUM
   File: ec2_insecure.tf
   Issue: EC2 instance has no IAM instance profile attached

2. [S3_NO_VERSIONING] MEDIUM
   File: s3_no_encryption.tf
   Issue: S3 bucket has versioning disabled

3. [S3_NO_LOGGING] MEDIUM
   File: s3_no_encryption.tf
   Issue: S3 bucket should have access logging enabled

Success! My tool caught all the security issues I intentionally planted. The policy-as-code concept worked exactly as I hoped.

Edge Cases and Limitations

My simple tool doesn't handle some real-world scenarios:

Potential False Positives:

Bastion hosts might legitimately need SSH from 0.0.0.0/0 (with other controls like MFA)
Public websites might intentionally use unencrypted S3 buckets
Development environments might have relaxed security for testing

What I'm Missing:

Context about what the resource is used for
Other compensating controls (WAF, VPN, etc.)
Exception handling for approved violations

Project Stats

Time invested: ~6 focused hours over a weekend
- Learning Terraform syntax and resource structures
- Researching common AWS security misconfigurations (OWASP, CIS benchmarks)
- Exploring existing policy-as-code tools (Checkov, OPA) to understand approaches
- Collaborating with AI to understand the code - This was key! I didn't just generate code; I asked "why" questions to code I did not understand, requested explanations of regex patterns, and had the AI walk me through design decisions. This turned code generation into a learning experience.
- Writing and testing the Python tool
- Debugging regex patterns and edge cases
Policies implemented: 8 security checks (4 for S3, 4 for EC2)
Test cases created: 4 Terraform files (2 vulnerable, 2 compliant)

What I Learned

Technical Skills:

Regex pattern matching for parsing Terraform
Python automation and CLI tool development
AWS security best practices (encryption, network security, IAM)
Why so many people avoid Rego

Key Insights:

S3 encryption architecture: In Terraform, you need TWO separate resources to encrypt an S3 bucket:
1. aws_s3_bucket - Creates the bucket itself
2. aws_s3_bucket_server_side_encryption_configuration - Enables encryption on that bucket
This means my policy can't just look for one resource; it has to verify both exist and are linked.
0.0.0.0/0 = the entire internet - This CIDR block means any IP address can attempt to connect (in AWS, ::/0 is the IPv6 equivalent)
Severity matters - Not all security issues are equal (HIGH violations vs MEDIUM warnings)
Policy-as-code scales - I see why policy-as-code tools are so popular. They can effortlessly scale.

Production Reality Check

While my learning project covers the basics, real world policy-as-code must handle:

Terraform modules and remote state - Policies need to understand module calls and state references
Multiple cloud providers - AWS, Azure, GCP all have different resource structures
Performance at scale - Scanning 1000s of configs efficiently
Integration points - CI/CD pipelines, PR comments, Slack notifications

and more...Im still learning so no doubt Im missing a few (or a lot) of things.

What's Next

This is just the beginning. When I return to this project, I plan to:

Add more AWS services - RDS, Lambda, IAM policies
Integrate with CI/CD - Block deployments with violations
Explore OPA (Open Policy Agent) - Compare Rego vs Python for policies
Deploy to real AWS - Test on actual infrastructure
Add compliance frameworks - CIS, NIST, SOC2 controls

My Biggest Takeaway

Building this from scratch helped me better understand policy-as-code:

I now know about how policy-as-code tools work under the hood.

Policy-as-code isn't mysterious - it's just applied programming to solve security problems at scale.

From "What's policy-as-code?" to "I built my own!" - one Terraform file, one Python script, one security rule at a time.

$ Baby Steps with Policy as Code Part 1