GitOps Overview

GitOps is a core component of the PloyD AI Gateway platform, providing declarative infrastructure management, automated deployments, and configuration drift detection for AI model serving at scale.

๐ŸŽฏ What is GitOps?

GitOps is an operational framework that takes DevOps best practices used for application development such as version control, collaboration, compliance, and CI/CD, and applies them to infrastructure automation.

Key Benefits

๐Ÿ”„

Declarative Deployments

Git as single source of truth for all configurations

๐Ÿš€

Automated CI/CD

Multi-environment promotion with automated validation

๐Ÿ”’

Security & Compliance

Policy enforcement and audit trails built-in

๐Ÿ“Š

Full Observability

Comprehensive monitoring and alerting

GitOps Architecture

GitOps Layer

YAML Configs
ArgoCD/Flux
CI/CD Pipeline
โ†“

PloyD SDK (Core Engine)

ModelServing
AIGateway
ModelRegistry
Security
Monitoring
โ†“

Infrastructure Layer

Kubernetes
Cloud APIs
Databases

GitOps Flow

1

Developer Commits

YAML configuration changes

โ†’
2

CI/CD Validation

Automated testing & security scanning

โ†’
3

ArgoCD Sync

Detects changes & deploys

โ†’
4

Model Serving

AI models deployed & monitored

SDK Integration

โœ… GitOps Uses PloyD SDK Internally

The GitOps implementation is built on top of the PloyD SDK as its core engine. GitOps acts as a declarative layer that translates YAML configurations into SDK API calls.

Translation Examples

GitOps YAML Configuration

apiVersion: ployd.ai/v1
kind: ModelDeployment
metadata:
  name: llama-7b-chat-v2-1
spec:
  model:
    framework: "vllm"
    path: "s3://models/llama-7b-chat-v2.1"
  resources:
    gpu:
      count: 2
      type: "nvidia-tesla-v100"
  scaling:
    minReplicas: 1
    maxReplicas: 5
Translates to
โ†’

PloyD SDK API Call

# Internal GitOps script uses PloyD SDK
from ployd import ModelServing

model_serving = ModelServing()

deployment = await model_serving.deploy_model(
    name="llama-7b-chat-v2-1",
    model_path="s3://models/llama-7b-chat-v2.1",
    framework="vllm",
    gpu_count=2,
    min_replicas=1,
    max_replicas=5,
    framework_config={
        "vllm": {
            "tensor_parallel_size": 2,
            "gpu_memory_utilization": 0.9
        }
    }
)

SDK Components Used by GitOps

SDK Component GitOps Usage Purpose
ModelServing deploy_model(), scale_model() Deploy and manage model instances
AIGateway create_route(), update_route() Configure intelligent routing
ModelRegistry register_model(), track_deployment() Model lifecycle management
Security create_policy(), apply_rbac() Security and access control
Monitoring setup_alerts(), track_metrics() Observability and monitoring

Setup & Configuration

Prerequisites

โœ“ Kubernetes cluster (v1.24+)
โœ“ kubectl configured and connected
โœ“ Helm 3.x installed
โœ“ Git repository for configurations
โœ“ PloyD AI Gateway SDK installed

One-Command Setup

# Setup GitOps infrastructure
./scripts/gitops-setup.sh \
  --tool argocd \
  --environment production \
  --repo https://github.com/company/ployd-config

# This will:
# โœ… Install ArgoCD or Flux
# โœ… Configure repositories
# โœ… Set up monitoring
# โœ… Apply security policies
# โœ… Create initial applications

Repository Structure

ployd-platform-config/
โ”œโ”€โ”€ environments/
โ”‚   โ”œโ”€โ”€ dev/
โ”‚   โ”‚   โ”œโ”€โ”€ applications/
โ”‚   โ”‚   โ”œโ”€โ”€ infrastructure/
โ”‚   โ”‚   โ””โ”€โ”€ policies/
โ”‚   โ”œโ”€โ”€ staging/
โ”‚   โ””โ”€โ”€ production/
โ”œโ”€โ”€ clusters/
โ”‚   โ”œโ”€โ”€ on-premise/
โ”‚   โ”œโ”€โ”€ aws-us-west-2/
โ”‚   โ”œโ”€โ”€ azure-eastus/
โ”‚   โ””โ”€โ”€ gcp-us-central1/
โ””โ”€โ”€ shared/
    โ”œโ”€โ”€ monitoring/
    โ”œโ”€โ”€ security/
    โ””โ”€โ”€ networking/

Model Deployment with GitOps

Step 1: Register Model

# Register model in PloyD registry
from ployd import ModelRegistry

registry = ModelRegistry()

model_id = await registry.register_model(
    name="llama-7b-chat",
    version="v2.1",
    model_path="s3://company-models/llama-7b-chat-v2.1",
    framework="vllm",
    metadata={
        "accuracy": 0.92,
        "gpu_memory": "14GB",
        "tags": ["production-ready"]
    }
)

Step 2: Create GitOps Configuration

# gitops/models/llama-7b-chat-deployment.yaml
apiVersion: ployd.ai/v1
kind: ModelDeployment
metadata:
  name: llama-7b-chat-v2-1
  namespace: ployd-ai-gateway
  labels:
    app.kubernetes.io/name: llama-7b-chat
    app.kubernetes.io/version: v2.1
    model.ployd.ai/framework: vllm
spec:
  # Model Registry Integration
  modelRegistry:
    modelId: "model_123456"
    version: "v2.1"
    
  # Model Configuration
  model:
    name: "llama-7b-chat"
    version: "v2.1"
    framework: "vllm"
    path: "s3://company-models/llama-7b-chat-v2.1"
    parameters:
      max_tokens: 2048
      temperature: 0.7
    vllm:
      tensor_parallel_size: 2
      max_model_len: 4096
      gpu_memory_utilization: 0.9
      
  # Resource Requirements
  resources:
    gpu:
      count: 2
      type: "nvidia-tesla-v100"
    cpu:
      requests: "4000m"
      limits: "8000m"
    memory:
      requests: "16Gi"
      limits: "32Gi"
      
  # Scaling Configuration
  scaling:
    minReplicas: 1
    maxReplicas: 5
    targetGPUUtilization: 70
    
  # Health Checks
  health:
    livenessProbe:
      httpGet:
        path: /health
        port: 8080
      initialDelaySeconds: 60
    readinessProbe:
      httpGet:
        path: /ready
        port: 8080
      initialDelaySeconds: 30

Step 3: Commit and Deploy

# Commit configuration to Git
git add gitops/models/llama-7b-chat-deployment.yaml
git commit -m "Deploy llama-7b-chat v2.1"
git push origin main

# ArgoCD automatically:
# 1. Detects Git changes
# 2. Validates configuration
# 3. Calls PloyD SDK to deploy model
# 4. Monitors deployment health
# 5. Reports status

Gateway Routing Configuration

Route Configuration

# gitops/gateway/routes.yaml
apiVersion: ployd.ai/v1
kind: GatewayRoute
metadata:
  name: chat-api-v2
  namespace: ployd-ai-gateway
spec:
  # Route Configuration
  path: "/v2/chat"
  methods: ["POST", "OPTIONS"]
  
  # Model Backends with Traffic Splitting
  models:
    - name: "llama-7b-chat-v2-1"
      service: "llama-7b-chat-v2-1.ployd-ai-gateway.svc.cluster.local"
      port: 80
      weight: 90  # 90% traffic to new version
      priority: 1
      
    - name: "llama-7b-chat-v2-0"
      service: "llama-7b-chat-v2-0.ployd-ai-gateway.svc.cluster.local"
      port: 80
      weight: 10  # 10% traffic to old version (fallback)
      priority: 2
      fallback: true
  
  # Routing Strategy
  routing:
    strategy: "weighted_round_robin"
    timeout: "30s"
    retries: 3
    
  # Rate Limiting
  rateLimit:
    enabled: true
    global:
      rpm: 10000
      burst: 1000
    perClient:
      rpm: 1000
      burst: 100
      
  # Authentication
  authentication:
    required: true
    methods: ["api_key", "jwt"]
    
  # Circuit Breaker
  circuitBreaker:
    enabled: true
    failureThreshold: 5
    recoveryTimeout: "30s"
    
  # Monitoring
  monitoring:
    enabled: true
    metrics:
      - "request_latency"
      - "request_throughput"
      - "error_rate"

Advanced Routing Features

๐ŸŽฏ Intelligent Routing

  • Latency-based routing
  • Cost-optimized routing
  • Load balancing strategies

๐Ÿ”„ Traffic Management

  • Weighted traffic splitting
  • Canary deployments
  • Blue-green deployments

๐Ÿ›ก๏ธ Resilience

  • Circuit breakers
  • Automatic failover
  • Health checks

๐Ÿ“Š Observability

  • Real-time metrics
  • Distributed tracing
  • Custom dashboards

CI/CD Workflows

GitHub Actions Pipeline

# .github/workflows/gitops.yml
name: GitOps - PloyD Platform

on:
  push:
    branches: [main, develop]
    paths:
      - 'gitops/**'
      - 'infrastructure/**'

jobs:
  validate-gitops:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Validate Kubernetes manifests
        run: |
          find gitops/ -name "*.yaml" | xargs kubeval
      
      - name: Security scanning
        uses: aquasecurity/trivy-action@master
        with:
          scan-type: 'fs'
          scan-ref: 'gitops/'

  deploy-production:
    needs: validate-gitops
    if: github.ref == 'refs/heads/main'
    environment: production
    runs-on: ubuntu-latest
    steps:
      - name: Deploy via PloyD SDK
        run: |
          python scripts/deploy-models-gitops.py \
            --environment production \
            --config-path gitops/

Environment Promotion

Development

Branch: develop

Auto Deploy: โœ… Yes

Validation: Basic

Resources: Minimal

โ†’

Staging

Branch: main

Auto Deploy: โœ… Yes

Validation: Full

Resources: Production-like

โ†’

Production

Branch: main

Auto Deploy: โŒ Manual

Validation: Comprehensive

Resources: Full

Monitoring & Observability

GitOps-Specific Alerts

# Prometheus AlertManager Rules
groups:
  - name: gitops
    rules:
      - alert: GitOpsAppOutOfSync
        expr: argocd_app_info{sync_status!="Synced"} == 1
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "GitOps application {{ $labels.name }} is out of sync"
      
      - alert: GitOpsAppUnhealthy
        expr: argocd_app_info{health_status!="Healthy"} == 1
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "GitOps application {{ $labels.name }} is unhealthy"

Model Performance Monitoring

๐Ÿ“ˆ Model Performance

  • Request latency and throughput
  • Model accuracy and drift detection
  • GPU utilization and memory usage
  • Error rates and failure patterns

๐Ÿ—๏ธ Infrastructure Health

  • Kubernetes pod status and restarts
  • Node resource utilization
  • Network connectivity and latency
  • Storage performance and capacity

๐Ÿ”„ GitOps Operations

  • Deployment frequency and success rate
  • Configuration drift detection
  • Sync status and health checks
  • Rollback frequency and causes

Security & Compliance

Policy as Code

# policies/security/pod-security.rego
package kubernetes.admission

deny[msg] {
  input.request.kind.kind == "Pod"
  input.request.object.spec.containers[_].securityContext.runAsRoot == true
  msg := "Containers must not run as root"
}

deny[msg] {
  input.request.kind.kind == "Pod"
  not input.request.object.spec.containers[_].securityContext.readOnlyRootFilesystem
  msg := "Containers must use read-only root filesystem"
}

Security Features

๐Ÿ” Access Control

  • Git-based access control
  • RBAC for deployment permissions
  • Audit trail for all changes
  • Signed commits verification

๐Ÿ›ก๏ธ Model Security

  • Model artifact scanning
  • Encrypted storage and transmission
  • Runtime security monitoring
  • Compliance policy enforcement

๐Ÿ—๏ธ Infrastructure Security

  • Network policies for isolation
  • Secret management for API keys
  • Container image security scanning
  • Runtime security monitoring

Canary Deployments

Canary Configuration

# gitops/canary/llama-7b-chat-canary.yaml
apiVersion: ployd.ai/v1
kind: CanaryDeployment
metadata:
  name: llama-7b-chat-canary
spec:
  stable:
    model: "llama-7b-chat-v2-0"
    replicas: 3
  canary:
    model: "llama-7b-chat-v2-1"
    replicas: 1
  traffic:
    canaryWeight: 10  # Start with 10% traffic
    maxWeight: 100
    stepWeight: 10    # Increase by 10% each step
    interval: "5m"    # Wait 5 minutes between steps
  analysis:
    metrics:
    - name: "success_rate"
      threshold: 99.5
    - name: "latency_p95"
      threshold: 200
    - name: "error_rate"
      threshold: 0.1
    failureThreshold: 3
    successThreshold: 5

Canary Process

1

Deploy with 0% Traffic

New model version deployed but receives no traffic

2

Health Validation

Run health checks and validation tests

3

Gradual Traffic Increase

5% โ†’ 25% โ†’ 50% โ†’ 100% traffic progression

4

Metrics Monitoring

Monitor performance and error rates

5

Automatic Decision

Promote or rollback based on metrics

Disaster Recovery

Backup Strategy

๐Ÿ“Š Database Backups

Automated backups with point-in-time recovery

Frequency: Hourly Retention: 30 days

๐Ÿค– Model Artifacts

Cross-region replication of model files

Frequency: Daily Retention: 90 days

โš™๏ธ Configuration

GitOps repository with version control

Method: Git Retention: Unlimited

Recovery Procedures

Scenario RTO RPO Procedure
Pod Failure < 1 min 0 Kubernetes auto-restart
Node Failure < 5 min 0 Auto-scaling + pod rescheduling
AZ Failure < 15 min < 1 min Multi-AZ deployment
Region Failure < 1 hour < 15 min Cross-region failover

Best Practices

Repository Management

๐Ÿ“

Separate Repositories

Use separate repos for infrastructure, platform config, and applications

๐Ÿ”’

Branch Protection

Enforce branch protection rules for main/production branches

๐Ÿ‘ฅ

Required Reviews

Mandate code reviews for all configuration changes

๐Ÿงช

Automated Testing

Run validation tests before merging changes

Ready to Implement GitOps?

Start deploying AI models with GitOps automation using PloyD AI Gateway