Disaster recovery planı kağıt üzerinden çıkarırken yaptığım hatalar

Disaster recovery (DR) plan’ı her şirketin bir yerinde var. 50-sayfalık PDF, runbook, step-by-step instructions. Ama gerçek bir incident’te test edildiğinde %70’i çalışmıyor.

Birkaç projede DR plan’ı kurguladım, ikisinde gerçek incident’te test ettim. Bu yazıda yaptığım hataları ve öğrendiklerimi anlatacağım.

Hata 1: Plan test edilmemiş

En yaygın. DR plan yazılıyor, kağıtta kalıyor. Gerçek test yapılmıyor.

Gerçek incident’te:
– Runbook’ta written command çalışmıyor (syntax değişmiş)
– Backup restore beklenen sürede tamamlanmıyor
– Recovery server’ın credentials’ları expired
– DNS TTL’i too high, propagation saatler sürüyor
– Team kişileri rotation’da, runbook’u ilk defa görüyor

Fix: Quarterly DR drill. Production’a yakın staging environment. Gerçekçi simulation.

Hata 2: RTO ve RPO belirsiz

RTO (Recovery Time Objective): Downtime ne kadar tolerable? (“4 saat”)

RPO (Recovery Point Objective): Data loss ne kadar tolerable? (“15 dakika”)

Bu ikisi framework’ün temel’i. Belirsizken:
– Backup frekansı belirsiz
– Replica stratejisi belirsiz
– Investment level belirsiz

Fix: Business ile konuş. “8 saat downtime OK” mı “2 saat’te up olmalı” mı? RPO “24 saat” vs “15 dakika” tamamen farklı architecture.

Hata 3: Backup test edilmemiş

Backup tutuyorsun ama restore etmeyi denemedin.

Gerçek test’te ortaya çıkanlar:
– Backup corrupt (silent failure)
– Restore script’i prod environment’ta çalışmıyor
– Backup decryption key unutulmuş
– Storage’a erişim permission’ı yok
– Backup incremental ama full restore mümkün değil

Fix: Ayda bir backup restore test. Staging environment’a restore, verify data integrity. Otomatize et.

Hata 4: Single point of failure (SPOF) gizli

Plan’ı yazarken obvious SPOF’lar handle edilmiş gibi görünüyor. Gerçek incident’te gizli SPOF’lar ortaya çıkıyor.

Example SPOF’lar:
– Tek CI/CD runner (prod deploy için gerekli ama DR’a dahil değil)
– Tek DNS provider (Cloudflare down olsa ne yapacaksın?)
– Tek certificate authority
– Tek secrets manager instance
– Tek monitoring provider

Fix: Dependency mapping. “Prod’u up tutmak için ne ne ne gerekli?” Her dependency için failover plan.

Hata 5: Team procedural knowledge gap

DR plan runbook halinde ama team içinde sadece 1-2 kişi detaylı biliyor. Bu 1-2 kişi tatilde veya ulaşılmıyor.

Fix:
– Rotation (herkes runbook’u çalıştırmış olsun)
– Documentation’ı step-by-step, assumption-free
– Game days quarterly
– On-call rotation’da new joiners pair eşliğinde

Hata 6: Communication plan eksik

Incident başladı. Kimi ararsın?

Typical gaps:
– Customer communication plan yok
– Stakeholder notification template yok
– Incident channel (Slack) prepared değil
– Status page otomatik update’lenmiyor
– Post-incident communication kim yapacak belirsiz

Fix:

Incident communication ladder: Kim, kime, ne zaman.
Template email’ler: Hazır. Duruma göre customize.
Status page: Automated + manual. Real-time.
Incident Slack channel: Pre-created.
Customer communication: “We’re aware, investigating” → “ETA” → “Resolved” pattern.

Hata 7: Partial failure handling

DR plan “tam çöküş” için yazılmış. Ama %80 olasılıkla partial failure oluyor: bir servis down, diğerleri OK.

Partial failure’a tam fail-over uygulamak overshoot. Additional outage yaratıyor.

Fix:
– Graduated response. Partial failure için limited action. Full failure için DR plan.
– Service-level failover. Tek servisi yedekten getir, cluster rest çalışıyor.
– Decision framework. “Full DR trigger şartları: X, Y, Z olmadıkça graduated response.”

Hata 8: Post-incident review atlanıyor

Incident resolved. Team relief, “geçti bitti”. Blameless postmortem yapılmıyor.

Bu adım kritik:
– Root cause analysis
– Contributing factors
– What went well
– What could improve
– Action items (tracked to completion)

Postmortem’siz incident tekrar etme riski yüksek.

Fix: Her incident için postmortem zorunlu (major’lerde). Template:

Timeline (what happened when)
Impact (customer, business, technical)
Detection (how discovered)
Resolution (what fixed it)
Root cause (5 whys)
Action items (prevent + detect + mitigate)

Hata 9: Recovery = restore eşitliği

Recovery sadece data restore değil. Full service recovery:

Data integrity check
Dependency service restart ordering
Cache warming (cold cache = slow performance)
User session handling (lost sessions vs persistent)
Monitoring re-enable
Alerting silencing reset
Customer communication update

Fix: Checklist-based recovery. “Restore completed” sadece ilk item.

Hata 10: Infrastructure-as-code yok

DR scenario’da secondary region’a spin up. Manual infrastructure create uzun sürüyor.

IaC (Infrastructure-as-Code):
– Terraform, CloudFormation, Pulumi
– Primary + secondary region template
– “terraform apply” ile fresh environment

Son proje’de secondary region spin up 4 saatten 30 dakikaya düştü. IaC sayesinde.

Fix: Critical infrastructure IaC’de. Manual setup discouraged.

Drill planning

Quarterly DR drill’in iyi setup’ı:

Pre-drill:
– Scenario belirle (specific failure)
– Participant list (on-call team + support)
– Time window (saatler)
– Success criteria (RTO met? RPO met? Communication ran? Documentation works?)

During drill:
– Real-time observation. Team ne yapıyor, neden?
– Stopwatch. RTO measurement.
– Note gaps. Runbook’ta olmayan steps.

Post-drill:
– Debrief session
– Findings document
– Action items tracked
– Runbook updates

Her quarter bir scenario. 12 ayda farklı failure modes test.

Budget for DR

DR maliyetli. Ama gerçek incident’in cost’undan çok daha az.

DR budget items:
– Secondary region infrastructure (idle %30-50 of primary cost)
– Backup storage (aylık maliyet)
– Monitoring tool’lar (paid tier)
– DR testing time (team hours)
– Training (workshops, simulation)

Incident cost estimate:
– Downtime × revenue/hour
– Customer trust loss
– Team overtime + recovery
– Regulatory penalties (compliance)

Balance: her şey yedekli ≠ practical. “Tolerable downtime” framework’ünde budget.

Cloud-specific considerations

AWS, GCP, Azure outage’ları oluyor. “Cloud reliability” infinite değil.

Multi-region strategy:
– Same cloud, different region (AWS eu-west-1 + eu-west-2)
– Multi-cloud (AWS + GCP)
– Hybrid (cloud + on-prem backup)

Complexity exponentially artıyor. Most teams single-region multi-AZ yeter.

Sonuç

DR plan kağıtta güzel görünüyor. Gerçek test’te problem’ler ortaya çıkıyor. 10 yaygın hata: test edilmemiş plan, belirsiz RTO/RPO, restore test edilmemiş, gizli SPOF, knowledge gap, communication eksik, partial failure, postmortem atlama, recovery ≠ restore, IaC eksik.

Quarterly drill + blameless postmortem + continuous improvement. Bu disiplinle DR plan gerçekten çalışıyor.

İlk drill painful olacak. Sorun ortaya çıkacak. İyi haber: production incident’te değil, controlled environment’ta.