Log aggregation mimarisi: 0'dan Elasticsearch'a kurulum

Tek bir sunucudan log okumak kolay. SSH, tail -f, sorun buldun. Ama 10 sunucu, 20 mikroservis, 100 container varsa bu yaklaşım çöpe gidiyor. Bir error neden oldu, hangi servisten başladı? SSH her sunucuya ayrı ayrı çıkmak imkansız.

Log aggregation’un çözdüğü problem bu. Tüm log’ları merkezi bir yerde topluyorsun, aranabilir, korelasyon yapılabilir yapıyorsun. ELK Stack (Elasticsearch + Logstash + Kibana) en yaygın çözüm. Bu yazıda production-ready bir ELK setup’ını kurgulamayı anlatacağım.

Stack bileşenleri

Elasticsearch: Search + analytics engine. Log’ları storage + index ediyor. Full-text search, aggregation.

Logstash: Log pipeline. Data’yı input’lardan alıp parse ediyor, transform ediyor, Elasticsearch’e yazıyor.

Kibana: Web UI. Elasticsearch data’yı visualize ediyor, search yapıyor, dashboard kuruyorsun.

Filebeat / Fluent Bit: Log collector agent’lar. Her server’da çalışıyor, log file’larını okuyup Logstash’a gönderiyor.

Bu 4 component bir araya gelince ELK (+Beats) Stack oluyor. Alternatif: Elasticsearch yerine OpenSearch (AWS fork). Splunk (commercial). Loki (Grafana ecosystem).

Mimari diagram

[Server 1]        [Server 2]        [Server 3]
Filebeat          Filebeat          Filebeat
    |                |                |
    +--------+       |        +-------+
             |       |        |
             v       v        v
          [Logstash cluster]
                   |
                   v
          [Elasticsearch cluster]
                   |
                   v
              [Kibana]
                   |
              (web UI)

Her server’da Filebeat, merkezi Logstash cluster’a forward, Logstash parse + enrich + Elasticsearch’e yaz, Kibana search + visualize.

Filebeat setup (server-side)

Her uygulama sunucusunda Filebeat kurulmalı. Log file’ları izliyor, incremental olarak forward ediyor.

Filebeat config (filebeat.yml):

filebeat.inputs:
  - type: log
    paths:
      - /var/log/app/*.log
    fields:
      app: my-service
      environment: production

output.logstash:
  hosts: ["logstash.internal:5044"]

paths log file’ları. fields her log entry’e metadata eklenecek. output.logstash nereye gönderilecek.

Filebeat systemd service olarak çalışıyor. systemctl start filebeat. Restart resilient, offset’i remember ediyor (restart sonrası kaldığı yerden devam).

Logstash pipeline

Logstash pipeline 3 section:

Input: Nereden log geliyor
Filter: Nasıl parse edilip transform edilecek
Output: Nereye yazılacak

Örnek pipeline (logstash.conf):

input {
  beats {
    port => 5044
  }
}

filter {
  # JSON log'ları parse et
  if [message] =~ /^\{/ {
    json {
      source => "message"
    }
  }
  
  # Timestamp'i parse et
  date {
    match => ["timestamp", "ISO8601"]
    target => "@timestamp"
  }
  
  # Geo IP lookup (user IP'sinden location)
  geoip {
    source => "client_ip"
    target => "geoip"
  }
  
  # Request duration'ı numeric'e çevir
  mutate {
    convert => { "request_duration_ms" => "integer" }
  }
}

output {
  elasticsearch {
    hosts => ["es.internal:9200"]
    index => "logs-%{+YYYY.MM.dd}"
    user => "logstash_user"
    password => "\${LOGSTASH_PASSWORD}"
  }
}

Logstash her log entry’i bu pipeline’dan geçiriyor. JSON parse, timestamp normalize, geoip enrich, Elasticsearch’e yaz.

Structured logging: JSON format

Logstash pipeline’ı kolaylaştıran en büyük decision: uygulama log’larının JSON format’ta olması.

Unstructured log:

2024-11-15 10:30:45 INFO Request received: /api/users duration=125ms user=abc123

Parse etmesi zor. Regex’le “duration” ve “user” field’larını extract etmek. Error-prone.

Structured log (JSON):

{"timestamp":"2024-11-15T10:30:45Z","level":"info","event":"request","path":"/api/users","duration_ms":125,"user_id":"abc123"}

Logstash direct parse, her field searchable. Best practice.

Uygulamanızda structured logging library kullanın: Go’da zerolog, Python’da structlog, Node.js’de pino, PHP’de Monolog JSON formatter.

Elasticsearch index strategy

Log data hızla büyüyor. 1GB/gün bile 1 yılda 365GB. Elasticsearch index strategy kritik.

Time-based indices:

logs-2024.11.15
logs-2024.11.16
logs-2024.11.17

Her gün yeni index. Avantaj:
– Eski index’leri silmek kolay (retention)
– Search range’e göre scope’lu (son 7 gün index’lere gitmiyor)
– Different settings farklı zaman için

Index lifecycle management (ILM):
– Hot phase: Son 7 gün, active index. Writes devam, searches hızlı.
– Warm phase: 7-30 gün. Writes durmuş, searches slower OK.
– Cold phase: 30-90 gün. Daha ucuz storage, rare search.
– Delete: 90+ gün eski. Delete.

Bu phases otomatik. Elasticsearch ILM policy tanımla, her index’i uygulasın.

Retention policy

Kaç gün log tutuyorsun?

30 gün: Debug için tipik minimum. Most issues 30 günde çıkar.

90 gün: Audit trail, compliance için standart. GDPR-compliant.

1 yıl+: Compliance required (financial, healthcare). Expensive ama gerekli.

Data volume’e dikkat. 1 yıl retention + 10GB/gün = 3.6TB storage. Elasticsearch’te pahalı. Cold tier (S3 via Searchable Snapshots) daha ucuz.

Kibana dashboard’ları

Kibana’da her servise ayrı dashboard yapıyorum:

API service dashboard:
– Request rate (per second)
– Error rate (4xx, 5xx)
– p95 latency
– Top endpoint’ler
– Recent errors (son 100)

Payment service dashboard:
– Transaction count
– Failed transaction rate
– Processing time distribution
– Daily revenue

System health dashboard:
– All services error rate
– Infrastructure metric’leri
– Cron job’ların success/failure
– Security events (failed login, suspicious IP)

Her developer own service’in dashboard’una bakıyor. On-call engineer system health’e bakıyor.

Correlation IDs

Microservice system’de bir user request 3-4 servis üzerinden geçiyor. Error oluştu, hangi path’ten geldi?

Correlation ID: Her request başında unique bir ID üret (UUID). Tüm servislerde log’larla beraber bu ID’yi taşı.

// API gateway
requestId = uuid.v4()
log.info({event: "request_start", requestId, path: req.path})
headers.set("X-Request-ID", requestId)

// Internal service çağrısı
fetch(url, {headers: {"X-Request-ID": requestId}})

// Other service log'u
log.info({event: "received", requestId})

Kibana’da requestId search’le o user’ın journey’sini tüm servislerden birleştir. 10 saatlik debug 10 dakikalık oluyor.

Alerting

Kibana alerts (veya ElastAlert, Opensearch alerting) ile error pattern’larında notification:

Alert 1: 5 dakikada 50+ error → email + Slack
Alert 2: p95 latency 2 saniye üstü → PagerDuty
Alert 3: Payment failed rate %5 üstü → Slack + PagerDuty
Alert 4: Security event pattern → Security team email

Alert fatigue’den kaçın: sadece actionable alert’ler. “Her warning log’u” alert’i olmaz.

Maliyet control

Log volume controlled edilmezse ELK cluster pahalıya mal oluyor.

Strategies:

1. Sampling: Debug level log’ların %10’unu gönder. Info ve yukarısını full.

2. Filter noise: Health check endpoint’leri, static asset request’leri log’lama.

3. Field filtering: Gerekli olmayan field’ları çıkar. User-agent string 500 byte, genelde gerekli değil.

4. Cold tier: Eski data S3’e. Elasticsearch index’ten searchable snapshot.

5. Dedup: Tekrar eden error message’lar dedupe edilsin. 1000 aynı error yerine 1 entry + count.

Typical startup 5-10GB/gün log üretiyor. Mid-size company 100GB-1TB. Large 10TB+.

Alternatif: Loki + Grafana

ELK’ya alternatif: Loki + Grafana. Grafana ecosystem’inden.

Loki “logs as metrics” approach. Full-text search değil, label-based. Daha ucuz storage. Grafana’da visualize, alerting.

Avantaj: cost, simpler operational. Dezavantaj: full-text search sınırlı, feature set azıcık ELK kadar değil.

Küçük-orta scale için Loki düşünülebilir. Enterprise ELK hâlâ dominant.

Managed options

Self-host zor ise:

AWS OpenSearch Service: Managed. Setup 30 dakika. Pricing $50-500/ay tipik ölçekte.

Elastic Cloud: Elastic’in managed hizmeti. Full ELK. Premium features.

Datadog Logs: SaaS log management. Expensive ama no-ops. Mature alerting.

Grafana Cloud Logs: Loki-based managed. Cheaper.

Startup için managed solution operational burden’ı azaltıyor. 10-20 hours/month ops time saving.

Sonuç

Log aggregation production system’in foundational piece’i. Microservice mimaride olmazsa debugging imkansız.

ELK Stack mature ve battle-tested. Self-hosted veya managed versionları seçenek. Structured logging + correlation ID’ler + proper retention + alerting bu foundation’ın temel dört sütunu.

İlk setup 1-2 haftalık iş. Ondan sonra operational costs ölçeğe göre değişiyor. Log gözlemlenebilir olmayan bir sistem production’a hazır değil.

Log aggregation mimarisi: 0’dan Elasticsearch’a kurulum