Skip to content

Technology Stack Explained

Overview: How OTEL + Loki + Grafana Works Together

This is a complete observability system. Here's how each component works and how they interact.

The 3 Pillars of Observability

1. Traces (Distributed Tracing)

Purpose: Understand the complete journey of a request through your system

In this project: - Tool: OpenTelemetry + Jaeger - What it tracks: API requests, database queries, function calls - What you see: Timeline of what happened and how long each step took - Use case: "Why did request X take 5 seconds?"

How it works:

Request arrives
    ↓
[Span: http_request] ----[23 ms total]----
    ├─ [Span: auth_check] ------[2 ms]
    ├─ [Span: db_query] -------[15 ms]
    └─ [Span: serialize] ------[6 ms]
    ↓
Response sent

Each span records: - Start/end time - Operation name - Attributes (user_id, endpoint, etc.) - Events (structured log messages) - Status (success/error)

In Rust: The #[tracing::instrument] macro creates spans:

#[tracing::instrument]
async fn get_user(id: String) -> Result<User> {
    info!("Fetching user");  // Creates an event in the span
    // ... your code ...
}


2. Logs (Structured Logging)

Purpose: Detailed record of what happened in the system

In this project: - Tool: Tracing (Rust) → Promtail → Loki - What it captures: Application events, errors, state changes - What you see: Searchable, indexed logs across services - Use case: "What error message did service X log at 3:45 PM?"

How it works:

Application
    ↓
info!("User created", user_id="123")  [structured]
    ↓
stdout output
    [JSON format]
    ↓
Promtail (collector)
    [reads logs from containers]
    ↓
Loki (indexer)
    [stores with labels: job, container, level]
    ↓
Grafana
    [query and visualize]

Key difference from traditional logs: - Traditional: "User 123 created at 2024-10-27 10:30:45" - Structured: {timestamp: "...", level: "INFO", user_id: "123", action: "created"}

Structured logs let you search by any field!


3. Metrics (Performance Data)

Purpose: Quantitative measurements of system health

In this project: - Tool: Prometheus + Grafana - What it tracks: Request count, error rate, response time, resource usage - What you see: Graphs and trends over time - Use case: "What's our p95 latency? How many requests failed today?"

How it works:

Application (exports metrics endpoint)
    ↓
Prometheus (scraper)
    [every 15 seconds: GET http://app:8080/metrics]
    ↓
Time-series database
    [stores: metric_name{labels} = value]
    ↓
Grafana
    [graphs, aggregations, alerts]

Example metric:

http_requests_total{endpoint="/api/users", method="POST", status="201"} = 42
request_duration_seconds{endpoint="/api/users", quantile="0.95"} = 0.234


Component Details

OpenTelemetry (OTEL)

What it is: A standard for instrumenting code to produce observability data

Why use it: - Vendor-agnostic (can export to Jaeger, Zipkin, Prometheus, etc.) - Community standard (multiple languages) - Pre-built integrations with popular libraries

In this project: - Automatically creates spans for instrumented functions - Sends spans to Jaeger (UDP port 6831) - Integrates with the tracing crate

// OTEL is configured once:
init_opentelemetry().await;

// Then every #[tracing::instrument] creates spans that go to Jaeger
#[tracing::instrument]
async fn my_function() { ... }

Jaeger (Distributed Tracing Backend)

What it is: Storage and UI for viewing traces

Port mapping: - 6831/UDP: Jaeger agent (receives spans from apps) - 14268/HTTP: Jaeger collector (alternative) - 16686/HTTP: Web UI for viewing traces

What you see in Jaeger: 1. Service list: All services sending traces 2. Operation list: All functions being traced 3. Trace timeline: Visual timeline with spans 4. Span details: Attributes, logs, child spans

Example Jaeger workflow:

1. Request → Rust app → creates span "get_user"
2. Span sent to Jaeger (UDP, port 6831)
3. Jaeger stores it with trace_id, span_id, timestamps
4. You open Jaeger UI → see the span in timeline


Loki (Log Aggregation)

What it is: Time-series database optimized for logs

Why Loki instead of other log systems: - More resource-efficient than Elasticsearch/Splunk - Works great with Grafana (both made by Grafana Labs) - Uses labels instead of indexing every field - Good for containerized environments

Port mapping: - 3100: HTTP API for pushing/querying logs

How logs flow to Loki:

1. Rust app writes JSON logs to logs/app.log
2. Host log folder is mounted read-only into the Promtail container
3. Promtail tails the file and attaches labels (job="rust-app")
4. Promtail pushes batches to Loki over HTTP (port 3100)
5. Loki indexes the entries for Grafana queries

Loki's Query Language (LogQL):

{job="rust-app"} | json
{job="rust-app"} | level="ERROR"
sum(rate({job="rust-app", status=~"5.."}[5m]))  # error rate


Promtail (Log Collector)

What it is: Agent that collects logs and sends to Loki

In this project: - Tails the application log file mounted from the host (/var/log/otel-tutorial/app.log) - Parses JSON entries and applies labels (job="rust-app", env="dev") - Sends batches to Loki every few seconds

Configuration:

clients:
  - url: http://loki:3100/loki/api/v1/push  # Where to send logs

scrape_configs:
  - job_name: rust-app
    static_configs:
      - targets: [localhost]
        labels:
          env: dev
          __path__: /var/log/otel-tutorial/app.log

What it does: 1. Tail the mounted application log file 2. Add labels for environment and job 3. Push to Loki periodically


Prometheus (Metrics Storage)

What it is: Time-series database for metrics

Port mapping: - 9090: Prometheus UI and API

How metrics are collected:

1. Prometheus wakes up every 15 seconds (scrape_interval)
2. Makes HTTP GET request to each target: http://service:port/metrics
3. Parses response (Prometheus text format)
4. Stores in time-series database
5. Keeps ~15 days of data (retention)

Prometheus query language (PromQL):

up  # is each service up? (1 = yes, 0 = no)
sum(rate(http_requests_total[5m]))  # requests per second
histogram_quantile(
  0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)  # p95 latency


Grafana (Visualization)

What it is: Dashboarding and visualization platform

Port mapping: - 3000: Web UI (default: admin/admin)

Capabilities: - Query any datasource (Prometheus, Loki, Jaeger) - Create custom dashboards - Set up alerts - Build multi-panel displays

Grafana integrations in this project:

Grafana Dashboard
├── Panel 1: Loki logs
│   └── Shows recent application logs
├── Panel 2: Prometheus metrics
│   └── Shows request rate graph
├── Panel 3: Jaeger trace explorer
│   └── Link to individual traces
└── Panel 4: Stats
    └── Error count, latency, etc.


Data Flow Diagram

┌──────────────────────────────────────────────────────────────────┐
│                     Your Rust Application                         │
│  • Uses #[tracing::instrument] macros                             │
│  • Calls info!(), warn!(), error!() for logging                   │
│  • All happens in-process, very low overhead                      │
└────────────┬──────────────────────┬──────────────────────────────┘
             │                      │
    ┌────────▼────────┐    ┌────────▼────────┐
    │ stdout (JSON)   │    │ UDP Port 6831   │
    │ logs            │    │ (Jaeger agent)  │
    │                 │    │ spans           │
    └────────┬────────┘    └────────┬────────┘
             │                      │
    ┌────────▼────────┐    ┌────────▼────────┐
    │   Promtail      │    │     Jaeger      │
    │  (collector)    │    │  (aggregator)   │
    │  Port: docker   │    │  Port: 6831     │
    │  socket         │    │                 │
    └────────┬────────┘    └────────┬────────┘
             │                      │
    ┌────────▼────────┐    ┌────────▼────────┐
    │      Loki       │    │     Jaeger      │
    │  (log indexer)  │    │  (storage & UI) │
    │  Port: 3100     │    │  Port: 16686    │
    └────────┬────────┘    └────────┬────────┘
             │                      │
             └──────────┬───────────┘
                        │
                    ┌───▼──────┐
                    │ Grafana  │
                    │ Port:3000│
                    │          │
                    │Dashboard │
                    │with Loki,│
                    │Prometheus│
                    │& Jaeger  │
                    └──────────┘

The OpenTelemetry Instrumentation Pattern

1. Automatic Spans (with macro)

#[tracing::instrument]
async fn process_request(id: String) {
    // Span created automatically
    // Function name = span name
    // Arguments recorded as fields
    // Return value and errors recorded

    info!("Processing");
    let result = do_something().await;
}

2. Manual Spans (for complex flows)

let span = tracing::info_span!("operation_name", user_id = %id);

let result = async {
    info!("Starting");
    // ... work ...
    info!("Done");
}.instrument(span).await;

3. Log Events

info!("message");              // INFO level
warn!("something odd");        // WARN level
error!("something failed");    // ERROR level
debug!("diagnostic info");     // DEBUG level

Performance Characteristics

Component Memory CPU Disk Notes
App with OTEL +5-10MB <1% Minimal Very low overhead
Loki 100-500MB 5-15% Depends on retention Efficient compression
Prometheus 100-1GB 5-20% Depends on retention In-memory cache
Jaeger 100-500MB 5-10% Depends on retention Can use external storage
Grafana 100-300MB 3-8% Minimal Just visualization

Common Questions

Q: Do these services talk to each other? A: Only through their APIs. Prometheus scrapes endpoints, Promtail pushes to Loki. The app is unaware of them.

Q: Can I use this with microservices? A: Yes! Each service sends traces to Jaeger. Traces automatically link across services via trace_id.

Q: What about security? A: In production, use authentication/encryption. This tutorial assumes a private network.

Q: Can I replace components? A: Yes! OpenTelemetry is designed for this. Replace Jaeger with Zipkin, Loki with ELK, etc.

Q: How much data volume can this handle? A: This setup handles ~1000 requests/second easily. Bigger volumes need distributed storage.


Next Level Learning

  1. Read: OpenTelemetry specification (opentelemetry.io)
  2. Explore: Jaeger documentation (jaegertracing.io)
  3. Practice: Add custom metrics to your application
  4. Deploy: Move to production with persistent storage
  5. Monitor: Set up alerting based on your metrics

You now understand how the entire observability stack works! 🚀