Monitoring

This guide covers monitoring a Riptides deployment, including the built-in Prometheus sidecar, daemon health endpoints, and recommended metrics to track.

Architecture

The Riptides control plane Helm chart deploys a Prometheus sidecar alongside the control plane container. This sidecar automatically discovers and scrapes metrics from both the control plane itself and registered daemons using file-based service discovery.

+------------------------------------------+
|  Control Plane Pod                       |
|                                          |
|  +--------------------+   +----------+  |
|  | controlplane       |-->| prom SD  |  |
|  | (writes SD config) |   | config   |  |
|  +--------------------+   | file     |  |
|                            +----------+  |
|  +--------------------+        |         |
|  | prometheus sidecar |--------+         |
|  | (reads SD config)  |                  |
|  +--------------------+                  |
+------------------------------------------+

The control plane generates a Prometheus file-based service discovery configuration that the sidecar watches. As daemons register and report their endpoints, Prometheus automatically begins scraping them.

Control Plane Metrics

Prometheus Sidecar Configuration

The Prometheus sidecar is configured through the Helm chart values:

prometheus:
  image:
    repository: docker.io/prom/prometheus
    tag: "v2.53.5"
  pvc:
    accessMode: ReadWriteOnce
    storageClassName: gp3
    size: 20Gi
  retention: 6h
  resources:
    requests:
      cpu: 100m
      memory: 256Mi
    limits:
      cpu: 500m
      memory: 512Mi

The sidecar is accessible on port 9090 within the pod. To access it for ad-hoc queries:

kubectl port-forward -n riptides-cp deploy/controlplane 9090:9090

Then open http://localhost:9090 in your browser.

File-Based Service Discovery

The control plane writes service discovery files that tell Prometheus where to scrape. The relevant configuration:

config:
  server:
    metrics:
      prometheusFileSDConfigDir: /promsdconfig
      prometheusFileSDConfigFile: riptides_sd_config.yaml

The control plane updates this file as daemons register and deregister. Prometheus reloads the file automatically when changes are detected.

Health Endpoint

The control plane exposes a health endpoint on the HTTP proxy:

Endpoint	Port	Description
`/healthz`	8080	Readiness and liveness probe

curl http://controlplane:8080/healthz

Daemon Metrics

Daemon Health Endpoint

Each daemon exposes a health endpoint:

Endpoint	Port	Description
`/healthz`	10100	Daemon health status

On Kubernetes:

kubectl port-forward -n riptides-system daemonset/riptides-daemon 10100:10100
curl http://localhost:10100/healthz

On bare metal:

curl http://localhost:10100/healthz

Kernel Module Health

On nodes with the kernel module loaded, check driver health directly:

cat /proc/riptides/health

External Prometheus Integration

If you run your own Prometheus instance (outside the control plane sidecar), configure it to scrape Riptides endpoints.

Scraping the Control Plane

Add a scrape config to your Prometheus configuration:

scrape_configs:
  - job_name: "riptides-controlplane"
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - riptides-cp
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
        action: keep
        regex: controlplane
      - source_labels: [__meta_kubernetes_pod_container_port_number]
        action: keep
        regex: "9090"

Scraping Daemons via File-Based Service Discovery

If you want your external Prometheus to use the same file-based service discovery that the sidecar uses, mount the SD config directory and reference it:

scrape_configs:
  - job_name: "riptides-daemons"
    file_sd_configs:
      - files:
          - /promsdconfig/riptides_sd_config.yaml
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance

Scraping Daemons Directly

Alternatively, scrape daemons directly using Kubernetes service discovery:

scrape_configs:
  - job_name: "riptides-daemons"
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - riptides-system
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
        action: keep
        regex: daemon
      - source_labels: [__meta_kubernetes_pod_ip]
        replacement: "$1:10100"
        target_label: __address__

Recording Rules

The control plane generates Prometheus recording rules for pre-computed metrics. Include the rules directory in your Prometheus configuration:

rule_files:
  - /promsdconfig/rules/*.yaml

Key Metrics to Monitor

Control Plane

Metric Category	What to Watch	Why
Daemon registrations	Count of registered daemons	Detect daemons dropping off
Certificate issuance	Issuance rate and errors	Ensure certificates are being renewed
API server latency	Request duration percentiles	Detect control plane performance issues
Storage utilization	Kine/SQLite database size	Plan storage capacity
gRPC connections	Active gRPC streams	Monitor daemon connectivity
Tunnel connections	Active tunnel count	Detect connectivity issues for NAT’d daemons

Daemon

Metric Category	What to Watch	Why
Health status	Daemon health endpoint response	Detect unhealthy daemons
Certificate expiry	Time until next certificate renewal	Catch renewal failures before expiry
Kernel module health	`/proc/riptides/health` status	Detect driver issues
Connection count	Active mTLS connections handled	Understand workload volume
Metadata collection	Collector errors	Detect issues with cloud metadata APIs

Alerting Recommendations

Critical Alerts

groups:
  - name: riptides-critical
    rules:
      - alert: RiptidesDaemonDown
        expr: up{job="riptides-daemons"} == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Riptides daemon is down on {{ $labels.instance }}"

      - alert: RiptidesControlPlaneDown
        expr: up{job="riptides-controlplane"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Riptides control plane is unreachable"

Warning Alerts

      - alert: RiptidesDaemonHighRestartRate
        expr: rate(kube_pod_container_status_restarts_total{container="daemon"}[15m]) > 0
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Riptides daemon is restarting frequently on {{ $labels.node }}"

      - alert: RiptidesPrometheusStorageHigh
        expr: (kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*prometheus.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~".*prometheus.*"}) > 0.8
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Prometheus storage usage above 80%"

Grafana Dashboards

If you use Grafana, create dashboards for:

Cluster Overview — Daemon count per cluster, overall health status, certificate issuance rate.
Daemon Detail — Per-node view of connection counts, certificate expiry timelines, metadata collector status.
Control Plane — API latency, gRPC stream counts, storage usage, tunnel connection counts.

Use the Prometheus data source pointing to either the sidecar (port 9090 on the control plane pod) or your external Prometheus instance.