Skip to content

Monitoring

This guide covers monitoring a Riptides deployment, including the built-in Prometheus sidecar, daemon health endpoints, and recommended metrics to track.

The Riptides control plane Helm chart deploys a Prometheus sidecar alongside the control plane container. This sidecar automatically discovers and scrapes metrics from both the control plane itself and registered daemons using file-based service discovery.

+------------------------------------------+
| Control Plane Pod |
| |
| +--------------------+ +----------+ |
| | controlplane |-->| prom SD | |
| | (writes SD config) | | config | |
| +--------------------+ | file | |
| +----------+ |
| +--------------------+ | |
| | prometheus sidecar |--------+ |
| | (reads SD config) | |
| +--------------------+ |
+------------------------------------------+

The control plane generates a Prometheus file-based service discovery configuration that the sidecar watches. As daemons register and report their endpoints, Prometheus automatically begins scraping them.

The Prometheus sidecar is configured through the Helm chart values:

prometheus:
image:
repository: docker.io/prom/prometheus
tag: "v2.53.5"
pvc:
accessMode: ReadWriteOnce
storageClassName: gp3
size: 20Gi
retention: 6h
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi

The sidecar is accessible on port 9090 within the pod. To access it for ad-hoc queries:

Terminal window
kubectl port-forward -n riptides-cp deploy/controlplane 9090:9090

Then open http://localhost:9090 in your browser.

The control plane writes service discovery files that tell Prometheus where to scrape. The relevant configuration:

config:
server:
metrics:
prometheusFileSDConfigDir: /promsdconfig
prometheusFileSDConfigFile: riptides_sd_config.yaml

The control plane updates this file as daemons register and deregister. Prometheus reloads the file automatically when changes are detected.

The control plane exposes a health endpoint on the HTTP proxy:

EndpointPortDescription
/healthz8080Readiness and liveness probe
Terminal window
curl http://controlplane:8080/healthz

Each daemon exposes a health endpoint:

EndpointPortDescription
/healthz10100Daemon health status

On Kubernetes:

Terminal window
kubectl port-forward -n riptides-system daemonset/riptides-daemon 10100:10100
curl http://localhost:10100/healthz

On bare metal:

Terminal window
curl http://localhost:10100/healthz

On nodes with the kernel module loaded, check driver health directly:

Terminal window
cat /proc/riptides/health

If you run your own Prometheus instance (outside the control plane sidecar), configure it to scrape Riptides endpoints.

Add a scrape config to your Prometheus configuration:

scrape_configs:
- job_name: "riptides-controlplane"
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- riptides-cp
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
action: keep
regex: controlplane
- source_labels: [__meta_kubernetes_pod_container_port_number]
action: keep
regex: "9090"

Scraping Daemons via File-Based Service Discovery

Section titled “Scraping Daemons via File-Based Service Discovery”

If you want your external Prometheus to use the same file-based service discovery that the sidecar uses, mount the SD config directory and reference it:

scrape_configs:
- job_name: "riptides-daemons"
file_sd_configs:
- files:
- /promsdconfig/riptides_sd_config.yaml
relabel_configs:
- source_labels: [__address__]
target_label: instance

Alternatively, scrape daemons directly using Kubernetes service discovery:

scrape_configs:
- job_name: "riptides-daemons"
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- riptides-system
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name]
action: keep
regex: daemon
- source_labels: [__meta_kubernetes_pod_ip]
replacement: "$1:10100"
target_label: __address__

The control plane generates Prometheus recording rules for pre-computed metrics. Include the rules directory in your Prometheus configuration:

rule_files:
- /promsdconfig/rules/*.yaml
Metric CategoryWhat to WatchWhy
Daemon registrationsCount of registered daemonsDetect daemons dropping off
Certificate issuanceIssuance rate and errorsEnsure certificates are being renewed
API server latencyRequest duration percentilesDetect control plane performance issues
Storage utilizationKine/SQLite database sizePlan storage capacity
gRPC connectionsActive gRPC streamsMonitor daemon connectivity
Tunnel connectionsActive tunnel countDetect connectivity issues for NAT’d daemons
Metric CategoryWhat to WatchWhy
Health statusDaemon health endpoint responseDetect unhealthy daemons
Certificate expiryTime until next certificate renewalCatch renewal failures before expiry
Kernel module health/proc/riptides/health statusDetect driver issues
Connection countActive mTLS connections handledUnderstand workload volume
Metadata collectionCollector errorsDetect issues with cloud metadata APIs
groups:
- name: riptides-critical
rules:
- alert: RiptidesDaemonDown
expr: up{job="riptides-daemons"} == 0
for: 5m
labels:
severity: critical
annotations:
summary: "Riptides daemon is down on {{ $labels.instance }}"
- alert: RiptidesControlPlaneDown
expr: up{job="riptides-controlplane"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Riptides control plane is unreachable"
- alert: RiptidesDaemonHighRestartRate
expr: rate(kube_pod_container_status_restarts_total{container="daemon"}[15m]) > 0
for: 10m
labels:
severity: warning
annotations:
summary: "Riptides daemon is restarting frequently on {{ $labels.node }}"
- alert: RiptidesPrometheusStorageHigh
expr: (kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*prometheus.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~".*prometheus.*"}) > 0.8
for: 15m
labels:
severity: warning
annotations:
summary: "Prometheus storage usage above 80%"

If you use Grafana, create dashboards for:

  1. Cluster Overview — Daemon count per cluster, overall health status, certificate issuance rate.
  2. Daemon Detail — Per-node view of connection counts, certificate expiry timelines, metadata collector status.
  3. Control Plane — API latency, gRPC stream counts, storage usage, tunnel connection counts.

Use the Prometheus data source pointing to either the sidecar (port 9090 on the control plane pod) or your external Prometheus instance.