Monitoring
This guide covers monitoring a Riptides deployment, including the built-in Prometheus sidecar, daemon health endpoints, and recommended metrics to track.
Architecture
Section titled “Architecture”The Riptides control plane Helm chart deploys a Prometheus sidecar alongside the control plane container. This sidecar automatically discovers and scrapes metrics from both the control plane itself and registered daemons using file-based service discovery.
+------------------------------------------+| Control Plane Pod || || +--------------------+ +----------+ || | controlplane |-->| prom SD | || | (writes SD config) | | config | || +--------------------+ | file | || +----------+ || +--------------------+ | || | prometheus sidecar |--------+ || | (reads SD config) | || +--------------------+ |+------------------------------------------+The control plane generates a Prometheus file-based service discovery configuration that the sidecar watches. As daemons register and report their endpoints, Prometheus automatically begins scraping them.
Control Plane Metrics
Section titled “Control Plane Metrics”Prometheus Sidecar Configuration
Section titled “Prometheus Sidecar Configuration”The Prometheus sidecar is configured through the Helm chart values:
prometheus: image: repository: docker.io/prom/prometheus tag: "v2.53.5" pvc: accessMode: ReadWriteOnce storageClassName: gp3 size: 20Gi retention: 6h resources: requests: cpu: 100m memory: 256Mi limits: cpu: 500m memory: 512MiThe sidecar is accessible on port 9090 within the pod. To access it for ad-hoc queries:
kubectl port-forward -n riptides-cp deploy/controlplane 9090:9090Then open http://localhost:9090 in your browser.
File-Based Service Discovery
Section titled “File-Based Service Discovery”The control plane writes service discovery files that tell Prometheus where to scrape. The relevant configuration:
config: server: metrics: prometheusFileSDConfigDir: /promsdconfig prometheusFileSDConfigFile: riptides_sd_config.yamlThe control plane updates this file as daemons register and deregister. Prometheus reloads the file automatically when changes are detected.
Health Endpoint
Section titled “Health Endpoint”The control plane exposes a health endpoint on the HTTP proxy:
| Endpoint | Port | Description |
|---|---|---|
/healthz | 8080 | Readiness and liveness probe |
curl http://controlplane:8080/healthzDaemon Metrics
Section titled “Daemon Metrics”Daemon Health Endpoint
Section titled “Daemon Health Endpoint”Each daemon exposes a health endpoint:
| Endpoint | Port | Description |
|---|---|---|
/healthz | 10100 | Daemon health status |
On Kubernetes:
kubectl port-forward -n riptides-system daemonset/riptides-daemon 10100:10100curl http://localhost:10100/healthzOn bare metal:
curl http://localhost:10100/healthzKernel Module Health
Section titled “Kernel Module Health”On nodes with the kernel module loaded, check driver health directly:
cat /proc/riptides/healthExternal Prometheus Integration
Section titled “External Prometheus Integration”If you run your own Prometheus instance (outside the control plane sidecar), configure it to scrape Riptides endpoints.
Scraping the Control Plane
Section titled “Scraping the Control Plane”Add a scrape config to your Prometheus configuration:
scrape_configs: - job_name: "riptides-controlplane" kubernetes_sd_configs: - role: pod namespaces: names: - riptides-cp relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name] action: keep regex: controlplane - source_labels: [__meta_kubernetes_pod_container_port_number] action: keep regex: "9090"Scraping Daemons via File-Based Service Discovery
Section titled “Scraping Daemons via File-Based Service Discovery”If you want your external Prometheus to use the same file-based service discovery that the sidecar uses, mount the SD config directory and reference it:
scrape_configs: - job_name: "riptides-daemons" file_sd_configs: - files: - /promsdconfig/riptides_sd_config.yaml relabel_configs: - source_labels: [__address__] target_label: instanceScraping Daemons Directly
Section titled “Scraping Daemons Directly”Alternatively, scrape daemons directly using Kubernetes service discovery:
scrape_configs: - job_name: "riptides-daemons" kubernetes_sd_configs: - role: pod namespaces: names: - riptides-system relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app_kubernetes_io_name] action: keep regex: daemon - source_labels: [__meta_kubernetes_pod_ip] replacement: "$1:10100" target_label: __address__Recording Rules
Section titled “Recording Rules”The control plane generates Prometheus recording rules for pre-computed metrics. Include the rules directory in your Prometheus configuration:
rule_files: - /promsdconfig/rules/*.yamlKey Metrics to Monitor
Section titled “Key Metrics to Monitor”Control Plane
Section titled “Control Plane”| Metric Category | What to Watch | Why |
|---|---|---|
| Daemon registrations | Count of registered daemons | Detect daemons dropping off |
| Certificate issuance | Issuance rate and errors | Ensure certificates are being renewed |
| API server latency | Request duration percentiles | Detect control plane performance issues |
| Storage utilization | Kine/SQLite database size | Plan storage capacity |
| gRPC connections | Active gRPC streams | Monitor daemon connectivity |
| Tunnel connections | Active tunnel count | Detect connectivity issues for NAT’d daemons |
Daemon
Section titled “Daemon”| Metric Category | What to Watch | Why |
|---|---|---|
| Health status | Daemon health endpoint response | Detect unhealthy daemons |
| Certificate expiry | Time until next certificate renewal | Catch renewal failures before expiry |
| Kernel module health | /proc/riptides/health status | Detect driver issues |
| Connection count | Active mTLS connections handled | Understand workload volume |
| Metadata collection | Collector errors | Detect issues with cloud metadata APIs |
Alerting Recommendations
Section titled “Alerting Recommendations”Critical Alerts
Section titled “Critical Alerts”groups: - name: riptides-critical rules: - alert: RiptidesDaemonDown expr: up{job="riptides-daemons"} == 0 for: 5m labels: severity: critical annotations: summary: "Riptides daemon is down on {{ $labels.instance }}"
- alert: RiptidesControlPlaneDown expr: up{job="riptides-controlplane"} == 0 for: 2m labels: severity: critical annotations: summary: "Riptides control plane is unreachable"Warning Alerts
Section titled “Warning Alerts” - alert: RiptidesDaemonHighRestartRate expr: rate(kube_pod_container_status_restarts_total{container="daemon"}[15m]) > 0 for: 10m labels: severity: warning annotations: summary: "Riptides daemon is restarting frequently on {{ $labels.node }}"
- alert: RiptidesPrometheusStorageHigh expr: (kubelet_volume_stats_used_bytes{persistentvolumeclaim=~".*prometheus.*"} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=~".*prometheus.*"}) > 0.8 for: 15m labels: severity: warning annotations: summary: "Prometheus storage usage above 80%"Grafana Dashboards
Section titled “Grafana Dashboards”If you use Grafana, create dashboards for:
- Cluster Overview — Daemon count per cluster, overall health status, certificate issuance rate.
- Daemon Detail — Per-node view of connection counts, certificate expiry timelines, metadata collector status.
- Control Plane — API latency, gRPC stream counts, storage usage, tunnel connection counts.
Use the Prometheus data source pointing to either the sidecar (port 9090 on the control plane pod) or your external Prometheus instance.