0

Context

I am trying to chart the network bandwidth usage of a node in 2 different manners:

  1. By looking at global metrics for that node
  2. By summing up the corresponding metric for each Pod

To achieve this, I am issuing the following Prometheus queries (example for the receive bandwidth):

  • For the entire node (metric from node-exporter)

    sum(irate(node_network_receive_bytes_total{instance="10.142.0.54:9100"}[$__rate_interval])) by (device)
    
  • Per Pod (metric from kubelet)

    sum(irate(container_network_receive_bytes_total{node="$node",container!=""}[$__rate_interval])) by (pod,interface)
    

The results are displayed in the following Grafana dashboard, after generating some load on a HTTP service called thrpt-receiver:

Receive bandwidth

Here's what I see if I look at the raw metrics, without sum() and irate() applied:

Received bytes

Problem

As you can see, results are vastly different, to the point I'm almost certain I am doing something wrong, but what?

What makes me especially suspicious about the Pod metrics is the supposedly increasing received bandwidth of kube-proxy (which AFAIK is not supposed to be receiving any traffic in iptables mode), and agents such as the Prometheus node-exporter, etc.

1 Answers1

1

I found out what was happening in my graphs. All the Pods mentioned above have one thing in common: they use the host's network namespace, so their network metrics are all identical, and equal to the global host's metric (just with a slightly different precision).

$ kubectl - monitoring get pod -o jsonpath='{.spec.hostNetwork}' \
    prometheus-stack-prometheus-node-exporter-jnhw7

true

$ kubectl -n kube-system get pod -o jsonpath='{.items[*].spec.hostNetwork}' \
    kube-proxy-gke-triggermesh-product-control-plane-7fc0ad24-z586 \
    gke-metrics-agent-5cv4m \
    prometheus-to-sd-tk8jv \
    fluentbit-gke-xh879

true true true true 

One way to see it is to compare the host's metric to one of the above Pods:

Comparison of eth0 and a Pod with hostNetwork

  • Thanks for the hint. We researched a similar problem today. Our node-exporter pod bandwith massively increased (more than 400mb/s) - which we could not explain. In our case a redis pod had very high network transmit, which was routed trough calico/kube-proxy in the hostNetwork.

    Although the host machine had only 150mb/s network transmit, which was caused by calico which only routed external traffic to other nodes. Lots of traffic stayed inside the virtual network of the host itself.

    – Martin Lantzsch Nov 11 '22 at 13:50