profile
viewpoint

Ask questionsActive time series suddenly drops to 0 while data points ingestion rate keeps working

Describe the bug I dont think its actually a bug so this is more of a question. I have a prometheus that uses remote_write into VM - cluster. Everything worked fine then suddenly the active time series dropped to 0 and all the queries return empty data. The vmstorage barley used any resources from its allocated capacity and no pods have stopped working (vminsert and vmstorage are all showing status running in EKS). I can query the data in the prometheus that does the remote write just fine, so the data itself is ok.

To Reproduce I just sent a lot of data from prometheus (around 2m active time series) with 2 vmstorage machines that are 16GB of RAM and 2 CPU cores, and 7 vminserts machines with 1GB of RAM and 0.5 CPU cores

Expected behavior I would expect to see VM continue working great as it was before, or at least show errors somehow to make me understand what is going on

Screenshots Here is the sudden drop of active time series to 0 Screen Shot 2020-12-01 at 20 27 45

Nothing special with vm resources usage Screen Shot 2020-12-01 at 20 28 17

Version v1.46.0

Any idea what could cause this behaviour or what should I do to resolve this? Thanks!

VictoriaMetrics/VictoriaMetrics

Answer questions Alon-Katz

Sure here are the top logs of each vm resource (vmstorage,vminsert,vmselect) (its from now but the dashboard in grafana is still showing around 1k message each second):

VMSTORAGE: (Looks like all of the 1k logs are coming from vmstorage)

{"ts":"2020-12-06T18:37:44.771Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:133","msg":"accepted vminsert conn from 10.0.169.211:42750"} {"ts":"2020-12-06T18:37:44.771Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:200","msg":"accepted vmselect conn from 10.0.169.211:41022"} {"ts":"2020-12-06T18:37:44.772Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:133","msg":"accepted vminsert conn from 10.0.169.211:42754"} {"ts":"2020-12-06T18:37:44.773Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:200","msg":"accepted vmselect conn from 10.0.169.211:41026"} {"ts":"2020-12-06T18:37:44.773Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:133","msg":"accepted vminsert conn from 10.0.169.211:42758"} {"ts":"2020-12-06T18:37:44.774Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:200","msg":"accepted vmselect conn from 10.0.169.211:41030"} {"ts":"2020-12-06T18:37:44.774Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:133","msg":"accepted vminsert conn from 10.0.169.211:42764"} {"ts":"2020-12-06T18:37:44.775Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:200","msg":"accepted vmselect conn from 10.0.169.211:41036"} {"ts":"2020-12-06T18:37:44.775Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:133","msg":"accepted vminsert conn from 10.0.169.211:42768"} {"ts":"2020-12-06T18:37:44.775Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:200","msg":"accepted vmselect conn from 10.0.169.211:41040"} {"ts":"2020-12-06T18:37:44.776Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:133","msg":"accepted vminsert conn from 10.0.169.211:42772"} {"ts":"2020-12-06T18:37:44.776Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:200","msg":"accepted vmselect conn from 10.0.169.211:41044"} {"ts":"2020-12-06T18:37:44.776Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:133","msg":"accepted vminsert conn from 10.0.169.211:42776"} {"ts":"2020-12-06T18:37:44.777Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:200","msg":"accepted vmselect conn from 10.0.169.211:41048"} {"ts":"2020-12-06T18:37:44.777Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:133","msg":"accepted vminsert conn from 10.0.169.211:42780"} {"ts":"2020-12-06T18:37:44.777Z","level":"info","caller":"VictoriaMetrics/app/vmstorage/transport/server.go:200","msg":"accepted vmselect conn from 10.0.169.211:41052"}

VMINSERT (last log is from 15/11 so I think there was no issue here):

{"ts":"2020-11-15T13:13:51.459Z","level":"warn","caller":"VictoriaMetrics/app/vminsert/netstorage/netstorage.go:205","msg":"cannot dial storageNode "vm-victoria-metrics-cluster-vmstorage-0.vm-victoria-metrics-cluster-vmstorage.vm.svc.cluster.local:8400": dial tcp4: i/o timeout"} {"ts":"2020-11-15T13:13:51.459Z","level":"warn","caller":"VictoriaMetrics/app/vminsert/netstorage/netstorage.go:205","msg":"cannot dial storageNode "vm-victoria-metrics-cluster-vmstorage-1.vm-victoria-metrics-cluster-vmstorage.vm.svc.cluster.local:8400": dial tcp4: i/o timeout"} {"ts":"2020-11-15T13:14:14.555Z","level":"info","caller":"VictoriaMetrics/app/vminsert/netstorage/netstorage.go:209","msg":"successfully dialed -storageNode="vm-victoria-metrics-cluster-vmstorage-0.vm-victoria-metrics-cluster-vmstorage.vm.svc.cluster.local:8400""} {"ts":"2020-11-15T13:14:15.151Z","level":"info","caller":"VictoriaMetrics/app/vminsert/netstorage/netstorage.go:209","msg":"successfully dialed -storageNode="vm-victoria-metrics-cluster-vmstorage-1.vm-victoria-metrics-cluster-vmstorage.vm.svc.cluster.local:8400""}

VMSELECT

{"ts":"2020-12-02T10:57:50.125Z","level":"warn","caller":"VictoriaMetrics/app/vmselect/promql/exec.go:29","msg":"slow query according to -search.logSlowQueryDuration=5s: duration=19.568 seconds, start=1606816800, end=1606903200, step=3600, accountID=3979676418, projectID=0, query="topk(5, container_cpu_usage_seconds_total)""} {"ts":"2020-12-02T10:57:54.791Z","level":"warn","caller":"VictoriaMetrics/app/vmselect/promql/exec.go:29","msg":"slow query according to -search.logSlowQueryDuration=5s: duration=11.466 seconds, start=1606820220, end=1606906620, step=60, accountID=3979676418, projectID=0, query="topk(5, container_cpu_usage_seconds_total)""} {"ts":"2020-12-02T10:57:54.791Z","level":"warn","caller":"VictoriaMetrics/app/vmselect/main.go:379","msg":"error in "/select/3979676418/prometheus/api/v1/query_range?query=topk(5%2C%20container_cpu_usage_seconds_total)&start=1606820220&end=1606906620&step=60": error when executing query="topk(5, container_cpu_usage_seconds_total)" on the time range (start=1606820220000, end=1606906620000, step=60000): cannot execute query: cannot evaluate "container_cpu_usage_seconds_total": not enough memory for processing 96832318 data points across 67198 time series with 1441 points in each time series; total available memory for concurrent requests: 161061273 bytes; possible solutions are: reducing the number of matching time series; switching to node with more RAM; increasing -memory.allowedPercent; increasing step query arg (60s)"} {"ts":"2020-12-02T10:58:10.334Z","level":"warn","caller":"VictoriaMetrics/app/vmselect/promql/exec.go:29","msg":"slow query according to -search.logSlowQueryDuration=5s: duration=19.519 seconds, start=1606816800, end=1606903200, step=3600, accountID=3979676418, projectID=0, query="topk(5, container_cpu_usage_seconds_total)""}

Let me know if I can provide more details.

useful!
source:https://uonfu.com/
Github User Rank List