profile
viewpoint
Chris Marchbanks csmarchbanks @Splunk Boulder, CO

cortexproject/cortex 3522

A horizontally scalable, highly available, multi-tenant, long term Prometheus.

csmarchbanks/go-schema-registry 2

A repository to interact with Avro schemas stored in a Confluent Schema Registry

csmarchbanks/gmail-scraper 1

An application that will index your gmail account into elasticsearch. Used to get to know Prometheus

csmarchbanks/remote-write-dedupe 1

A Prometheus remote write deduplicating proxy

csmarchbanks/advent-2017 0

Solutions in Go for http://adventofcode.com/2017

csmarchbanks/cadvisor 0

Analyzes resource usage and performance characteristics of running containers.

csmarchbanks/client_golang 0

Prometheus instrumentation library for Go applications

csmarchbanks/cloudflare-ddns 0

Setup dynamic DNS with Cloudflare

csmarchbanks/cortex 0

A horizontally scalable, highly available, multi-tenant, long term Prometheus.

csmarchbanks/docker-siege 0

Docker image for centos & siege. Used for load testing

issue commentprometheus/prometheus

Proposal: Completely remove series after deleting and cleaning

@roidelapluie We're seeing the same behaviour. I deleted the series but it still pops up in the suggestion box.

trallnag

comment created time in an hour

pull request commentprometheus/prometheus

CNAME responses can occur with "Type: A" dns_sd_config requests

Looks like all tests passed, and the proper text was updated, @brian-brazil. Please let me know if there's anything further needed on this PR.

mattberther

comment created time in 2 hours

issue openedprometheus/prometheus

Prometheus raise out of bounds error for all target after resume the system from a suspend

What did you do?

After suspend the system and resume again, prometheus report following error and can not scrape any new metrics, unless restart the Prometheus service.

What did you expect to see?

promtheus should continue to scrape new metrics.

What did you see instead? Under which circumstances?

check the log above. and in the webui, i got following

image

Environment

  • System information:
Linux t470p 5.4.79-1-lts #1 SMP Sun, 22 Nov 2020 14:22:21 +0000 x86_64 GNU/Linux
  • Prometheus version:
prometheus, version 2.22.2 (branch: tarball, revision: 2.22.2)
  build user:       someone@builder
  build date:       20201117-18:44:08
  go version:       go1.15.5
  platform:         linux/amd64
# prometheus.service

# /usr/lib/systemd/system/prometheus.service
[Unit]
Description=Prometheus service
Requires=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Restart=on-failure
WorkingDirectory=/usr/share/prometheus
EnvironmentFile=-/etc/conf.d/prometheus
ExecStart=/usr/bin/prometheus --config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/var/lib/prometheus/dat>
ExecReload=/bin/kill -HUP $MAINPID
LimitNOFILE=65535
NoNewPrivileges=true
ProtectHome=true
ProtectSystem=full
ProtectHostname=true
ProtectControlGroups=true
ProtectKernelModules=true
ProtectKernelTunables=true
LockPersonality=true
RestrictRealtime=yes
RestrictNamespaces=yes
MemoryDenyWriteExecute=yes
PrivateDevices=yes
CapabilityBoundingSet=

[Install]
WantedBy=multi-user.target

# /etc/systemd/system/prometheus.service.d/prometheus.conf
[Service]
ExecStart=
ExecStart=/usr/bin/prometheus --config.file=/etc/prometheus/prometheus.yml --storage.tsdb.path=/home/prometheus $PROME>
ProtectHome=False
  • Prometheus configuration file:
---
global:
  scrape_interval: 300s
  evaluation_interval: 10s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - localhost:9093
scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets:
          - localhost:9100
        labels:
          uid: t470
  • Logs:
Dec 01 09:46:29 t470p prometheus[1629652]: level=info ts=2020-12-01T01:46:29.714Z caller=head.go:889 component=tsdb msg="WAL checkpoint complete" first=1668 last=1669 duration=32.511167ms
Dec 01 09:46:29 t470p prometheus[1629652]: level=info ts=2020-12-01T01:46:29.763Z caller=head.go:809 component=tsdb msg="Head GC completed" duration=739.34µs
Dec 01 09:46:29 t470p prometheus[1629652]: level=info ts=2020-12-01T01:46:29.812Z caller=head.go:809 component=tsdb msg="Head GC completed" duration=802.06µs
Dec 01 09:46:29 t470p prometheus[1629652]: level=info ts=2020-12-01T01:46:29.812Z caller=checkpoint.go:96 component=tsdb msg="Creating checkpoint" from_segment=1670 to_segment=1671 mint=1606780800000
Dec 01 09:46:29 t470p prometheus[1629652]: level=info ts=2020-12-01T01:46:29.822Z caller=head.go:889 component=tsdb msg="WAL checkpoint complete" first=1670 last=1671 duration=10.152588ms
Dec 01 09:46:47 t470p prometheus[1629652]: level=warn ts=2020-12-01T01:46:47.293Z caller=scrape.go:1378 component="scrape manager" scrape_pool=ssh target="http://127.0.0.1:9115/probe?module=ssh_banner&target=172.20.149.141%3A22" msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=6
Dec 01 09:46:47 t470p prometheus[1629652]: level=warn ts=2020-12-01T01:46:47.293Z caller=scrape.go:1145 component="scrape manager" scrape_pool=ssh target="http://127.0.0.1:9115/probe?module=ssh_banner&target=xxxx%3A22" msg="Append failed" err="out of bounds"
Dec 01 09:46:47 t470p prometheus[1629652]: level=warn ts=2020-12-01T01:46:47.293Z caller=scrape.go:1094 component="scrape manager" scrape_pool=ssh target="http://127.0.0.1:9115/probe?module=ssh_banner&target=xxx%3A22" msg="Appending scrape report failed" err="out of bounds"
Dec 01 09:46:56 t470p prometheus[1629652]: level=warn ts=2020-12-01T01:46:56.162Z caller=scrape.go:1378 component="scrape manager" scrape_pool=blackbox target="http://127.0.0.1:9115/probe?module=http_2xx&target=http%3A%2F%2Fxxx" msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=17
Dec 01 09:46:56 t470p prometheus[1629652]: level=warn ts=2020-12-01T01:46:56.162Z caller=scrape.go:1145 component="scrape manager" scrape_pool=blackbox target="http://127.0.0.1:9115/probe?module=http_2xx&target=http%3A%2F%2Fwww.baidu.com" msg="Append failed" err="out of bounds"
Dec 01 09:46:56 t470p prometheus[1629652]: level=warn ts=2020-12-01T01:46:56.162Z caller=scrape.go:1094 component="scrape manager" scrape_pool=blackbox target="http://127.0.0.1:9115/probe?module=http_2xx&target=http%3A%2F%2Fwww.baidu.com" msg="Appending scrape report failed" err="out of bounds"
Dec 01 09:47:01 t470p prometheus[1629652]: level=warn ts=2020-12-01T01:47:01.836Z caller=scrape.go:1378 component="scrape manager" scrape_pool=gitea target=http://localhost:10080/metrics msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=67
Dec 01 09:47:01 t470p prometheus[1629652]: level=warn ts=2020-12-01T01:47:01.836Z caller=scrape.go:1145 component="scrape manager" scrape_pool=gitea target=http://localhost:10080/metrics msg="Append failed" err="out of bounds"

created time in 2 hours

Pull request review commentprometheus/prometheus

CNAME responses can occur when with "Type: A" dns_sd_config requests

 func (d *Discovery) refreshOne(ctx context.Context, name string, ch chan<- *targ 			target = hostPort(addr.A.String(), d.port) 		case *dns.AAAA: 			target = hostPort(addr.AAAA.String(), d.port)+		case *dns.CNAME:+			// Ignore to prevent warning message from default case.

I misunderstood what you meant by comment- I thought you were referring to the title of the PR. My mistake, I'll get it adjusted.

mattberther

comment created time in 3 hours

Pull request review commentprometheus/prometheus

CNAME responses can occur when with "Type: A" dns_sd_config requests

 func (d *Discovery) refreshOne(ctx context.Context, name string, ch chan<- *targ 			target = hostPort(addr.A.String(), d.port) 		case *dns.AAAA: 			target = hostPort(addr.AAAA.String(), d.port)+		case *dns.CNAME:+			// Ignore to prevent warning message from default case.

This comment is still not useful. I can already tell this from the code alone, instead explain why this is the right thing to do.

mattberther

comment created time in 3 hours

pull request commentprometheus/prometheus

CNAME responses can occur when with "Type: A" dns_sd_config requests

I've kicked it off.

mattberther

comment created time in 4 hours

pull request commentprometheus/prometheus

CNAME responses can occur when with "Type: A" dns_sd_config requests

@brian-brazil I've made the proposed changes. There seems to be a circleci test that is failing. I'm not expecting that my change of an error message caused the described failure (since the pipeline passed on the initial PR):

Failed
=== RUN   TestHandleMultipleQuitRequests
    web_test.go:492: 
        	Error Trace:	web_test.go:492
        	            				asm_amd64.s:1374
        	Error:      	Received unexpected error:
        	            	Post "http://localhost:9090/-/quit": EOF
        	Test:       	TestHandleMultipleQuitRequests
--- FAIL: TestHandleMultipleQuitRequests (5.01s)

However, I see no way of being able to re-run the workflow (presumably because i do not have write access). Is this something that you can kick off, or is there another way for me to re-run the workflow?

mattberther

comment created time in 4 hours

issue commentcortexproject/cortex

Docs: Update the Architecture Section about the Ring

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

gotjosh

comment created time in 6 hours

issue commentcortexproject/cortex

Fix deprecated gRPC naming.Watcher

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

zendern

comment created time in 6 hours

issue commentcortexproject/cortex

Support multiple regions with disjoint ingesters and DB replication

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

ThePants999

comment created time in 6 hours

issue commentcortexproject/cortex

Make each ingester use its own key in the ring

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

gouthamve

comment created time in 6 hours

issue commentcortexproject/cortex

query-frontend: Response caching should work for subqueries and part queries.

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

bwplotka

comment created time in 6 hours

issue commentcortexproject/cortex

DynamoDB: Make min usage for scaledown configurable

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

dcherman

comment created time in 6 hours

issue commentcortexproject/cortex

TSDB unloading

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

pstibrany

comment created time in 6 hours

issue commentcortexproject/cortex

Reduce the risk of having unregistered metrics

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

pracucci

comment created time in 6 hours

PR closed cortexproject/cortex

Selectively disable Indexing of Labels size/XL stale

<!-- Thanks for sending a pull request! Before submitting:

  1. Read our CONTRIBUTING.md guide
  2. Rebase your PR if it gets out of sync with master -->

What this PR does: This PR introduces a new config under the chunk_store section called as exclude labels. These labels are skipped from the lookup_series_from_matchers function inside series_store. Hence reducing index lookups.

Signed-off-by: Jay Batra jaybatra73@gmail.com

Which issue(s) this PR fixes: Fixes #2068

Checklist

  • [x] Tests updated
  • [ ] Documentation added
  • [x] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]
+492 -36

3 comments

10 changed files

jaybatra26

pr closed time in 6 hours

issue closedprometheus/prometheus

Prometheus hangs without log message

What did you do?

$ while curl 10.10.2.4:9090/-/healthy; do date; done
Sun May 31 18:18:11 UTC 2020
Prometheus is Healthy.
Sun May 31 18:18:11 UTC 2020
Prometheus is Healthy.
Sun May 31 18:18:11 UTC 2020
Prometheus is Healthy.
Sun May 31 18:18:29 UTC 2020
Prometheus is Healthy.
Sun May 31 18:18:29 UTC 2020
Prometheus is Healthy.
Sun May 31 18:18:29 UTC 2020

What did you expect to see?

Either a healthy message at least once every 2 seconds or a warning message in the Prometheus logs.

What did you see instead? Under which circumstances?

Prometheus unresponsive, here for 18 seconds, later for over 30 seconds (long enough to get it killed), and nothing in the Prometheus log between "Server is ready to receive web requests" and "Received SIGTERM, exiting gracefully".

Prometheus is idle, with compaction disabled because it is running Thanos sidecar to upload data to S3.

Prometheus has 1 full CPU and 4Gi memory allocated, and no indication it is using more than 1.5Gi or being killed because the node is OOM. This is a quiet cluster with total allocated memory Limits lower than total available memory.

Environment

Prometheus 2.18.1 on Kubernetes 1.15.10 EKS. Running 2 replicas. Both replicas (on separate nodes) exhibiting the same behavior).

  • System information:
$ uname -srm
Linux 4.14.165-133.209.amzn2.x86_64 x86_64
  • Prometheus version:
$ prometheus --version
prometheus, version 2.18.1 (branch: HEAD, revision: ecee9c8abfd118f139014cb1b174b08db3f342cf)
  build user:       root@2117a9e64a7e
  build date:       20200507-16:51:47
  go version:       go1.14.2
  • Prometheus configuration file: <details><summary>Click to reveal</summary>
global:
  evaluation_interval: 30s
  scrape_interval: 30s
  external_labels:
    cluster: test
    prometheus: monitoring/po-prometheus
    prometheus_replica: prometheus-po-prometheus-1

plus jobs and rules from CoreOS Prometheus Operator </details>

  • Prometheus command line args: <details><summary>Click to reveal</summary>
      args:
        - '--web.console.templates=/etc/prometheus/consoles'
        - '--web.console.libraries=/etc/prometheus/console_libraries'
        - '--config.file=/etc/prometheus/config_out/prometheus.env.yaml'
        - '--storage.tsdb.path=/prometheus'
        - '--storage.tsdb.retention.time=7h'
        - '--web.enable-lifecycle'
        - '--storage.tsdb.no-lockfile'
        - '--web.enable-admin-api'
        - '--web.external-url=https://prometheus.redacted.com'
        - '--web.route-prefix=/'
        - '--log.format=json'
        - '--storage.tsdb.max-block-duration=2h'

</details>

  • Logs: <details><summary>Log extract (omissions noted with "snip")</summary>

{"caller":"main.go:337","level":"info","msg":"Starting Prometheus","ts":"2020-05-31T18:45:54.980Z","version":"(version=2.18.1, branch=HEAD, revision=ecee9c8abfd118f139014cb1b174b08db3f342cf)"}
{"build_context":"(go=go1.14.2, user=root@2117a9e64a7e, date=20200507-16:51:47)","caller":"main.go:338","level":"info","ts":"2020-05-31T18:45:54.980Z"}
{"caller":"main.go:339","host_details":"(Linux 4.14.165-133.209.amzn2.x86_64 #1 SMP Sun Feb 9 00:21:30 UTC 2020 x86_64 prometheus-po-prometheus-0 (none))","level":"info","ts":"2020-05-31T18:45:54.980Z"}
{"caller":"main.go:340","fd_limits":"(soft=65536, hard=65536)","level":"info","ts":"2020-05-31T18:45:54.981Z"}
{"caller":"main.go:341","level":"info","ts":"2020-05-31T18:45:54.981Z","vm_limits":"(soft=unlimited, hard=unlimited)"}
{"caller":"query_logger.go:79","component":"activeQueryTracker","level":"info","msg":"These queries didn't finish in prometheus' last run:","queries":"[{\"query\":\"sum(rate(container_network_transmit_bytes_total{pod=~\\\"ingress-nginx-ingress-controller-hffqg\\\",namespace=\\\"kube-system\\\"}[1m])) by (container, namespace)\",\"timestamp_sec\":1590950448},{\"query\":\"sum(kube_pod_container_resource_requests{pod=~\\\"prometheus-po-prometheus-0\\\",resource=\\\"memory\\\",namespace=\\\"monitoring\\\"}) by (container, namespace)\",\"timestamp_sec\":1590950448},{\"query\":\"sum(rate(container_cpu_usage_seconds_total{container!=\\\"POD\\\",container!=\\\"\\\",pod=~\\\"prometheus-po-prometheus-0\\\",namespace=\\\"monitoring\\\"}[1m])) by (container, namespace)\",\"timestamp_sec\":1590950448}]","ts":"2020-05-31T18:45:54.985Z"}
{"caller":"main.go:678","level":"info","msg":"Starting TSDB ...","ts":"2020-05-31T18:45:55.015Z"}

snip

{"caller":"head.go:627","component":"tsdb","duration":"31.459049573s","level":"info","msg":"WAL replay completed","ts":"2020-05-31T18:46:26.704Z"}
{"caller":"main.go:694","fs_type":"NFS_SUPER_MAGIC","level":"info","ts":"2020-05-31T18:46:26.964Z"}
{"caller":"main.go:695","level":"info","msg":"TSDB started","ts":"2020-05-31T18:46:26.964Z"}
{"caller":"main.go:799","filename":"/etc/prometheus/config_out/prometheus.env.yaml","level":"info","msg":"Loading configuration file","ts":"2020-05-31T18:46:26.964Z"}
{"caller":"kubernetes.go:253","component":"discovery manager scrape","discovery":"k8s","level":"info","msg":"Using pod service account via in-cluster config","ts":"2020-05-31T18:46:26.968Z"}
{"caller":"kubernetes.go:253","component":"discovery manager scrape","discovery":"k8s","level":"info","msg":"Using pod service account via in-cluster config","ts":"2020-05-31T18:46:26.970Z"}
{"caller":"kubernetes.go:253","component":"discovery manager scrape","discovery":"k8s","level":"info","msg":"Using pod service account via in-cluster config","ts":"2020-05-31T18:46:26.970Z"}
{"caller":"kubernetes.go:253","component":"discovery manager notify","discovery":"k8s","level":"info","msg":"Using pod service account via in-cluster config","ts":"2020-05-31T18:46:26.971Z"}
{"caller":"main.go:827","filename":"/etc/prometheus/config_out/prometheus.env.yaml","level":"info","msg":"Completed loading of configuration file","ts":"2020-05-31T18:46:27.184Z"}
{"caller":"main.go:646","level":"info","msg":"Server is ready to receive web requests.","ts":"2020-05-31T18:46:27.184Z"}
{"caller":"main.go:524","level":"warn","msg":"Received SIGTERM, exiting gracefully...","ts":"2020-05-31T18:48:26.766Z"}

Turning on debugging, I can see this (the healthy endpoint was unresponsive from 2020-05-31T20:35:25 to 2020-05-31T20:35:57, the SIGTERM coming presumably because the health probe failureThreshold was exceeded):

{"caller":"klog.go:53","component":"k8s_client_runtime","func":"Verbose.Infof","level":"debug","msg":"caches populated","ts":"2020-05-31T20:34:49.570Z"}
{"caller":"scrape.go:962","component":"scrape manager","err":"Get \"http://10.10.15.65:9090/metrics\": context deadline exceeded","level":"debug","msg":"Scrape failed","scrape_pool":"monitoring/po-prometheus/0","target":"http://10.10.15.65:9090/metrics","ts":"2020-05-31T20:35:25.783Z"}
{"alertname":"KubeSchedulerDown","caller":"manager.go:783","component":"rule manager","group":"kubernetes-system-scheduler","labels":"{alertname=\"KubeSchedulerDown\", severity=\"critical\"}","level":"debug","msg":"'for' state restored","restored_time":"Saturday, 30-May-20 11:20:31 UTC","ts":"2020-05-31T20:35:57.170Z"}
{"alertname":"KubeControllerManagerDown","caller":"manager.go:783","component":"rule manager","group":"kubernetes-system-controller-manager","labels":"{alertname=\"KubeControllerManagerDown\", severity=\"critical\"}","level":"debug","msg":"'for' state restored","restored_time":"Saturday, 30-May-20 11:20:05 UTC","ts":"2020-05-31T20:35:57.170Z"}
{"alertname":"KubeVersionMismatch","caller":"manager.go:783","component":"rule manager","group":"kubernetes-system","labels":"{alertname=\"KubeVersionMismatch\", severity=\"warning\"}","level":"debug","msg":"'for' state restored","restored_time":"Saturday, 30-May-20 11:20:14 UTC","ts":"2020-05-31T20:35:57.175Z"}
{"alertname":"TargetDown","caller":"manager.go:783","component":"rule manager","group":"general.rules","labels":"{alertname=\"TargetDown\", job=\"po-prometheus\", namespace=\"monitoring\", service=\"po-prometheus\", severity=\"warning\"}","level":"debug","msg":"'for' state restored","restored_time":"Sunday, 31-May-20 20:35:57 UTC","ts":"2020-05-31T20:35:57.182Z"}
{"alertname":"TargetDown","caller":"manager.go:783","component":"rule manager","group":"general.rules","labels":"{alertname=\"TargetDown\", job=\"kubelet\", namespace=\"kube-system\", service=\"po-kubelet\", severity=\"warning\"}","level":"debug","msg":"'for' state restored","restored_time":"Sunday, 31-May-20 20:35:57 UTC","ts":"2020-05-31T20:35:57.182Z"}
{"alertname":"KubePodCrashLooping","caller":"manager.go:783","component":"rule manager","group":"kubernetes-apps","labels":"{alertname=\"KubePodCrashLooping\", container=\"prometheus\", endpoint=\"http\", instance=\"10.10.17.202:8080\", job=\"kube-state-metrics\", namespace=\"monitoring\", pod=\"prometheus-po-prometheus-1\", service=\"stable-po-kube-state-metrics\", severity=\"critical\"}","level":"debug","msg":"'for' state restored","restored_time":"Sunday, 31-May-20 20:30:57 UTC","ts":"2020-05-31T20:35:57.247Z"}
{"alertname":"KubePodCrashLooping","caller":"manager.go:783","component":"rule manager","group":"kubernetes-apps","labels":"{alertname=\"KubePodCrashLooping\", container=\"prometheus\", endpoint=\"http\", instance=\"10.10.17.202:8080\", job=\"kube-state-metrics\", namespace=\"monitoring\", pod=\"prometheus-po-prometheus-0\", service=\"stable-po-kube-state-metrics\", severity=\"critical\"}","level":"debug","msg":"'for' state restored","restored_time":"Sunday, 31-May-20 20:30:57 UTC","ts":"2020-05-31T20:35:57.247Z"}
{"caller":"main.go:524","level":"warn","msg":"Received SIGTERM, exiting gracefully...","ts":"2020-05-31T20:37:52.433Z"}

</details>

closed time in 6 hours

Nuru

issue commentprometheus/prometheus

Prometheus hangs without log message

I am closing this bug. Please reopen if after upgrading to 2.21 you can reproduce it.

Nuru

comment created time in 6 hours

issue closedkubernetes-monitoring/kubernetes-mixin

k8s rules not fetching correct pod/namespace for kube_pod_info

This is about rules that use kube_pod_info in rules/app.libsonnet

When kube_pod_info is pulled from kube-state-metrics it reports

          "metric": {
            "__name__": "kube_pod_info",
            "container": "kube-state-metrics",
            "created_by_kind": "DaemonSet",
            "created_by_name": "cilium",
            "endpoint": "http",
            "exported_namespace": "kube-system",
            "exported_pod": "cilium-999hs",
            "host_ip": "10.206.248.211",
            "instance": "100.96.4.154:8080",
            "job": "kube-state-metrics",
            "namespace": "addons",
            "node": "ip-10-206-248-211.ec2.internal",
            "pod": "kube-state-metrics-855b4fbdb5-crgtg",
            "pod_ip": "10.206.248.211",
            "priority_class": "system-node-critical",
            "service": "kube-state-metrics",
            "uid": "2a8c09ad-c921-46ef-abe4-b6e234255e8b"

Currently in the rule, it pulls pod and namespace. This becomes an issue in node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate as only the kube-state-metrics pods will be reported instead of the exported_pod.

To fix this, I replaced kube_pod_info{node!=""} with label_replace(label_replace(kube_pod_info{node!=""}, "pod", "$1", "exported_pod", "(.*)"), "namespace", "$1", "exported_namespace", "(.*)") to use the correct exported labels.

This is the PR with the fix https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/522

closed time in 6 hours

allenmunC1

issue commentkubernetes-monitoring/kubernetes-mixin

k8s rules not fetching correct pod/namespace for kube_pod_info

Fixed by using servicemonitor relabeling

allenmunC1

comment created time in 6 hours

issue openedkubernetes-monitoring/kubernetes-mixin

k8s rules not fetching correct pod/namespace for kube_pod_info

This is about rules that use kube_pod_info in rules/app.libsonnet

When kube_pod_info is pulled from kube-state-metrics it reports

          "metric": {
            "__name__": "kube_pod_info",
            "container": "kube-state-metrics",
            "created_by_kind": "DaemonSet",
            "created_by_name": "cilium",
            "endpoint": "http",
            "exported_namespace": "kube-system",
            "exported_pod": "cilium-999hs",
            "host_ip": "10.206.248.211",
            "instance": "100.96.4.154:8080",
            "job": "kube-state-metrics",
            "namespace": "addons",
            "node": "ip-10-206-248-211.ec2.internal",
            "pod": "kube-state-metrics-855b4fbdb5-crgtg",
            "pod_ip": "10.206.248.211",
            "priority_class": "system-node-critical",
            "service": "kube-state-metrics",
            "uid": "2a8c09ad-c921-46ef-abe4-b6e234255e8b"

Currently in the rule, it pulls pod and namespace. This becomes an issue in node_namespace_pod_container:container_cpu_usage_seconds_total:sum_rate as only the kube-state-metrics pods will be reported instead of the exported_pod.

To fix this, I replaced kube_pod_info{node!=""} with label_replace(label_replace(kube_pod_info{node!=""}, "pod", "$1", "exported_pod", "(.*)"), "namespace", "$1", "exported_namespace", "(.*)") to use the correct exported labels.

This is the PR with the fix https://github.com/kubernetes-monitoring/kubernetes-mixin/pull/522

created time in 8 hours

Pull request review commentprometheus/prometheus

Guard closing quitCh with sync.Once to prevent double close

 func (h *Handler) version(w http.ResponseWriter, r *http.Request) { }  func (h *Handler) quit(w http.ResponseWriter, r *http.Request) {-	select {-	case <-h.quitCh:+	var stopped bool+	h.quitOnce.Do(func() {+		stopped = true+		close(h.quitCh)+	})+	if stopped {

This should be the opposite

johejo

comment created time in 9 hours

pull request commentprometheus/prometheus

Consider status code 429 as recoverable errors to avoid resharding

Ah, I feel like rate limiting is a thing from the remote storage. That means, the remote storage should get more control as to how it wants a particular request to behave. This gives itself the ability to handle the situation and come out of it. So, the response header should be a good way out here.

Harkishen-Singh

comment created time in 10 hours

pull request commentprometheus/prometheus

Guard closing quitCh with sync.Once to prevent double close

Thanks good suggestion. I think it's better only Handler.Quit(e.g. from main module) could read quitCh.

johejo

comment created time in 12 hours

pull request commentprometheus/prometheus

Consider status code 429 as recoverable errors to avoid resharding

I didn't get the retry after thing. Is it something that remote storage will respond to the remote write component, a time after which only, it should retry?

Harkishen-Singh

comment created time in 12 hours

pull request commentprometheus/prometheus

Guard closing quitCh with sync.Once to prevent double close

You could instead revert 8166 and apply to the full function ?

pseudo code

var stopped bool
quitOnce.Do {
 stopped=true
  print(Quitting)
  close(quit)
}
if !stopped {
  print(Exit in progress)
}
johejo

comment created time in 13 hours

PR opened prometheus/prometheus

Guard closing quitCh with sync.Once to prevent double close

related #8144

See https://github.com/prometheus/prometheus/issues/8144#issuecomment-735814282

Signed-off-by: Mitsuo Heijo mitsuo.heijo@gmail.com

<!-- Don't forget!

- If the PR adds or changes a behaviour or fixes a bug of an exported API it would need a unit/e2e test.

- Where possible use only exported APIs for tests to simplify the review and make it as close as possible to an actual library usage.

- No tests are needed for internal implementation changes.

- Performance improvements would need a benchmark test to prove it.

- All exposed objects should have a comment.

- All comments should start with a capital letter and end with a full stop.

-->

+4 -1

0 comment

1 changed file

pr created time in 13 hours

issue commentprometheus/prometheus

Error: http: superfluous response.WriteHeader call

I get the same error

Client: Docker Engine - Community
 Cloud integration: 1.0.2
 Version:           19.03.13
 API version:       1.40
 Go version:        go1.13.15
 Git commit:        4484c46d9d
 Built:             Wed Sep 16 16:58:31 2020
 OS/Arch:           darwin/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.13
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.13.15
  Git commit:       4484c46d9d
  Built:            Wed Sep 16 17:07:04 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          v1.3.7
  GitCommit:        8fba4e9a7d01810a393d5d25a3621dc101981175
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

I'm using skaffold

bandesz

comment created time in 13 hours

more