profile
viewpoint
Ganesh Vernekar codesome @grafana https://ganeshvernekar.com @prometheus maintainer | Software Engineer @grafana | IIT Hyderabad CSE Alumnus

issue closedprometheus/prometheus

Could we export the target status in detail to the /metrics

What did you do?

Now, when a target is not accessible, we only have two ways to found this.

  • on the webui /targets page, to check which one is unhealthy
  • through prometheus api /api/v1/targets to filter the down targets
  • through the current net_conntrack_dialer_conn_failed_total metric from prometheus

But the net_conntrack_dialer_conn_failed_total only report job group failure, it isn't include every target scrape status in the job groups.

What did you expect to see?

Prometheus should exports every target scrape status, which will be easy to write alert rule to tell which target is unhealthy. Maybe something like below

net_conntrack_dialer_conn_failed_total{dialer_name="node_exporter", target="192.168.1.1",reason="refused"} 23
net_conntrack_dialer_conn_failed_total{dialer_name="node_exporter", target="192.168.1.1",reason="resolution"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="node_exporter", target="192.168.1.1",reason="timeout"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="node_exporter", target="192.168.1.1",reason="unknown"} 23

net_conntrack_dialer_conn_failed_total{dialer_name="node_exporter", target="192.168.1.2",reason="refused"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="node_exporter", target="192.168.1.2",reason="resolution"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="node_exporter", target="192.168.1.2",reason="timeout"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="node_exporter", target="192.168.1.2",reason="unknown"} 0

What did you see instead? Under which circumstances?

net_conntrack_dialer_conn_failed_total{dialer_name="node_exporter",reason="refused"} 23
net_conntrack_dialer_conn_failed_total{dialer_name="node_exporter",reason="resolution"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="node_exporter",reason="timeout"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="node_exporter",reason="unknown"} 23

Environment

  • System information:
Linux t470p 5.4.79-1-lts #1 SMP Sun, 22 Nov 2020 14:22:21 +0000 x86_64 GNU/Linux
  • Prometheus version:
prometheus --version
prometheus, version 2.23.0 (branch: tarball, revision: 2.23.0)
  build user:       someone@builder
  build date:       20201126-18:32:01
  go version:       go1.15.5
  platform:         linux/amd64
  • Prometheus configuration file:
---
global:
  scrape_interval: 300s
  evaluation_interval: 10s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - localhost:9093
scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets:
          - 192.168.1.1:9100
          - 192.168.1.2:9100
  • Logs:
no log

closed time in 21 minutes

jeffrey4l

issue commentprometheus/prometheus

Could we export the target status in detail to the /metrics

sorry, i found up metric is what i want. Just close the issue.

jeffrey4l

comment created time in 21 minutes

issue openedprometheus/prometheus

Could we export the target status in detail to the /metrics

What did you do?

Now, when a target is not accessible, we only have two ways to found this.

  • on the webui /targets page, to check which one is unhealthy
  • through prometheus api /api/v1/targets to filter the down targets
  • through the current net_conntrack_dialer_conn_failed_total metric from prometheus

But the net_conntrack_dialer_conn_failed_total only report job group failure, it isn't include every target scrape status in the job groups.

What did you expect to see?

Prometheus should exports every target scrape status, which will be easy to write alert rule to tell which target is unhealthy. Maybe something like below

net_conntrack_dialer_conn_failed_total{dialer_name="node_exporter", target="192.168.1.1",reason="refused"} 23
net_conntrack_dialer_conn_failed_total{dialer_name="node_exporter", target="192.168.1.1",reason="resolution"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="node_exporter", target="192.168.1.1",reason="timeout"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="node_exporter", target="192.168.1.1",reason="unknown"} 23

net_conntrack_dialer_conn_failed_total{dialer_name="node_exporter", target="192.168.1.2",reason="refused"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="node_exporter", target="192.168.1.2",reason="resolution"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="node_exporter", target="192.168.1.2",reason="timeout"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="node_exporter", target="192.168.1.2",reason="unknown"} 0

What did you see instead? Under which circumstances?

net_conntrack_dialer_conn_failed_total{dialer_name="node_exporter",reason="refused"} 23
net_conntrack_dialer_conn_failed_total{dialer_name="node_exporter",reason="resolution"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="node_exporter",reason="timeout"} 0
net_conntrack_dialer_conn_failed_total{dialer_name="node_exporter",reason="unknown"} 23

Environment

  • System information:
Linux t470p 5.4.79-1-lts #1 SMP Sun, 22 Nov 2020 14:22:21 +0000 x86_64 GNU/Linux
  • Prometheus version:
prometheus --version
prometheus, version 2.23.0 (branch: tarball, revision: 2.23.0)
  build user:       someone@builder
  build date:       20201126-18:32:01
  go version:       go1.15.5
  platform:         linux/amd64
  • Prometheus configuration file:
---
global:
  scrape_interval: 300s
  evaluation_interval: 10s

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - localhost:9093
scrape_configs:
  - job_name: 'node_exporter'
    static_configs:
      - targets:
          - 192.168.1.1:9100
          - 192.168.1.2:9100
  • Logs:
no log

created time in 27 minutes

Pull request review commentgrafana/cortex-tools

update ruler api path

 import ( 	"github.com/grafana/cortex-tools/pkg/rules/rwrulefmt" ) +const (+	rulerAPIPath  = "/api/v1/rules"+	legacyAPIPath = "/api/prom/rules"+)

The snippet has moved to pkg/client/client.go

AllenzhLi

comment created time in 4 hours

pull request commentgrafana/cortex-tools

update ruler api path

Overall this LGTM. However, I have a few more small nits. Once these things are fixed we can probably merge this.

The main block is you included the cortextool binary in your PR. Please remove the binary

It has been removed.

AllenzhLi

comment created time in 4 hours

Pull request review commentgrafana/cortex-tools

update ruler api path

 ## Unreleased  * [BUGFIX] Fix inaccuracy in `e2ealerting` caused by invalid purging condition on timestamps. #117-+* [CHANGE] When using `rules` commands, cortex ruler API requests will now default to using the `/api/v1/` prefix. +The `--use-legacy-routes` flag has been added to allow users to use the original `/api/prom/` routes. #99

I have updated, now this is a single line.

AllenzhLi

comment created time in 4 hours

Pull request review commentgrafana/cortex-tools

update ruler api path

 import ( 	"github.com/grafana/cortex-tools/pkg/rules/rwrulefmt" ) +const (+	rulerAPIPath  = "/api/v1/rules"+	legacyAPIPath = "/api/prom/rules"+)

Move this snippet to the same file where they are used.

AllenzhLi

comment created time in 10 hours

Pull request review commentgrafana/cortex-tools

update ruler api path

 ## Unreleased  * [BUGFIX] Fix inaccuracy in `e2ealerting` caused by invalid purging condition on timestamps. #117-+* [CHANGE] When using `rules` commands, cortex ruler API requests will now default to using the `/api/v1/` prefix. +The `--use-legacy-routes` flag has been added to allow users to use the original `/api/prom/` routes. #99

This should be a single line

* [CHANGE] When using `rules` commands, cortex ruler API requests will now default to using the `/api/v1/` prefix. The `--use-legacy-routes` flag has been added to allow users to use the original `/api/prom/` routes. #99
AllenzhLi

comment created time in 11 hours

issue commentprometheus/prometheus

Error: http: superfluous response.WriteHeader call

you change from 1 to 5 directly or you where changing one to one until work?

bandesz

comment created time in 11 hours

Pull request review commentprometheus/prometheus

Guard closing quitCh with sync.Once to prevent double close

 func (h *Handler) version(w http.ResponseWriter, r *http.Request) { }  func (h *Handler) quit(w http.ResponseWriter, r *http.Request) {-	select {-	case <-h.quitCh:+	var stopped bool+	h.quitOnce.Do(func() {+		stopped = true+		close(h.quitCh)+	})+	if stopped {

Should we rename stopped to closed, and move the print inside the Do () ? Why do we have if else after where a simple if would be enough? Also is it safe to write after closing?

johejo

comment created time in 11 hours

pull request commentprometheus/prometheus

'@ <timestamp>' modifier with start() end() and range()

And my good query is not returning data ;) I suspect we do not pass data corretly to remote read?

irate(label_join({__name__=~".*_bytes_.*"},"__name","","__name__")[5m:]) and (topk(5,label_join( {__name__=~".*_bytes_.*"} @1606838617 ,"__name","","__name__")   ))

(this does not return data:)

{__name__=~".*_bytes_.*"} @ 1606838617
codesome

comment created time in 14 hours

pull request commentprometheus/prometheus

'@ <timestamp>' modifier with start() end() and range()

And my good query is not returning data ;) I suspect we do not pass data corretly to remote read?

irate(label_join({__name__=~".*_bytes_.*"},"__name","","__name__")[5m:]) and (topk(5,label_join( {__name__=~".*_bytes_.*"} @1606838617 ,"__name","","__name__")   ))
codesome

comment created time in 14 hours

pull request commentprometheus/prometheus

'@ <timestamp>' modifier with start() end() and range()

Oh indeed.

However, this (bad) query created a panic:

irate(label_join({__name__=~".*_bytes_.*"},"__name","","__name__")[5m:]) and (max_over_time(label_join({__name__=~".*_bytes_.*"},"__name","","__name__")[1h]  @1606838617  ))

Because we do not set *timestampp

parser panic: runtime error: invalid memory address or nil pointer dereference                                                                     
goroutine 159 [running]:                                                                                                                           
github.com/prometheus/prometheus/promql/parser.(*parser).recover(0xc0013a0000, 0xc001302f78)                                           
        /home/roidelapluie/dev/prometheus/promql/parser/parse.go:274 +0x125                           
panic(0x2586b80, 0x42cdb50)                                                                                                                        
        /home/roidelapluie/godist/go/src/runtime/panic.go:969 +0x175                                                                               
github.com/prometheus/prometheus/promql/parser.(*parser).setInstant(0xc0013a0000, 0x31bb780, 0xc0011f1a00, 0x41d7f19a56400000)        
        /home/roidelapluie/dev/prometheus/promql/parser/parse.go:731 +0x5d                            
github.com/prometheus/prometheus/promql/parser.(*yyParserImpl).Parse(0xc0013a0078, 0x31bb980, 0xc0013a0000, 0x0)                                   
        generated_parser.y:391 +0x4bc5                                                                                                             
github.com/prometheus/prometheus/promql/parser.(*parser).parseGenerated(0xc0013a0000, 0xe046, 0xc0013a0000, 0xc0001c06f8)              
        /home/roidelapluie/dev/prometheus/promql/parser/parse.go:644 +0x6d                           
github.com/prometheus/prometheus/promql/parser.ParseExpr(0xc0009de8f0, 0xad, 0x0, 0x0, 0x0, 0x0)
        /home/roidelapluie/dev/prometheus/promql/parser/parse.go:110 +0xdb
github.com/prometheus/prometheus/promql.(*Engine).NewInstantQuery(0xc0007843f0, 0x7f4b2e840358, 0xc000473800, 0xc0009de8f0, 0xad, 0x174e2fc0, 0xed7
586410, 0x0, 0x0, 0x0, ...)
        /home/roidelapluie/dev/prometheus/promql/engine.go:333 +0x3f
github.com/prometheus/prometheus/web/api/v1.(*API).query(0xc000165900, 0xc00121dd00, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/roidelapluie/dev/prometheus/web/api/v1/api.go:355 +0x26b
github.com/prometheus/prometheus/web/api/v1.(*API).Register.func1.1(0x31dea80, 0xc0011f15c0, 0xc00121dd00)
        /home/roidelapluie/dev/prometheus/web/api/v1/api.go:265 +0xa2
net/http.HandlerFunc.ServeHTTP(0xc000970460, 0x31dea80, 0xc0011f15c0, 0xc00121dd00)
        /home/roidelapluie/godist/go/src/net/http/server.go:2042 +0x44
github.com/prometheus/prometheus/util/httputil.CompressionHandler.ServeHTTP(0x31a7b60, 0xc000970460, 0x7f4b2ea20478, 0xc000a954a0, 0xc00121dd00)
        /home/roidelapluie/dev/prometheus/util/httputil/compression.go:90 +0x7e
codesome

comment created time in 14 hours

pull request commentprometheus/prometheus

'@ <timestamp>' modifier with start() end() and range()

That's all as expected.

codesome

comment created time in 14 hours

pull request commentprometheus/prometheus

'@ <timestamp>' modifier with start() end() and range()

We also do not allow ParetExpr:

Error executing query: invalid parameter 'query': 1:78: parse error: @ modifier must be preceded by an instant or range selector or a subquery, but follows a *parser.ParenExpr instead
codesome

comment created time in 14 hours

pull request commentprometheus/prometheus

'@ <timestamp>' modifier with start() end() and range()

Don't try this at home, but I tried:

irate(label_join({__name__=~".*_bytes_.*"},"__name","","__name__")[5m:]) and (topk(5, max_over_time(label_join({__name__=~".*_bytes_.*"},"__name","","__name__")[5m:])) @1606838617 )

and got:

Error executing query: invalid parameter 'query': 1:79: parse error: @ modifier must be preceded by an instant or range selector or a subquery, but follows a *parser.AggregateExpr instead
codesome

comment created time in 14 hours

pull request commentprometheus/prometheus

Update fsnotify to v1.4.9

v1.4.9 doesn't exist in pkg.go.dev, so it seems that I made a mistake. https://pkg.go.dev/gopkg.in/fsnotify/fsnotify.v1?tab=versions

johejo

comment created time in 14 hours

PR closed prometheus/prometheus

Update fsnotify to v1.4.9

Migrate from gopkg.in to github.com

Signed-off-by: Mitsuo Heijo mitsuo.heijo@gmail.com

<!-- Don't forget!

- If the PR adds or changes a behaviour or fixes a bug of an exported API it would need a unit/e2e test.

- Where possible use only exported APIs for tests to simplify the review and make it as close as possible to an actual library usage.

- No tests are needed for internal implementation changes.

- Performance improvements would need a benchmark test to prove it.

- All exposed objects should have a comment.

- All comments should start with a capital letter and end with a full stop.

-->

+130 -58

4 comments

25 changed files

johejo

pr closed time in 14 hours

pull request commentprometheus/prometheus

Update fsnotify to v1.4.9

ok, I understand. close

johejo

comment created time in 14 hours

issue commentprometheus/prometheus

[Feature] Support for nesting split config of alert.rules

If you have something that will also work for inotify, I'd be open to it.

hsolberg

comment created time in 15 hours

pull request commentprometheus/prometheus

Update fsnotify to v1.4.9

Modules newer than v1.4.8 do not exist in gopkg.in.

https://gopkg.in/fsnotify/fsnotify.v1 has 1.4.9.

fsnotify/fsnotify#219 fsnotify/fsnotify#273

We don't fork, so these aren't relevant to Prometheus.

johejo

comment created time in 15 hours

issue commentprometheus/prometheus

[Feature] Support for nesting split config of alert.rules

Seems like it's an issue for Thanos as well https://github.com/thanos-io/thanos/issues/3401. In the Go-issue mentioned earlier they suggest using this library -> filepathx. Is that a viable option?

hsolberg

comment created time in 15 hours

Pull request review commentprometheus/prometheus

Guard closing quitCh with sync.Once to prevent double close

 func (h *Handler) version(w http.ResponseWriter, r *http.Request) { }  func (h *Handler) quit(w http.ResponseWriter, r *http.Request) {-	select {-	case <-h.quitCh:+	var stopped bool+	h.quitOnce.Do(func() {+		stopped = true+		close(h.quitCh)+	})+	if stopped {

I squashed my fix, is it OK?

A

if cond {
} else {
}

B

if !cond {
} else {
}

If B is preferred over A in this project, I will fix further.

johejo

comment created time in 15 hours

pull request commentprometheus/prometheus

Update fsnotify to v1.4.9

fsnotify migrated from gopkg.in to github.com for go modules. https://github.com/fsnotify/fsnotify/pull/309

Modules newer than v1.4.8 do not exist in gopkg.in.

Some fixes

  • https://github.com/fsnotify/fsnotify/pull/219
  • https://github.com/fsnotify/fsnotify/pull/273
johejo

comment created time in 15 hours

issue commentprometheus/prometheus

[Feature] Support for nesting split config of alert.rules

A problem with that is that we also have globs over in file_sd, which uses inotify and that library wouldn't support it either. We should be consistent in how fields like this work across the entire config.

I also note that you're suggested syntax is a valid file name.

hsolberg

comment created time in 15 hours

pull request commentprometheus/prometheus

Update fsnotify to v1.4.9

Why are you changing where we're taking the package from? The gopkg.in version appears to be official. Also is there some particular bug or other reason you're looking to bump this?

johejo

comment created time in 16 hours

issue commentprometheus/prometheus

[Feature] Support for nesting split config of alert.rules

Recursive glob could do the trick, but it's not supported by Go yet. https://github.com/golang/go/issues/11862

hsolberg

comment created time in 16 hours

issue commentprometheus/prometheus

[Feature] Support for nesting split config of alert.rules

I'm very confused here, how is INCLUDE_DIR_MERGE_LIST different from the existing globs? This seems like a fairly complex way of doing what already is trivially possible.

hsolberg

comment created time in 16 hours

issue openedprometheus/prometheus

[Feature] Support for nesting split config of alert.rules

<!--

Please do *NOT* ask usage questions in Github issues.

If your issue is not a feature request or bug report use:
https://groups.google.com/forum/#!forum/prometheus-users. If
you are unsure whether you hit a bug, search and ask in the
mailing list first.

You can find more information at: https://prometheus.io/community/

-->

Proposal

Use case. Why is this important? When splitting up alert.rules into several files and placing them in sub-folders the possibility to use include instead of handling each folder-level separately would be ideal.

Here's an example from HomeAssistant (source)

Advanced Usage

We offer four advanced options to include whole directories at once. Please note that your files must have the .yaml file extension; .yml is not supported.

  • !include_dir_list will return the content of a directory as a list with each file content being an entry in the list. The list entries are ordered based on the alphanumeric ordering of the names of the files.
  • !include_dir_named will return the content of a directory as a dictionary which maps filename => content of file.
  • !include_dir_merge_list will return the content of a directory as a list by merging all files (which should contain a list) into 1 big list.
  • !include_dir_merge_named will return the content of a directory as a dictionary by loading each file and merging it into 1 big dictionary.

These work recursively. As an example using !include_dir_* automation, will include all 6 files shown below:

.
└── .homeassistant
    ├── automation
    │   ├── lights
    │   │   ├── turn_light_off_bedroom.yaml
    │   │   ├── turn_light_off_lounge.yaml
    │   │   ├── turn_light_on_bedroom.yaml
    │   │   └── turn_light_on_lounge.yaml
    │   ├── say_hello.yaml
    │   └── sensors
    │       └── react.yaml
    └── configuration.yaml (not included)

The same idea could be used for alert.rules to structure them by type of alert

# So instead of using
rule_files:
- "*/*.rules"
- "*/*/*.rules"
- "*/*/*/*.rules"

# OR
rule_files:
- "your_folder/*.rules"
- "your_folder/your_sub_folder/*.rules"
- "your_folder/your_sub_folder/your_sub_sub_folder/*.rules"

# You could just do this:
rule_files: !INCLUDE_DIR_MERGE_LIST rules
 
# Or this if you're using ansible-templates
rule_files: !INCLUDE_DIR_MERGE_LIST {{ env }}

created time in 16 hours

PR opened prometheus/prometheus

Update fsnotify to v1.4.9

Migrate from gopkg.in to github.com

Signed-off-by: Mitsuo Heijo mitsuo.heijo@gmail.com

<!-- Don't forget!

- If the PR adds or changes a behaviour or fixes a bug of an exported API it would need a unit/e2e test.

- Where possible use only exported APIs for tests to simplify the review and make it as close as possible to an actual library usage.

- No tests are needed for internal implementation changes.

- Performance improvements would need a benchmark test to prove it.

- All exposed objects should have a comment.

- All comments should start with a capital letter and end with a full stop.

-->

+130 -58

0 comment

25 changed files

pr created time in 16 hours

more