profile
viewpoint
Peter Štibraný pstibrany @grafana Software Engineer at Grafana Labs

pstibrany/OWASP-CSRFGuard 1

OWASP CSRFGuard is a library that implements a variant of the synchronizer token pattern to mitigate the risk of Cross-Site Request Forgery (CSRF) attacks.

pstibrany/chunks-inspect 0

Tool for inspecting Loki and Cortex chunks

pstibrany/common 0

Libraries used in multiple Weave projects

pstibrany/cortex 0

A horizontally scalable, highly available, multi-tenant, long term Prometheus.

pstibrany/cortex-tools 0

A set of powerful command line tools for interacting with cortex and friends.

pstibrany/foglyn-update 0

Eclipse update site for Foglyn

pstibrany/gitdm 0

📜Fork for tracking CNCF projects

Pull request review commentcortexproject/cortex

WIP: Multi tenant query federation

 store_gateway_client: # (ingesters shuffle sharding on read path is disabled). # CLI flag: -querier.shuffle-sharding-ingesters-lookback-period [shuffle_sharding_ingesters_lookback_period: <duration> | default = 0s]++tenant_federation:+  # If enabled, multi tenant query federation can be used by supplying multiple+  # tenant IDs in the read path (experimental).+  # CLI flag: -querier.tenant-federation.enabled+  [enabled: <boolean> | default = false]

If some cluster enables multi tenant querying, the flag should be set across all the instances and not only on the query frontend/scheduler, so that e.g. the distributor rejects multi tenant ingestion. I feel the prefix -querier. is misleading

simonswine

comment created time in 6 hours

issue commentcortexproject/cortex

HA tracker shows incorrect userid<>cluster associations after some time

The new configuration has been running for ~20 hours now with out any issues. It looks like the default ha_cluster_label value of cluster was causing all of the problems for me.

zdykstra

comment created time in 7 hours

issue commentcortexproject/cortex

Add test for tablemanager when no scaling is enabled

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

gouthamve

comment created time in 8 hours

issue commentcortexproject/cortex

Improve limits documentation

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

pracucci

comment created time in 8 hours

issue commentcortexproject/cortex

Store-gateway blocks resharding during rollout

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

pracucci

comment created time in 8 hours

fork klauspost/gzip

:floppy_disk: Golang gzip middleware for Gin and net/http | Golang gzip中间件,支持Gin和net/http,开箱即用同时可定制

fork in 9 hours

PR opened cortexproject/cortex

typo

<!-- Thanks for sending a pull request! Before submitting:

  1. Read our CONTRIBUTING.md guide
  2. Rebase your PR if it gets out of sync with master -->

What this PR does:

Which issue(s) this PR fixes: Fixes #<issue number>

Checklist

  • [ ] Tests updated
  • [ ] Documentation added
  • [ ] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]
+1 -1

0 comment

1 changed file

pr created time in 11 hours

Pull request review commentcortexproject/cortex

fix panic in inverted index delete operation when expected fp is not present

 func (shard *indexShard) delete(labels labels.Labels, fp model.Fingerprint) { 		j := sort.Search(len(fingerprints.fps), func(i int) bool { 			return fingerprints.fps[i] >= fp 		})++		// see if search didn't find fp which matches the condition which means we don't have to do anything.+		if j == len(fingerprints.fps) {

What if the desired fingerprint is not in the slice, but there are other fingerprints with a greater value? The search will return the position in the slice where the fp should have been.
eg if fingerprints.fps=[0,1,3] and you search for the fingerprint with a value of "2", j will be set to 2. This will then result in the fingerprint with value of "3" being deleted.

Looks like the checks used in the docs, https://golang.org/pkg/sort/#Search are probably what is needed here.

sandeepsukhani

comment created time in 12 hours

issue openedcortexproject/cortex

Two ingester updates in quick succession can fail

We do CI of Cortex builds into our staging area; about once a month I'm alerted to an ingester rollout which has stuck.

I think what happens is that Kubernetes kills one newly-arrived ingester before it asserts its place in the ring, and the replacement doesn't manage to take over in its place.

I guess the killed ingester should take care to reset state before exiting, or the leaving ingester should detect that it died and go back to looking for someone to hand over to.

I will attempt to find the right logs from a recent occurrence, or post them the next time it happens.

created time in 12 hours

push eventcortexproject/cortex

ci

commit sha c650201bcecb311c55e4f1dcebfd063f2db6372d

Deploy to GitHub pages

view details

push time in 12 hours

issue commentgrafana/cortex-jsonnet

Mixtool generates empty alerts and rules

Good catch!

muecs

comment created time in 12 hours

issue commentcortexproject/cortex

[bug] cortex_ingester_memory_chunks become negative after ingester OOM

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

Wing924

comment created time in 15 hours

PR closed cortexproject/cortex

Improve allocations while doing index queries size/XXL

This contains 3 optimizations.

1 - Reuse the iterator when iterating through batches 2 - Avoid the string copy when parsing the data from the index. 3 - Use an intermediary IndexQueryEntry that does not contains the tableName and hashValue strings for lookupEntriesByQueries that doesn't require it.

Hard to get a non noisy benchmark but that's how it looks like before and after:

benchmark                                   old ns/op     new ns/op     delta
Benchmark_GetRefsChunkWithManyChunks-16     73513148      72606738      -1.23%

benchmark                                   old allocs     new allocs     delta
Benchmark_GetRefsChunkWithManyChunks-16     741487         641464         -13.49%

benchmark                                   old bytes     new bytes     delta
Benchmark_GetRefsChunkWithManyChunks-16     78208523      64817304      -17.12%

Performance is unchanged but allocation is reduced.

Why I'm doing this because I found out that we do allocate a lot in size and object.

image

+890 -200

1 comment

40 changed files

cyriltovena

pr closed time in 15 hours

pull request commentcortexproject/cortex

Improve allocations while doing index queries

This contains 3 optimizations.

I would have preferred to see three separate PRs with their own bechmarks. One referenced in this PR isn't too convincing.

1 - Reuse the iterator when iterating through batches

This has introduced bug everywhere where QueryPages is used with multiple queries, and implementations run them concurrently. (Which is all implementations, except testing ones.) In the end, this optimization may work only at very few places.

2 - Avoid the string copy when parsing the data from the index.

I would prefer to keep using standard library for conversion in chunk.go. Stdlib is typically optimized faster and hand-in-hand with runtime optimizations. By switching to 3rd party library for such basic operations, we're risking not getting those anymore.

As to the change, personally I'm not a fan of replacing strings with []byte, as it brings complexity to the code (security and concurrency aspects are completely different) That said, I can see how this specific optimization may be useful.

3 - Use an intermediary IndexQueryEntry that does not contains the tableName and hashValue strings for lookupEntriesByQueries that doesn't require it.

This saves 32 bytes per instance. For 1M entries, that's 32 MB. On first sight it looks like very tiny optimization to have a major impact. Can you provide some context where this is useful?

The problem we have in Loki is that compare to cortex, we have many chunks in a small time frame, and so index queries are very expensive because they returns a lot of hits. Now I think I'm going to park this PR because it gave me another idea. I think the problem is more how GetRefChunk works. Most of the time, I need only the first few refs. I'll see if I can refactor this way instead.

I agree that some of those optimization are not yielding enough improvement.

cyriltovena

comment created time in 15 hours

Pull request review commentcortexproject/cortex

Improve allocations while doing index queries

 func (s *cachingIndexClient) QueryPages(ctx context.Context, queries []chunk.Ind  		results[key] = rb 	}-+	var iter chunk.ReadBatchIterator 	err = s.IndexClient.QueryPages(ctx, cacheableMissed, func(cacheableQuery chunk.IndexQuery, r chunk.ReadBatch) bool { 		resultsMtx.Lock() 		defer resultsMtx.Unlock() 		key := queryKey(cacheableQuery) 		existing := results[key]-		for iter := r.Iterator(); iter.Next(); {+		for iter = r.Iterator(iter); iter.Next(); {

I don't think that's true, but I agree that it's not super clear and may be dangerous to maintain. There's a lock.

cyriltovena

comment created time in 16 hours

Pull request review commentcortexproject/cortex

Improve allocations while doing index queries

 func (ds *DeleteStore) GetPendingDeleteRequestsForUser(ctx context.Context, user  func (ds *DeleteStore) queryDeleteRequests(ctx context.Context, deleteQuery []chunk.IndexQuery) ([]DeleteRequest, error) { 	deleteRequests := []DeleteRequest{}+	var itr chunk.ReadBatchIterator 	err := ds.indexClient.QueryPages(ctx, deleteQuery, func(query chunk.IndexQuery, batch chunk.ReadBatch) (shouldContinue bool) {-		itr := batch.Iterator()+		itr = batch.Iterator(itr)

I sent a PR to rework this function to accept only one request.

cyriltovena

comment created time in 16 hours

issue commentgrafana/cortex-jsonnet

Mixtool generates empty alerts and rules

I've had to replace "prometheusAlerts+::" with "prometheusAlerts+:" in alerts.libsonnet. I think mixtool's generate doesn't see hidden fields when looking for "prometheusAlerts": https://github.com/monitoring-mixins/mixtool/blob/bd0efc3ad2090d7f92cc53a0a8d453584a5645b0/pkg/mixer/eval.go#L27

muecs

comment created time in a day

issue commentcortexproject/cortex

HA tracker shows incorrect userid<>cluster associations after some time

This is tentatively resolved by the following limits block:

limits:
  max_queriers_per_tenant: 4
  accept_ha_samples: true
  ha_cluster_label: "__cluster__"
  ha_replica_label: "__replica__"
  drop_labels:
    - "__cluster__"
    - "__replica__"

Each of my HA prometheus instances now attach __cluster__ and __replica__ as external labels, which are then subsequently dropped by the distributors. Based on per-user ingest rate graphs, HA is still being respected. I'll check /distributor/ha_tracker tomorrow to confirm that the mappings are still accurate, but they've been correct for the last two hours of runtime.

zdykstra

comment created time in a day

issue commentcortexproject/cortex

Ruler exposed high cardinality summary metrics per-user

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

jtlisi

comment created time in a day

push eventcortexproject/cortex

Tom Wilkie

commit sha 06fbe27d7274b468e2fb697902d422de88d86d90

Propagate the stats via gRPC Only partially done, still need to merge & record the results in the query frontend. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>

view details

push time in a day

push eventcortexproject/cortex

Marco Pracucci

commit sha 07b09def9ff5cf54b2466a8da72e07bcb605c4f7

Track number of samples too Signed-off-by: Marco Pracucci <marco@pracucci.com>

view details

push time in a day

push eventcortexproject/cortex

Marco Pracucci

commit sha 9570b0bf7f58ef3c920f1a557ca54c083145f626

Fixed series tracker Signed-off-by: Marco Pracucci <marco@pracucci.com>

view details

push time in a day

push eventcortexproject/cortex

ci

commit sha f6336a4f660fc452508bc389c4cec920f5c52ace

Deploy to GitHub pages

view details

push time in a day

issue commentcortexproject/cortex

Query fails past certain point in time: expanding series: not found

And I'm pretty sure it's not lack of resources because the hosts are under-utilized if we look at CPU/RAM for Cortex nodes:

cortex_low_cpu_usage

cortex_mem_cpu_usage

Under 15% CPU utilization and ~11GB memory free most of the time. So I'm pretty sure it's not the resources. Same can be said for Cassandra, which is even lower:

cassandra_low_cpu_usage

So it's clearly something about my configuration that is under-utilizing the hardware available and causing these longer queries to fail.

jakubgs

comment created time in a day

Pull request review commentcortexproject/cortex

[Querier] Deprecate `-querier.compress-http-responses-deprecated`

 func (t *Cortex) initQueryFrontend() (serv services.Service, err error) {  	// Wrap roundtripper into Tripperware. 	roundTripper = t.QueryFrontendTripperware(roundTripper)--	handler := transport.NewHandler(t.Cfg.Frontend.Handler, roundTripper, util.Logger)

If we deprecate removing the code as well we introduce a breaking change (which we can't, I think). We should keep the logic as is, deprecated it in the CLI flag description + CHANGELOG and in 2 minor versions we'll remove it. @gouthamve may you confirm it, please?

gotjosh

comment created time in a day

push eventcortexproject/cortex

Marco Pracucci

commit sha 19d0f860491dbfe2064c895038dd3769ab805274

Exported process metrics to monitor the number of memory map areas allocated (#3537) * Exported process metrics to monitor the number of memory map areas allocated Signed-off-by: Marco Pracucci <marco@pracucci.com> * Fix linter and tests Signed-off-by: Marco Pracucci <marco@pracucci.com> * Addressed review comments Signed-off-by: Marco Pracucci <marco@pracucci.com> * Added a check to see if /proc is supported Signed-off-by: Marco Pracucci <marco@pracucci.com> * Addressed more review comments Signed-off-by: Marco Pracucci <marco@pracucci.com>

view details

push time in a day

PR merged cortexproject/cortex

Exported process metrics to monitor the number of memory map areas allocated size/L

What this PR does: The Cortex blocks storage (in particular TSDB) does an extensive usage of mmap-ed files. The linux kernel has a limit on the max number of memory map areas a process can allocate (defaults to 65K) and I would like to alert on it. Unfortunately neither cAdvisor or default process metrics export it, so I'm proposing to export them directly from Cortex.

I looked at procfs and it supports reading both the maps and the limit, but it does way more than we need (eg. it parse every single map entry) so I decided to just open the file and count/parse by myself, so that we don't waste CPU and memory parsing something we don't need.

Tested in a Kubernetes cluster running on Linux: Screenshot 2020-11-24 at 15 48 14

Which issue(s) this PR fixes: N/A

Checklist

  • [x] Tests updated
  • [ ] Documentation added
  • [x] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]
+199 -0

0 comment

4 changed files

pracucci

pr closed time in a day

PR opened cortexproject/cortex

[Querier] Deprecate `-querier.compress-http-responses-deprecated`

What this PR does: Deprecate -querier.compress-http-responses-deprecated In favour of -api.response-compression-enabled which applies to every other API endpoint we register.

Which issue(s) this PR fixes: A follow up from #3536

Checklist

  • [ ] Tests updated
  • [ ] Documentation added
  • [ ] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]
+8 -15

0 comment

4 changed files

pr created time in a day

PR opened cortexproject/cortex

fix panic in inverted index delete operation when expected fp is not present

<!-- Thanks for sending a pull request! Before submitting:

  1. Read our CONTRIBUTING.md guide
  2. Rebase your PR if it gets out of sync with master -->

What this PR does: We noticed a panic in one of the Loki ingesters when flushing chunks. The stack trace was

panic: runtime error: slice bounds out of range [3:2]
goroutine 114 [running]:
github.com/cortexproject/cortex/pkg/ingester/index.(*indexShard).delete(0xc004bee870, 0xc005c82340, 0xd, 0xd, 0xdd5778b83ef6a735)
	/src/loki/vendor/github.com/cortexproject/cortex/pkg/ingester/index/index.go:247 +0x4f6
github.com/cortexproject/cortex/pkg/ingester/index.(*InvertedIndex).Delete(0xc00090f540, 0xc005c82340, 0xd, 0xd, 0xdd5778b83ef6a735)
	/src/loki/vendor/github.com/cortexproject/cortex/pkg/ingester/index/index.go:85 +0x85
github.com/grafana/loki/pkg/ingester.(*Ingester).removeFlushedChunks(0xc000701400, 0xc000477040, 0xc01d7fc3c0)
	/src/loki/pkg/ingester/flush.go:305 +0x245
github.com/grafana/loki/pkg/ingester.(*Ingester).sweepInstance(0xc000701400, 0xc000477040, 0x0)
	/src/loki/pkg/ingester/flush.go:159 +0x138
github.com/grafana/loki/pkg/ingester.(*Ingester).sweepUsers(0xc000701400, 0xc00bc9ff00)
	/src/loki/pkg/ingester/flush.go:149 +0x6c
github.com/grafana/loki/pkg/ingester.(*Ingester).loop(0xc000701400)
	/src/loki/pkg/ingester/ingester.go:247 +0xc5
created by github.com/grafana/loki/pkg/ingester.(*Ingester).starting
	/src/loki/pkg/ingester/ingester.go:196 +0x1cf

This was due to sort.Search not finding the expected fp and returning the length of input slice while the code here was not checking if search found the element or not. So when the element is not found it was trying to access an element beyond the actual length of the slice. This code fixes it by checking the response from sort.Search to see if it found the element or not.

Checklist

  • [x] CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]
+5 -0

0 comment

1 changed file

pr created time in a day

push eventcortexproject/cortex

Tom Wilkie

commit sha d4939270d453cb532137ed803774f44f2a8ed7cf

Make Stats a proto so we can propagate it over gRPC. Signed-off-by: Tom Wilkie <tom.wilkie@gmail.com>

view details

push time in a day

more