profile
viewpoint
Tom Wilkie tomwilkie @grafana London https://grafana.com @grafana VP Product, @prometheus & @cortexproject maintainer. Previously @kausalco, @weaveworks, @google, @acunu

grafana/loki 11728

Like Prometheus, but for logs.

cortexproject/cortex 3528

A horizontally scalable, highly available, multi-tenant, long term Prometheus.

grafana/tanka 1162

Flexible, reusable and concise configuration for Kubernetes

pavoni/pywemo 102

Lightweight Python module to discover and control WeMo devices.

tomwilkie/awesomation 25

Home Awesomation; python home automation system.

grafana/cortex 19

A multitenant, horizontally scalable Prometheus as a Service

tomwilkie/cubienetes 11

Cubienetes: A Kubernetes Cluster on Cubieboard2s

tomwilkie/aws-self-serve 4

A self service portal for ec2 instances

tomwilkie/boto 1

Python interface to Amazon Web Services

tomwilkie/dnsmasq_exporter 1

dnsmasq exporter for Prometheus

issue commentcortexproject/cortex

Ingester target not showing the ingester ring on the http server

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

roidelapluie

comment created time in 9 hours

issue commentcortexproject/cortex

Moving to Github Actions from Circle CI

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

gouthamve

comment created time in 9 hours

issue commentcortexproject/cortex

S3 endpoint in single-process-config-blocks.yaml needs to be updated

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

vikram-yerneni

comment created time in 9 hours

push eventweaveworks-experiments/consul-sidekick

Rick Richardson

commit sha 147ed87b5a2ccf40460571fe24c907d5f472c28e

updating to latest k8s go-client API, added basic background context

view details

Bryan Boreham

commit sha f634bb8caf8a3414312dfb39608c29d663e2e8a1

Merge pull request #9 from rrichardson/master Updating to k8s 1.18 API. Other minor cleanups

view details

push time in 16 hours

PR merged weaveworks-experiments/consul-sidekick

Updating to k8s 1.18 API. Other minor cleanups

This updates the source to the lates k8s go-client API. It introduces a Context, as that is now a required part of the API. It also offers some styling fixes.

+12 -8

3 comments

3 changed files

rrichardson

pr closed time in 16 hours

issue commentcortexproject/cortex

Separate Integration Test Suites

Less a need since when we moved to GitHub actions (CI looks faster), but still a nice to have

jtlisi

comment created time in 19 hours

issue commentcortexproject/cortex

Document how flusher works with blocks storage

Still valid, help wanted

pracucci

comment created time in 19 hours

issue commentcortexproject/cortex

Document how flusher works with blocks storage

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

pracucci

comment created time in a day

issue commentcortexproject/cortex

Separate Integration Test Suites

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

jtlisi

comment created time in a day

issue commentcortexproject/cortex

Re-add Querier logging in Single Binary

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

mattmendick

comment created time in a day

issue commentcortexproject/cortex

Migrate object storage clients to use thanos objstore.Bucket implementations

This issue has been automatically marked as stale because it has not had any activity in the past 60 days. It will be closed in 15 days if no further activity occurs. Thank you for your contributions.

jtlisi

comment created time in a day

PR opened grafana/cortex-jsonnet

s/interval/rate_interval/g

Followup to https://github.com/grafana/cortex-jsonnet/pull/224

+88 -88

0 comment

11 changed files

pr created time in a day

issue commentcortexproject/cortex

Crash and restart of store-gateway puts extra pressure on other store-gateways

Step 4. is problematic, because it puts extra pressure on store-gateways that hasn't crashed -- they will have disproportionally bigger number of blocks loaded.

Agree and we need to fix it. The reason why I did it was so to cover the scale you scale up to a number of blocks >= replication factor. Let's say you start 10 new store-gateway at the same time. They will all be JOINING while synching blocks: if other ACTIVE store-gateways start to offload blocks before the JOINING store-gateways are still loading blocks, there will be a period of time during which some blocks are not loaded by any store-gateway and queries will fail.

Thoughts?

pstibrany

comment created time in a day

PR opened grafana/cortex-jsonnet

Uses $__rate_interval in ruler dashboard queries

Using the $__interval along with rate/increase queries can leave undesirably blank dashboards in some cases. As per https://grafana.com/docs/grafana/latest/datasources/prometheus/#using-__rate_interval-variable, we should use $__rate_interval instead.

Before: image

After: image

+16 -17

0 comment

1 changed file

pr created time in a day

push eventcortexproject/cortex

ci

commit sha 7bb25d9b6f9a9ff22dd2a8afabdd087d31053343

Deploy to GitHub pages

view details

push time in a day

pull request commentcortexproject/cortex

Add basic query stats collection & logging.

Thanks @pstibrany for your review! I should have addressed your comments.

tomwilkie

comment created time in a day

push eventcortexproject/cortex

Marco Pracucci

commit sha cad539aeade583b66b70cbea6f412ef447cdbadb

Improved log message Signed-off-by: Marco Pracucci <marco@pracucci.com>

view details

push time in a day

push eventcortexproject/cortex

Goutham Veeramachaneni

commit sha 60bc27069b262f308af5d27665a6ee35c315275d

Ignore series API lookups to chunks store (#3559) Signed-off-by: Goutham Veeramachaneni <gouthamve@gmail.com>

view details

push time in a day

Pull request review commentcortexproject/cortex

Add multi tenant query federation

 func getQuerierID(server frontendv1pb.Frontend_ProcessServer) (string, error) { }  func (f *Frontend) queueRequest(ctx context.Context, req *request) error {-	userID, err := tenant.TenantID(ctx)+	tenantIDs, err := tenant.TenantIDs(ctx) 	if err != nil { 		return err 	}  	req.enqueueTime = time.Now() 	req.queueSpan, _ = opentracing.StartSpanFromContext(ctx, "queued") -	maxQueriers := f.limits.MaxQueriersPerUser(userID)+	// figure out the highest max querier per user+	var maxQueriers int+	for _, tenantID := range tenantIDs {+		v := f.limits.MaxQueriersPerUser(tenantID)+		if v > maxQueriers {

Should maxQueriers be the minimum value from all of the tenants being queried? Otherwise wouldn't it be possible to exceed the limit for a tenant with a low max queriers setting?

simonswine

comment created time in a day

Pull request review commentcortexproject/cortex

Add multi tenant query federation

 import ( func getHTTPCacheGenNumberHeaderSetterMiddleware(cacheGenNumbersLoader *purger.TombstonesLoader) middleware.Interface { 	return middleware.Func(func(next http.Handler) http.Handler { 		return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {-			userID, err := tenant.TenantID(r.Context())+			tenantIDs, err := tenant.TenantIDs(r.Context()) 			if err != nil { 				http.Error(w, err.Error(), http.StatusUnauthorized) 				return 			} -			cacheGenNumber := cacheGenNumbersLoader.GetResultsCacheGenNumber(userID)+			var cacheGenNumber string+			// join the strings for multiple tenants+			// TODO: Figure out how sensible that is+			for _, tenantID := range tenantIDs {+				cacheGenNumber += cacheGenNumbersLoader.GetResultsCacheGenNumber(tenantID)+			}

I don't think it's likely it will lead to collisions. However, it will increase the number of characters that prefix the Memcached item key string. This means that for multi-tenant queries across enough tenants caching could break due to the item key being too long. The existing limit is not very high at 150 characters. Multi-tenant queries will already have larger cache item keys since it will contain the concatenation of multiple tenant IDs. The cachegen number will exacerbate this issue further.

There are two options for dealing with this:

  1. Ignore the cache generation for multi-tenant queries in this PR. It's not a commonly used feature and IMO it would be fine to ignore it for this PR. It

  2. Consolidate the cachegen number into a small, consistent, unique string. One option off the top of my head is to convert the strings to integers and adding them together then use a string of the resulting value. The

My preference would be option 1 since it would reduce the scope of the PR a bit more and make it a bit easier to review. Cache generation handling can be added in a future PR.

simonswine

comment created time in a day

Pull request review commentcortexproject/cortex

Add multi tenant query federation

 func New(cfg Config) (*Cortex, error) { 		os.Exit(0) 	} +	// Swap out the default resolver to support multiple tenant IDs separated by a '|'+	if cfg.TenantFederation.Enabled {+		util.WarnExperimentalUse("multi-tenant-query-federation")

suggestion: Rename this to match the config field

		util.WarnExperimentalUse("tenant-federation")
simonswine

comment created time in a day

Pull request review commentcortexproject/cortex

Add multi tenant query federation

+package tenantfederation++import (+	"flag"+)++type Config struct {+	// Enabled switches on support for multi tenant query federation+	Enabled bool `yaml:"enabled"`+}++func (cfg *Config) RegisterFlags(f *flag.FlagSet) {+	f.BoolVar(&cfg.Enabled, "tenant-federation.enabled", false, "If enabled, multi tenant query federation can be used by supplying multiple tenant IDs in the read path (experimental).")

suggestion(optional): I think this help text can be reworked to explain how to use multi-tenant queries using | character in the X-Scope-OrgID.

nit(optional,if-minor): It might make sense to reword the phrase multi tenant query federation can be used. Changing it to something that describes query data from multiple tenants without using phrasing similar to the config field. Ideally, the help text should describe what a feature means to someone who can't parse the meaning from the config field name.

simonswine

comment created time in a day

Pull request review commentcortexproject/cortex

Add multi tenant query federation

+// +build requires_docker++package integration++import (+	"fmt"+	"strings"+	"testing"+	"time"++	"github.com/prometheus/common/model"+	"github.com/prometheus/prometheus/pkg/labels"+	"github.com/prometheus/prometheus/prompb"+	"github.com/stretchr/testify/assert"+	"github.com/stretchr/testify/require"++	"github.com/cortexproject/cortex/integration/e2e"+	e2ecache "github.com/cortexproject/cortex/integration/e2e/cache"+	e2edb "github.com/cortexproject/cortex/integration/e2e/db"+	"github.com/cortexproject/cortex/integration/e2ecortex"+)++type querierTenantFederationConfig struct {+	querySchedulerEnabled  bool+	shuffleShardingEnabled bool+}++func TestQuerierTenantFederation(t *testing.T) {+	runQuerierTenantFederationTest(t, querierTenantFederationConfig{})+}++func TestQuerierTenantFederationWithQueryScheduler(t *testing.T) {+	runQuerierTenantFederationTest(t, querierTenantFederationConfig{+		querySchedulerEnabled: true,+	})+}++func TestQuerierTenantFederationWithShuffleSharding(t *testing.T) {+	runQuerierTenantFederationTest(t, querierTenantFederationConfig{+		shuffleShardingEnabled: true,+	})+}++func TestQuerierTenantFederationWithQuerySchedulerAndShuffleSharding(t *testing.T) {+	runQuerierTenantFederationTest(t, querierTenantFederationConfig{+		querySchedulerEnabled:  true,+		shuffleShardingEnabled: true,+	})+}++func runQuerierTenantFederationTest(t *testing.T, cfg querierTenantFederationConfig) {+	const numUsers = 10++	s, err := e2e.NewScenario(networkName)+	require.NoError(t, err)+	defer s.Close()++	memcached := e2ecache.NewMemcached()+	consul := e2edb.NewConsul()+	require.NoError(t, s.StartAndWaitReady(consul, memcached))++	flags := mergeFlags(BlocksStorageFlags(), map[string]string{+		"-querier.cache-results":             "true",+		"-querier.split-queries-by-interval": "24h",+		"-querier.query-ingesters-within":    "12h", // Required by the test on query /series out of ingesters time range+		"-frontend.memcached.addresses":      "dns+" + memcached.NetworkEndpoint(e2ecache.MemcachedPort),+		"-tenant-federation.enabled":         "true",

thought: We should be able to enable this feature for every e2e test and not have any failures. Doing so could provide a good smoke test for the feature.

simonswine

comment created time in a day

push eventcortexproject/cortex

Marco Pracucci

commit sha 40a448d71002e6c83209c519f730e3c8737de61b

Renamed query stats log fields Signed-off-by: Marco Pracucci <marco@pracucci.com>

view details

push time in a day

push eventcortexproject/cortex

Marco Pracucci

commit sha ce0fc819eb21d28461ee966371aae617b414a11f

Moved query stats reporter to frontend transport.Handler Signed-off-by: Marco Pracucci <marco@pracucci.com>

view details

push time in a day

Pull request review commentcortexproject/cortex

Add basic query stats collection & logging.

+package stats++import (+	"net/http"+	"time"++	"github.com/go-kit/kit/log"+	"github.com/go-kit/kit/log/level"+	"github.com/prometheus/client_golang/prometheus"+	"github.com/prometheus/client_golang/prometheus/promauto"++	"github.com/cortexproject/cortex/pkg/tenant"+)++// ReportMiddleware logs and track metrics with the query statistics.+type ReportMiddleware struct {+	logger log.Logger++	querySeconds *prometheus.CounterVec+}++// NewReportMiddleware makes a new ReportMiddleware.+func NewReportMiddleware(logger log.Logger, reg prometheus.Registerer) ReportMiddleware {+	return ReportMiddleware{+		logger: logger,+		querySeconds: promauto.With(reg).NewCounterVec(prometheus.CounterOpts{+			Name: "cortex_query_seconds_total",+			Help: "Total amount of wall clock time spend processing queries.",+		}, []string{"user"}),+	}+}++// Wrap implements middleware.Interface.+func (m ReportMiddleware) Wrap(next http.Handler) http.Handler {+	return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {+		userID, err := tenant.TenantID(r.Context())+		if err != nil {+			http.Error(w, err.Error(), http.StatusBadRequest)+			return+		}++		// Initialise the stats in the context and make sure it's propagated+		// down the request chain.+		stats, ctx := ContextWithEmptyStats(r.Context())+		r = r.WithContext(ctx)++		startTime := time.Now()+		next.ServeHTTP(w, r)++		// Track statistics.+		m.querySeconds.WithLabelValues(userID).Add(float64(stats.LoadWallTime()))++		level.Info(m.logger).Log(+			"msg", "query stats",+			"user", userID,+			"method", r.Method,+			"path", r.URL.Path,

Makes sense. Done.

tomwilkie

comment created time in a day

Pull request review commentcortexproject/cortex

Add multi tenant query federation

+// +build requires_docker++package integration++import (+	"fmt"+	"strings"+	"testing"+	"time"++	"github.com/prometheus/common/model"+	"github.com/prometheus/prometheus/pkg/labels"+	"github.com/prometheus/prometheus/prompb"+	"github.com/stretchr/testify/assert"+	"github.com/stretchr/testify/require"++	"github.com/cortexproject/cortex/integration/e2e"+	e2ecache "github.com/cortexproject/cortex/integration/e2e/cache"+	e2edb "github.com/cortexproject/cortex/integration/e2e/db"+	"github.com/cortexproject/cortex/integration/e2ecortex"+)++type querierTenantFederationConfig struct {+	querySchedulerEnabled  bool+	shuffleShardingEnabled bool+}++func TestQuerierTenantFederation(t *testing.T) {+	runQuerierTenantFederationTest(t, querierTenantFederationConfig{})+}++func TestQuerierTenantFederationWithQueryScheduler(t *testing.T) {+	runQuerierTenantFederationTest(t, querierTenantFederationConfig{+		querySchedulerEnabled: true,+	})+}++func TestQuerierTenantFederationWithShuffleSharding(t *testing.T) {+	runQuerierTenantFederationTest(t, querierTenantFederationConfig{+		shuffleShardingEnabled: true,+	})+}++func TestQuerierTenantFederationWithQuerySchedulerAndShuffleSharding(t *testing.T) {+	runQuerierTenantFederationTest(t, querierTenantFederationConfig{+		querySchedulerEnabled:  true,+		shuffleShardingEnabled: true,+	})+}++func runQuerierTenantFederationTest(t *testing.T, cfg querierTenantFederationConfig) {+	const numUsers = 10++	s, err := e2e.NewScenario(networkName)+	require.NoError(t, err)+	defer s.Close()++	memcached := e2ecache.NewMemcached()+	consul := e2edb.NewConsul()+	require.NoError(t, s.StartAndWaitReady(consul, memcached))++	flags := mergeFlags(BlocksStorageFlags(), map[string]string{+		"-querier.cache-results":             "true",+		"-querier.split-queries-by-interval": "24h",+		"-querier.query-ingesters-within":    "12h", // Required by the test on query /series out of ingesters time range+		"-frontend.memcached.addresses":      "dns+" + memcached.NetworkEndpoint(e2ecache.MemcachedPort),+		"-querier.tenant-federation.enabled": "true",+	})++	// Start the query-scheduler if enabled.+	var queryScheduler *e2ecortex.CortexService+	if cfg.querySchedulerEnabled {+		queryScheduler = e2ecortex.NewQueryScheduler("query-scheduler", flags, "")+		require.NoError(t, s.StartAndWaitReady(queryScheduler))+		flags["-frontend.scheduler-address"] = queryScheduler.NetworkGRPCEndpoint()+		flags["-querier.scheduler-address"] = queryScheduler.NetworkGRPCEndpoint()+	}++	if cfg.shuffleShardingEnabled {+		// Use only single querier for each user.+		flags["-frontend.max-queriers-per-tenant"] = "1"+	}++	minio := e2edb.NewMinio(9000, flags["-blocks-storage.s3.bucket-name"])+	require.NoError(t, s.StartAndWaitReady(minio))++	// Start the query-frontend.+	queryFrontend := e2ecortex.NewQueryFrontend("query-frontend", flags, "")+	require.NoError(t, s.Start(queryFrontend))++	if !cfg.querySchedulerEnabled {+		flags["-querier.frontend-address"] = queryFrontend.NetworkGRPCEndpoint()+	}++	// Start all other services.+	ingester := e2ecortex.NewIngester("ingester", consul.NetworkHTTPEndpoint(), flags, "")+	distributor := e2ecortex.NewDistributor("distributor", consul.NetworkHTTPEndpoint(), flags, "")+	querier := e2ecortex.NewQuerier("querier", consul.NetworkHTTPEndpoint(), flags, "")++	var querier2 *e2ecortex.CortexService+	if cfg.shuffleShardingEnabled {+		querier2 = e2ecortex.NewQuerier("querier-2", consul.NetworkHTTPEndpoint(), flags, "")+	}++	require.NoError(t, s.StartAndWaitReady(querier, ingester, distributor))+	require.NoError(t, s.WaitReady(queryFrontend))+	if cfg.shuffleShardingEnabled {+		require.NoError(t, s.StartAndWaitReady(querier2))+	}++	// Wait until distributor and queriers have updated the ring.+	require.NoError(t, distributor.WaitSumMetrics(e2e.Equals(512), "cortex_ring_tokens_total"))+	require.NoError(t, querier.WaitSumMetrics(e2e.Equals(512), "cortex_ring_tokens_total"))+	if cfg.shuffleShardingEnabled {+		require.NoError(t, querier2.WaitSumMetrics(e2e.Equals(512), "cortex_ring_tokens_total"))+	}++	// Push a series for each user to Cortex.+	now := time.Now()+	expectedVectors := make([]model.Vector, numUsers)+	tenantIDs := make([]string, numUsers)++	for u := 0; u < numUsers; u++ {+		tenantIDs[u] = fmt.Sprintf("user-%d", u)+		c, err := e2ecortex.NewClient(distributor.HTTPEndpoint(), "", "", "", tenantIDs[u])+		require.NoError(t, err)++		var series []prompb.TimeSeries+		series, expectedVectors[u] = generateSeries("series_1", now)++		res, err := c.Push(series)+		require.NoError(t, err)+		require.Equal(t, 200, res.StatusCode)+	}++	// query all tenants+	c, err := e2ecortex.NewClient(distributor.HTTPEndpoint(), queryFrontend.HTTPEndpoint(), "", "", strings.Join(tenantIDs, "|"))+	require.NoError(t, err)++	result, err := c.Query("series_1", now)+	require.NoError(t, err)++	assert.Equal(t, mergeResults(tenantIDs, expectedVectors), result.(model.Vector))++	// ensure a push to multiple tenants is failing+	series, _ := generateSeries("series_1", now)+	res, err := c.Push(series)+	require.NoError(t, err)+	require.Equal(t, 500, res.StatusCode)++	// check metric label values for total queries in the query frontend+	require.NoError(t, queryFrontend.WaitSumMetricsWithOptions(e2e.Equals(1), []string{"cortex_query_frontend_queries_total"}, e2e.WithLabelMatchers(+		labels.MustNewMatcher(labels.MatchEqual, "user", strings.Join(tenantIDs, "|")),+		labels.MustNewMatcher(labels.MatchEqual, "op", "query"))))++	// check metric label values for query queue length in either query frontend or query scheduler+	queueComponent := queryFrontend+	queueMetricName := "cortex_query_frontend_queue_length"+	if cfg.querySchedulerEnabled {+		queueComponent = queryScheduler+		queueMetricName = "cortex_query_scheduler_queue_length"+	}+	require.NoError(t, queueComponent.WaitSumMetricsWithOptions(e2e.Equals(0), []string{queueMetricName}, e2e.WithLabelMatchers(+		labels.MustNewMatcher(labels.MatchEqual, "user", strings.Join(tenantIDs, "|")))))++	// TODO: check cache invalidation on tombstone cache gen increase+	// TODO: check fairness in queryfrontend

It should be possible. However, I wouldn't make it a priority for this PR.

simonswine

comment created time in a day

Pull request review commentcortexproject/cortex

Add multi tenant query federation

 store_gateway_client: # (ingesters shuffle sharding on read path is disabled). # CLI flag: -querier.shuffle-sharding-ingesters-lookback-period [shuffle_sharding_ingesters_lookback_period: <duration> | default = 0s]++tenant_federation:+  # If enabled, multi tenant query federation can be used by supplying multiple+  # tenant IDs in the read path (experimental).+  # CLI flag: -querier.tenant-federation.enabled+  [enabled: <boolean> | default = false]

👍

simonswine

comment created time in a day

more