profile
viewpoint
Robert Fratto rfratto @grafana dev @grafana, secret grafana/agent man and Loki maintainer

grafana/agent 97

A lightweight subset of Prometheus and more, optimized for Grafana Cloud

rfratto/FreeImage 2

CMake-based FreeImage fork

rfratto/localenv 2

My personal local dev environment for work

rfratto/annoybot 1

annoy coworkers on slack in one easy script

rfratto/accidental-noise-library 0

Automatically exported from code.google.com/p/accidental-noise-library

rfratto/agent 0

A lightweight subset of Prometheus for Grafana Cloud.

rfratto/awaybot 0

A Slack bot to track away statuses

rfratto/bbparse 0

JavaScript library to parse BBCode-like expressions.

rfratto/beats 0

:tropical_fish: Beats - Lightweight shippers for Elasticsearch & Logstash

rfratto/bootkube 0

bootkube - Launch a self-hosted Kubernetes cluster

delete branch rfratto/agent

delete branch : persist-components

delete time in 5 hours

push eventgrafana/agent

Robert Fratto

commit sha 210b64648c16172a63e1ccce92db7259ccc478a6

Store prometheus components in Instance (#144) * store prometheus components in Instance This doesn't do anything right now, but it sets up the groundwork for calling ApplyConfig to components from outside of the Run function. Related to #140. * address review feedback

view details

push time in 5 hours

PR merged grafana/agent

Store prometheus components in Instance

This doesn't do anything right now, but it sets up the groundwork for calling ApplyConfig against components from outside of the Run method.

Related to #140.

+99 -66

0 comment

1 changed file

rfratto

pr closed time in 5 hours

push eventrfratto/agent

Robert Fratto

commit sha 47befe55b758420e263f93951464a8e4b50b93e1

address review feedback

view details

push time in 5 hours

Pull request review commentgrafana/agent

Store prometheus components in Instance

 func (i *Instance) Run(ctx context.Context) error { 		) 	} 	{+		sm, err := i.readyScrapeManager.Get()+		if err != nil {+			level.Error(i.logger).Log("msg", "failed to get scrape manager")+			return err+		}+ 		// Scrape manager 		rg.Add( 			func() error {-				err := scrapeManager.Run(discovery.SyncCh())+				err := sm.Run(i.discovery.SyncCh()) 				level.Info(i.logger).Log("msg", "scrape manager stopped") 				return err 			}, 			func(err error) { 				// The scrape manager is closed first to allow us to write staleness 				// markers without receiving new samples from scraping in the meantime. 				level.Info(i.logger).Log("msg", "stopping scrape manager...")-				scrapeManager.Stop()+				sm.Stop()  				// On a graceful shutdown, write staleness markers. If something went 				// wrong, then the instance will be relaunched. 				if err == nil && i.cfg.WriteStaleOnShutdown { 					level.Info(i.logger).Log("msg", "writing staleness markers...")-					err := wstore.WriteStalenessMarkers(i.getRemoteWriteTimestamp)+					err := i.wal.WriteStalenessMarkers(i.getRemoteWriteTimestamp) 					if err != nil { 						level.Error(i.logger).Log("msg", "error writing staleness markers", "err", err) 					} 				}  				level.Info(i.logger).Log("msg", "closing storage...")-				if err := storage.Close(); err != nil {+				if err := i.storage.Close(); err != nil { 					level.Error(i.logger).Log("msg", "error stopping storage", "err", err) 				} 			}, 		) 	} -	err = rg.Run()+	err := rg.Run() 	if err != nil { 		level.Error(i.logger).Log("msg", "agent instance stopped with error", "err", err) 	} 	return err } +// initialize sets up the various Prometheus components with their initial+// settings. initialize will be called each time the Instance is run. Prometheus+// components cannot be reused after they are stopped so we need to recreate them+// each run.+//+// Components are stored as struct members to allow other methods to access+// components from different goroutines (as opposed to keeping them completely+// defined within the Run method where nothing else can access them).+func (i *Instance) initialize(ctx context.Context, reg prometheus.Registerer) error {

We need a lock because another goroutine might request the active targets at any time (i.e., the uses calls the new /api/agent/v1/targets API).

It doesn't matter if the components aren't run yet, we just need to prevent a concurrent read/write to the pointers.

rfratto

comment created time in 8 hours

Pull request review commentgrafana/agent

Store prometheus components in Instance

 func (i *Instance) Run(ctx context.Context) error { 		) 	} 	{+		sm, err := i.readyScrapeManager.Get()+		if err != nil {+			level.Error(i.logger).Log("msg", "failed to get scrape manager")+			return err+		}+ 		// Scrape manager 		rg.Add( 			func() error {-				err := scrapeManager.Run(discovery.SyncCh())+				err := sm.Run(i.discovery.SyncCh()) 				level.Info(i.logger).Log("msg", "scrape manager stopped") 				return err 			}, 			func(err error) { 				// The scrape manager is closed first to allow us to write staleness 				// markers without receiving new samples from scraping in the meantime. 				level.Info(i.logger).Log("msg", "stopping scrape manager...")-				scrapeManager.Stop()+				sm.Stop()  				// On a graceful shutdown, write staleness markers. If something went 				// wrong, then the instance will be relaunched. 				if err == nil && i.cfg.WriteStaleOnShutdown { 					level.Info(i.logger).Log("msg", "writing staleness markers...")-					err := wstore.WriteStalenessMarkers(i.getRemoteWriteTimestamp)+					err := i.wal.WriteStalenessMarkers(i.getRemoteWriteTimestamp) 					if err != nil { 						level.Error(i.logger).Log("msg", "error writing staleness markers", "err", err) 					} 				}  				level.Info(i.logger).Log("msg", "closing storage...")-				if err := storage.Close(); err != nil {+				if err := i.storage.Close(); err != nil { 					level.Error(i.logger).Log("msg", "error stopping storage", "err", err) 				} 			}, 		) 	} -	err = rg.Run()+	err := rg.Run() 	if err != nil { 		level.Error(i.logger).Log("msg", "agent instance stopped with error", "err", err) 	} 	return err } +// initialize sets up the various Prometheus components with their initial+// settings. initialize will be called each time the Instance is run. Prometheus+// components cannot be reused after they are stopped so we need to recreate them+// each run.+//+// Components are stored as struct members to allow other methods to access+// components from different goroutines (as opposed to keeping them completely+// defined within the Run method where nothing else can access them).+func (i *Instance) initialize(ctx context.Context, reg prometheus.Registerer) error {

Right, the components are stopped at the end of Run. The initialize method was pulled out to make it easier to grab the mutex. Would more comments or better naming be ideal here?

rfratto

comment created time in 8 hours

PR opened grafana/agent

Store prometheus components in Instance

This doesn't do anything right now, but it sets up the groundwork for calling ApplyConfig against components from outside of the Run method.

Related to #140.

+99 -66

0 comment

1 changed file

pr created time in a day

create barnchrfratto/agent

branch : persist-components

created branch time in a day

push eventgrafana/agent

Jeroen Op 't Eynde

commit sha bb3465fffd7dce1f96fa346b7f4e7361fb7c6ba7

feat: use k8s-alpha (#143)

view details

push time in a day

delete branch grafana/agent

delete branch : duologic/move_to_k8s

delete time in a day

PR merged grafana/agent

feat: use k8s-alpha

Move on from the deprecated ksonnet library.

+61581 -61507

4 comments

1090 changed files

Duologic

pr closed time in a day

pull request commentgrafana/agent

feat: use k8s-alpha

It is what it is, not much draft anymore, 99% of the change is a vendoring, so not so massive.

I understand, there's just a lot of changed files and it's hard to review through the Github UI.

We've been running k8s-alpha for a while now in production (deployment_tools), where also grafana-agent runs, does that count as tested?

If this specific change has been tested, then yes! If not, I quickly ran the k3d example to validate this and everything LGTM. Once it's out of draft, I'm happy to approve and merge.

Duologic

comment created time in a day

pull request commentgrafana/agent

feat: use k8s-alpha

Thanks for working on this! This is a pretty massive thing to review, and I'm leaning towards trust with a rubber stamp here. Has this been tested to make sure it still works? If not, I can spend some time testing it myself before approving and mergng.

Duologic

comment created time in a day

push eventrfratto/agent

Robert Fratto

commit sha 5ba6575a06d0361a8445d22c5821f3010b3f2918

pkg/prom: refactor InstanceManager (#142) * pkg/prom: refactor InstanceManager This commit refactors InstanceManager and makes it responsible for the actual running of instances directly. This allows InstanceManager to be aware of the running instance interfaces, and will open the door for implementing #140. * add support for an instance failing to be created * don't leak context when returning error in spawnProcess

view details

push time in 4 days

delete branch rfratto/agent

delete branch : refactor-instance-manager

delete time in 4 days

push eventgrafana/agent

Robert Fratto

commit sha 5ba6575a06d0361a8445d22c5821f3010b3f2918

pkg/prom: refactor InstanceManager (#142) * pkg/prom: refactor InstanceManager This commit refactors InstanceManager and makes it responsible for the actual running of instances directly. This allows InstanceManager to be aware of the running instance interfaces, and will open the door for implementing #140. * add support for an instance failing to be created * don't leak context when returning error in spawnProcess

view details

push time in 4 days

PR merged grafana/agent

pkg/prom: refactor InstanceManager

This PR refactors InstanceManager to make it responsible for running instances, rather than deferring to a function which creates and runs them. This now makes InstanceManager aware of the running instance interfaces, opening the door for implementing #140.

Note that with this PR, pkg/prom.Agent does very little outside of exposing an HTTP API and gluing various things together.

+186 -147

2 comments

8 changed files

rfratto

pr closed time in 4 days

pull request commentgrafana/agent

pkg/prom: refactor InstanceManager

I've done some manual testing and this doesn't appear to have broken anything. I'll do more intensive manual testing pre-release as usual.

rfratto

comment created time in 4 days

pull request commentgrafana/agent

pkg/prom: refactor InstanceManager

I want to do some manual testing of this but this should be ready for review now.

rfratto

comment created time in 4 days

push eventrfratto/agent

Robert Fratto

commit sha 8dec9674817e42f97fc502908efe8ba9ad29b16a

don't leak context when returning error in spawnProcess

view details

push time in 4 days

push eventrfratto/agent

Robert Fratto

commit sha 77d2da71f0f63719af0ffa7d297cdade8f367d97

don't leak context when returning error in spawnProcess

view details

push time in 4 days

push eventrfratto/agent

Robert Fratto

commit sha 67daf47b3cfbdf404c918165b972347c0f5d0454

add support for an instance failing to be created

view details

push time in 4 days

PR opened grafana/agent

pkg/prom: refactor InstanceManager

This PR refactors InstanceManager to make it responsible for running instances, rather than deferring to a function which creates and runs them. This now makes InstanceManager aware of the running instance interfaces, opening the door for implementing #140.

Note that with this PR, pkg/prom.Agent does very little outside of exposing an HTTP API and gluing various things together.

+179 -141

0 comment

8 changed files

pr created time in 4 days

create barnchrfratto/agent

branch : refactor-instance-manager

created branch time in 4 days

create barnchrfratto/agent

branch : dynamic-update-instance

created branch time in 4 days

push eventrfratto/ecobee_exporter

Robert Fratto

commit sha cbe896e3df500d5a0159c0de809c9c7b18518392

add metrics for cooling/fan/heating status

view details

push time in 6 days

issue commentgrafana/agent

Combine instance configs to share WALs

This change will also affect users that are using integrations, since those currently spawn a dedicated WAL per running integration too.

rfratto

comment created time in 6 days

issue openedgrafana/agent

Combine instance configs to share WALs

When running the Agent in Scraping Service Mode, having a significant number of instance configs will cause a significant performance penalty as the number of spawned Prometheus SDs, scrapers, remote storage, and WALs are generated.

This overhead can be avoided by combining instance configs that write to the same set of remote_write configs. This change involves modifying the ConfigManager to combine instance configs into a "shared config." The full design doc for instance sharing is available here.

Depends on #140.

created time in 6 days

issue openedgrafana/agent

Allow dynamic updating of instances

Instances are currently immutable and can only be updated by stopping the running instance and starting a new instance. Prometheus components support dynamically being given new configs, so we should utilize this for the Agent instances.

For now, dynamic updating should only apply to changing scrape_configs and remote_write. If any of the other settings, like host_filter, wal_truncate_frequency, remote_flush_deadline, or write_stale_on_shutdown are changed, the instance can be deleted/recreated as normal. Future work may change that behavior.

created time in 6 days

push eventrfratto/ecobee_exporter

Robert Fratto

commit sha 17455fd2858318b7d6a2ebde1837abd8a0f3c336

cache thermostat results and only update on revision change also change project name to ecobee_exporter

view details

push time in 7 days

Pull request review commentgrafana/loki

pkg/promtail: propagate a logger rather than using util.Logger globally

 import ( 	"sync"  	"github.com/cortexproject/cortex/pkg/util"+	"github.com/go-kit/kit/log"  	"github.com/grafana/loki/pkg/promtail/client" 	"github.com/grafana/loki/pkg/promtail/config" 	"github.com/grafana/loki/pkg/promtail/server" 	"github.com/grafana/loki/pkg/promtail/targets" ) +// Option is a function that can be passed to the New method of Promtail and+// customize the Promtail that is created.+type Option func(p *Promtail)++// WithLogger overrides the default logger for Promtail.+func WithLogger(log log.Logger) Option {

It'd have to be a public field in the Config struct for this to work since the logger is already passed around when New is called. I'm open to this, but I think it'd be a little strange to do that.

I do agree that the Option pattern is a big heavy handed for this specific PR, but my justification is that there's going to be a second With<X> function in my final PR for #2405, where you give it a custom mux router, where having two option functions would feel more justified.

I don't feel that strongly about this either though, I'm happy to make it a public field of Config if you think that'd be a better approach here.

rfratto

comment created time in 7 days

push eventrfratto/loki

Robert Fratto

commit sha 2db3605fccc4bff45e0dd6f6fdae99b56f3fa3d0

remove applyOptions function

view details

push time in 7 days

PR opened grafana/loki

pkg/promtail: propagate a logger rather than using util.Logger globally

This PR allows for creating Promtail with a custom log.Logger instance which will be propagated and used consistently throughout the Promtail package. This allows for clients to provide a Promtail-specific logger.

The options pattern is now used for instantiating a Promtail instance; the only option to apply as of this PR is WithLogger. If unspecified, the default behavior is unchanged from how Promtail behaves today (i.e., using the global util.Logger).

Address a portion of #2405.

+60 -40

0 comment

7 changed files

pr created time in 7 days

create barnchrfratto/loki

branch : promtail-custom-logger

created branch time in 7 days

delete branch rfratto/agent

delete branch : wal-bugs

delete time in 8 days

push eventgrafana/agent

Robert Fratto

commit sha df54b6c15957d298bf455dba1c6c55b7916ba08c

improve memory stability and performance of WAL (#137) * improve memory stability and performance of WAL Overall, this commit should result in less spikey memory usage and may demonstrate a measurable decrease in memory usage overall. Active series will now be reported more accurately. Several small issues were fixed in particular: 1. The upstream TSDB usage of the WAL prevent series from getting deleted if they were pending a commit. 2. The upstream TSDB usage of the WAL recently changed to only delete the lower 2/3rds of segments. 3. The deleted series metric was being incorrectly reported. 4. We revert the behavior of not tracking labels for active series. This caused many issues and only marginally improved memory usage. Not tracking labels caused series to get double-counted when they went stale. Reverting this change does make memory usage worse, but the combined set of changes for this commit offset the hit here. * delete unused delLabels function * also register metric to track appended samples

view details

push time in 8 days

PR merged grafana/agent

Reviewers
improve memory stability and performance of WAL

Overall, this PR should result in less spikey memory usage and may demonstrate a measurable decrease in memory usage overall. Active series will now be reported more accurately.

Several small issues were fixed in particular:

  1. Match the upstream TSDB usage of the WAL where series are prevented from getting deleted if they were pending a commit.

  2. Match the upstream TSDB usage of the WAL to only delete the lower 2/3rds of segments when truncating the WAL.

  3. The deleted series metric was being incorrectly reported.

  4. We revert the behavior of not tracking labels for active series. This caused many issues and only marginally improved memory usage. Not tracking labels caused series to get double-counted when they went stale. Reverting this change does make memory usage marginally worse, but the combined set of changes for this commit offset the hit here.

/cc @annanay25

+28 -49

0 comment

2 changed files

rfratto

pr closed time in 8 days

Pull request review commentgrafana/agent

improve memory stability and performance of WAL

 func (s *stripeSeries) saveLabels(hash uint64, series *memSeries, lbls labels.La 	s.locks[i].Unlock() } -func (s *stripeSeries) delLabels(hash uint64, series *memSeries) {

This function is unused now with this PR, the linter nagged me to remove it.

rfratto

comment created time in 8 days

push eventrfratto/rtmp_exporter

Robert Fratto

commit sha 05d908fd53a451dcfc330ec5274831e909085ab6

add client uptime stats

view details

push time in 11 days

push eventrfratto/rtmp_exporter

Robert Fratto

commit sha 8180d839ec7ad5e1b14b8fd520ee61ed5185c260

downgrade dependencies

view details

push time in 11 days

push eventrfratto/rtmp_exporter

Robert Fratto

commit sha 3edaf6ee630d36df807606de6d17fcc030b95dfa

add more stats

view details

push time in 11 days

push eventrfratto/rtmp_exporter

Robert Fratto

commit sha da1a04afae492f995d94b8803b3b2b0aebc8aa3d

get simple end-to-end example working

view details

push time in 11 days

push eventrfratto/rtmp_exporter

Robert Fratto

commit sha 7eb45c3228f54eede3f19951dceb4beb0ed2083a

parse rtmp stats

view details

push time in 11 days

create barnchrfratto/rtmp_exporter

branch : master

created branch time in 11 days

created repositoryrfratto/rtmp_exporter

Prometheus exporter for nginx_rtmp_module

created time in 11 days

delete branch rfratto/agent

delete branch : targets-api

delete time in 11 days

push eventgrafana/agent

Robert Fratto

commit sha ab56429580b8d185be802e4d065e135062a8ce06

Create scrape targets API (#139) * create scrape targets API /agent/api/v1/targets is a new API endpoint that will return info on all scrape targets currently being scraped by the running Agent. Targets from other Agents (i.e., in scraping service mode), will not be returned. This API exposes the same information present on the Prometheus UI page for targets. * change targets to always be an array, never null * fix flaking test * address review feedback

view details

push time in 11 days

PR merged grafana/agent

Create scrape targets API

/agent/api/v1/targets is a new API endpoint that will return info on all scrape targets currently being scraped by the running Agent. Targets from other Agents (i.e., in scraping service mode), will not be returned.

This API exposes the same information present on the Prometheus UI page for targets.

Provides some of #6.

+255 -13

1 comment

6 changed files

rfratto

pr closed time in 11 days

pull request commentgrafana/agent

Create scrape targets API

@hoenn PTAL, all of your feedback should be addressed now

rfratto

comment created time in 11 days

push eventrfratto/agent

Robert Fratto

commit sha cb030d8b3e417c8318e4c9a6b2a0b8e8d811fdb4

address review feedback

view details

push time in 11 days

Pull request review commentgrafana/agent

Create scrape targets API

 func TestAgent_ListInstancesHandler(t *testing.T) { 		}) 	}) }++func TestAgent_ListTargetsHandler(t *testing.T) {+	fact := newMockInstanceFactory()+	a, err := newAgent(Config{+		WALDir: "/tmp/agent",+	}, log.NewNopLogger(), fact.factory)+	require.NoError(t, err)++	r := httptest.NewRequest("GET", "/agent/api/v1/targets", nil)++	t.Run("scrape manager not ready", func(t *testing.T) {+		a.instances = map[string]inst{+			"test_instance": &mockInstanceScrape{},+		}++		rr := httptest.NewRecorder()+		a.ListTargetsHandler(rr, r)+		expect := `{"status":"success","data":[]}`+		require.Equal(t, expect, rr.Body.String())+	})++	t.Run("scrape manager targets", func(t *testing.T) {+		tgt := scrape.NewTarget(labels.FromMap(map[string]string{+			model.JobLabel:         "job",+			model.InstanceLabel:    "instance",+			model.SchemeLabel:      "http",+			model.AddressLabel:     "localhost:12345",+			model.MetricsPathLabel: "/metrics",+			"foo":                  "bar",+		}), nil, nil)++		startTime := time.Date(1994, time.January, 12, 0, 0, 0, 0, time.UTC)+		tgt.Report(startTime, time.Minute, fmt.Errorf("something went wrong"))++		a.instances = map[string]inst{+			"test_instance": &mockInstanceScrape{+				tgts: map[string][]*scrape.Target{+					"group_a": {tgt},+				},+			},+		}++		rr := httptest.NewRecorder()+		a.ListTargetsHandler(rr, r)+		expect := `{"status":"success","data":[{"instance":"test_instance","target_group":"group_a","endpoint":"http://localhost:12345/metrics","state":"down","labels":{"foo":"bar","instance":"instance","job":"job"},"last_scrape":"1994-01-12T00:00:00Z","scrape_duration_ms":60000,"scrape_error":"something went wrong"}]}`

Oh wow, I didn't know about JSONEq and that fixes a lot of my problems here. I'll change the tests to use that instead and break up the giant string into a more readable string.

rfratto

comment created time in 11 days

PR opened grafana/loki

Add RegisterFlagsWithPrefix to config structs

All config structs now have a RegisterFlagsWithPrefix function that allows specifying a prefixed string to append to all flag names. This may be used by clients to namespace support for Promtail in a larger application (e.g., grafana/agent).

One exception to this is the Weaveworks server which doesn't currently have a RegisterFlagsWithPrefix function. This might be a problem down the line, but it's not a problem for my particular use case (I expose a subset of the Promtail config which doesn't include the weaveworks server).

I noticed a few config structs were incorrectly registering to the global FlagSet rather than the FlagSet passed to RegisterFlags. This PR also includes a fix for that.

Related to #2405.

+54 -22

0 comment

5 changed files

pr created time in 11 days

create barnchrfratto/loki

branch : promtail-flags-prefix

created branch time in 11 days

issue openedgrafana/loki

Make Promtail more friendly for using as a library

I'm working on embedding Promtail into grafana/agent and while my prototype works, there's a few things I'd like to change in Promtail to embed it more cleanly:

  1. Allow passing a custom log.Logger to Promtail rather than using util.Logger
  2. Allow providing a custom mux router and cleanup function to pkg/promtail/targets/lokipush
  3. Add RegisterFlagsWithPrefix functions to all config structs so I can register flags to a flagset but with a namespaced prefix to make it clear they're logging related.

I'd like to make separate PRs to handle all of these, but wanted to open an issue first to collect feedback first. I think passing a custom logger and router is best implemented through the functional options pattern. My usage of pkg/promtail would look something like:

promtail.New(cfg, false, promtail.WithLogger(logger), promtail.WithRouter(router)

While cmd/promtail would continue to use it like it already does:

promtail.New(config, *dryRun)

/cc @slim-bean @owen-d @cyriltovena

created time in 12 days

push eventrfratto/agent

gotjosh

commit sha 7be2e5e570de56231925c5b1136dcd9a5fa885ea

Scrape default/kubernetes in host_filter mode Fixes #124 Signed-off-by: gotjosh <josue@grafana.com>

view details

gotjosh

commit sha 27d67ec83be3144df1f57361340f38035001c249

Address review comments Signed-off-by: gotjosh <josue@grafana.com>

view details

gotjosh

commit sha 4d210b6e7efbc33ffceeed654c109d934713c1d0

Use the correct scrape config Signed-off-by: gotjosh <josue@grafana.com>

view details

Robert Fratto

commit sha 6646cbd5886264a21751ff8f239ec9cc7ee092dd

make absolute symbolic links relative

view details

Robert Fratto

commit sha 594ae2eea470bb658fb844c1b65c6a8e93f6515b

fix errors in tanka configs, make deployment name distinct

view details

Robert Fratto

commit sha ec0a8034edaad8dae0528edfb85e3540cfdbfbc6

make example-kubernetes

view details

Robert Fratto

commit sha 5b30306744a357504693e06facff30a80d6fbc66

use release docker image in kubernetes manifest

view details

gotjosh

commit sha d804389e9d6efbed36c78f4ff803f239a2cf1eb2

Merge pull request #126 from grafana/deployment-agent Scrape default/kubernetes in host_filter mode

view details

Robert Fratto

commit sha 55e9697c948f7d68fc7c1a2a85a7becced08eba3

fix default settings for filesystem collector on linux (#128) * fix default settings for filesystem collector on linux * remove unneded default, add docs

view details

Robert Fratto

commit sha fbc611dc701f53ff962a4fda0d9ae6af4b2dba0f

wip: promtail support

view details

Robert Fratto

commit sha ad3d903d3442fa151b45f31d2a251f368460afc8

bump to latest loki

view details

push time in 12 days

Pull request review commentgrafana/agent

Create scrape targets API

 func TestAgent_ListInstancesHandler(t *testing.T) { 		}) 	}) }++func TestAgent_ListTargetsHandler(t *testing.T) {+	fact := newMockInstanceFactory()+	a, err := newAgent(Config{+		WALDir: "/tmp/agent",+	}, log.NewNopLogger(), fact.factory)+	require.NoError(t, err)++	r := httptest.NewRequest("GET", "/agent/api/v1/targets", nil)++	t.Run("scrape manager not ready", func(t *testing.T) {+		a.instances = map[string]inst{+			"test_instance": &mockInstanceScrape{},+		}++		rr := httptest.NewRecorder()+		a.ListTargetsHandler(rr, r)+		expect := `{"status":"success","data":[]}`+		require.Equal(t, expect, rr.Body.String())+	})++	t.Run("scrape manager targets", func(t *testing.T) {+		tgt := scrape.NewTarget(labels.FromMap(map[string]string{+			model.JobLabel:         "job",+			model.InstanceLabel:    "instance",+			model.SchemeLabel:      "http",+			model.AddressLabel:     "localhost:12345",+			model.MetricsPathLabel: "/metrics",+			"foo":                  "bar",+		}), nil, nil)++		startTime := time.Date(1994, time.January, 12, 0, 0, 0, 0, time.UTC)+		tgt.Report(startTime, time.Minute, fmt.Errorf("something went wrong"))++		a.instances = map[string]inst{+			"test_instance": &mockInstanceScrape{+				tgts: map[string][]*scrape.Target{+					"group_a": {tgt},+				},+			},+		}++		rr := httptest.NewRecorder()+		a.ListTargetsHandler(rr, r)+		expect := `{"status":"success","data":[{"instance":"test_instance","target_group":"group_a","endpoint":"http://localhost:12345/metrics","state":"down","labels":{"foo":"bar","instance":"instance","job":"job"},"last_scrape":"1994-01-12T00:00:00Z","scrape_duration_ms":60000,"scrape_error":"something went wrong"}]}`

it's a good question. @gotjosh and I had talked about this a while ago, where he said he generally prefers testing for actual string content. WDYT @gotjosh, any more detail on one vs the other?

I'm happy to change this either way, and I acknowledge the expectation string is pretty unwieldy.

rfratto

comment created time in 13 days

push eventrfratto/agent

Robert Fratto

commit sha 6fbe4fbe45c797b97adbef0a7111c36d994c15d6

fix flaking test

view details

push time in 13 days

push eventrfratto/agent

Robert Fratto

commit sha be5fd9d6247d5ce345e0fe4a941421189cb99b67

fix flaking test

view details

push time in 13 days

push eventrfratto/agent

Robert Fratto

commit sha d3442a495a75f251b99e225913dad0485a846ad3

change targets to always be an array, never null

view details

push time in 13 days

PR opened grafana/agent

Reviewers
Create scrape targets API

/agent/api/v1/targets is a new API endpoint that will return info on all scrape targets currently being scraped by the running Agent. Targets from other Agents (i.e., in scraping service mode), will not be returned.

This API exposes the same information present on the Prometheus UI page for targets.

Provides some of #6.

+233 -5

0 comment

6 changed files

pr created time in 13 days

push eventrfratto/agent

Robert Fratto

commit sha 2c80b9cba8a405c32579fc2bf57f04c1ec8c1e61

create scrape targets API /agent/api/v1/targets is a new API endpoint that will return info on all scrape targets currently being scraped by the running Agent. Targets from other Agents (i.e., in scraping service mode), will not be returned. This API exposes the same information present on the Prometheus UI page for targets.

view details

push time in 13 days

create barnchrfratto/agent

branch : targets-api

created branch time in 13 days

issue commentgrafana/agent

WAL: Minimum series staleness time before deletion

Maybe I'm just not up to date on how the agent works but it's not clear what "received a write" means here.

Every 1 minute we'll poll the value of queue_highest_sent_timestamp_seconds and use that as the truncation timestamp. Any series whose timestamp is before that value will be scheduled for deletion. If the last received (i.e., scraped) timestamp for the series is still lower than queue_highest_sent_timestamp_seconds on the next iteration, it'll be deleted. Otherwise, if we got a sample that's newer than queue_highest_sent_timestamp_seconds, we'll unschedule that series for deletion and keep it in memory/in the WAL.

rfratto

comment created time in 14 days

pull request commentprometheus/prometheus

Allow changing log level at runtime

That's what they said, but their claim seems to be that all the debug logs are so voluminous that enabling debug logging isn't viable for their logging solution can't handle it. I'm dubious of this claim, but even if it's true changing how debug logs are enabled doesn't help them as they still can't enable debug logs.

This PR does not help with #7384 as stated.

To be fully transparent, the request is originating from a user whose logging system struggles to handle a high volume and they have a significant number of scrape targets. Debug level logging at their scrape load to the centralized logging system would cause a lot of issues for them. I didn't challenge the claim, but I find it to be reasonable (but this also isn't the forum to discuss that).

As Brian mentioned, while I think this is nice, this doesn't alleviate the problems from #7384; between dynamically changing log level and including the last received error in the UI, I think the latter is preferable.

roidelapluie

comment created time in 15 days

push eventrfratto/agent

Robert Fratto

commit sha 3150dc836b00c85e1788805772c6746a12917d66

also register metric to track appended samples

view details

push time in 18 days

Pull request review commentgrafana/agent

added files for RPM packaging

+# Sample config for Grafana Agent +server:    #configuration options for the embeded server it self. +  http_listen_address: '127.0.0.1'  # local host only. +  http_listen_port: 9090  +  +prometheus:+  global:+    scrape_interval:     15s  # By default, scrape targets every 15 seconds.+    external_labels:          # Static labels to add to all metrics+      scanned_by: 'Grafana Agent' +      scanner_on: 'local.snc.me.uk'

Hmm, maybe we want to comment these out and just leave them as examples?

simonc6372

comment created time in 18 days

Pull request review commentgrafana/agent

added files for RPM packaging

+#!/bin/sh++set -e++[ -f /etc/sysconfig/grafana-agent ] && . /etc/sysconfig/grafana-agent++startAgent() {+  if [ -x /bin/systemctl ] ; then+    /bin/systemctl daemon-reload

Nit: indentation

simonc6372

comment created time in 18 days

Pull request review commentgrafana/agent

added files for RPM packaging

+#!/bin/sh++set -e++[ -f /etc/sysconfig/grafana-agent ] && . /etc/sysconfig/grafana-agent++startAgent() {+  if [ -x /bin/systemctl ] ; then+    /bin/systemctl daemon-reload+		/bin/systemctl start grafana-agent.service

Really small nit: can you fix the indentation here?

simonc6372

comment created time in 18 days

Pull request review commentgrafana/agent

added files for RPM packaging

+#!/bin/sh++set -e++[ -f /etc/sysconfig/grafana-agent ] && . /etc/sysconfig/grafana-agent++startAgent() {+  if [ -x /bin/systemctl ] ; then+    /bin/systemctl daemon-reload+		/bin/systemctl start grafana-agent.service+	elif [ -x /etc/init.d/grafana-agent ] ; then+		/etc/init.d/grafana-agent start+	elif [ -x /etc/rc.d/init.d/grafana-agent ] ; then+		/etc/rc.d/init.d/grafana-agent start+	fi+}++stopAgent() {+	if [ -x /bin/systemctl ] ; then+		/bin/systemctl stop grafana-agent.service > /dev/null 2>&1 || :+	elif [ -x /etc/init.d/grafana-agent ] ; then+		/etc/init.d/grafana-agent stop+	elif [ -x /etc/rc.d/init.d/grafana-agent ] ; then+		/etc/rc.d/init.d/grafana-agent stop+	fi+}+++# Initial installation: $1 == 1+# Upgrade: $1 == 2, and configured to restart on upgrade+if [ $1 -eq 1 ] ; then+	[ -z "$AGENT_USER" ] && AGENT_USER="grafana-agent"+	[ -z "$AGENT_GROUP" ] && AGENT_GROUP="grafana-agent"+	if ! getent group "$AGENT_GROUP" > /dev/null 2>&1 ; then+    groupadd -r "$AGENT_GROUP"+	fi+	if ! getent passwd "$AGENT_USER" > /dev/null 2>&1 ; then+    useradd -r -g grafana-agent -d /var/lib/grafana-agent  -s /sbin/nologin -c "grafana-agent user" grafana-agent

Nit for indentation again :)

simonc6372

comment created time in 18 days

issue commentgrafana/agent

WAL: Minimum series staleness time before deletion

I could implement this now but I'd like to wait until #137 is merged at least.

rfratto

comment created time in 18 days

issue openedgrafana/agent

WAL: Minimum series staleness time before deletion

Using queue_highest_sent_timestamp_seconds doesn't guarantee that all samples prior to that time have been successfully sent and as such, it shouldn't be used as the only indicator for deleting stale series.

A new config option should be added to delete stale series if they haven't received a write since the queue_highest_sent_timestamp_seconds and their last write was over X seconds ago. This will help guarantee that samples get sent if remote_write is sending them in a different order.

One thing to note: it's not clear how significant of a problem this is. Since we only consider samples for deletion after ~2 GC cycles and the GC cycle interval has to be greater or equal to the scrape interval, it might be exceedingly difficult for this to align in a way where we GC unsent samples.

/cc @cstyan

created time in 18 days

push eventrfratto/agent

Robert Fratto

commit sha fbbed7d9cf4f3969265857f48a5207fdc50ecdbe

delete unused delLabels function

view details

push time in 18 days

PR opened grafana/agent

Reviewers
improve memory stability and performance of WAL

Overall, this PR should result in less spikey memory usage and may demonstrate a measurable decrease in memory usage overall. Active series will now be reported more accurately.

Several small issues were fixed in particular:

  1. Match the upstream TSDB usage of the WAL where series are prevented from getting deleted if they were pending a commit.

  2. Match the upstream TSDB usage of the WAL to only delete the lower 2/3rds of segments when truncating the WAL.

  3. The deleted series metric was being incorrectly reported.

  4. We revert the behavior of not tracking labels for active series. This caused many issues and only marginally improved memory usage. Not tracking labels caused series to get double-counted when they went stale. Reverting this change does make memory usage marginally worse, but the combined set of changes for this commit offset the hit here.

/cc @annanay25

+24 -41

0 comment

2 changed files

pr created time in 18 days

push eventrfratto/agent

Robert Fratto

commit sha f4eea11f33801bfeff59b0aaaf60532bd128d75f

improve memory stability and performance of WAL Overall, this commit should result in less spikey memory usage and may demonstrate a measurable decrease in memory usage overall. Active series will now be reported more accurately. Several small issues were fixed in particular: 1. The upstream TSDB usage of the WAL prevent series from getting deleted if they were pending a commit. 2. The upstream TSDB usage of the WAL recently changed to only delete the lower 2/3rds of segments. 3. The deleted series metric was being incorrectly reported. 4. We revert the behavior of not tracking labels for active series. This caused many issues and only marginally improved memory usage. Not tracking labels caused series to get double-counted when they went stale. Reverting this change does make memory usage worse, but the combined set of changes for this commit offset the hit here.

view details

push time in 18 days

create barnchrfratto/agent

branch : wal-bugs

created branch time in 20 days

push eventrfratto/agent

Robert Fratto

commit sha d83c9a8a76ad4a186acc4e9d2bb4902a1faa2ac7

fix tests on windows

view details

push time in 20 days

push eventrfratto/agent

Robert Fratto

commit sha 9a74bab7495a55cd426dd3fcaec3c40467434a8b

fix tests on windows?

view details

push time in 20 days

push eventrfratto/agent

Robert Fratto

commit sha 5d8d25f8b4c78034fabc87c56ed09224844614a0

agentctl: add wal-stats and target-stats wal-stats and target-stats are new debugging tools to help developers investigate issues with the WAL and help users discover which targets have an unexpected amount of active series. Users can use wal-stats and target-stats to discover what metric_relabel_rules they wish to define to filter targets and not send them to the remote_write endpoint. These commands supplant the old cmd/walreader.

view details

push time in 20 days

PR opened grafana/agent

agentctl: add wal-stats and target-stats

wal-stats and target-stats are new debugging tools to help developers investigate issues with the WAL and help users discover which targets have an unexpected amount of active series.

Users can use wal-stats and target-stats to discover what metric_relabel_rules they wish to define to filter targets and not send them to the remote_write endpoint.

These commands supplant the old cmd/walreader.

(I also deleted an empty docker-compose file that was sitting around)

+3198 -124

0 comment

32 changed files

pr created time in 20 days

push eventrfratto/agent

Robert Fratto

commit sha 3a40bbb537f977a1de42c81709f3434586c21ab1

flesh out details for walstats

view details

push time in 20 days

create barnchrfratto/agent

branch : wal-tools

created branch time in 20 days

issue openedgrafana/agent

Integrations: replace `agent_hostname` label with `instance` label

#134 pointed out issues with the agent_hostname label. Integrations would work more nicely with existing Grafana Dashboards if integrations instead replaced the value of the instance label with the machine hostname.

This is harder than it sounds; we don't want to do this manually with metric_relabeling_rules, because we'd either have to deal with an added exported_instance or force all integrations to not respect labels on behalf of the users. The proper way to implement this is to intercept the label at the WAL layer.

Since this will be a feature added at the WAL layer, it makes sense to provide an optional replace_instance_label flag (which replaces the label value with the machine hostname) to the non-integration Prometheus configs as well.

created time in 21 days

issue closedgrafana/agent

Grafana agent instance label and agent_hostname

I have cortex setup and was configuring grafana cloud agent to scrape and forward metrics according to this. So, I have multilple machines and cloud agent is running in each of them. I noticed that, for the node_exporter integration, a new label agent_hostname is added, which has the actual hostname, apart from the instance label, which has value localhost:xxxx. I was also trying to scrape envoy metrics and added a scrape config for the same - providing a static config to localhost. In the envoy scraped metrics i do not see any agent_hostname label added and the instance value is localhost for metrics from all the different machines. SInce, the instance value is the same, the series are not differentiated and this causes issues.

The solution i understand is to use metric_relabel_configs and change the instance value to the actual hostname. My question is why was this approach not taken for the node_exporter integration and why add another label agent_hostname. What's the right way to accomplish this?

closed time in 21 days

ssesha

issue commentgrafana/agent

Grafana Agent Binary - Configuration question

Hey there! Like @robx mentioned, you need a Grafana Cloud account with Hosted Prometheus to follow along with the instructions in that blog post. If you don't want to do that, your best bet is to run Cortex yourself; their site has a site of good instructions for how to do that. When that guide talks about setting remote_write, you'd pass that URL to the Agent's config rather than putting it in Prometheus.

darkwizard242

comment created time in 22 days

issue commentgrafana/agent

Grafana agent instance label and agent_hostname

Hi! Having the integrations write the hostname to the instance label instead of localhost makes a lot more sense than the separate agent_hostname label, but it's harder than it sounds to implement, so it'd take some amount of care to have it be done transparently to the users.

The separate agent_hostname integration label breaks a lot of existing dashboards that depend on instance being the unique label, so I'm interested in changing the behavior here relatively soon (cc @eamonryan). For your specific use case, it'd be useful if you could tell the "normal" scrape configs to replace the instance label with the machine hostname as well.

For now, metric_relabel_configs is the way to do it until we have something better in place.

ssesha

comment created time in 22 days

PR closed grafana/agent

use fpm for generating system packages

This is a WIP and only does rpm, but there should be deb support before this is merged. I'm also planning on having the config file it provides be more in depth, with a ton of comments and explaining everything available to users.

I'm new to building system packages, so I'm opening this up early in a draft to collect feedback on:

  1. What init systems should we try to support? Is just systemd enough in 2020?
  2. What other things might we want to set up in the postinstall scripts?
  3. Anything else I'm missing (e.g., structuring this in the repo, etc)
+58175 -49135

1 comment

643 changed files

rfratto

pr closed time in a month

pull request commentgrafana/agent

use fpm for generating system packages

This is superseded by #129, closing this.

rfratto

comment created time in a month

issue openedgrafana/agent

Scraping service: WAL management

If a Prometheus instance is deleted, there's no reason to keep the WAL around any more. It should be removed from disk after the instance is shut down. The implementation of this should be careful to make sure this only happens if the instance is explicitly deleted; restarts of the instance or the Agent process should keep the WAL.

It's possible that the Agent could restart and not be reassigned any of the instances it had before. This would leave the old WALs sitting around on disk; there should be a really slow poll interval to check for old WALs and delete them.

The WAL should also be given a max age. The truncation loop will use the higher value from last succesful send and the max age to truncate old data. This ensures that the WAL does not grow forever.

created time in a month

issue openedgrafana/agent

Global Prometheus Remote Write

Today, users using integrations alongside normal scrape configs have to configure remote_write twice: once for the Prometheus instance config and once for the integrations config.

It'd be nice to have a global remote write section within the prometheus block that affects all instance configs and integration configs, removing the need to put the same scrape config twice.

created time in a month

issue openedgrafana/agent

Scraping service: Explore allowing failed Agents in hash ring

If an Agent goes down and a config hashes to it, that config will not be loaded until the node is forgotten from the right. Can we remove the dependency on the quorum checking for the Agent? It's not clear if we benefit from the same guarantees that Cortex needs when looking up something in the ring.

created time in a month

push eventgrafana/loki

Owen Diehl

commit sha 48b8504c30e6680bfd1c009fc62ef1d41285f4b3

lock fix for flaky test (#2268)

view details

Ed Welch

commit sha aee064d0a8ff3f4692f5f95efee04740595a5595

Loki: Series API will return all series with no match or empty matcher (#2254) * Allow series queries with no match or an empty {} matcher which will return all series for the provided timeframe. * fix test * fix lint

view details

push time in a month

push eventgrafana/loki

Cyril Tovena

commit sha 1e19aa4715c83dc1b58475779b3d9941e91cee91

Add performance profile flags for logcli. (#2248) * Add performance profile flags for logcli. Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com> * Update the logcli documentation. Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com> * Remove comment. Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com> * remove unused import. Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

view details

Lukas Grimm

commit sha 17390fd45d1f35342f79cba4b33f48a07f9726e9

Canary: make stream configurable (#2259) * tried to fix canary in non kubernetes env * update doc

view details

Owen Diehl

commit sha 7682b32cbae8c1146d44601ebf4b13cc9c718b98

Revert "Revert "Update go from 1.13 to 1.14. (#2013)"" (#2029) * Revert "Revert "Update go from 1.13 to 1.14. (#2013)"" This reverts commit e8fece69e316c7c58ca2033349e43b9bbac2f0b0. * Update cortex to latest. Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com> * Update go.sum and doc. Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com> Co-authored-by: Cyril Tovena <cyril.tovena@gmail.com>

view details

Cyril Tovena

commit sha 8161b929060659069a7c860e9c5ea8708bc378f0

Update to latest cortex. (#2266) * Update to latest cortex. Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com> * update go.sum Signed-off-by: Cyril Tovena <cyril.tovena@gmail.com>

view details

push time in a month

push eventgrafana/agent

Robert Fratto

commit sha 55e9697c948f7d68fc7c1a2a85a7becced08eba3

fix default settings for filesystem collector on linux (#128) * fix default settings for filesystem collector on linux * remove unneded default, add docs

view details

push time in a month

issue closedgrafana/agent

node_exporter: filesystem collector doesn't collect anything by default on Linux

Looks like the filesystem config regexes aren't working for Linux

closed time in a month

rfratto

push eventrfratto/agent

Robert Fratto

commit sha ba9146de1066b11adfe50e762f4073d0271c4d32

remove unneded default, add docs

view details

push time in a month

PR opened grafana/agent

Reviewers
fix default settings for filesystem collector on linux

Fixes #127

+3 -4

0 comment

1 changed file

pr created time in a month

create barnchrfratto/agent

branch : fix-node-exporter-filesystem-defaults

created branch time in a month

issue openedgrafana/agent

node_exporter: filesystem collector doesn't collect anything by default on Linux

Looks like the filesystem config regexes aren't working for Linux

created time in a month

pull request commentgrafana/agent

Scrape default/kubernetes in host_filter mode

Yep, feel free to merge whenever 🙂

gotjosh

comment created time in a month

more