profile
viewpoint
Gonzalo Peci pecigonzalo Palma de Mallorca, Spain linkedin.com/in/pecig

pecigonzalo/cadvisor 1

Analyzes resource usage and performance characteristics of running containers.

pecigonzalo/agent 0

The Buildkite Agent is an open-source toolkit written in Golang for securely running build jobs on any device or network

pecigonzalo/amazon-ecr-credential-helper 0

Automatically gets credentials for Amazon ECR on docker push/docker pull

pecigonzalo/amazon-ecs-agent 0

Amazon EC2 Container Service Agent

pecigonzalo/amazon-ecs-cli 0

The Amazon ECS CLI enables users to run their applications on ECS/Fargate using the Docker Compose file format, quickly provision resources, push/pull images in ECR, and monitor running applications on ECS/Fargate.

pecigonzalo/amplify-cli 0

A CLI toolchain for simplifying serverless web and mobile development.

pecigonzalo/application_python 0

A Chef cookbook to deploy Python applications.

pecigonzalo/atom-project-shell-env 0

Atom package to load shell env variables from project directory

startedliljencrantz/crush

started time in 6 hours

issue openedsourcegraph/sourcegraph

WIP: Distribution 3.20 Tracking issue

Plan

<!-- Summarize what the team wants to achieve this iteration.

  • What are the problems we want to solve or what information do we want to gather?
  • Why is solving those problems or gathering that information important?
  • How do we plan to solve those problems or gather that information? -->

Availability

If you have planned unavailability this iteration (e.g., vacation), you can note that here.

Tracked issues

<!-- BEGIN WORK --> <!-- END WORK -->

Legend

  • 👩 Customer issue
  • 🐛 Bug
  • 🧶 Technical debt
  • 🛠️ Roadmap
  • 🕵️ Spike
  • 🔒 Security issue
  • :shipit: Pull Request

created time in 8 hours

Pull request review commentsourcegraph/about

Update customer issues process

+# Filing customer issues++Read [the support overview](index.md) before filing an issue.++## Create a customer issue++Customer support tickets should be translated to GitHub issues. In such cases, [create a+new issue for the request](https://github.com/sourcegraph/customer/issues/new).

@slimsag updated, please review

pecigonzalo

comment created time in 9 hours

Pull request review commentsourcegraph/about

Update customer issues process

+# Filing customer issues++Read [the support overview](index.md) before filing an issue.++## Create a customer issue++Customer support tickets should be translated to GitHub issues. In such cases, [create a+new issue for the request](https://github.com/sourcegraph/customer/issues/new).++Provide the appropriate context and add a label with the affected customer as `customer/$name`. Once its created, sharing it with the required [team](routing_questions.md).+If necessary, link to the appropriate JIRA Service Desk ticket or [HubSpot](#find-the-unique-company-url) notes.++### General issues++General issues are those that affect more users than those of a particular deployment. In such cases, create a [new issue for the request](https://github.com/sourcegraph/sourcegraph/issues/new/choose) describing it. If there was a previous [customer issue](##create-a-customer-issue), please link the issue in its description.++Remove any potentially private information (e.g. individual people's names, company names, self-hosted Sourcegraph URLs, repo names, screenshots, etc.)

@uwedeportivo updated, please review

pecigonzalo

comment created time in 9 hours

push eventsourcegraph/about

Gonzalo Peci

commit sha e67792a509745a3b2fe94cccfd6e75df009dc7e2

Clarify where to remove private information from

view details

Gonzalo Peci

commit sha f35a86076f74de8ea77f2ba1125243fc3b3eafbd

Clarify when to create GitHub issues

view details

push time in 9 hours

Pull request review commentsourcegraph/about

Update customer issues process

+# Filing customer issues++Read [the support overview](index.md) before filing an issue.++## Create a customer issue++Customer support tickets should be translated to GitHub issues. In such cases, [create a+new issue for the request](https://github.com/sourcegraph/customer/issues/new).

Ill rephrase it to indicate when it requires other engineers. While its not the intention for all the issues they manage to get a GitHub issue associated, I think it would be a useful metric for CE to be able to analyze what should we document more or what are the frequent type of questions we get.

pecigonzalo

comment created time in 10 hours

Pull request review commentsourcegraph/about

Update customer issues process

+# Filing customer issues++Read [the support overview](index.md) before filing an issue.++## Create a customer issue++Customer support tickets should be translated to GitHub issues. In such cases, [create a+new issue for the request](https://github.com/sourcegraph/customer/issues/new).++Provide the appropriate context and add a label with the affected customer as `customer/$name`. Once its created, sharing it with the required [team](routing_questions.md).+If necessary, link to the appropriate JIRA Service Desk ticket or [HubSpot](#find-the-unique-company-url) notes.++### General issues++General issues are those that affect more users than those of a particular deployment. In such cases, create a [new issue for the request](https://github.com/sourcegraph/sourcegraph/issues/new/choose) describing it. If there was a previous [customer issue](##create-a-customer-issue), please link the issue in its description.++Remove any potentially private information (e.g. individual people's names, company names, self-hosted Sourcegraph URLs, repo names, screenshots, etc.)

Ill rephrase it, I meant it to ensure no private information is on the general issue filed on sourcegrpah/sourcegraph

pecigonzalo

comment created time in a day

PR opened sourcegraph/about

Reviewers
Update customer issues process

I would like to ensure all customer issues are tracked and created in GitHub. Right now, some issues are created, some are in other tracking systems, which makes it difficult to analyze the types of issues we have, or how many there are, etc.

As most of our workflow is already in GitHub, I think it will be a good idea to continue to do it here, as we can benefit from adding issues to labels, projects and milestones, as well as cross-link them to PRs or other issues. We could, for example, use labels to categorize issues.

The downside of using GitHub is that there is no out of the box tool to analyze the data for any MTTR or other types of report. Given those are not required at the moment, I dont think it will be a problem.

+43 -48

0 comment

7 changed files

pr created time in a day

create barnchsourcegraph/about

branch : gp/issues

created branch time in a day

pull request commentsourcegraph/sourcegraph

Document when to introduce new services or not

I think many of the items listed on the "additional complexity" section could be misleading as most apply to any new feature/service. Metrics, alerts, docs, update the deployment, how it scales, etc. need to be thought about, updated and/or created regardless of it being a new service or its developed as part of an existing service.

slimsag

comment created time in a day

pull request commentsourcegraph/about

distribution roadmap

@sqs It is, but we have to update it to match our current goals.

slimsag

comment created time in 2 days

issue commentsourcegraph/sourcegraph

Disable low resource utilization alerts

We have some changes already in 3.19 that might mitigate this, and we will re-review this issue after that release.

pecigonzalo

comment created time in 2 days

pull request commentsourcegraph/sourcegraph

monitoring: encourage silencing, render entry for alerts w/o solutions

@bobheadxi I dont share that concern, if the have a link or a clear relationship to how to silence alerts, I think its actually more likely that someone will search for "silence alerts sourcegraph" than wait for an alert to pop up. If we want to point them in the right direction, we could link to the "how to silence" page/section from the alert. If its hard to find the sections and understand our documentation on how to monitor and manage alerts, we should fix that instead.

In general, I would actually not encourage silencing without expire, as its likely the silence will remain there after the issue is fixed.

bobheadxi

comment created time in 2 days

pull request commentsourcegraph/sourcegraph

monitoring: encourage silencing, render entry for alerts w/o solutions

I think it would be simpler to have a header that says "silencing alerts" and tells you how to do it in a generic fashion and we can reference/link that instead.

Silence an alert

If you are aware of an alert and want to silence notifications for it, add the following to your site configuration:

{
  "observability.silenceAlerts": [
    "ALERT_NAME"
  ]
}

You can find the ALERT_NAME on lorem ipsum

bobheadxi

comment created time in 2 days

pull request commentsourcegraph/about

cloud: document manual migrations we're performing

I think there are several reasons to avoid supporting that in the service. Security wise, if the service needs to create the user, it means the service has admin permissions to create users and other functionality, as it requires them for migration, which is in most cases something undesired.

Aside from that, while you can perform any action in a migration, as it can execute any SQL command, I dont think this is a migration, as the scope of change its outside of its own database. Migrations traditionally manage the schema of a database, but not the database itself or any other part of the system. I would say this is provisioning and configurations which, in my opinion, is not in the scope of the service. Allowing admins to provision and manage their system using their own tools favors composability and reduces the amount permutations we have to account for when validating we can provision a database because we dont allow customers to do it on their own. We already have customers with restricted environments which will most likely hit this issue.

Just to clarify, we already require admins to provision a database for Sourcegraph, performing this migrations will require them to have a dedicated database server, not just a database for Sourcegraph.

slimsag

comment created time in 3 days

startedsamber/awesome-prometheus-alerts

started time in 3 days

push eventpecigonzalo/mothership

push time in 4 days

push eventpecigonzalo/mothership

Gonzalo Peci

commit sha 829fd8a9ce82429ced4ab684a3713ff781cc6f29

--wip-- [skip ci]

view details

Gonzalo Peci

commit sha 7c76159efc71727d3ef247124f780b76d40f2dc4

--wip-- [skip ci]

view details

Gonzalo Peci

commit sha a2dad90b6a71eea14386e1b314156d4939616856

--wip-- [skip ci]

view details

Gonzalo Peci

commit sha 3f8d4786efdb1ac3d78a9b2e3a0a954bdc097623

--wip-- [skip ci]

view details

Gonzalo Peci

commit sha 2a0eecc6843c2b6186ffb61a9afa6bbf4f28ffd3

Fix network compat for ansible

view details

push time in 4 days

issue openedsourcegraph/sourcegraph

Disable low resource utilization alerts

We currently have multiple [WARNING] symbols: less than X that notify about services or resources that are over-provisioned. As we are not periodically reviewing and actioning these alerts, we want to remove their notifications and re-asses how to implement these alerts in the future.

Task

  • [ ] Disable low resource utilization notifications via Slack
  • [ ] Disable low resource utilization notifications via site-admin

created time in 4 days

issue openedsourcegraph/src-cli

Disable low resource utilization alerts

We currently have multiple [WARNING] symbols: less than X alerts which fire at different thresholds and levels that notify about services that are over-provisioned. As we are not periodically reviewing and actioning the alerts, we want to remove their notifications and re-asses how to implement these alerts in the future.

Task

  • [ ] Disable low resource utilization notifications via Slack
  • [ ] Disable low resource utilization notifications via site-admin

created time in 4 days

pull request commentsourcegraph/about

cloud: document manual migrations we're performing

@tsenart I dont think there should be for indexes, unless we are testing something and even then, they could be done in code and feature flagged or reverted.

There are another cases that are being referenced here, which are non-application related tasks, as the RO user creation. Those types of changes or settings should not be part of the application migration, as they are not part of the application, they are more on the side of provisioning and deployment and are not relevant to all environments and deployments.

Lets say we would like to require one or multiple read-only users, we should not impose how those are created as different environments will deploy and configure their databases differently, some might not even grant the Sourcegraph migration script the permissions to create users. As an administrator, I should be able to choose to use my own database and administer it following my internal requirements and guidelines as long as it meets the requirements by the Sourcegraph service.

slimsag

comment created time in 4 days

issue commentsourcegraph/sourcegraph

Distribution: 3.19 Tracking issue

Last week

We finished our initial team goals, I also finalized the review of RFC-199. We will make we test using microVMs with ignite for a v0 and will have to review the outcome of that testing before we can move to v1 and define how we deploy/support/HA/etc.

This week

We will kick-off our 360 review cycle and I will focus on that. Ill be working on the roadmap and a product readiness document with Stephen and will pair with Geoffrey to get more familiar with our Dhall implementation. I have not been able to progress RFC-202 and if time allows I would like to finish that up.

Team update

The high priority items from last week seem to be resolved, and we will return to our tracking issue priorities.

Ill update this again after I confirm those issues are resolved

pecigonzalo

comment created time in 4 days

delete branch sourcegraph/about

delete branch : gp/commitments

delete time in 7 days

push eventsourcegraph/about

Gonzalo Peci

commit sha 539795f5b3b8b05f85d88ab44eb63fc2f822e9fd

distribution: Creating GCP commitments (#1312)

view details

push time in 7 days

PR merged sourcegraph/about

distribution: Creating GCP commitments

Document current commitments and the commitment creation process.

+47 -0

0 comment

2 changed files

pecigonzalo

pr closed time in 7 days

PR opened sourcegraph/about

distribution: Creating GCP commitments

Document current commitments and the commitment creation process.

+47 -0

0 comment

2 changed files

pr created time in 7 days

create barnchsourcegraph/about

branch : gp/commitments

created branch time in 7 days

delete branch sourcegraph/about

delete branch : gp/distribution-goals

delete time in 9 days

push eventsourcegraph/about

Gonzalo Peci

commit sha a601f73e85e97ac3fd1c0b1ec32e8b467e344e78

Add distribution team goals (#1294) * Add distribution team goals * fixup! Add distribution team goals * Guide goals update * Update planning processs * fixup! Guide goals update * Update handbook/engineering/distribution/goals.md Co-authored-by: uwedeportivo <534011+uwedeportivo@users.noreply.github.com> * fixup! Guide goals update Co-authored-by: uwedeportivo <534011+uwedeportivo@users.noreply.github.com>

view details

push time in 9 days

PR merged sourcegraph/about

Add distribution team goals

Adds our initial team goals

+49 -3

0 comment

3 changed files

pecigonzalo

pr closed time in 9 days

push eventsourcegraph/about

Gonzalo Peci

commit sha f76df3a324fd67d4e8aef355cc592ed72cd77354

fixup! Guide goals update

view details

push time in 9 days

Pull request review commentsourcegraph/about

Add distribution team goals

+# Goals++Goals are continuously updated and reviewed. If you find these goals do not reflect our current priorities or are out of date, please update them as soon as possible or add it as a topic to our [weekly sync](recurring_processes.md#weekly-distribution-team-sync).++## Medium-term goals++### Any engineer at Sourcegraph can create a release for all of our supported deployment types by running a single command++Creating a new release for our deployments is currently a semi-automated process, which requires several manual steps and synchronizing our versioned artifacts (Sourcegraph, Kubernetes manifests, docker-compose manifests, etc). We want to enable any engineer to perform a release as often as needed, to enable this we want to make releasing Sourcegraph a simple, automated process.++- **Owner**: Distribution Team+- **Status**: In Progress+- **Outcomes**:+  - Releases can be triggered by a single manual step+  - All supported deployment types are released at the same time with the same command+  - Support documentation enables any engineer to perform a release with confidence++### Upgrades between releases are easy to perform++Performing upgrades to deployments is currently a complicated process that requires keeping a fork of our configuration and resolving diff conflicts when performing upgrades which are often complicated as the configuration might contain environment-specific customization. This process creates a bad experience for our customers because of the unknown amount of effort of the upgrade process.+We will start by looking at our Kubernetes deployment and working on an easier update process.++- **Owner**: Distribution Team+- **Status**: In Progress+- **Outcomes**:+  - Upgrades to deployments do not require resolving diff conflicts from upstream+  - Upgrading a deployment configuration requires less than 2 hours of work++### Improve the debugging and troubleshooting process+As we deploy Sourcegraph to multiple dissimilar environments, we need to provide a consistent and straight forward process. We will initially focus on reducing the time it takes to collect troubleshooting information.

As we deploy Sourcegraph to multiple dissimilar environments, we need to provide a consistent and straight forward process to debug issues. We are currently lacking tools to collect debugging information (configuration, type, size, diff from upstream, etc) consistently and a process to capture the output of debugging sessions to feed back into our priorities and documentation. We will initially focus on reducing the time it takes to collect troubleshooting information.

I think this might reflect better.

pecigonzalo

comment created time in 9 days

delete branch sourcegraph/about

delete branch : gp/terraform-state-guide

delete time in 9 days

push eventsourcegraph/about

Gonzalo Peci

commit sha fcef7b5a2ad197b3f9e0448f5e5acf5caa024103

Document Terraform state styleguide (#1297) * Document Terraform state styleguide * Set @sourcegraph/distribution as CODEOWNERS for terraform style

view details

push time in 9 days

PR merged sourcegraph/about

Reviewers
Document Terraform state styleguide

This will add information about our standard for terraform state configuration.

+31 -0

1 comment

2 changed files

pecigonzalo

pr closed time in 9 days

push eventsourcegraph/about

Gonzalo Peci

commit sha 2cd48c9576861d3ade8a99e4e5d335a2cf3557a5

Update handbook/engineering/distribution/goals.md Co-authored-by: uwedeportivo <534011+uwedeportivo@users.noreply.github.com>

view details

push time in 9 days

push eventsourcegraph/about

Gonzalo Peci

commit sha b2e3f6f4cb7fb6121a5a8c946ef0d6edd59cd579

Set @sourcegraph/distribution as CODEOWNERS for terraform style

view details

push time in 9 days

Pull request review commentsourcegraph/about

Document Terraform state styleguide

  - General Terraform [styleguide](https://www.terraform.io/docs/configuration/style.html) +## State++State must be stored using a [GCS Terraform state backend](https://www.terraform.io/docs/backends/types/gcs.html).++Example configuration+```+terraform {+  required_version = "0.12.26"++  backend "gcs" {+    bucket = "sourcegraph-tfstate"+    prefix = "infrastructure/dns"+  }+}+```++### State for state buckets++Because we need to create state buckets as code, we also need to store the state of the code that creates the state bucket. Given this code rarely changes and that moving it to be stored in a remote location creates a chicken and egg situation, we will store state bucket creation's state in Git.

I have that on a shirt :D

pecigonzalo

comment created time in 9 days

PR opened sourcegraph/about

Reviewers
Document Terraform state styleguide

This will add information about our standard for terraform state configuration.

+30 -0

0 comment

1 changed file

pr created time in 9 days

create barnchsourcegraph/about

branch : gp/terraform-state-guide

created branch time in 9 days

issue closedsourcegraph/sourcegraph

Migrate terraform state to GCP

Currently the following terraform deployments rely on local state, and developers running a terraform apply then checking in there code + a state file into the repo. In doing so we assume the risk that developers could corrupt a state file or forget to check it in.

The following terraform deployments should be migrated to use remote state in GCP:

  • [ ] https://github.com/sourcegraph/infrastructure/tree/master/cloud
  • [ ] https://github.com/sourcegraph/infrastructure/tree/master/dns
  • [ ] https://github.com/sourcegraph/infrastructure/tree/master/site24x7

TODO

Determine naming scheme for each deployment

closed time in 9 days

davejrt

push eventsourcegraph/about

Gonzalo Peci

commit sha 7556590c718852da480ed2cf79bffc151c99dd83

fixup! Guide goals update

view details

push time in 9 days

push eventsourcegraph/about

Gonzalo Peci

commit sha 454a2ae757d1e8d290c194cf9c488065457804c6

Guide goals update

view details

Gonzalo Peci

commit sha e58d6b0d53f21c70438410f1aae98bdbe40b064f

Update planning processs

view details

push time in 9 days

push eventsourcegraph/about

Gonzalo Peci

commit sha 155a608b272bb9d9c063dc288cb4db668aa9d6c2

fixup! Add distribution team goals

view details

push time in 9 days

PR opened sourcegraph/about

Add distribution team goals
+43 -0

0 comment

2 changed files

pr created time in 9 days

create barnchsourcegraph/about

branch : gp/distribution-goals

created branch time in 9 days

pull request commentsourcegraph/sourcegraph

search: trace and observe each zoekt host

I would be careful with sort of metrics as they can create a cardinality explosion for the metrics DB. In most cases, I believe the downstream service should actually provide the metrics if possible, and this service only provide his health (general latency to the upstream, etc)

keegancsmith

comment created time in 10 days

startedKartikChugh/Otto

started time in 10 days

pull request commentsourcegraph/sourcegraph

search: trace and observe each zoekt host

In theory, each service should already have an identifier as prometheus adds it to it from discovery. Ill verify.

keegancsmith

comment created time in 10 days

fork pecigonzalo/beancount

Official Beancount repository.

fork in 11 days

startedbeancount/beancount

started time in 11 days

fork pecigonzalo/fava

Fava - web interface for Beancount

https://beancount.github.io/fava/

fork in 11 days

startedbeancount/fava

started time in 11 days

issue commentsourcegraph/sourcegraph

Distribution: 3.19 Tracking issue

Week July 20

Last week focus has been working with the team to set our team goals. The test around using GitHub projects for tracking progress seems to be working and ill continue with this during the rest of the iteration. I have been also talking with Chayim about the secrets loading implementation.

Week July 27

Ill continue to focus on setting our team goals, we have settled on them but we are still working out the details. I will also try to finalize RFC-202 and the review of RFC 199.

Team update

Issue https://github.com/sourcegraph/customer/issues/65 has been resolved, but our focus remains on the sub-issues created by it https://github.com/sourcegraph/customer/issues/69 and https://github.com/sourcegraph/customer/issues/70.

pecigonzalo

comment created time in 11 days

Pull request review commentsourcegraph/about

update values

 # Sourcegraph values -Our values are:+These values are some of the beliefs and principles that help us achieve our [goals](goals/index.md) and [vision](strategy.md#vision). -## People+This list isn't intended to cover everything we care about; instead, it lists the values that we frequently find useful and refer to. We'll keep this list up to date with the frequently used beliefs and principles (adding, editing, and removing entries as needed). Our hope is that this makes this list more accurate and useful than if it were a list of stale, vague, aspirational, or obvious values. -Together we are advancing technology for the good of people all around the world. We will attract, hire and retain the best teammates in the world and treat everyone in a first-class manner.+## High quality -## Journey+Every person on our team is individually responsible for knowing what high-quality work looks like and producing high-quality work.

I like this point, but its description seems to be more aimed towards self-management than delivering high quality.

sqs

comment created time in 11 days

pull request commentsourcegraph/sourcegraph

monitoring: implement owner routing for alerts

I'll add something to check on the rendered routes

Exactly, we should not test Alertmanager itself or that it routes properly, only that we render the expected config given X input.

bobheadxi

comment created time in 11 days

pull request commentsourcegraph/sourcegraph

monitoring: implement owner routing for alerts

I would use owners similarly to how we use level and not onLevel. Could we add some tests to this to ensure it behaves as expected without doing a full deployment?

bobheadxi

comment created time in 11 days

startedhwayne/awesome-cold-showers

started time in 12 days

issue commentsourcegraph/sourcegraph

Bare-metal Buildkite agents capable of running Docker and VMs

Maybe we could use https://github.com/firecracker-microvm/firecracker in a similar way as RFC-199

slimsag

comment created time in 14 days

issue commentsourcegraph/sourcegraph

Proposal: Move monitoring configuration closer to the service code, use TOML

It would be great to define or link the original problem we are trying to fix. I believe there was some talk about it in relation to https://github.com/sourcegraph/about/pull/1221 but I cant find the content to link it directly, maybe it was talked about on a meeting.

The main benefit I see with the Go based generator is around generating dashboards for Grafana from the same code we use to define rules and all the wrapping we had to do because of alert_count. I think ideally services would define their dashboards and alerts automagically as part of their service definition/code like they define their metrics, although this could end up being even more complex.

The main downsides for me are around readability and easy to grok output from code because of the abstraction, this was made more complicated for me because part is its config is scattered between places (siteconfig, generator, static files, configmap). Maybe this is not a problem of the generator itself and its just about making it simpler to onboard to by cleaning old configs, reducing the amount of places this is spread on and documenting what goes where and how.

As an example, without taking into account Grafana and wrapping, an alert rule without all the wrapping is quite simple to understand and even write.

groups:
- name: replacer
  rules:
  - alert: replacer_frontend_internal_api_error_responses
    expr: sum by(category) (increase(src_frontend_internal_request_duration_seconds_count{job="replacer",code!~"2.."}[5m])) > X
    labels:
      level: warning
      service_name: replacer
    annotations:
      summary: SOME summary
      description: SOME description

Or even "simpler" if we use generic alerts instead, as we will not have multiple copies of this same alert

groups:
- name: api_error_responses
  rules:
  - alert: frontend_internal_api_error_responses
    expr: sum by(category) (increase(src_frontend_internal_request_duration_seconds_count{code!~"2.."}[5m])) > X
    labels:
      level: warning
      service_name: "{{ $labels.job }}"
    annotations:
      summary: "SOME {{ $labels.foo }} summary"
      description: |
        SOME {{ $labels.bar }} description

Than the following Go definition, which I also need to go to each type and function to understand what it will do.

package main

func Replacer() *Container {
	return &Container{
		Name:        "replacer",
		Title:       "Replacer",
		Description: "Backend for find-and-replace operations.",
		Groups: []Group{
			{
				Title: "General",
				Rows: []Row{
					{
						sharedFrontendInternalAPIErrorResponses("replacer"),
					},
				},
			},
		},
	}
}

A rule with all the wrapping would be really ugly because we have to repeat the wrapper everywhere.

clamp_max(clamp_min(floor(
      max((((( 
THE_REAL_QUERY
) OR on() vector(0)) >= 0) OR on() vector(1))
      ), 0), 1) OR on() vector(1)

Therefore, I assume the generator is mainly there to help us resolve 3 things:

  • Generating Grafana dashboards because they are cumbersome to write and read
  • Generating wrapping due to alert_count and other current requirements (this could be changed/fixed)
  • Ensuring alerts/rules conform to certain requirements

I strongly feel Dhall would not be a good choice here, that feels like raising the barrier even further for most on the team.

I think this is a valid concern for Dhall right now, but if we use it for Kuberentes then it will be required anyway, otherwise the same concern applies there.

slimsag

comment created time in 14 days

starteddflook/terraform-github-actions

started time in 15 days

fork pecigonzalo/semgrep

Lightweight static analysis for many languages. Find bug variants with patterns that look like source code.

https://semgrep.live

fork in 15 days

startedreturntocorp/semgrep

started time in 15 days

pull request commentsourcegraph/deploy-sourcegraph

search: generous timeout for search pods

@keegancsmith Is it correct to assume then that HTTP and /heatlthz should start working before it finishes loading/indexing?

keegancsmith

comment created time in 15 days

issue commentsourcegraph/sourcegraph

Distribution: 3.19 Tracking issue

Priorities update

As discussed with CE https://github.com/sourcegraph/customer/issues/65 is now our top priority

pecigonzalo

comment created time in 16 days

pull request commentsourcegraph/sourcegraph

monitoring: migrate out-of-band alerts to the generator

I think requiring a dashboard per alert makes sense as well

I agree that we should link to a relevant dashboard, graph or/and document, my comment is around what the monitoring pillars required as the reason why we are splitting into an alert rule per service.

This is definitely an improvement that could be made! However, I think this is an implementation detail that is not within the scope of this PR to fix (since such an improvement would apply to many alerts we have) - I'm focusing on simply moving the alerts here for now

I understood from "previously multi-service alerts are now defined per-service as per monitoring pillars, since a multi-service alert would require multi-service dashboards" that we were implementing that split into multiple per-service alerts here.

bobheadxi

comment created time in 16 days

pull request commentsourcegraph/sourcegraph

monitoring: migrate out-of-band alerts to the generator

to have alerts span multiple services, we'd need a multi-service dashboard.

Is this a rule set by our current generator?


As I interpreted the pillar, it defines that a graph should have an associated alert, but not that an alert necessarily needs a graph. If we want to create an dashboard per service that shows service alerts, we should not need to split alerts into multiple alert rules, we should be able to use one of the labels to filter on each graph.

bobheadxi

comment created time in 16 days

pull request commentsourcegraph/sourcegraph

monitoring: migrate out-of-band alerts to the generator

previously multi-service alerts are now defined per-service as per monitoring pillars

Is tihs the 2nd pillar? I don't think it applies here, I believe this was also closely related to defining alerts in Grafana and the map between alerts and dashboards (cc/ @slimsag). Unless we need different thresholds for each, I dont think we need to do this.

Related: https://www.robustperception.io/undoing-the-benefits-of-labels

bobheadxi

comment created time in 16 days

issue commentsourcegraph/sourcegraph

Distribution: 3.19 Tracking issue

Week July 13

I worked on planning 3.19 which included starting this experiment for tracking project progress. I have also started RFC-202 for standardizing configuration across our services.

Week July 20

My focus this week will be working with the team to set the our goals and planning the retrospective for 3.18

pecigonzalo

comment created time in 18 days

push eventsourcegraph/deploy-sourcegraph

Robert Lin

commit sha fd8ae81269ab6a14507e5740c6622adbb0e9206d

prometheus: add consistent alert labels (#786)

view details

Robert Lin

commit sha b101bcb79613e16dc11df729637d84bdad032e77

remove configure alertmanager (#787)

view details

renovate[bot]

commit sha f39dcdda310e1deb4910d226f8bb33a3da53b7f6

Update Sourcegraph Docker images Docker tags to v3.18.0-rc.1 (#788) Co-authored-by: Renovate Bot <bot@renovateapp.com>

view details

renovate[bot]

commit sha 310ba24683416e969ce460c4259daf8c9de9fe8a

Update index.docker.io/sourcegraph/prometheus Docker tag to v3.18.0-rc.1 (#789) Co-authored-by: Renovate Bot <bot@renovateapp.com>

view details

renovate[bot]

commit sha 91c8311288d596c31d179f575e3a056d4830b98e

Update Sourcegraph Docker images Docker tags to v3.18.0-rc.2 (#791) Co-authored-by: Renovate Bot <bot@renovateapp.com>

view details

renovate[bot]

commit sha dca0782800cfd1891e3c20a8dd6df3ea1e3c21de

Update index.docker.io/sourcegraph/prometheus Docker tag to v3.18.0-rc.2 (#792) Co-authored-by: Renovate Bot <bot@renovateapp.com>

view details

renovate[bot]

commit sha 6f98eefd59638b55812a12dbbc807f8fc046b894

Update Sourcegraph Docker images Docker tags to v3.18.0-rc.3 (#793) Co-authored-by: Renovate Bot <bot@renovateapp.com>

view details

renovate[bot]

commit sha 8373c2d705bfbb91355100fab53b38e384a4978b

Update Sourcegraph Docker images Docker tags to v3.18.0-rc.6 (#795) Co-authored-by: Renovate Bot <bot@renovateapp.com>

view details

renovate[bot]

commit sha d4cd3a6ce1aa23ea4ee0df50d1a4e9399cfd3ad2

Update index.docker.io/sourcegraph/prometheus Docker tag to v3.18.0-rc.6 (#794) Co-authored-by: Renovate Bot <bot@renovateapp.com>

view details

renovate[bot]

commit sha 2d9cc08fe7b05a5c664c588d444671a0f6f9fcd6

Update Sourcegraph Docker images Docker tags to v3.18.0-rc.7 (#796) Co-authored-by: Renovate Bot <bot@renovateapp.com>

view details

renovate[bot]

commit sha 9b89bad68fadd7b4f793261f35f4445c4a7c16dc

Update index.docker.io/sourcegraph/prometheus Docker tag to v3.18.0-rc.7 (#797) Co-authored-by: Renovate Bot <bot@renovateapp.com>

view details

renovate[bot]

commit sha 53fcd336c5ebab6be7d20074219674f21eb9f5da

Update Sourcegraph Docker images Docker tags to v3.18.0 (master) (#798) Co-authored-by: Renovate Bot <bot@renovateapp.com>

view details

renovate[bot]

commit sha f18a52aef66f6c039e26cbef8bd2b1c9eb81d32e

Update index.docker.io/sourcegraph/prometheus Docker tag to v3.18.0 (#799) Co-authored-by: Renovate Bot <bot@renovateapp.com>

view details

Dax McDonald

commit sha 1400bdc11c34a5a28391c5fe40f046d5ecf06e24

Update grafanan to 3.18.0 (#800)

view details

Gonzalo Peci

commit sha 112c1a3bcb5eed1122c5f827c861bb446b9e6178

--wip-- [skip ci]

view details

Gonzalo Peci

commit sha 70950f1ac94055a6f07d4ab6d8a785c0aa517767

Merge branch 'master' into feature/mixin

view details

push time in 18 days

startedk14s/ytt

started time in 18 days

issue commentsourcegraph/sourcegraph

progress indicator of state of indexing

I believe this is related to https://github.com/sourcegraph/sourcegraph/pull/12322

uwedeportivo

comment created time in 18 days

issue closedsourcegraph/sourcegraph

Distribution: 3.18 Tracking issue

Plan

Support new and existing deployments

This is an ongoing expense, and based on the current state of customers we anticipate this taking taking no more than 10d of work spread across the entire team.

Reduce upgrade overhead

Upgrading Kubernetes deployments requires customers spend a lot of engineering time to converge our released Kubernetes manifests with their fork as documented in RFC-141.

We have evaluated and eliminated Bash and Cue as solutions to this problem, and have partially evaluated Dhall. We will continue to evaluate Dhall more extensively this iteration by dogfooding it on k8s.sgdev.org.

Reduce support burdens

Investigating and troubleshooting problems with deployments currently requires a lot of overhead to retrieve all necessary information. We will improve documentation around tooling and services to simplify data collection in customer issue reports and streamline observability tooling.

Onboard @pecigonzalo

As the team grows, the management overhead of the teams makes it harder for engineers to focus on our objectives and OKRs. We will finalize the initial onboarding and handover of processes to @pecigonzalo.

Improve reliability sourcegraph.com alerts

Sourcegraph.com alerting does not use our new monitoring stack we ship to customers and has flaky alerts due to that and site24x7. We will begin using the new monitoring stack on Sourcegraph.com and explore more reliable replacements to site24x7 such as blackbox exporter. This supports RFC 189

Availability

Period is from June 22nd to July 17th. Please write the days you won't be working and the number of working days for the period.

  • @daxmc99: 1d
  • @ggilmore: 1d
  • @slimsag: 1d
  • @bobheadxi: 1d

Workload

<!-- BEGIN WORK --> <!-- BEGIN ASSIGNEE: bobheadxi --> @bobheadxi: 14.00d

  • [x] monitoring: provisioning alerts not consistently applied #11571 2d 🐛
  • [x] monitoring: update sourcegraph/server to use prom-wrapper #11473 1d
  • [x] monitoring: automatically configure Alertmanager SMTP from site config #11454 3d
  • [x] monitoring: granular alerts notifications with Alertmanager #11452 2d
  • [x] siteAdmin: report-a-bug sometimes doesn't render alerts #11806 0.5d 🐛
  • [x] license check failing in CI: mdi-react #11866 0.5d
  • [x] explore solutions for "impossible to silence monitoring alerts" #11210 2d
  • [ ] ~Dogfood the monitoring we ship with Sourcegraph~ #5370 2d
  • [x] Direct admins to Grafana when critical alerts have fired recently #11517 1d
  • [x] monitoring: reassess flakey/unactionable critical alerts #12011
  • [x] monitoring: alerting followups #12026 2d <!-- END ASSIGNEE -->

<!-- BEGIN ASSIGNEE: davejrt --> @davejrt: 10.00d

  • [x] Improve reliability of Sourcegraph.com site24x7 ping alerts #10742 8d
  • [x] Set up $BIGCUSTOMER replica with TLS and DNS #11875 2d <!-- END ASSIGNEE -->

<!-- BEGIN ASSIGNEE: daxmc99 --> @daxmc99: 3.00d

  • [x] add terraform configuration for sourcegraph.com's infrastructure #10455 2d
  • [ ] Prevent releasing server without latest grafana/prometheus changes #9983 1d
  • [ ] Sourcegraph.com - add redis-store & precise-code-intel-bundle-manager snapshotting #10450 <!-- END ASSIGNEE -->

<!-- BEGIN ASSIGNEE: ggilmore --> @ggilmore: 3.00d

  • [ ] reduce k8s upgrade merge conflicts: RFC 141: Dhall investigation #10936 3d 🕵️
  • [ ] ~Prevent services from running out of ephemeral storage space and being evicted~ #9604 1d 👩
  • [ ] deploy-sourcegraph-dhall: extend configuration layer for subset of services on k8s.sgdev.org #11830
  • [x] deploy-sourcegraph-dhall: bring resources up-to-date with latest deploy-sourcegraph commits #11829
  • [ ] RFC 141: prepare k8s.sgdev.org infrastructure for dhall-based pipeline #12104
  • [ ] ~deploy-sourcegraph-dhall: frontend: implement new configuration logic for k8s.sgdev.org~ #12105
  • [ ] ~deploy-sourcegraph-dhall: grafana: implement new configuration logic for k8s.sgdev.org~ #12109 <!-- END ASSIGNEE -->

<!-- BEGIN ASSIGNEE: pecigonzalo --> @pecigonzalo

  • [x] Run Distribution team syncs from July 6 onward #11817
  • [x] Plan 3.19 work ahead of time #11818
  • [ ] ~Reduce the impact of unplanned work~ #11904
  • [x] sourcegraph/customer #63 👩 <!-- END ASSIGNEE -->

<!-- BEGIN ASSIGNEE: slimsag --> @slimsag: 4.50d

  • [ ] License report for syntect_server & its dependencies #11269 1d 👩
  • [ ] Deploy and release: Replace Renovate with GitHub Actions #10011 0.5d
  • [ ] verify "syntax highlighting sometimes doesn't work" is fixed on customer instance #9557 0.5d 🐛
  • [ ] observability: improve/restructure admin documentation #9773 0.5d
  • [ ] Docker Compose should be released on the 20th #10486 0.5d
  • [x] not possible to upgrade from 3.17.0 ("0.0.0+dev") -> next patch release #11666
  • [x] 3.17.2 patch release #11724
  • [x] 3.17.1 patch release #11642
  • [ ] sourcegraph/customer #53 0.5d 👩
  • [ ] sourcegraph/customer #49 0.5d 🐛👩
  • [ ] sourcegraph/customer #62 0.5d 👩 <!-- END ASSIGNEE -->

<!-- BEGIN ASSIGNEE: uwedeportivo --> @uwedeportivo: 4.50d

  • [x] keep e2e tests in functional state during iteration #11886 2d
  • [x] Release patch v3.17.3 #11881 0.5d
  • [x] ~deploy-sourcegraph-dhall: cadvisor: implement generate reading from config~ #12065 0.5d
  • [x] deploy-sourcegraph-dhall: gitserver: implement generate reading from config #12067 0.5d
  • [x] ~deploy-sourcegraph-dhall: github-proxy: implement generate reading from config~ #12066 0.5d
  • [x] deploy-sourcegraph-dhall: indexed-search: implement generate reading from config #12068 0.5d
  • [ ] ~deploy-sourcegraph-dhall: postgres: implement generate reading from config~ #12070 0.5d
  • [x] ~deploy-sourcegraph-dhall: jaeger: implement generate reading from config~ #12069 0.5d
  • [x] ~deploy-sourcegraph-dhall: query-runner: implement generate reading from config~ #12072 0.5d
  • [ ] ~deploy-sourcegraph-dhall: precise-code-intel: implement generate reading from config~ #12071 0.5d
  • [x] ~deploy-sourcegraph-dhall: searcher: implement generate reading from config~ #12075 0.5d
  • [x] ~deploy-sourcegraph-dhall: symbols: implement generate reading from config~ #12076 0.5d
  • [x] ~deploy-sourcegraph-dhall: replacer: implement generate reading from config~ #12074 0.5d
  • [x] ~deploy-sourcegraph-dhall: repo-updater: implement generate reading from config~ #12073 0.5d
  • [x] deploy-sourcegraph-dhall: prometheus: implement generate reading from config #12063 1d
  • [ ] admin docs: describes experimental validation command in the src-cli #12001 :shipit:
  • [ ] test validates instance #687 :shipit:
  • [ ] validate command #200 :shipit: <!-- END ASSIGNEE --> <!-- END WORK -->

Legend

  • 👩 Customer issue
  • 🐛 Bug
  • 🧶 Technical debt
  • 🛠️ Roadmap
  • 🕵️ Spike
  • 🔒 Security issue
  • :shipit: Pull Request

closed time in 18 days

slimsag

issue commentsourcegraph/sourcegraph

Mark unplanned issues in our Tracking issues

Created a thread to discuss this idea.

pecigonzalo

comment created time in 18 days

IssuesEvent

issue closedsourcegraph/sourcegraph

Distribution: 3.18 Tracking issue

Plan

Support new and existing deployments

This is an ongoing expense, and based on the current state of customers we anticipate this taking taking no more than 10d of work spread across the entire team.

Reduce upgrade overhead

Upgrading Kubernetes deployments requires customers spend a lot of engineering time to converge our released Kubernetes manifests with their fork as documented in RFC-141.

We have evaluated and eliminated Bash and Cue as solutions to this problem, and have partially evaluated Dhall. We will continue to evaluate Dhall more extensively this iteration by dogfooding it on k8s.sgdev.org.

Reduce support burdens

Investigating and troubleshooting problems with deployments currently requires a lot of overhead to retrieve all necessary information. We will improve documentation around tooling and services to simplify data collection in customer issue reports and streamline observability tooling.

Onboard @pecigonzalo

As the team grows, the management overhead of the teams makes it harder for engineers to focus on our objectives and OKRs. We will finalize the initial onboarding and handover of processes to @pecigonzalo.

Improve reliability sourcegraph.com alerts

Sourcegraph.com alerting does not use our new monitoring stack we ship to customers and has flaky alerts due to that and site24x7. We will begin using the new monitoring stack on Sourcegraph.com and explore more reliable replacements to site24x7 such as blackbox exporter. This supports RFC 189

Availability

Period is from June 22nd to July 17th. Please write the days you won't be working and the number of working days for the period.

  • @daxmc99: 1d
  • @ggilmore: 1d
  • @slimsag: 1d
  • @bobheadxi: 1d

Workload

<!-- BEGIN WORK --> <!-- BEGIN ASSIGNEE: bobheadxi --> @bobheadxi: 14.00d

  • [x] monitoring: provisioning alerts not consistently applied #11571 2d 🐛
  • [x] monitoring: update sourcegraph/server to use prom-wrapper #11473 1d
  • [x] monitoring: automatically configure Alertmanager SMTP from site config #11454 3d
  • [x] monitoring: granular alerts notifications with Alertmanager #11452 2d
  • [x] siteAdmin: report-a-bug sometimes doesn't render alerts #11806 0.5d 🐛
  • [x] license check failing in CI: mdi-react #11866 0.5d
  • [x] explore solutions for "impossible to silence monitoring alerts" #11210 2d
  • [ ] ~Dogfood the monitoring we ship with Sourcegraph~ #5370 2d
  • [x] Direct admins to Grafana when critical alerts have fired recently #11517 1d
  • [x] monitoring: reassess flakey/unactionable critical alerts #12011
  • [x] monitoring: alerting followups #12026 2d <!-- END ASSIGNEE -->

<!-- BEGIN ASSIGNEE: davejrt --> @davejrt: 10.00d

  • [x] Improve reliability of Sourcegraph.com site24x7 ping alerts #10742 8d
  • [x] Set up $BIGCUSTOMER replica with TLS and DNS #11875 2d <!-- END ASSIGNEE -->

<!-- BEGIN ASSIGNEE: daxmc99 --> @daxmc99: 3.00d

  • [x] add terraform configuration for sourcegraph.com's infrastructure #10455 2d
  • [ ] Prevent releasing server without latest grafana/prometheus changes #9983 1d
  • [ ] Sourcegraph.com - add redis-store & precise-code-intel-bundle-manager snapshotting #10450 <!-- END ASSIGNEE -->

<!-- BEGIN ASSIGNEE: ggilmore --> @ggilmore: 3.00d

  • [ ] reduce k8s upgrade merge conflicts: RFC 141: Dhall investigation #10936 3d 🕵️
  • [ ] ~Prevent services from running out of ephemeral storage space and being evicted~ #9604 1d 👩
  • [ ] deploy-sourcegraph-dhall: extend configuration layer for subset of services on k8s.sgdev.org #11830
  • [x] deploy-sourcegraph-dhall: bring resources up-to-date with latest deploy-sourcegraph commits #11829
  • [ ] RFC 141: prepare k8s.sgdev.org infrastructure for dhall-based pipeline #12104
  • [ ] ~deploy-sourcegraph-dhall: frontend: implement new configuration logic for k8s.sgdev.org~ #12105
  • [ ] ~deploy-sourcegraph-dhall: grafana: implement new configuration logic for k8s.sgdev.org~ #12109 <!-- END ASSIGNEE -->

<!-- BEGIN ASSIGNEE: pecigonzalo --> @pecigonzalo

  • [x] Run Distribution team syncs from July 6 onward #11817
  • [x] Plan 3.19 work ahead of time #11818
  • [ ] ~Reduce the impact of unplanned work~ #11904
  • [x] sourcegraph/customer #63 👩 <!-- END ASSIGNEE -->

<!-- BEGIN ASSIGNEE: slimsag --> @slimsag: 4.50d

  • [ ] License report for syntect_server & its dependencies #11269 1d 👩
  • [ ] Deploy and release: Replace Renovate with GitHub Actions #10011 0.5d
  • [ ] verify "syntax highlighting sometimes doesn't work" is fixed on customer instance #9557 0.5d 🐛
  • [ ] observability: improve/restructure admin documentation #9773 0.5d
  • [ ] Docker Compose should be released on the 20th #10486 0.5d
  • [x] not possible to upgrade from 3.17.0 ("0.0.0+dev") -> next patch release #11666
  • [x] 3.17.2 patch release #11724
  • [x] 3.17.1 patch release #11642
  • [ ] sourcegraph/customer #53 0.5d 👩
  • [ ] sourcegraph/customer #49 0.5d 🐛👩
  • [ ] sourcegraph/customer #62 0.5d 👩 <!-- END ASSIGNEE -->

<!-- BEGIN ASSIGNEE: uwedeportivo --> @uwedeportivo: 4.50d

  • [x] keep e2e tests in functional state during iteration #11886 2d
  • [x] Release patch v3.17.3 #11881 0.5d
  • [x] ~deploy-sourcegraph-dhall: cadvisor: implement generate reading from config~ #12065 0.5d
  • [x] deploy-sourcegraph-dhall: gitserver: implement generate reading from config #12067 0.5d
  • [x] ~deploy-sourcegraph-dhall: github-proxy: implement generate reading from config~ #12066 0.5d
  • [x] deploy-sourcegraph-dhall: indexed-search: implement generate reading from config #12068 0.5d
  • [ ] ~deploy-sourcegraph-dhall: postgres: implement generate reading from config~ #12070 0.5d
  • [x] ~deploy-sourcegraph-dhall: jaeger: implement generate reading from config~ #12069 0.5d
  • [x] ~deploy-sourcegraph-dhall: query-runner: implement generate reading from config~ #12072 0.5d
  • [ ] ~deploy-sourcegraph-dhall: precise-code-intel: implement generate reading from config~ #12071 0.5d
  • [x] ~deploy-sourcegraph-dhall: searcher: implement generate reading from config~ #12075 0.5d
  • [x] ~deploy-sourcegraph-dhall: symbols: implement generate reading from config~ #12076 0.5d
  • [x] ~deploy-sourcegraph-dhall: replacer: implement generate reading from config~ #12074 0.5d
  • [x] ~deploy-sourcegraph-dhall: repo-updater: implement generate reading from config~ #12073 0.5d
  • [x] deploy-sourcegraph-dhall: prometheus: implement generate reading from config #12063 1d
  • [ ] admin docs: describes experimental validation command in the src-cli #12001 :shipit:
  • [ ] test validates instance #687 :shipit:
  • [ ] validate command #200 :shipit: <!-- END ASSIGNEE --> <!-- END WORK -->

Legend

  • 👩 Customer issue
  • 🐛 Bug
  • 🧶 Technical debt
  • 🛠️ Roadmap
  • 🕵️ Spike
  • 🔒 Security issue
  • :shipit: Pull Request

closed time in 18 days

slimsag

issue closedsourcegraph/sourcegraph

Plan 3.19 work ahead of time

Depends on @slimsag completing Distribution product roadmap and giving Gonza enough context/info to do this

closed time in 18 days

slimsag

issue commentsourcegraph/sourcegraph

Plan 3.19 work ahead of time

Done in https://github.com/sourcegraph/sourcegraph/issues/11954

slimsag

comment created time in 18 days

Pull request review commentsourcegraph/about

distribution: add monitoring architecture page

+# Sourcegraph monitoring architecture++**Note:** Looking for _how to monitor Sourcegraph?_ See the [observability documentation](https://docs.sourcegraph.com/admin/observability).++**Note:** Looking for _how to develop Sourcegraph monitoring?_ See the [monitoring developer guide](monitoring.md).++This document describes the architecture of Sourcegraph's monitoring stack, and the technical decisions we have made to date and why.++<!-- generated from monitoring_architecture.excalidraw -->+![architecture diagram](https://storage.googleapis.com/sourcegraph-assets/monitoring-architecture.png)++## Long-term vision++To better understand our goals with Sourcegraph's monitoring stack, please read [monitoring pillars: long-term vision](monitoring_pillars.md#long-term-vision).++## Monitoring generator++We use a custom [declarative Go generator syntax](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/tree/monitoring) for:++- Defining the services we monitor.+- Describing _what those services do_ to site admins.+- Laying out dashboards in a uniform, consistent, and simple way.+- Generating the [Prometheus alerting rules](#alerting) and Grafana dashboards.+- Generating documentation in the form of ["possible solutions"](https://docs.sourcegraph.com/admin/observability/alert_solutions) for site admins to follow when alerts are firing.++This allows us to assert constraints and principles that we want to hold ourselves to, as described in [our monitoring pillars](monitoring_pillars.md).++To learn more about adding monitoring using the generator, see ["How easy is it to add monitoring?"](monitoring.md#how-easy-is-it-to-add-monitoring)++## Sourcegraph deployment++### Sourcegraph Grafana++We use [Grafana](https://grafana.com) for:++- Displaying generated dashboards for our Prometheus metrics and alerts.+- Providing an interface to query Prometheus metrics and Jaeger traces.++The [`sourcegraph/grafana`](https://github.com/sourcegraph/sourcegraph/tree/master/docker-images/grafana) image handles shipping Grafana and Sourcegraph monitoring dashboards. It bundles:++* Preconfigured [Grafana](https://grafana.com), which displays data from Prometheus and Jaeger+* Dashboards generated by the [monitoring generator](#monitoring-generator).++#### Admin reverse-proxy++For convenience, Grafana is served on `/-/debug/grafana` on all Sourcegraph deployments via a reverse-proxy restricted to admins.++Services served via reverse-proxy in this manner could be vulnerable to [cross-site request forgery](https://owasp.org/www-community/attacks/csrf), which is complicated to resolve ([#6075](https://github.com/sourcegraph/sourcegraph/issues/6075)). This means that at the moment, making changes to Grafana using the Grafana UI is not possible without setting up a port-forward, something [we want to avoid asking customers to do](monitoring_pillars.md#long-term-vision). In addition, provisioned dashboards generated by the [monitoring generator](#monitoring-generator) cannot be edited at all.

Ok, I got this the other way around. This is just describing that Grafana does not allow provisioned dashboards (the ones you can auto-load by adding them to provisioning/dashboards) to be edited. Is that correct?

slimsag

comment created time in 18 days

startedViRb3/SylphyHornEx

started time in 19 days

push eventsourcegraph/deploy-sourcegraph

Gonzalo Peci

commit sha e1f0845f89a56efc85dd5c9f83912a0d87a09508

--wip-- [skip ci]

view details

push time in 21 days

push eventsourcegraph/deploy-sourcegraph

Gonzalo Peci

commit sha cb8a95a0261d0d8f8685df9df56fe8331ebc6400

--wip-- [skip ci]

view details

push time in 21 days

create barnchsourcegraph/deploy-sourcegraph

branch : feature/mixin

created branch time in 21 days

Pull request review commentsourcegraph/about

distribution: add monitoring architecture page

+# Sourcegraph monitoring architecture++**Note:** Looking for _how to monitor Sourcegraph?_ See the [observability documentation](https://docs.sourcegraph.com/admin/observability).++**Note:** Looking for _how to develop Sourcegraph monitoring?_ See the [monitoring developer guide](monitoring.md).++This document describes the architecture of Sourcegraph's monitoring stack, and the technical decisions we have made to date and why.++<!-- generated from monitoring_architecture.excalidraw -->+![architecture diagram](https://storage.googleapis.com/sourcegraph-assets/monitoring-architecture.png)++## Long-term vision++To better understand our goals with Sourcegraph's monitoring stack, please read [monitoring pillars: long-term vision](monitoring_pillars.md#long-term-vision).++## Monitoring generator++We use a custom [declarative Go generator syntax](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/tree/monitoring) for:++- Defining the services we monitor.+- Describing _what those services do_ to site admins.+- Laying out dashboards in a uniform, consistent, and simple way.+- Generating the [Prometheus alerting rules](#alerting) and Grafana dashboards.+- Generating documentation in the form of ["possible solutions"](https://docs.sourcegraph.com/admin/observability/alert_solutions) for site admins to follow when alerts are firing.++This allows us to assert constraints and principles that we want to hold ourselves to, as described in [our monitoring pillars](monitoring_pillars.md).++To learn more about adding monitoring using the generator, see ["How easy is it to add monitoring?"](monitoring.md#how-easy-is-it-to-add-monitoring)++## Sourcegraph deployment++### Sourcegraph Grafana++We use [Grafana](https://grafana.com) for:++- Displaying generated dashboards for our Prometheus metrics and alerts.+- Providing an interface to query Prometheus metrics and Jaeger traces.++The [`sourcegraph/grafana`](https://github.com/sourcegraph/sourcegraph/tree/master/docker-images/grafana) image handles shipping Grafana and Sourcegraph monitoring dashboards. It bundles:++* Preconfigured [Grafana](https://grafana.com), which displays data from Prometheus and Jaeger+* Dashboards generated by the [monitoring generator](#monitoring-generator).++#### Admin reverse-proxy++For convenience, Grafana is served on `/-/debug/grafana` on all Sourcegraph deployments via a reverse-proxy restricted to admins.++Services served via reverse-proxy in this manner could be vulnerable to [cross-site request forgery](https://owasp.org/www-community/attacks/csrf), which is complicated to resolve ([#6075](https://github.com/sourcegraph/sourcegraph/issues/6075)). This means that at the moment, making changes to Grafana using the Grafana UI is not possible without setting up a port-forward, something [we want to avoid asking customers to do](monitoring_pillars.md#long-term-vision). In addition, provisioned dashboards generated by the [monitoring generator](#monitoring-generator) cannot be edited at all.++### Sourcegraph Prometheus++We use [Prometheus](https://prometheus.io) for:++- Collecting high-level, and low-cardinality, metrics from our services.+- Defining Sourcegraph alerts as both:+  - Prometheus recording rules, [`alert_count`](#alert-count-metrics).+  - Prometheus alert rules (which trigger [notifications](#alert-notifications)) based on `alert_count` metrics.++The [`sourcegraph/prometheus`](https://github.com/sourcegraph/sourcegraph/tree/master/docker-images/prometheus) image handles shipping Sourcegraph metrics and alerting. It bundles:++* Preconfigured [Prometheus](https://prometheus.io), which consumes metrics from Sourcegraph services.+* Alert and recording rules generated by the [monitoring generator](#monitoring-generator).+* [Alertmanager](https://prometheus.io/docs/alerting/latest/alertmanager/), which handles alerts from Prometheus.+* [prom-wrapper](https://github.com/sourcegraph/sourcegraph/tree/master/docker-images/prometheus/cmd/prom-wrapper), which subscribes to updates in [site configuration](https://docs.sourcegraph.com/admin/config/site_config) and propagates relevant settings to Alertmanager configuration.++#### Alert count metrics++`alert_count` metrics are special Prometheus recording rules that evaluate a single upper or lower bound, as defined in an [Observable](https://sourcegraph.com/search?q=repo:%5Egithub%5C.com/sourcegraph/sourcegraph%24+file:generator.go+type+Observable+struct+%7B:%5Bdef%5D%7D&patternType=structural) and generated by the [monitoring generator](#monitoring-generator). This metric is always either 0 if the threshold is not exceeded (or data does not exist, if configured), or 1 if the threshold is exceeded. This allows historical alert data to easily be [consumed programmatically](https://docs.sourcegraph.com/admin/observability/alerting_custom_consumption).++Learn more about the `alert_count` metrics in the [metrics guide](https://docs.sourcegraph.com/admin/observability/metrics_guide#alert-count).++*Rationale for `alert_count`*: TODO(@slimsag)++#### Alert notifications++We use [Alertmanager](https://prometheus.io/docs/alerting/latest/alertmanager/) for:++- Providing data about currently active Sourcegraph alerts.+- Routing alerts to appropriate receivers and silencing them when desired, [configured using site configuration](#alert-notifications).++Alertmanager is bundled in `sourcegraph/prometheus`, and notifications are configured for Sourcegraph alerts [using site configuration](https://docs.sourcegraph.com/admin/observability/alerting). This functionality is provided by the [prom-wrapper](#prom-wrapper).++*Rationale for notifiers in site configuration*: Due to the limitations of [admin reverse-proxies](#admin-reverse-proxy), alerts cannot be configured without port-forwarding or custom ConfigMaps, something we [want to avoid](monitoring_pillars.md#long-term-vision).

As I recall from our conversation, we could have shipped via ConfigMaps, this is a limitation of the implementation we want to do of siteconfig <- prom-wrapper. I think we should add somewhere that we want to request admins when onboarding a site through the frontend to configure a default destination for notifications, which is driving this implementation, otherwise I think this could also be a required var of our deployments.

slimsag

comment created time in 21 days

Pull request review commentsourcegraph/about

distribution: add monitoring architecture page

+# Sourcegraph monitoring architecture++**Note:** Looking for _how to monitor Sourcegraph?_ See the [observability documentation](https://docs.sourcegraph.com/admin/observability).++**Note:** Looking for _how to develop Sourcegraph monitoring?_ See the [monitoring developer guide](monitoring.md).++This document describes the architecture of Sourcegraph's monitoring stack, and the technical decisions we have made to date and why.++<!-- generated from monitoring_architecture.excalidraw -->+![architecture diagram](https://storage.googleapis.com/sourcegraph-assets/monitoring-architecture.png)++## Long-term vision++To better understand our goals with Sourcegraph's monitoring stack, please read [monitoring pillars: long-term vision](monitoring_pillars.md#long-term-vision).++## Monitoring generator++We use a custom [declarative Go generator syntax](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/tree/monitoring) for:++- Defining the services we monitor.+- Describing _what those services do_ to site admins.+- Laying out dashboards in a uniform, consistent, and simple way.+- Generating the [Prometheus alerting rules](#alerting) and Grafana dashboards.+- Generating documentation in the form of ["possible solutions"](https://docs.sourcegraph.com/admin/observability/alert_solutions) for site admins to follow when alerts are firing.++This allows us to assert constraints and principles that we want to hold ourselves to, as described in [our monitoring pillars](monitoring_pillars.md).++To learn more about adding monitoring using the generator, see ["How easy is it to add monitoring?"](monitoring.md#how-easy-is-it-to-add-monitoring)++## Sourcegraph deployment++### Sourcegraph Grafana++We use [Grafana](https://grafana.com) for:++- Displaying generated dashboards for our Prometheus metrics and alerts.+- Providing an interface to query Prometheus metrics and Jaeger traces.++The [`sourcegraph/grafana`](https://github.com/sourcegraph/sourcegraph/tree/master/docker-images/grafana) image handles shipping Grafana and Sourcegraph monitoring dashboards. It bundles:++* Preconfigured [Grafana](https://grafana.com), which displays data from Prometheus and Jaeger+* Dashboards generated by the [monitoring generator](#monitoring-generator).++#### Admin reverse-proxy++For convenience, Grafana is served on `/-/debug/grafana` on all Sourcegraph deployments via a reverse-proxy restricted to admins.++Services served via reverse-proxy in this manner could be vulnerable to [cross-site request forgery](https://owasp.org/www-community/attacks/csrf), which is complicated to resolve ([#6075](https://github.com/sourcegraph/sourcegraph/issues/6075)). This means that at the moment, making changes to Grafana using the Grafana UI is not possible without setting up a port-forward, something [we want to avoid asking customers to do](monitoring_pillars.md#long-term-vision). In addition, provisioned dashboards generated by the [monitoring generator](#monitoring-generator) cannot be edited at all.++### Sourcegraph Prometheus++We use [Prometheus](https://prometheus.io) for:++- Collecting high-level, and low-cardinality, metrics from our services.+- Defining Sourcegraph alerts as both:+  - Prometheus recording rules, [`alert_count`](#alert-count-metrics).+  - Prometheus alert rules (which trigger [notifications](#alert-notifications)) based on `alert_count` metrics.++The [`sourcegraph/prometheus`](https://github.com/sourcegraph/sourcegraph/tree/master/docker-images/prometheus) image handles shipping Sourcegraph metrics and alerting. It bundles:++* Preconfigured [Prometheus](https://prometheus.io), which consumes metrics from Sourcegraph services.+* Alert and recording rules generated by the [monitoring generator](#monitoring-generator).+* [Alertmanager](https://prometheus.io/docs/alerting/latest/alertmanager/), which handles alerts from Prometheus.+* [prom-wrapper](https://github.com/sourcegraph/sourcegraph/tree/master/docker-images/prometheus/cmd/prom-wrapper), which subscribes to updates in [site configuration](https://docs.sourcegraph.com/admin/config/site_config) and propagates relevant settings to Alertmanager configuration.++#### Alert count metrics++`alert_count` metrics are special Prometheus recording rules that evaluate a single upper or lower bound, as defined in an [Observable](https://sourcegraph.com/search?q=repo:%5Egithub%5C.com/sourcegraph/sourcegraph%24+file:generator.go+type+Observable+struct+%7B:%5Bdef%5D%7D&patternType=structural) and generated by the [monitoring generator](#monitoring-generator). This metric is always either 0 if the threshold is not exceeded (or data does not exist, if configured), or 1 if the threshold is exceeded. This allows historical alert data to easily be [consumed programmatically](https://docs.sourcegraph.com/admin/observability/alerting_custom_consumption).++Learn more about the `alert_count` metrics in the [metrics guide](https://docs.sourcegraph.com/admin/observability/metrics_guide#alert-count).++*Rationale for `alert_count`*: TODO(@slimsag)++#### Alert notifications++We use [Alertmanager](https://prometheus.io/docs/alerting/latest/alertmanager/) for:++- Providing data about currently active Sourcegraph alerts.+- Routing alerts to appropriate receivers and silencing them when desired, [configured using site configuration](#alert-notifications).++Alertmanager is bundled in `sourcegraph/prometheus`, and notifications are configured for Sourcegraph alerts [using site configuration](https://docs.sourcegraph.com/admin/observability/alerting). This functionality is provided by the [prom-wrapper](#prom-wrapper).++*Rationale for notifiers in site configuration*: Due to the limitations of [admin reverse-proxies](#admin-reverse-proxy), alerts cannot be configured without port-forwarding or custom ConfigMaps, something we [want to avoid](monitoring_pillars.md#long-term-vision).++*Rationale for Alertmanager*: An approach for notifiers using Grafana was considered, but had some issues outlined in [#11832](https://github.com/sourcegraph/sourcegraph/pull/11832), so Alertmanager was selected as our notification provider.++*Rationale for silencing in site configuration*: Similar to the [Grafana admin reverse-proxy](#admin-reverse-proxy), silencing using the Alertmanager UI would require port-forwarding, something we [want to avoid](monitoring_pillars.md#long-term-vision).++#### prom-wrapper++The [prom-wrapper](https://github.com/sourcegraph/sourcegraph/tree/master/docker-images/prometheus/cmd/prom-wrapper) is the entrypoint program for `sourcegraph/prometheus` and it:++* Handles starting up Prometheus and Alertmanager

I think we should add a note on why we chose to use a wrapper in the same container vs multiple containers. As I recall from the conversation with @slimsag, this was to avoid increasing the container count, but there is no hard requirement, as we could share the file via a ro: mount and use the reload API of Prometheus and Alertmanager.

slimsag

comment created time in 21 days

Pull request review commentsourcegraph/about

distribution: add monitoring architecture page

+# Sourcegraph monitoring architecture++**Note:** Looking for _how to monitor Sourcegraph?_ See the [observability documentation](https://docs.sourcegraph.com/admin/observability).++**Note:** Looking for _how to develop Sourcegraph monitoring?_ See the [monitoring developer guide](monitoring.md).++This document describes the architecture of Sourcegraph's monitoring stack, and the technical decisions we have made to date and why.++<!-- generated from monitoring_architecture.excalidraw -->+![architecture diagram](https://storage.googleapis.com/sourcegraph-assets/monitoring-architecture.png)++## Long-term vision++To better understand our goals with Sourcegraph's monitoring stack, please read [monitoring pillars: long-term vision](monitoring_pillars.md#long-term-vision).++## Monitoring generator++We use a custom [declarative Go generator syntax](https://sourcegraph.com/github.com/sourcegraph/sourcegraph/-/tree/monitoring) for:++- Defining the services we monitor.+- Describing _what those services do_ to site admins.+- Laying out dashboards in a uniform, consistent, and simple way.+- Generating the [Prometheus alerting rules](#alerting) and Grafana dashboards.+- Generating documentation in the form of ["possible solutions"](https://docs.sourcegraph.com/admin/observability/alert_solutions) for site admins to follow when alerts are firing.++This allows us to assert constraints and principles that we want to hold ourselves to, as described in [our monitoring pillars](monitoring_pillars.md).++To learn more about adding monitoring using the generator, see ["How easy is it to add monitoring?"](monitoring.md#how-easy-is-it-to-add-monitoring)++## Sourcegraph deployment++### Sourcegraph Grafana++We use [Grafana](https://grafana.com) for:++- Displaying generated dashboards for our Prometheus metrics and alerts.+- Providing an interface to query Prometheus metrics and Jaeger traces.++The [`sourcegraph/grafana`](https://github.com/sourcegraph/sourcegraph/tree/master/docker-images/grafana) image handles shipping Grafana and Sourcegraph monitoring dashboards. It bundles:++* Preconfigured [Grafana](https://grafana.com), which displays data from Prometheus and Jaeger+* Dashboards generated by the [monitoring generator](#monitoring-generator).++#### Admin reverse-proxy++For convenience, Grafana is served on `/-/debug/grafana` on all Sourcegraph deployments via a reverse-proxy restricted to admins.++Services served via reverse-proxy in this manner could be vulnerable to [cross-site request forgery](https://owasp.org/www-community/attacks/csrf), which is complicated to resolve ([#6075](https://github.com/sourcegraph/sourcegraph/issues/6075)). This means that at the moment, making changes to Grafana using the Grafana UI is not possible without setting up a port-forward, something [we want to avoid asking customers to do](monitoring_pillars.md#long-term-vision). In addition, provisioned dashboards generated by the [monitoring generator](#monitoring-generator) cannot be edited at all.

we want to avoid asking customers to do

I dont think that sections describes why we want to prevent that. Its clear that we want to ship good defaults and dashboard that are similar to the ones we use internally. While I think its a good idea to provide the dashboards as a static asset of our product, I dont think preventing customization is necessarily a requirement because dogfooding.

slimsag

comment created time in 21 days

pull request commentsourcegraph/deploy-sourcegraph-docker

WIP: Split docker-compose code into multiple files

If we need a one liner we can do curl this | sh -c to a script.

pecigonzalo

comment created time in 22 days

issue commentsourcegraph/sourcegraph

Distribution: 3.19 Tracking issue

cc/ @christinaforney @dadlerj

pecigonzalo

comment created time in 22 days

Pull request review commentsourcegraph/about

Goals (not OKRs): continuous, fewer in number, less strict format

 Planning is a continuous process of negotiation between product and engineering.  Planning requires several artifacts for communicating. This section clarifies how each artifact is used in the planning process. -### OKRs+### OKGoals
### Goals
sqs

comment created time in 22 days

startedpobrn/ite8291r3-ctl

started time in 23 days

issue commentsourcegraph/sourcegraph

RFC-189: Support per-team alerts and on-call rotations

Robert and I had a chat about this topic and what to add to our interface and site-config:

TLDR:

  • We want to bundle alerts and metrics relevant to the application performance to ensure admins can monitor and be alerted of problems
  • We want to prompt/require notifications to be setup during the site onboarding
  • We want to consume the same alerts and metrics for all our or deployments
  • We will not add the full Alertmanager / Prometheus functionality to our config and only the subset required for simple per-team notifications (no advanced routing, etc)
  • As site config is already providing this configuration, it is no possible or even more complex to setup routing separately while allowing site admins to setup notifications

Additionally, @bobheadxi will create an RFC to capture the information for this topic.

cc/ @bobheadxi feel free to modify this comment

bobheadxi

comment created time in 23 days

push eventsourcegraph/sourcegraph

Gonzalo Peci

commit sha e3fd5bcf76aad326fb44649f4c17caa1189ab098

--wip-- [skip ci]

view details

push time in 23 days

issue commentsourcegraph/sourcegraph

Mark unplanned issues in our Tracking issues

Maybe as another idea, we can use GH Projects, I can create a project at the start of each iteration with the default automation (so issues are automatically marked as done) and we can add them there. I can use that as reporting, and it does not interfere with the current tracking issue.

Similarly, it could be a good idea to do the same for projects, which we can link at the top of a tracking issue. That way its also easy to see the overall project status. I created this one as an example https://github.com/orgs/sourcegraph/projects/71

pecigonzalo

comment created time in 23 days

issue commentsourcegraph/sourcegraph

Mark unplanned issues in our Tracking issues

I think unplanned is more clear as an opposite to our planned work, but I dont mind the term.

I tried handling it in the same way, then a task remains marked in all future iterations as well.

This might be Ok if we consider an unplanned issue as unplanned even if we "plan" on it later on, otherwise we could close the unplanned and plan a new one.

pecigonzalo

comment created time in 23 days

issue commentsourcegraph/sourcegraph

RFC-189: Support per-team alerts and on-call rotations

I dont think it goes around it. I dont think that task, and dogfooding in general require this configuration to be part of the Sourcegraph frontend interface.

  • Im not suggesting to have another thing write to it, but rather to limit how many configs we take from the frontend
  • I dont think it should derivate. The task says "use the exact same ConfigMap", which I agree, we should use the same config

My concern is that we will be building and admin panel for monitoring and alerting as part of our main Sourcegraph product and creating our own abstractions over well known config formats. Something like a phpMyAdmin but for monitoring and an Opsgenie all in one.

Dogfooding is more about us consuming our own alerts/config/metrics/tools as a customer would. It does not require us to have all be configured through the Sourcegraph interface. If we deploy using Kustomize and using https://github.com/sourcegraph/deploy-sourcegraph which might include our alerting and monitoring config, using our own docs, etc. we are dogfooding.

bobheadxi

comment created time in 23 days

issue commentsourcegraph/sourcegraph

RFC-189: Support per-team alerts and on-call rotations

I think having a "notification email" or "notification endpoint" or even "message template" simple customization on the interface and default to notify that is great, but not so sure about supporting any advanced config/routing. That can be mounted directly to Prometheus / Alertmanager and they already have a config structure for alerts/routing. I think we could be creating complexity by abstracting that into our own JSON format configured by the sourcegraph frontend interface.

bobheadxi

comment created time in 23 days

issue commentsourcegraph/sourcegraph

RFC-189: Support per-team alerts and on-call rotations

I dont think the problem is that this is cloud-only or that is contains a frontend for the config file (although I would prefer to avoid creating a config abstraction) but rather how much of the implementation logic and routing functionality becomes part of the frontend service and extends its domain/scope. As long as we keep this minimal, it should not be a problem.

bobheadxi

comment created time in 24 days

more