profile
viewpoint

bobheadxi/gobenchdata 60

📉 Run Go benchmarks, publish results to an interactive web app, and check for performance regressions in your pull requests

bobheadxi/deployments 55

❗️GitHub Action for working painlessly with deployment statuses

bobheadxi/facebook-spotify-chatbot 11

:notes: a Facebook Messenger bot for managing your party playlist with guests

bobheadxi/calories 5

:poultry_leg: a Facebook Messenger bot in Golang for all your calorie-tracking needs

bobheadxi/go 4

cute vanity imports for my Go stuff - https://go.bobheadxi.dev

bobheadxi/btt 3

📸 bob's BetterTouchTool configurations

bobheadxi/ctl 3

🐒 Package ctl enables drop-in gRPC client integration for your service into command-line applications

bobheadxi/labelist 2

😶 simple serverless function to attach labels to Todoist items when I can't afford premium

bobheadxi/res 2

📫 Ergonomic primitives for working with JSON in RESTful Go servers and clients

bobheadxi/borrow-me 1

:department_store: a goodwill-based marketplace for small, everyday items (NWhacks 2018)

issue commentsourcegraph/sourcegraph

Sourcegraph.com update checks alert fires too frequently

updated title because we didn't fix update checks being slow (or if they were slow at all) 🙃

slimsag

comment created time in a day

Pull request review commentsourcegraph/sourcegraph

Create a updatecheck_client_total time metric

 func Frontend() *Container { 							PossibleSolutions: "none", 						}, 						{-							Name:              "90th_percentile_updatecheck_requests",-							Description:       "90th percentile successful update-check requests (sourcegraph.com only)",-							Query:             `histogram_quantile(0.9, sum by (method,le) (rate(src_updatecheck_client_duration_seconds_bucket[5m])))`,+							Name:              "total_time_to_perform_update_checks",+							Description:       "amount of time to perform update check",+							Query:             `sum(src_updatecheck_client_duration_seconds_sum)`,

hm, we might need to specify the quantile still:

sum (rate(src_updatecheck_client_duration_seconds_bucket{le="1"}[5m]))

image

daxmc99

comment created time in 2 days

Pull request review commentsourcegraph/sourcegraph

Create a updatecheck_client_total time metric

 func Frontend() *Container { 							PossibleSolutions: "none", 						}, 						{-							Name:              "90th_percentile_updatecheck_requests",-							Description:       "90th percentile successful update-check requests (sourcegraph.com only)",-							Query:             `histogram_quantile(0.9, sum by (method,le) (rate(src_updatecheck_client_duration_seconds_bucket[5m])))`,+							Name:              "total_time_to_perform_update_checks",+							Description:       "amount of time to perform update check",+							Query:             `sum(src_updatecheck_client_duration_seconds_sum)`,

If the old query was correct, I think we can just use the previous one, but remove the quantile and by:

sum(rate(src_updatecheck_client_duration_seconds_bucket[5m]))

https://sourcegraph.com/-/debug/grafana/explore?orgId=1&left=%5B%22now-3h%22,%22now%22,%22Prometheus%22,%7B%22expr%22:%22sum(src_updatecheck_client_duration_seconds_bucket)sum%20by%20(method,le)%20(rate(src_updatecheck_client_duration_seconds_bucket%5B5m%5D)))%22%7D,%7B%22ui%22:%5Btrue,true,true,%22none%22%5D%7D%5D

daxmc99

comment created time in 2 days

Pull request review commentsourcegraph/sourcegraph

Create a updatecheck_client_total time metric

 func Frontend() *Container { 							PossibleSolutions: "none", 						}, 						{-							Name:              "90th_percentile_updatecheck_requests",-							Description:       "90th percentile successful update-check requests (sourcegraph.com only)",-							Query:             `histogram_quantile(0.9, sum by (method,le) (rate(src_updatecheck_client_duration_seconds_bucket[5m])))`,+							Name:              "total_time_to_perform_update_checks",+							Description:       "amount of time to perform update check",+							Query:             `sum(src_updatecheck_client_duration_seconds_sum)`,

hm 🤔 this looks like an accumulative total:

image

maybe we need a rate here?

sum(rate(src_updatecheck_client_duration_seconds_sum[1m]))
daxmc99

comment created time in 2 days

Pull request review commentsourcegraph/sourcegraph

Create a updatecheck_client_total time metric

 func Frontend() *Container { 							PossibleSolutions: "none", 						}, 						{-							Name:              "90th_percentile_updatecheck_requests",-							Description:       "90th percentile successful update-check requests (sourcegraph.com only)",-							Query:             `histogram_quantile(0.9, sum by (method,le) (rate(src_updatecheck_client_duration_seconds_bucket[5m])))`,+							Name:              "total_time_to_perform_update_checks",+							Description:       "amount of time to perform update check",
							Name:              "update_check_duration",
							Description:       "update check duration",

(a bit more in line with our other wording)

daxmc99

comment created time in 2 days

Pull request review commentsourcegraph/about

handbook eng: clarify use of docker-images-patch-notest

 Snapshots of all Kubernetes resources are taken periodically and pushed to https If you need to build Docker images on Buildkite for testing purposes, e.g. you have a PR with a fix and want to deploy that fix to a test instance, you can push the branch to the special `docker-images-patch` and-`docker-images-patch-notest` branches.+`docker-images-patch-notest` branches. You shouldn't need to resolve merge conflicts, instead you can simply force-push.  Example: you want to build a new Docker image for `frontend` and `gitserver` based on the branch `my_fix`.  ```-git push origin my_fix:docker-images-patch-notest/frontend-git push origin my_fix:docker-images-patch-notest/gitserver-git push origin my_fix:docker-images-patch-notest/$(Docker_image_to_build)+git push -f origin my_fix:docker-images-patch-notest/frontend+git push - origin my_fix:docker-images-patch-notest/gitserver

is this supposed to be a -f?

uwedeportivo

comment created time in 2 days

pull request commentsourcegraph/sourcegraph

docker images: copy redis instead of apk add

is the Comby change supposed to be added here?

uwedeportivo

comment created time in 2 days

delete branch sourcegraph/sourcegraph

delete branch : monitoring/gitserver-errors

delete time in 2 days

push eventsourcegraph/sourcegraph

Robert Lin

commit sha b973f000839da45c4d6cb0a381e846b26b688bbf

monitoring: sync gitserver_error_responses for code-intel with frontend (#12981) Frontend's gitserver_error_responses was converted to a ratio, we apply the same change to the precise-code-intel versions of this panel. Also assigns Search as the owner of gitserver-related alerts.

view details

push time in 2 days

PR merged sourcegraph/sourcegraph

Reviewers
monitoring: sync gitserver_error_responses for code-intel with frontend

Frontend's gitserver_error_responses was converted to a ratio, this PR applies the same change to the precise-code-intel versions of this panel

<!-- Reminder: Have you updated the changelog and relevant docs (user docs, architecture diagram, etc) ? -->

+16 -10

1 comment

3 changed files

bobheadxi

pr closed time in 2 days

push eventsourcegraph/sourcegraph

Robert Lin

commit sha a077973672e527c18bcf7b0ecfc187e460300fc3

let search own gitserver alerts

view details

push time in 2 days

PR opened sourcegraph/sourcegraph

Reviewers
monitoring: sync gitserver_error_responses for code-intel with frontend

Frontend's gitserver_error_responses was converted to a ratio, this PR applies the same change to the precise-code-intel versions of this panel

<!-- Reminder: Have you updated the changelog and relevant docs (user docs, architecture diagram, etc) ? -->

+15 -9

0 comment

3 changed files

pr created time in 2 days

create barnchsourcegraph/sourcegraph

branch : monitoring/gitserver-errors

created branch time in 2 days

startedplankanban/planka

started time in 2 days

pull request commentubclaunchpad/docs

Updated Design recruitment goals

forgot to merge this before the newsletter QQ :'(

sandyklc

comment created time in 2 days

push eventubclaunchpad/docs

Sandy Co

commit sha 795b754f4d06e755d6289ab8637efd642f3cf3a0

Updated Design recruitment goals (#147) Co-authored-by: Robert Lin <robert@bobheadxi.dev>

view details

push time in 2 days

delete branch ubclaunchpad/docs

delete branch : sandyklc-patch-1

delete time in 2 days

PR merged ubclaunchpad/docs

Updated Design recruitment goals

Let me know what you guys think, still giving this a little bit of thought so might make a few changes.

Related Tickets

<!-- List relevant tickets, e.g. 'Closes #<some_ticket_number>', or just write 'n/a' -->

Closes #

Changes

<!-- Briefly describe the changes you made -->

Checklist

+14 -0

1 comment

1 changed file

sandyklc

pr closed time in 2 days

push eventubclaunchpad/docs

Robert Lin

commit sha 83df5c727863d02bf2deb6d160fc410e98b45af6

adjust layout of links to applications

view details

push time in 3 days

push eventubclaunchpad/docs

Robert Lin

commit sha 8569c93e03f406a3a2dcac19500c138d651d5977

wording

view details

push time in 3 days

push eventubclaunchpad/ubclaunchpad.com

Robert Lin

commit sha e987504a9f269bec244f851efaa9b45ba4a5d1fe

join: add links to roles, enable applications (#195)

view details

push time in 3 days

delete branch ubclaunchpad/ubclaunchpad.com

delete branch : applications-update

delete time in 3 days

PR merged ubclaunchpad/ubclaunchpad.com

Reviewers
join: add links to roles

Closes #193 - currently still disabled, but when we open applications it'll look like this:

image

We won't be recruiting for Strategy this fall

+13 -35

3 comments

4 changed files

bobheadxi

pr closed time in 3 days

issue closedubclaunchpad/ubclaunchpad.com

set up recruitment section

The recruitment section currently makes assumptions about the way we will direct people to apply that are no longer valid - we're currently going towards using handbook pages:

  • https://github.com/ubclaunchpad/docs/pull/152
  • https://github.com/ubclaunchpad/docs/pull/151

Should update our recruitment section to leverage these pages when they land. Should not enable the section publicly yet.

closed time in 3 days

bobheadxi

pull request commentubclaunchpad/ubclaunchpad.com

join: add links to roles

good catch @andrewzulaybar ! enabled in c1009f2dfc4ae4a6f6f31d4ac424f4416538d1fd

bobheadxi

comment created time in 3 days

push eventubclaunchpad/ubclaunchpad.com

Robert Lin

commit sha c1009f2dfc4ae4a6f6f31d4ac424f4416538d1fd

enable applications

view details

push time in 3 days

pull request commentubclaunchpad/docs

add tips linking to google recruitment forms

merged because this is probably time sensitive to update haha

andrewzulaybar

comment created time in 3 days

push eventubclaunchpad/docs

Andrew

commit sha 7256233e2472884b3a08fb5131eadf380256b958

add tips linking to google recruitment forms (#158)

view details

push time in 3 days

delete branch ubclaunchpad/docs

delete branch : fix/recruitment-status

delete time in 3 days

PR merged ubclaunchpad/docs

Reviewers
add tips linking to google recruitment forms

Related Tickets

n/a

Changes

Previously, these pages had a warning saying that we are not accepting applications. However, with the upcoming recruitment campaign, we want to say that we are now accepting applications and we want to link users directly to the application form.

Checklist

+8 -4

1 comment

2 changed files

andrewzulaybar

pr closed time in 3 days

issue commentsourcegraph/sourcegraph

frontend: blob load latency

the old alert for reference: https://github.com/sourcegraph/deploy-sourcegraph-dot-com/blob/release/base/prometheus/prometheus.ConfigMap.yaml#L478-L486

bobheadxi

comment created time in 3 days

issue openedsourcegraph/sourcegraph

frontend: blob load latency

blob_load_latency seems to spike frequently, which sets off a critical-level alert.

panel

image

The big one in the middle is some downtime we had recently. This alert was migrated from our custom dot-com alerts - the variant there no longer works due to referencing a nonexistent record, so this alert is pretty much new. Some questions:

  • Is this expected behaviour?
  • Is this actionable? Should it be a warning instead? Does this panel even need to exist?

The threshold for this was already recently relaxed in https://github.com/sourcegraph/sourcegraph/pull/12936

Assigning to @sourcegraph/search since the alert is currently tagged as owned by search (let me know if this isn't right!)

created time in 3 days

issue commentsourcegraph/sourcegraph

cadvisor: investigate collecting IO metrics

From sourcegraph.com perspective we can make it a bit easier to fire since we have more context to debug it.

🤔 this is a separate topic, but since we don't have support for custom thresholds (for example, generating different thresholds for dot-com) we should either:

  • scale metrics off some measure of deployment type (for example, ratios based on total requests, or in this case my earlier idea about having a ratio based on gitserver repos)
  • have two observables, one tailored for dot-com/really large instances and one tailored for "normal instances", and mute the "normal instances" alert
bobheadxi

comment created time in 3 days

issue commentsourcegraph/sourcegraph

cadvisor: investigate collecting IO metrics

What's recently?

looks like it started ~8/11:

image

An alert for this is tricky. I would suspect sustained high IO is a sign of under provisioning / runaway process. Sustained IO would also potentially be useful for an admin to know about.

how does sustained minimum per gitserver > x for y time sound? query:

min by(name)(sum by(name) (rate(container_fs_reads_total{name=~".*gitserver.*",name!~".*(_POD_|_jaeger-agent_).*"}[5m]) + rate(container_fs_writes_total{name=~".*gitserver.*",name!~".*(_POD_|_jaeger-agent_).*"}[5m]))) > 2000
bobheadxi

comment created time in 3 days

Pull request review commentsourcegraph/sourcegraph

license: upgrade license_finder, regenerate report

 - - :restrict   - unknown   - &1-    :who:+    :who: 

this is generated by the license_finder CLI :(((((

bobheadxi

comment created time in 3 days

push eventsourcegraph/sourcegraph

Robert Lin

commit sha 9f71b052d364eae582e87f9f10c9da85747235e7

license: upgrade license_finder, regenerate report (#12935)

view details

push time in 3 days

delete branch sourcegraph/sourcegraph

delete branch : chore/license_finder-upgrade

delete time in 3 days

push eventsourcegraph/sourcegraph

Robert Lin

commit sha 7161fc88317a359851f20f54276ee444eb789b92

monitoring: relax blob_load_latency, github_core_rate_limit_remaining (#12936) * frontend: increase blob_load_latency threshold to 5s * github_proxy: decrease github_core_rate_limit_remaining threshold to 500 for 5m

view details

push time in 3 days

delete branch sourcegraph/sourcegraph

delete branch : critical-alerts-tweaks

delete time in 3 days

PR merged sourcegraph/sourcegraph

Reviewers
monitoring: relax blob_load_latency, github_core_rate_limit_remaining
  • increase blob_load_latency threshold - the previous threshold has been causing it to fire very frequently lately. i think it makes sense that this threshold be more relaxed than page_load_latency. that said, this might indicate a real problem, since it seems to have a tendency to spike occasionally while being pretty low the rest of the time. Dashboard
  • decrease github_core_rate_limit_remaining threshold and duration - the last time this fired was when it dipped to 999 for an instant. the new limit seems like a generous limit still, and if rate limit does not recover for a period of time, then it indicates a real problem. Dashboard

<!-- Reminder: Have you updated the changelog and relevant docs (user docs, architecture diagram, etc) ? -->

+6 -4

1 comment

3 changed files

bobheadxi

pr closed time in 3 days

push eventsourcegraph/sourcegraph

Robert Lin

commit sha 1feffdaaf754e9c95aa04f6f79302a56d87351f0

set github.com/sourcegraph/gosaml2 license

view details

push time in 3 days

push eventsourcegraph/sourcegraph

Robert Lin

commit sha 493c4a1f59f57e40ffb4fca59293af2c7f70a10d

regenerate

view details

push time in 3 days

PR opened sourcegraph/sourcegraph

Reviewers
monitoring: relax blob_load_latency, github_core_rate_limit_remaining
  • increase blob_load_latency threshold - the previous threshold has been causing it to fire very frequently lately. i think it makes sense that this threshold be more relaxed than page_load_latency. that said, this might indicate a real problem, since it seems to have a tendency to spike occasionally while being pretty low the rest of the time. Dashboard
  • decrease github_core_rate_limit_remaining threshold and duration - the last time this fired was when it dipped to 999 for an instant. the new limit seems like a generous limit still, and if rate limit does not recover for a period of time, then it indicates a real problem. Dashboard

<!-- Reminder: Have you updated the changelog and relevant docs (user docs, architecture diagram, etc) ? -->

+5 -3

0 comment

3 changed files

pr created time in 3 days

push eventsourcegraph/sourcegraph

Robert Lin

commit sha c73b0e554b620e30caff5803421697817b0628c6

monitoring(github_proxy): decrease github_core_rate_limit_remaining threshold to 500 1000 seems like a wide berth - 500 should be sufficient, given the threshold for the similar github_search_rate_limit_remaining is so low

view details

push time in 3 days

create barnchsourcegraph/sourcegraph

branch : critical-alerts-tweaks

created branch time in 3 days

push eventsourcegraph/sourcegraph

Robert Lin

commit sha 48c6d46f20f0ca9a6e302e62b336326c80fba837

bump pinned versions

view details

push time in 3 days

delete branch sourcegraph/sourcegraph

delete branch : chore/licenses-update

delete time in 3 days

PR closed sourcegraph/sourcegraph

chore: update third-party licenses

This is an automated pull request generated by this run.

+10 -14

2 comments

1 changed file

github-actions[bot]

pr closed time in 3 days

pull request commentsourcegraph/sourcegraph

chore: update third-party licenses

superseded by https://github.com/sourcegraph/sourcegraph/pull/12935

github-actions[bot]

comment created time in 3 days

PR opened sourcegraph/sourcegraph

license: upgrade license_finder, regenerate report

Upgrades to a new version of license_finder. This removes a lot of dependencies from the licenses report, due to these changes:

  • Change Go modules to only report imported packages (as with other Go package managers)
  • Detect Go modules based on go.mod (instead of go.sum)

<!-- Reminder: Have you updated the changelog and relevant docs (user docs, architecture diagram, etc) ? -->

+21 -130

0 comment

3 changed files

pr created time in 3 days

create barnchsourcegraph/sourcegraph

branch : chore/license_finder-upgrade

created branch time in 3 days

Pull request review commentsourcegraph/sourcegraph

admin docs: describes experimental validation command in the src-cli

+# Sourcegraph Instance Validation++>NOTE: **Sourcegraph Instance Validation is currently experimental.** We're exploring this feature set. +>Let us know what you think! [File an issue](https://github.com/sourcegraph/sourcegraph) +>with feedback/problems/questions, or [contact us directly](https://about.sourcegraph.com/contact).++Instance validation provides a quick way to check that a Sourcegraph instance functions properly after a fresh install+ or an update.++The [`src` CLI](https://github.com/sourcegraph/src-cli) has an experimental command `validate` which drives the+ validation from a user-provided configuration file with a validation specification (in JSON or YAML format).++### Validation specification+ +The best way to describe this initial, simple and experimental validation specification is with the example below+(in YAML format to allow for comments):++```yaml+# creates the first admin user on a fresh install (skips creation if user exists)+firstAdmin:+    email: foo@example.com+    username: foo+    password: "{{ .admin_password }}"++# adds the specified code host+externalService:+  config:+    url: https://github.com+    token: "{{ .github_token }}"+    orgs: []+    repos:+      - sourcegraph-testing/zap+  kind: GITHUB+  displayName: footest+  # set to true if this code host config should be deleted at the end of validation+  deleteWhenDone: true++# checks maxTries if specified repo is cloned and waits sleepBetweenTriesSeconds between checks +waitRepoCloned:+  repo: github.com/footest/foo+  maxTries: 5+  sleepBetweenTriesSeconds: 2++# performs the specified search and checks that at least one result is returned+searchQuery: repo:^github.com/footest/foo$ uniquelyFoo+```  ++The validation command executes the following steps: ++* create the first admin user+* add an external service+* wait for a repository to be cloned+* perform a search+ +Every step is optional (if the corresponding top-level key is not present then the step is skipped).++### Passing in secrets++It is often the case that the config file with the validation specification needs to declare passwords, tokens or other+secrets and these secrets should not be exposed or committed to a git repo.++The validation specification can refer to string values that come from a context specified outside the config file+(see the `Usage` section below). References to string values from this outside context are specified like so:+`{{ .some_key }}`. The context will have a string value defined under the key `some_key` and the validation execution will+use that.++### Usage++Use the [`src` CLI](https://github.com/sourcegraph/src-cli) to validate:++```shell script+src validate -context github_token=$GITHUB_TOKEN validate.yaml+```++The `src` binary finds the Sourcegraph instance to validate from the environment variables +[SRC_ENDPOINT and SRC_ACCESS_TOKEN](https://github.com/sourcegraph/src-cli#setup-with-your-sourcegraph-instance). 
[`SRC_ENDPOINT` and `SRC_ACCESS_TOKEN`](https://github.com/sourcegraph/src-cli#setup-with-your-sourcegraph-instance). 
uwedeportivo

comment created time in 3 days

Pull request review commentsourcegraph/sourcegraph

admin docs: describes experimental validation command in the src-cli

+# Sourcegraph Instance Validation++>NOTE: **Sourcegraph Instance Validation is currently experimental.** We're exploring this feature set. +>Let us know what you think! [File an issue](https://github.com/sourcegraph/sourcegraph) +>with feedback/problems/questions, or [contact us directly](https://about.sourcegraph.com/contact).++Instance validation provides a quick way to check that a Sourcegraph instance functions properly after a fresh install+ or an update.++The [`src` CLI](https://github.com/sourcegraph/src-cli) has an experimental command `validate` which drives the+ validation from a user-provided configuration file with a validation specification (in JSON or YAML format).++### Validation specification+ +The best way to describe this initial, simple and experimental validation specification is with the example below+(in YAML format to allow for comments):
The best way to describe this initial, simple and experimental validation specification is with the example below:

(the choice of yaml for comments seems like an implementation detail?)

uwedeportivo

comment created time in 3 days

Pull request review commentsourcegraph/sourcegraph

admin docs: describes experimental validation command in the src-cli

+# Sourcegraph Instance Validation++>NOTE: **Sourcegraph Instance Validation is currently experimental.** We're exploring this feature set. +>Let us know what you think! [File an issue](https://github.com/sourcegraph/sourcegraph) +>with feedback/problems/questions, or [contact us directly](https://about.sourcegraph.com/contact).++Instance validation provides a quick way to check that a Sourcegraph instance functions properly after a fresh install+ or an update.++The [`src` CLI](https://github.com/sourcegraph/src-cli) has an experimental command `validate` which drives the+ validation from a user-provided configuration file with a validation specification (in JSON or YAML format).++### Validation specification+ +The best way to describe this initial, simple and experimental validation specification is with the example below+(in YAML format to allow for comments):++```yaml+# creates the first admin user on a fresh install (skips creation if user exists)+firstAdmin:+    email: foo@example.com+    username: foo+    password: "{{ .admin_password }}"++# adds the specified code host+externalService:+  config:+    url: https://github.com+    token: "{{ .github_token }}"+    orgs: []+    repos:+      - sourcegraph-testing/zap+  kind: GITHUB+  displayName: footest+  # set to true if this code host config should be deleted at the end of validation+  deleteWhenDone: true++# checks maxTries if specified repo is cloned and waits sleepBetweenTriesSeconds between checks +waitRepoCloned:+  repo: github.com/footest/foo+  maxTries: 5+  sleepBetweenTriesSeconds: 2++# performs the specified search and checks that at least one result is returned+searchQuery: repo:^github.com/footest/foo$ uniquelyFoo+```  ++The validation command executes the following steps: 
With this configuration, the validation command executes the following steps: 
uwedeportivo

comment created time in 3 days

Pull request review commentsourcegraph/sourcegraph

admin docs: describes experimental validation command in the src-cli

+# Sourcegraph Instance Validation++>NOTE: **Sourcegraph Instance Validation is currently experimental.** We're exploring this feature set. +>Let us know what you think! [File an issue](https://github.com/sourcegraph/sourcegraph) 
>Let us know what you think! [File an issue](https://github.com/sourcegraph/sourcegraph/issues/new/choose)
uwedeportivo

comment created time in 3 days

push eventsourcegraph/sourcegraph

Robert Lin

commit sha dbdd965125f5e394b86943070d80083da3efd2a5

monitoring: add DataMayBeNaN to ratio alerts, fix wording (#12869) Add DataMayBeNaN to ratio alerts, since the denominator of these queries (typically some form of "total requests") can be 0. For example, index_queue_growth_rate has been firing frequently due to `sum(increase(src_index_queue_processor_total[30m])) == 0`. This issue happens most frequently with *_queue_growth_rate alerts it seems. Also remove some redundant "for ..." from metric descriptions, since this is appended by the generator already.

view details

push time in 4 days

delete branch sourcegraph/sourcegraph

delete branch : monitoring/wording

delete time in 4 days

PR merged sourcegraph/sourcegraph

monitoring: add DataMayBeNaN to ratio alerts, fix wording

Add DataMayBeNaN to ratio alerts, since the denominator of these queries (typically some form of "total requests") can be 0. For example, index_queue_growth_rate has been firing frequently due to sum(increase(src_index_queue_processor_total[30m])) == 0. This issue happens most frequently with *_queue_growth_rate alerts it seems

frequency comparison for index_queue_growth_rate:

image

Also removes some redundant "for ..." from metric descriptions, since this is appended by the generator already.

closes https://github.com/sourcegraph/sourcegraph/issues/12868

follows up https://github.com/sourcegraph/sourcegraph/pull/12756

<!-- Reminder: Have you updated the changelog and relevant docs (user docs, architecture diagram, etc) ? -->

+50 -27

2 comments

6 changed files

bobheadxi

pr closed time in 4 days

issue closedsourcegraph/sourcegraph

monitoring: index_queue_growth_rate firing without hitting threshold

query

image

Orange line indicates the alert firing when data does not exist, despite DataMayNotExist being enabled on this alert

closed time in 4 days

bobheadxi

Pull request review commentsourcegraph/sourcegraph

Create a updatecheck_client_total time metric

 func updateURL(ctx context.Context) string { 	return baseURL.String() } -func updateBody(ctx context.Context) (io.Reader, error) {+func updateBody(ctx context.Context) (_ io.Reader, err error) {+	defer recordOperation("total_request_time")(&err)

I think the dashboard in monitoring will need to be updated

daxmc99

comment created time in 4 days

Pull request review commentsourcegraph/sourcegraph

Create a updatecheck_client_total time metric

 func updateURL(ctx context.Context) string { 	return baseURL.String() } -func updateBody(ctx context.Context) (io.Reader, error) {+func updateBody(ctx context.Context) (_ io.Reader, err error) {+	defer recordOperation("total_request_time")(&err)

I think sum can be used here if we're looking for total!

daxmc99

comment created time in 4 days

issue openedsourcegraph/sourcegraph

monitoring for host machines

We want to add a single dashboard, something like "host machines", which displays all host-node metrics. In k8s, this would be all nodes in the cluster. In Docker Compose, this would just describe the single host machine. This model matches up with the idea that dashboards describe where to look for problems ("if there are issues on the host machine dashboard, then look at the host machines")

We currently only use cAdvisor for exporting some machine metrics at a per-service level, though it does provide some per-node metrics as well. node_exporter is an alternative/augmenting service that we can consider, in case this data is insufficient. Some context from @slimsag :

We did in fact used to make use of node_exporter before. We removed it for two reasons: (1) it was not deployed or consistent with what we had in other non-k8s deployments (no node exporter) and we were trying to unify metrics across deployments as much as possible at the time. (2) at the time we did not have a good seperate deployment model for customers demanding the "no privileged cluster access" mode. These two points combined meant it was simpler to remove at the time with plans to re-add it in the future. IIRC Uber did complain/question why we were removing it and suggested they would like to have it back in the future. Now seems like a good time to consider doing that.

created time in 4 days

push eventsourcegraph/sourcegraph

Robert Lin

commit sha ac8be9118321fc14677172ee333e50fb5942f572

monitoring: ratios for {un}indexed_search_request_errors, frontend_internal_api_error_responses (#12882) Convert more alerts to ratios, in an effort to provide more meaningful feedback on larger instances (most notably Sourcegraph Cloud). See #12756 for related rationale.

view details

push time in 4 days

delete branch sourcegraph/sourcegraph

delete branch : monitoring/ratios

delete time in 4 days

PR merged sourcegraph/sourcegraph

monitoring: ratios for {un}indexed_search_request_errors, frontend_internal_api_error_responses

Convert more alerts to ratios, in an effort to provide more meaningful feedback on larger instances (most notably Sourcegraph Cloud). See https://github.com/sourcegraph/sourcegraph/pull/12756 for related rationale, but the tl;dr is to make these metrics easier to interpret in terms of "how much of total traffic is being affected"

This closes #12865 - there are some other noisy alerts that might be caused by a misconfiguration, addressed in https://github.com/sourcegraph/sourcegraph/pull/12869

Breakdown of current frequencies of the following alerts - these alerts are all pretty frequently firing

  • indexed_search_request_errors - fired ~271 times in the past 7 days. comparison
  • unindexed_search_request_errors - fired ~70 times in the past 7 days comparison
  • frontend_internal_api_error_responses - fired ~194 times in the past 7 days, though this frequency is related to an incident earlier this week where the frontend internal API was accidentally made inaccessible. comparison

<!-- Reminder: Have you updated the changelog and relevant docs (user docs, architecture diagram, etc) ? -->

+25 -19

0 comment

4 changed files

bobheadxi

pr closed time in 4 days

issue closedsourcegraph/sourcegraph

monitoring: change relevant hard threshold alerts to ratio-based alerts

https://github.com/sourcegraph/sourcegraph/issues/12158 (PR: https://github.com/sourcegraph/sourcegraph/pull/12756) changes many of sourcegraph-frontend's hard-threshold alerts to ratio-based alerts. The rationale is roughly:

In general some of the noisiest alerts in #alerts-cloud are those alerts with hard thresholds, ie "Y+ errors in X minutes" - on larger instances like Sourcegraph Cloud, this could mean we fire alerts on issues that only affect a very small number of users.

We should take a look at the remaining noisy alerts on #alerts-cloud and see which ones we can improve for large Sourcegraph deployments by converting them to ratio-based alerts.

closed time in 4 days

bobheadxi

issue commentbobheadxi/deployments

Should my description input display in the deployments view?

Hm - this parameter is provided to GitHub API calls correctly, how it gets displayed seems up to GitHub: https://sourcegraph.com/search?q=repo:%5Egithub%5C.com/bobheadxi/deployments%24+file:%5Esrc/main%5C.ts+description&patternType=literal

johndesp

comment created time in 4 days

push eventbobheadxi/deployments

Robert Lin

commit sha 35e08f1e9bd67d3735c5c037a97e44e3ef7e7838

add more examples

view details

push time in 4 days

issue commentsourcegraph/sourcegraph

Globbing: don't search fuzzy repos when @commit is present

thought: what if I want to search across all v2 branches in a set of repositories that use this convention?

rvantonder

comment created time in 4 days

issue commentsourcegraph/sourcegraph

cadvisor: investigate collecting IO metrics

question: how do we set an alert on this? an idea is to "scale" the total read/write with repos cloned to scale with deployment sizes, a la this query:

(sum by(name)(rate(container_fs_reads_total{name=~".*gitserver.*"}[5m])) + sum by(name)(rate(container_fs_writes_total{name=~".*gitserver.*"}[5m]))) / ignoring(name) group_left sum(src_gitserver_repo_cloned)

i don't really know what this number means yet, or if it is a useful measure

cc @keegancsmith and @slimsag

bobheadxi

comment created time in 4 days

issue commentsourcegraph/sourcegraph

cadvisor: investigate collecting IO metrics

post-sync with @daxmc99 : container_fs_io_* is per-node, hence why it is only available (seemingly) for cadvisor pods. This is not super useful for us, since we want metrics per-service

a useful approx might be reads_total and writes_total

image

we could set up something like this for:

  • [ ] gitserver
  • [ ] zoekt
bobheadxi

comment created time in 4 days

push eventsourcegraph/sourcegraph

Robert Lin

commit sha 95c4ed226606a41fa204b8222f2442d9ea6cbe6f

doc: use main in UI elements (#12883)

view details

push time in 5 days

delete branch sourcegraph/sourcegraph

delete branch : docsite/main

delete time in 5 days

PR merged sourcegraph/sourcegraph

doc: use 'main' branch in UI elements

Docsite is now correctly serving stuff on the main branch it seems: https://docs.sourcegraph.com/admin/observability/alert_solutions

however, UI elements still show master - this PR updates it

<!-- Reminder: Have you updated the changelog and relevant docs (user docs, architecture diagram, etc) ? -->

+4 -4

0 comment

1 changed file

bobheadxi

pr closed time in 5 days

issue commentsourcegraph/sourcegraph

RFC-189: follow up with distribution, cloud, code-intel, search to set up opsgenie rotations

@pecigonzalo adding this to 3.20 because I don't think it will be resolved in 3.19

bobheadxi

comment created time in 5 days

issue commentsourcegraph/sourcegraph

remove custom alertmanager from cloud

@pecigonzalo moving this to 3.20 because I don't think https://github.com/sourcegraph/sourcegraph/issues/12899 will be resolved in this iteration

bobheadxi

comment created time in 5 days

issue openedsourcegraph/sourcegraph

RFC-189: follow up with distribution, cloud, code-intel, search to set up opsgenie rotations

This is a tracking issue for following up with relevant teams to ensure that each has rotations configured before we move forward with removing the old alerting stack entirely (https://github.com/sourcegraph/sourcegraph/issues/12160). For more context, see RFC 189: on-call rotation changes.

created time in 5 days

PR opened sourcegraph/sourcegraph

doc: use main in UI elements

Docsite is now correctly serving stuff on the main branch it seems: https://docs.sourcegraph.com/admin/observability/alert_solutions

however, UI elements still show master - this PR updates it

<!-- Reminder: Have you updated the changelog and relevant docs (user docs, architecture diagram, etc) ? -->

+4 -4

0 comment

1 changed file

pr created time in 5 days

create barnchsourcegraph/sourcegraph

branch : docsite/main

created branch time in 5 days

PR opened sourcegraph/sourcegraph

Reviewers
monitoring: ratios for {un}indexed_search_request_errors, frontend_internal_api_error_responses

Convert more alerts to ratios, in an effort to provide more meaningful feedback on larger instances (most notably Sourcegraph Cloud). See https://github.com/sourcegraph/sourcegraph/pull/12756 for related rationale, but the tl;dr is to make these metrics easier to interpret in terms of "how much of total traffic is being affected"

This closes #12865 - there are some other noisy alerts that might be caused by a misconfiguration, addressed in https://github.com/sourcegraph/sourcegraph/pull/12869

Breakdown of current frequencies of the following alerts - this alerts are all pretty frequently firing

  • indexed_search_request_errors - fired ~271 times in the past 7 days. comparison
  • unindexed_search_request_errors - fired ~70 times in the past 7 days comparison
  • frontend_internal_api_error_responses - fired ~194 times in the past 7 days, though this frequency is related to an incident earlier this week where the frontend internal API was accidentally made inaccessible. comparison

<!-- Reminder: Have you updated the changelog and relevant docs (user docs, architecture diagram, etc) ? -->

+25 -19

0 comment

4 changed files

pr created time in 5 days

push eventsourcegraph/sourcegraph

Thorsten Ball

commit sha 268e68fa5a3a174ae30e2e2cdf5027e9aed54b3f

Include namespace in Campaign/CampaignSpec URLs (#12873) * Include namespace in Campaign/CampaignSpec URLs * Remove redundant return statement

view details

Eric Fritz

commit sha 489110b359e08bbfb53adc6bfe04bf56829c203f

codeintel: Additional worker memory improvements (#12108)

view details

Eric Fritz

commit sha 910fab14fe85c5d0f66dcb0081c0488c8c686236

codeintel: Collapse worker handler and processor (#12795)

view details

Eric Fritz

commit sha 01a0b2afdafa933ec058d7621d4b6d23f2daf17e

codeintel: Group code intel data for serialization on-demand (#12125)

view details

Robert Lin

commit sha 784e96eff68b7de48737d93b739abadab87f34ed

frontend: disallow non-admins from using TriggerObservabilityTestAlert (#12876)

view details

Thorsten Ball

commit sha f62d6b24d8c991369cd226e59b0f7225d5913ca7

Implement CreateCampaign mutation and add CampaignSpec.AppliesToCampaign field (#12872) * Implement CampaignSpec.AppliesToCampaign property * Implement CreateCampaign mutation

view details

Erik Seliger

commit sha c5b034b63da2ae91b3c0020cfd75c6854ceb59ff

Use stable clock for nextSyncAt to fix flaky test (#12877)

view details

Robert Lin

commit sha a27600222f656f68439aafdf1f3334848f6deb21

dev: upgrade to docsite@v1.5.0, update docsite dev guide (#12874)

view details

renovate[bot]

commit sha 945017cadccc315e88f03cc48ed26a520ee80da1

Update prom/prometheus Docker tag to v2.20.1 (#12854) Co-authored-by: Renovate Bot <bot@renovateapp.com>

view details

Robert Lin

commit sha a8018a2b8b3bcb6f274ec609cbfc41e78964c244

Merge branch 'main' of github.com:sourcegraph/sourcegraph into monitoring/ratios

view details

push time in 5 days

delete branch sourcegraph/sourcegraph

delete branch : renovate/docker-prom-prometheus-2.x

delete time in 5 days

push eventsourcegraph/sourcegraph

renovate[bot]

commit sha 945017cadccc315e88f03cc48ed26a520ee80da1

Update prom/prometheus Docker tag to v2.20.1 (#12854) Co-authored-by: Renovate Bot <bot@renovateapp.com>

view details

push time in 5 days

PR merged sourcegraph/sourcegraph

Update prom/prometheus Docker tag to v2.20.1 bot

This PR contains the following updates:

Package Type Update New value Sourcegraph
prom/prometheus stage minor v2.20.1 code search for "prom/prometheus"

Renovate configuration

:date: Schedule: "on the 1st through 7th day of the month" in timezone America/Los_Angeles.

:vertical_traffic_light: Automerge: Disabled by config. Please merge this manually once you are satisfied.

:recycle: Rebasing: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

:no_bell: Ignore: Close this PR and you won't be reminded about this update again.


  • [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

This PR has been generated by WhiteSource Renovate. View repository job log here.

+1 -1

2 comments

1 changed file

renovate[bot]

pr closed time in 5 days

pull request commentsourcegraph/sourcegraph

Update prom/prometheus Docker tag to v2.20.1

took a look at the changelog and it seems like a lot of fixes and improvements: https://github.com/prometheus/prometheus/blob/master/CHANGELOG.md

renovate[bot]

comment created time in 5 days

Pull request review commentsourcegraph/sourcegraph

repo-updater: Add /healthz endpoint

 func (s *Server) Handler() http.Handler { 	return mux } +func (s *Server) handleHealthz(w http.ResponseWriter, r *http.Request) {+	w.WriteHeader(200)+	_, err := w.Write([]byte("ok"))+	if err != nil {+		log15.Info("Error checking /healthz: " + err.Error())

I would advocate for switching it over to structured, but maybe keep it as a warning

pecigonzalo

comment created time in 5 days

push eventsourcegraph/sourcegraph

Robert Lin

commit sha a27600222f656f68439aafdf1f3334848f6deb21

dev: upgrade to docsite@v1.5.0, update docsite dev guide (#12874)

view details

push time in 5 days

delete branch sourcegraph/sourcegraph

delete branch : bump-docsite

delete time in 5 days

PR merged sourcegraph/sourcegraph

dev: upgrade to docsite@v1.5.0, update docsite dev guide

see https://github.com/sourcegraph/docsite/pull/52

<!-- Reminder: Have you updated the changelog and relevant docs (user docs, architecture diagram, etc) ? -->

+11 -12

1 comment

2 changed files

bobheadxi

pr closed time in 5 days

delete branch sourcegraph/sourcegraph

delete branch : monitoring/disallow-non-admins-from-alerts

delete time in 5 days

push eventsourcegraph/sourcegraph

Robert Lin

commit sha 784e96eff68b7de48737d93b739abadab87f34ed

frontend: disallow non-admins from using TriggerObservabilityTestAlert (#12876)

view details

push time in 5 days

PR merged sourcegraph/sourcegraph

Reviewers
frontend: disallow non-admins from using TriggerObservabilityTestAlert

I suspect that somebody set off https://sourcegraph.app.opsgenie.com/alert/detail/7921f57e-fae0-4f21-99bc-fa3060409339-1597066049210/details , since I just realized I did not set a guard on this endpoint (https://github.com/sourcegraph/sourcegraph/pull/12532)

Looking into logs now to see if I can confirm if the last person to use this endpoint is in our org

<!-- Reminder: Have you updated the changelog and relevant docs (user docs, architecture diagram, etc) ? -->

+6 -0

2 comments

1 changed file

bobheadxi

pr closed time in 5 days

pull request commentsourcegraph/sourcegraph

frontend: disallow non-admins from using TriggerObservabilityTestAlert

Just for vis:

Looking into logs now to see if I can confirm if the last person to use this endpoint is in our org

it doesn't seem like our logs provide this information

bobheadxi

comment created time in 5 days

PR opened sourcegraph/sourcegraph

Reviewers
frontend: disallow non-admins from using TriggerObservabilityTestAlert

I suspect that somebody set off https://sourcegraph.app.opsgenie.com/alert/detail/7921f57e-fae0-4f21-99bc-fa3060409339-1597066049210/details , since I just realized I did not set a guard on this endpoint (https://github.com/sourcegraph/sourcegraph/pull/12532)

Looking into logs now to see if I can confirm if the last person to use this endpoint is in our org

<!-- Reminder: Have you updated the changelog and relevant docs (user docs, architecture diagram, etc) ? -->

+6 -0

0 comment

1 changed file

pr created time in 5 days

create barnchsourcegraph/sourcegraph

branch : monitoring/ratios

created branch time in 5 days

push eventsourcegraph/sourcegraph

Robert Lin

commit sha dd0224309b758cb3a9bfe990990e9fde2706f227

add DataMayBeNaN for upload_queue_growth_rate

view details

push time in 5 days

issue commentsourcegraph/sourcegraph

Sourcegraph.com update checks slow

@daxmc99 seems like you added this one, but I'm not entirely sure how it works - do you mind taking a look and seeing if these really are too slow, or if the threshold is too low?

slimsag

comment created time in 5 days

issue commentsourcegraph/sourcegraph

Sourcegraph.com update checks slow

This is one of our most frequent alerts, according to this query:

image

slimsag

comment created time in 5 days

pull request commentsourcegraph/docsite

add support for a custom default branch

Followups:

  • https://github.com/sourcegraph/deploy-sourcegraph-dot-com/pull/3163
  • https://github.com/sourcegraph/sourcegraph/pull/12874
bobheadxi

comment created time in 5 days

more