Newest 'sre' Questions

0 votes

0 answers

29 views

Using Metric Labels as Dynamic Limits in Grafana Expressions

I have an application where I want to export metrics with expected limits. The goal is to automate the creation of alerts and set triggered limits. Ideally, I’d like to add an additional tag with the ...

Sergei Kurenchuk

1

asked Nov 9 at 14:51

-1 votes

1 answer

29 views

finding errors in docker-compose file task? [closed]

I have a task of finding the errors in this docker-compose.yaml file. I have included my answers, could someone tell me its correct and if I've found everything? The first is the yaml file and the ...

Krist Manor

19

asked Aug 24 at 19:49

0 votes

0 answers

21 views

Calculating Availability % in prometheus, ignoring downtime of X mins

We are polling our services to check if they are available or not using an exporter. The output of those probes are 1/0. Now, I've a requirement to calculate the Availability % which ignores any ...

Prajul Nambiar

1

asked Jul 26 at 19:16

0 votes

0 answers

28 views

Deploy a custom prometheus exporter on a application lost host

Im trying to deploy a custom app exporter on our Linux server as a end-point then register in prometheus, unable to bring up a service on localhost https://github.com/devon-mar/tacacs-exporter can ...

tarasi_214

11

asked Jul 11 at 21:37

0 votes

0 answers

49 views

Unable to renaming page name in Neoload while recording

I have recorded a script in neoload, By default the neoload names the page name after the first request name. There are same requests in other transaction too. So this creates confusion while ...

aron

1

asked Jun 3 at 9:01

0 votes

1 answer

59 views

I am using Active Directory as Security Realm and Role based Authorization for my jenkins

When I run the script to fetch all users on jenkins, it gives me both active and inactive users. But it is also giving me users which aren't on my jenkins manage jenkins->manage roles->assign ...

qwerty

1

asked May 26 at 10:44

-2 votes

1 answer

116 views

Understand the thinking behind "slow error is even worse than a fast error" [closed]

While reading SRE 4 golden signals in (under the Latency section) https://sre.google/sre-book/monitoring-distributed-systems/ I specifically unable to understand of the below line On the other hand, ...

symbaa

11

asked Oct 5, 2023 at 12:21

0 votes

1 answer

181 views

Azure Alerts for an Application Gateway Availability SLI

I am attempting to implement an Azure Alert that triggers when our Availability SLI drops below a threshold, say 99.9%. For context, our Availability SLI is calculated as 100 - (the number of requests ...

devguydavid

4,119

asked Oct 2, 2023 at 20:56

0 votes

2 answers

163 views

Azure Chaos Studio with Chaos Mesh VNET Injection in Private Clusters Unsuccessful

We are beginning to evaluate Azure Chaos Studio usage with Chaos Mesh k8s experiments (AKS Chaos Mesh Pod Chaos for example). Our clusters are private and we've enabled VNET injection when setting ...

tbarkley29

1

asked Sep 7, 2023 at 19:45

0 votes

1 answer

142 views

docker unable to delete default network

When I start the docker-compose file all containers are working fine. Docker File: services: db: container_name: postgresql environment: POSTGRES_DB: sonar POSTGRES_PASSWORD: ...

Mayur Dagdi

31

asked Jul 4, 2023 at 6:28

0 votes

1 answer

303 views

how we set name of docker network in docker-compose

write docker-compose file make multi containe. when i use docker-compose up commend it work fine but again i down docker it give **error will removing network ** Stopping sonarqube ... error ...

Mayur Dagdi

31

asked Jun 28, 2023 at 10:05

0 votes

1 answer

389 views

PromQL queries to for SLI(Service Level Indicator) indicators using prometheus/grafana and blackbox exporter

i want to achieve the specified SLI(Service Level Indicator) for our http endpoints using blackbox exporter for probing like the following indicators: 80% availability Latency less than 1s For latency ...

sal

33

asked Jun 6, 2023 at 18:20

-1 votes

1 answer

541 views

Harbor registry proxy cache vs replication [closed]

I'm new to Harbor registry. I was asked to propose an architecture for harbor in my company. I proposed at first to use an architecture based on proxy cache. But the CISO refused to use proxy cache ...

mastertopg

1

asked Jun 1, 2023 at 4:32

0 votes

1 answer

101 views

Application Monitoring using sql and shell script

we are using shell scripts and sql queries to monitor our application. we are planning to migrate to cloud and use prometheus and openserch for monitoring. Is there a way to execute oracle sql quires(...

user3069309

39

asked Feb 28, 2023 at 15:25

0 votes

1 answer

226 views

Should an not found or empty response be always 404?

I have an endpoint for a REST API that checks for the existence of a (or a list of) requests. It can return 200 OK if there is an order in progress or 404 NOT FOUND if there are no current orders ...

Plinio Fabrycio

106

asked Sep 9, 2022 at 4:06

1 vote

1 answer

643 views

Promethesus: How do I write a PromQL query to find the drastic increase or decrease by some X% in my graph and stays for 10m, need to raise an alert

I am trying to use rate() query like comparing last 10 min with the previous 50 min like: (sum by() rate(cmd_get{}[10m]) / (sum by() rate(cmd_get{}[50m] offset 10m)) If I want to check the percentage ...

samantha

11

asked Jul 14, 2022 at 19:40

0 votes

1 answer

6k views

Alertmanager: how to send alerts only in weekdays?

I tried to add this to my alertmanager.yml in root level, but I got this error: yaml: unmarshall errors: field time_intervals not found in type config.plain time_intervals: - times: weekdays: ['...

TestAutomator

289

asked Jun 2, 2022 at 14:29

1 vote

1 answer

66 views

RBAC for Infrastructure Engineer

I feel this is a rather basic question, but somehow I'm unable to find a good answer. Recently auditors are complaining about the Role Based Access Control for our cloud set-up. My team is responsible ...

Herman

857

asked Jun 1, 2022 at 13:09

0 votes

1 answer

138 views

How to set SLO for operations that are dependent on file size?

I have an endpoint POST /upload that uploads file into my storage. The response time is dependent on the file size (the bigger file, the longer it takes to respond with 200). How should I set a ...

NyamNyam

320

asked May 25, 2022 at 7:25

0 votes

1 answer

1k views

How can I OOM kill a pod manually in Kubernetes

I'm trying to manually OOM Kill pods for testing purposes, does anyone know how I can achieve this?

GreatBear

39

asked May 23, 2022 at 14:23

2 votes

1 answer

141 views

conditions to check if Aerospike cluster is being idle

Assuming aerospike is running, I need some conditions through which check weather aerospike cluster is idle and not being used at all. I tried checking log files but it also logs the heartbeat, so ...

Sujay_ks

328

asked Apr 29, 2022 at 5:15

0 votes

1 answer

110 views

Puppet3 | read values from different yaml file

So I'm using puppet3 and I have X.yaml and Y.yaml. X.yaml has profiles::resolv_conf::nameservers: [ '1.1.1.1', '8.8.8.8', '2.2.2.2' ]in it. I want to add that [ '1.1.1.1', '8.8.8.8', '2.2.2.2' ] as a ...

Codemypath

3

asked Apr 4, 2022 at 16:32

1 vote

1 answer

609 views

Can Services in GCP's Monitoring monitor endpoints?

I installed managed Anthos on a GKE cluster. Anthos Service Mesh is working and is displaying my API. Thanks to that Services that are in Monitoring automatically detect my API. This is great as it ...

Marcin Kulik

963

asked Mar 21, 2022 at 18:13

0 votes

1 answer

1k views

Flink 1.14.3 - [issue] failed to bind to /0.0.0.0:6123

We are using 1.14.3 version of flink and when we try to run Job manager, we are getting below exception. I tried entering akka.remote.netty.tcp.hostname = "127.0.0.1" in flink-conf.yml file ...

Vinayraj007

1

asked Feb 2, 2022 at 10:49

1 vote

1 answer

70 views

Can TTFB be affected after page load?

In case of server side rendering, we know that TTFB is the time it takes between the start of the request and the start of the response. My question is can the TTFB be affected if the page visually ...

user14199036

asked Jan 13, 2022 at 19:08

0 votes

0 answers

62 views

What and where is this class 'UniversalScalabilityLawForecast' in Micrometer library?

I'm reading 'SRE with Java Microservices'(O'reilly) "USL forecasting is a form of “derived” Meter in Micrometer and can be enabled as shown in Example 4-39. " Example 4-39. Universal ...

BY-J

9

asked Jan 9, 2022 at 14:46

2 votes

1 answer

900 views

how do I measure error budget consumption for rolling windows?

I have a SLO for one application where 95% of service response times must be less than 450ms over a rolling 24 hour window. I sample once every 60 seconds. Typically my "current service level&...

Miked

21

asked Dec 9, 2021 at 13:31

0 votes

2 answers

1k views

What happens if container exceeds cpu requested but under limit on kubernetes?

In kubernetes we can set limits and requests for cpu. If the container exceeds the limit, from my understanding it will be throttled. However if the container exceeds the requested but is still ...

user2962698

145

asked Oct 2, 2021 at 11:23

1 vote

1 answer

5k views

give access to service principal which is in another azure tenant

we deploy resources in our Azure tenant through Jenkins which uses terraform to provision infra resources. and we use service principal for authentication and infra provisioning which are in same ...

chitender kumar

454

asked Sep 24, 2021 at 11:49

0 votes

1 answer

719 views

How to add multiple AWS ClientVPN Routes using Terraform

I have AWS clientVPN which was created manually from AWS console and it has around 20 plus route table entry. Now, I want to terraform this so we can add any new route using terraform. I have imported ...

Techhack New

13

asked Sep 14, 2021 at 8:48

0 votes

1 answer

1k views

Prometheus inhibit alert selectively

I need to create an alerting system that has to notify when a particular condition (e.g. Tomcat goes down) is met. Multiple remote servers deployed in different locations (with different time zones) ...

gpprimo

1

asked Sep 9, 2021 at 13:46

-3 votes

1 answer

130 views

How google runs production systems - what's really the "50% time for project work" for SRE?

Quote: "SREs at 50% of their time. Their remaining time should be spent using their coding skills on project work." (page 7)" I'm reading this book, and realy can't understand. What is &...

user227685

1

asked Jul 16, 2021 at 20:22

6 votes

1 answer

7k views

PromQL query to calculate service uptime & downtime from a fixed date

I'm trying to build a basic SRE dashboard in order to learn Prometheus/Grafana. I want to calculate the number of hours the service has been running & the number of hours its been down since the ...

user9492428

633

asked Jun 12, 2021 at 18:31

0 votes

1 answer

463 views

Anchore Container scanning in Jenkins CI Pipeline

I need help with my Jenkinsfile CI file. Code in Jenknsfile looks like this: pipeline { environment { registry = "user/demo1" registryCredential = 'dockerhub' dockerImage = '' ...

Rukender

57

asked May 5, 2021 at 10:44

0 votes

2 answers

3k views

Prometheus rules - check file count inside a directory of an app container

I'm looking to write a prometheus rule to constantly check for message queue length(exim mail relay) which is the total number of files in a directory in an app's container and alert a slack channel ...

Avi

1,623

asked Apr 9, 2021 at 12:17

0 votes

1 answer

294 views

prometheus alert expression for 99% availability of rest API

I would like to create an alert in Prometheus for a REST API, if the API is not available 99% of the time. I am new to prometheus expression. Could you please help me to create an expression to ...

user3777385

31

asked Mar 22, 2021 at 4:23

0 votes

2 answers

802 views

Eliminate specific value from Jmx exporter through config Yaml

Here is the current Jmx exporter pattern: pattern: 'metrics<name=resilience4jCircuitbreakerState.name.(.*).state.(.*), type=gauges><>Value' name: 'x.y.z.resilience4j.circuitbreaker.state' ...

Md. Hasan Basri

157

asked Jan 8, 2021 at 3:30

0 votes

0 answers

138 views

Is the error budget in GCP UI supposed to rise above 100%?

I have just started using SLO's in GCP and my first SLI seems to be working, but, the "error budget" field is way above 100%. All the examples I have seen online sit at 100%, whereas mine ...

Cameron

44

asked Dec 29, 2020 at 21:11

1 vote

1 answer

117 views

How to avoid "Positive Feedback Cycle Overload Problem"?

Sometimes while designing reliable systems, we try to make the system more reliable by adding retries in event of failure (with feedback mechanisms). And it results to potential for an overload ...

Stalin Rijal

21

asked Dec 15, 2020 at 17:02

3 votes

1 answer

2k views

manage dataproc cluster access using service account and IAM roles

I am a beginner in cloud and would like to limit my dataproc cluster access to a given gcs buckets in my project. Lets says I have created a service account named as 'data-proc-service-account@my-...

vikrant rana

4,654

asked Jul 29, 2020 at 1:33

0 votes

1 answer

138 views

Is the maintenance window burning error budget

Is the maintenance window burning error budget? Example: Let's say I have a 1h error budget left. I stop the service for planned maintenance for 30 minutes. Is the error budget still 1h or is it 30 ...

danielinclouds

467

asked May 26, 2020 at 20:50

0 votes

1 answer

216 views

what are best practices for deploying new features for spring boot application?

i have a spring boot application with too many users, and there are many incoming requests to my application, what should i do for deploying a new feature to the application without losing incoming ...

Moya

21

asked Apr 14, 2020 at 6:04

Collectives™ on Stack Overflow

Related Tags