Skip to main content
Filter by
Sorted by
Tagged with
0 votes
0 answers
29 views

Using Metric Labels as Dynamic Limits in Grafana Expressions

I have an application where I want to export metrics with expected limits. The goal is to automate the creation of alerts and set triggered limits. Ideally, I’d like to add an additional tag with the ...
Sergei Kurenchuk's user avatar
-1 votes
1 answer
29 views

finding errors in docker-compose file task? [closed]

I have a task of finding the errors in this docker-compose.yaml file. I have included my answers, could someone tell me its correct and if I've found everything? The first is the yaml file and the ...
Krist Manor's user avatar
0 votes
0 answers
21 views

Calculating Availability % in prometheus, ignoring downtime of X mins

We are polling our services to check if they are available or not using an exporter. The output of those probes are 1/0. Now, I've a requirement to calculate the Availability % which ignores any ...
Prajul Nambiar's user avatar
0 votes
0 answers
28 views

Deploy a custom prometheus exporter on a application lost host

Im trying to deploy a custom app exporter on our Linux server as a end-point then register in prometheus, unable to bring up a service on localhost https://github.com/devon-mar/tacacs-exporter can ...
tarasi_214's user avatar
0 votes
0 answers
49 views

Unable to renaming page name in Neoload while recording

I have recorded a script in neoload, By default the neoload names the page name after the first request name. There are same requests in other transaction too. So this creates confusion while ...
aron's user avatar
  • 1
0 votes
1 answer
59 views

I am using Active Directory as Security Realm and Role based Authorization for my jenkins

When I run the script to fetch all users on jenkins, it gives me both active and inactive users. But it is also giving me users which aren't on my jenkins manage jenkins->manage roles->assign ...
qwerty's user avatar
  • 1
-2 votes
1 answer
116 views

Understand the thinking behind "slow error is even worse than a fast error" [closed]

While reading SRE 4 golden signals in (under the Latency section) https://sre.google/sre-book/monitoring-distributed-systems/ I specifically unable to understand of the below line On the other hand, ...
symbaa's user avatar
  • 11
0 votes
1 answer
181 views

Azure Alerts for an Application Gateway Availability SLI

I am attempting to implement an Azure Alert that triggers when our Availability SLI drops below a threshold, say 99.9%. For context, our Availability SLI is calculated as 100 - (the number of requests ...
devguydavid's user avatar
  • 4,119
0 votes
2 answers
163 views

Azure Chaos Studio with Chaos Mesh VNET Injection in Private Clusters Unsuccessful

We are beginning to evaluate Azure Chaos Studio usage with Chaos Mesh k8s experiments (AKS Chaos Mesh Pod Chaos for example). Our clusters are private and we've enabled VNET injection when setting ...
tbarkley29's user avatar
0 votes
1 answer
142 views

docker unable to delete default network

When I start the docker-compose file all containers are working fine. Docker File: services: db: container_name: postgresql environment: POSTGRES_DB: sonar POSTGRES_PASSWORD: ...
Mayur Dagdi's user avatar
0 votes
1 answer
303 views

how we set name of docker network in docker-compose

write docker-compose file make multi containe. when i use docker-compose up commend it work fine but again i down docker it give **error will removing network ** Stopping sonarqube ... error ...
Mayur Dagdi's user avatar
0 votes
1 answer
389 views

PromQL queries to for SLI(Service Level Indicator) indicators using prometheus/grafana and blackbox exporter

i want to achieve the specified SLI(Service Level Indicator) for our http endpoints using blackbox exporter for probing like the following indicators: 80% availability Latency less than 1s For latency ...
sal's user avatar
  • 33
-1 votes
1 answer
541 views

Harbor registry proxy cache vs replication [closed]

I'm new to Harbor registry. I was asked to propose an architecture for harbor in my company. I proposed at first to use an architecture based on proxy cache. But the CISO refused to use proxy cache ...
mastertopg's user avatar
0 votes
1 answer
101 views

Application Monitoring using sql and shell script

we are using shell scripts and sql queries to monitor our application. we are planning to migrate to cloud and use prometheus and openserch for monitoring. Is there a way to execute oracle sql quires(...
user3069309's user avatar
0 votes
1 answer
226 views

Should an not found or empty response be always 404?

I have an endpoint for a REST API that checks for the existence of a (or a list of) requests. It can return 200 OK if there is an order in progress or 404 NOT FOUND if there are no current orders ...
Plinio Fabrycio's user avatar
1 vote
1 answer
643 views

Promethesus: How do I write a PromQL query to find the drastic increase or decrease by some X% in my graph and stays for 10m, need to raise an alert

I am trying to use rate() query like comparing last 10 min with the previous 50 min like: (sum by() rate(cmd_get{}[10m]) / (sum by() rate(cmd_get{}[50m] offset 10m)) If I want to check the percentage ...
samantha's user avatar
0 votes
1 answer
6k views

Alertmanager: how to send alerts only in weekdays?

I tried to add this to my alertmanager.yml in root level, but I got this error: yaml: unmarshall errors: field time_intervals not found in type config.plain time_intervals: - times: weekdays: ['...
TestAutomator's user avatar
1 vote
1 answer
66 views

RBAC for Infrastructure Engineer

I feel this is a rather basic question, but somehow I'm unable to find a good answer. Recently auditors are complaining about the Role Based Access Control for our cloud set-up. My team is responsible ...
Herman's user avatar
  • 857
0 votes
1 answer
138 views

How to set SLO for operations that are dependent on file size?

I have an endpoint POST /upload that uploads file into my storage. The response time is dependent on the file size (the bigger file, the longer it takes to respond with 200). How should I set a ...
NyamNyam's user avatar
  • 320
0 votes
1 answer
1k views

How can I OOM kill a pod manually in Kubernetes

I'm trying to manually OOM Kill pods for testing purposes, does anyone know how I can achieve this?
GreatBear's user avatar
2 votes
1 answer
141 views

conditions to check if Aerospike cluster is being idle

Assuming aerospike is running, I need some conditions through which check weather aerospike cluster is idle and not being used at all. I tried checking log files but it also logs the heartbeat, so ...
Sujay_ks's user avatar
  • 328
0 votes
1 answer
110 views

Puppet3 | read values from different yaml file

So I'm using puppet3 and I have X.yaml and Y.yaml. X.yaml has profiles::resolv_conf::nameservers: [ '1.1.1.1', '8.8.8.8', '2.2.2.2' ]in it. I want to add that [ '1.1.1.1', '8.8.8.8', '2.2.2.2' ] as a ...
Codemypath's user avatar
1 vote
1 answer
609 views

Can Services in GCP's Monitoring monitor endpoints?

I installed managed Anthos on a GKE cluster. Anthos Service Mesh is working and is displaying my API. Thanks to that Services that are in Monitoring automatically detect my API. This is great as it ...
Marcin Kulik's user avatar
0 votes
1 answer
1k views

Flink 1.14.3 - [issue] failed to bind to /0.0.0.0:6123

We are using 1.14.3 version of flink and when we try to run Job manager, we are getting below exception. I tried entering akka.remote.netty.tcp.hostname = "127.0.0.1" in flink-conf.yml file ...
Vinayraj007's user avatar
1 vote
1 answer
70 views

Can TTFB be affected after page load?

In case of server side rendering, we know that TTFB is the time it takes between the start of the request and the start of the response. My question is can the TTFB be affected if the page visually ...
user avatar
0 votes
0 answers
62 views

What and where is this class 'UniversalScalabilityLawForecast' in Micrometer library?

I'm reading 'SRE with Java Microservices'(O'reilly) "USL forecasting is a form of “derived” Meter in Micrometer and can be enabled as shown in Example 4-39. " Example 4-39. Universal ...
BY-J's user avatar
  • 9
2 votes
1 answer
900 views

how do I measure error budget consumption for rolling windows?

I have a SLO for one application where 95% of service response times must be less than 450ms over a rolling 24 hour window. I sample once every 60 seconds. Typically my "current service level&...
Miked's user avatar
  • 21
0 votes
2 answers
1k views

What happens if container exceeds cpu requested but under limit on kubernetes?

In kubernetes we can set limits and requests for cpu. If the container exceeds the limit, from my understanding it will be throttled. However if the container exceeds the requested but is still ...
user2962698's user avatar
1 vote
1 answer
5k views

give access to service principal which is in another azure tenant

we deploy resources in our Azure tenant through Jenkins which uses terraform to provision infra resources. and we use service principal for authentication and infra provisioning which are in same ...
chitender kumar's user avatar
0 votes
1 answer
719 views

How to add multiple AWS ClientVPN Routes using Terraform

I have AWS clientVPN which was created manually from AWS console and it has around 20 plus route table entry. Now, I want to terraform this so we can add any new route using terraform. I have imported ...
Techhack New's user avatar
0 votes
1 answer
1k views

Prometheus inhibit alert selectively

I need to create an alerting system that has to notify when a particular condition (e.g. Tomcat goes down) is met. Multiple remote servers deployed in different locations (with different time zones) ...
gpprimo's user avatar
-3 votes
1 answer
130 views

How google runs production systems - what's really the "50% time for project work" for SRE?

Quote: "SREs at 50% of their time. Their remaining time should be spent using their coding skills on project work." (page 7)" I'm reading this book, and realy can't understand. What is &...
user227685's user avatar
6 votes
1 answer
7k views

PromQL query to calculate service uptime & downtime from a fixed date

I'm trying to build a basic SRE dashboard in order to learn Prometheus/Grafana. I want to calculate the number of hours the service has been running & the number of hours its been down since the ...
user9492428's user avatar
0 votes
1 answer
463 views

Anchore Container scanning in Jenkins CI Pipeline

I need help with my Jenkinsfile CI file. Code in Jenknsfile looks like this: pipeline { environment { registry = "user/demo1" registryCredential = 'dockerhub' dockerImage = '' ...
Rukender's user avatar
0 votes
2 answers
3k views

Prometheus rules - check file count inside a directory of an app container

I'm looking to write a prometheus rule to constantly check for message queue length(exim mail relay) which is the total number of files in a directory in an app's container and alert a slack channel ...
Avi's user avatar
  • 1,623
0 votes
1 answer
294 views

prometheus alert expression for 99% availability of rest API

I would like to create an alert in Prometheus for a REST API, if the API is not available 99% of the time. I am new to prometheus expression. Could you please help me to create an expression to ...
user3777385's user avatar
0 votes
2 answers
802 views

Eliminate specific value from Jmx exporter through config Yaml

Here is the current Jmx exporter pattern: pattern: 'metrics<name=resilience4jCircuitbreakerState.name.(.*).state.(.*), type=gauges><>Value' name: 'x.y.z.resilience4j.circuitbreaker.state' ...
Md. Hasan Basri's user avatar
0 votes
0 answers
138 views

Is the error budget in GCP UI supposed to rise above 100%?

I have just started using SLO's in GCP and my first SLI seems to be working, but, the "error budget" field is way above 100%. All the examples I have seen online sit at 100%, whereas mine ...
Cameron's user avatar
  • 44
1 vote
1 answer
117 views

How to avoid "Positive Feedback Cycle Overload Problem"?

Sometimes while designing reliable systems, we try to make the system more reliable by adding retries in event of failure (with feedback mechanisms). And it results to potential for an overload ...
Stalin Rijal's user avatar
3 votes
1 answer
2k views

manage dataproc cluster access using service account and IAM roles

I am a beginner in cloud and would like to limit my dataproc cluster access to a given gcs buckets in my project. Lets says I have created a service account named as 'data-proc-service-account@my-...
vikrant rana's user avatar
  • 4,654
0 votes
1 answer
138 views

Is the maintenance window burning error budget

Is the maintenance window burning error budget? Example: Let's say I have a 1h error budget left. I stop the service for planned maintenance for 30 minutes. Is the error budget still 1h or is it 30 ...
danielinclouds's user avatar
0 votes
1 answer
216 views

what are best practices for deploying new features for spring boot application?

i have a spring boot application with too many users, and there are many incoming requests to my application, what should i do for deploying a new feature to the application without losing incoming ...
Moya's user avatar
  • 21