42 questions
0
votes
0
answers
29
views
Using Metric Labels as Dynamic Limits in Grafana Expressions
I have an application where I want to export metrics with expected limits. The goal is to automate the creation of alerts and set triggered limits.
Ideally, I’d like to add an additional tag with the ...
-1
votes
1
answer
29
views
finding errors in docker-compose file task? [closed]
I have a task of finding the errors in this docker-compose.yaml file. I have included my answers, could someone tell me its correct and if I've found everything? The first is the yaml file and the ...
0
votes
0
answers
21
views
Calculating Availability % in prometheus, ignoring downtime of X mins
We are polling our services to check if they are available or not using an exporter. The output of those probes are 1/0. Now, I've a requirement to calculate the Availability % which ignores any ...
0
votes
0
answers
28
views
Deploy a custom prometheus exporter on a application lost host
Im trying to deploy a custom app exporter on our Linux server as a end-point then register in prometheus, unable to bring up a service on localhost
https://github.com/devon-mar/tacacs-exporter
can ...
0
votes
0
answers
49
views
Unable to renaming page name in Neoload while recording
I have recorded a script in neoload, By default the neoload names the page name after the first request name. There are same requests in other transaction too. So this creates confusion while ...
0
votes
1
answer
59
views
I am using Active Directory as Security Realm and Role based Authorization for my jenkins
When I run the script to fetch all users on jenkins, it gives me both active and inactive users. But it is also giving me users which aren't on my jenkins manage jenkins->manage roles->assign ...
-2
votes
1
answer
116
views
Understand the thinking behind "slow error is even worse than a fast error" [closed]
While reading SRE 4 golden signals in (under the Latency section) https://sre.google/sre-book/monitoring-distributed-systems/
I specifically unable to understand of the below line
On the other hand, ...
0
votes
1
answer
181
views
Azure Alerts for an Application Gateway Availability SLI
I am attempting to implement an Azure Alert that triggers when our Availability SLI drops below a threshold, say 99.9%. For context, our Availability SLI is calculated as 100 - (the number of requests ...
0
votes
2
answers
163
views
Azure Chaos Studio with Chaos Mesh VNET Injection in Private Clusters Unsuccessful
We are beginning to evaluate Azure Chaos Studio usage with Chaos Mesh k8s experiments (AKS Chaos Mesh Pod Chaos for example).
Our clusters are private and we've enabled VNET injection when setting ...
0
votes
1
answer
142
views
docker unable to delete default network
When I start the docker-compose file all containers are working fine.
Docker File:
services:
db:
container_name: postgresql
environment:
POSTGRES_DB: sonar
POSTGRES_PASSWORD: ...
0
votes
1
answer
303
views
how we set name of docker network in docker-compose
write docker-compose file make multi containe.
when i use docker-compose up commend it work fine but again i down docker it give **error will removing network **
Stopping sonarqube ... error
...
0
votes
1
answer
389
views
PromQL queries to for SLI(Service Level Indicator) indicators using prometheus/grafana and blackbox exporter
i want to achieve the specified SLI(Service Level Indicator) for our http endpoints using blackbox exporter for probing like the following indicators:
80% availability
Latency less than 1s
For latency ...
-1
votes
1
answer
541
views
Harbor registry proxy cache vs replication [closed]
I'm new to Harbor registry. I was asked to propose an architecture for harbor in my company. I proposed at first to use an architecture based on proxy cache. But the CISO refused to use proxy cache ...
0
votes
1
answer
101
views
Application Monitoring using sql and shell script
we are using shell scripts and sql queries to monitor our application.
we are planning to migrate to cloud and use prometheus and openserch for monitoring.
Is there a way to execute oracle sql quires(...
0
votes
1
answer
226
views
Should an not found or empty response be always 404?
I have an endpoint for a REST API that checks for the existence of a (or a list of) requests.
It can return 200 OK if there is an order in progress
or 404 NOT FOUND if there are no current orders
...
1
vote
1
answer
643
views
Promethesus: How do I write a PromQL query to find the drastic increase or decrease by some X% in my graph and stays for 10m, need to raise an alert
I am trying to use rate() query like comparing last 10 min with the previous 50 min like:
(sum by() rate(cmd_get{}[10m]) / (sum by() rate(cmd_get{}[50m] offset 10m))
If I want to check the percentage ...
0
votes
1
answer
6k
views
Alertmanager: how to send alerts only in weekdays?
I tried to add this to my alertmanager.yml in root level, but I got this error:
yaml: unmarshall errors: field time_intervals not found in type config.plain
time_intervals:
- times:
weekdays: ['...
1
vote
1
answer
66
views
RBAC for Infrastructure Engineer
I feel this is a rather basic question, but somehow I'm unable to find a good answer.
Recently auditors are complaining about the Role Based Access Control for our cloud set-up. My team is responsible ...
0
votes
1
answer
138
views
How to set SLO for operations that are dependent on file size?
I have an endpoint POST /upload that uploads file into my storage.
The response time is dependent on the file size (the bigger file, the longer it takes to respond with 200).
How should I set a ...
0
votes
1
answer
1k
views
How can I OOM kill a pod manually in Kubernetes
I'm trying to manually OOM Kill pods for testing purposes, does anyone know how I can achieve this?
2
votes
1
answer
141
views
conditions to check if Aerospike cluster is being idle
Assuming aerospike is running, I need some conditions through which check weather aerospike cluster is idle and not being used at all.
I tried checking log files but it also logs the heartbeat, so ...
0
votes
1
answer
110
views
Puppet3 | read values from different yaml file
So I'm using puppet3 and I have X.yaml and Y.yaml. X.yaml has profiles::resolv_conf::nameservers: [ '1.1.1.1', '8.8.8.8', '2.2.2.2' ]in it. I want to add that [ '1.1.1.1', '8.8.8.8', '2.2.2.2' ] as a ...
1
vote
1
answer
609
views
Can Services in GCP's Monitoring monitor endpoints?
I installed managed Anthos on a GKE cluster. Anthos Service Mesh is working and is displaying my API. Thanks to that Services that are in Monitoring automatically detect my API. This is great as it ...
0
votes
1
answer
1k
views
Flink 1.14.3 - [issue] failed to bind to /0.0.0.0:6123
We are using 1.14.3 version of flink and when we try to run Job manager, we are getting below exception.
I tried entering
akka.remote.netty.tcp.hostname = "127.0.0.1" in flink-conf.yml file ...
1
vote
1
answer
70
views
Can TTFB be affected after page load?
In case of server side rendering, we know that TTFB is the time it takes between the start of the request and the start of the response. My question is can the TTFB be affected if the page visually ...
0
votes
0
answers
62
views
What and where is this class 'UniversalScalabilityLawForecast' in Micrometer library?
I'm reading 'SRE with Java Microservices'(O'reilly)
"USL forecasting is a form of “derived” Meter in Micrometer and can be enabled as
shown in Example 4-39. "
Example 4-39. Universal ...
2
votes
1
answer
900
views
how do I measure error budget consumption for rolling windows?
I have a SLO for one application where 95% of service response times must be less than 450ms over a rolling 24 hour window. I sample once every 60 seconds. Typically my "current service level&...
0
votes
2
answers
1k
views
What happens if container exceeds cpu requested but under limit on kubernetes?
In kubernetes we can set limits and requests for cpu. If the container exceeds the limit, from my understanding it will be throttled. However if the container exceeds the requested but is still ...
1
vote
1
answer
5k
views
give access to service principal which is in another azure tenant
we deploy resources in our Azure tenant through Jenkins which uses terraform to provision infra resources.
and we use service principal for authentication and infra provisioning which are in same ...
0
votes
1
answer
719
views
How to add multiple AWS ClientVPN Routes using Terraform
I have AWS clientVPN which was created manually from AWS console and it has around 20 plus route table entry.
Now, I want to terraform this so we can add any new route using terraform.
I have imported ...
0
votes
1
answer
1k
views
Prometheus inhibit alert selectively
I need to create an alerting system that has to notify when a particular condition (e.g. Tomcat goes down) is met.
Multiple remote servers deployed in different locations (with different time zones) ...
-3
votes
1
answer
130
views
How google runs production systems - what's really the "50% time for project work" for SRE?
Quote: "SREs at 50% of their time. Their remaining time should be spent using their coding skills on project work." (page 7)"
I'm reading this book, and realy can't understand.
What is &...
6
votes
1
answer
7k
views
PromQL query to calculate service uptime & downtime from a fixed date
I'm trying to build a basic SRE dashboard in order to learn Prometheus/Grafana.
I want to calculate the number of hours the service has been running & the number of hours its been down since the ...
0
votes
1
answer
463
views
Anchore Container scanning in Jenkins CI Pipeline
I need help with my Jenkinsfile CI file.
Code in Jenknsfile looks like this:
pipeline {
environment {
registry = "user/demo1"
registryCredential = 'dockerhub'
dockerImage = ''
...
0
votes
2
answers
3k
views
Prometheus rules - check file count inside a directory of an app container
I'm looking to write a prometheus rule to constantly check for message queue length(exim mail relay) which is the total number of files in a directory in an app's container and alert a slack channel ...
0
votes
1
answer
294
views
prometheus alert expression for 99% availability of rest API
I would like to create an alert in Prometheus for a REST API, if the API is not available 99% of the time. I am new to prometheus expression. Could you please help me to create an expression to ...
0
votes
2
answers
802
views
Eliminate specific value from Jmx exporter through config Yaml
Here is the current Jmx exporter pattern:
pattern: 'metrics<name=resilience4jCircuitbreakerState.name.(.*).state.(.*), type=gauges><>Value'
name: 'x.y.z.resilience4j.circuitbreaker.state'
...
0
votes
0
answers
138
views
Is the error budget in GCP UI supposed to rise above 100%?
I have just started using SLO's in GCP and my first SLI seems to be working, but, the "error budget" field is way above 100%. All the examples I have seen online sit at 100%, whereas mine ...
1
vote
1
answer
117
views
How to avoid "Positive Feedback Cycle Overload Problem"?
Sometimes while designing reliable systems, we try to make the system more reliable by adding retries in event of failure (with feedback mechanisms). And it results to potential for an overload ...
3
votes
1
answer
2k
views
manage dataproc cluster access using service account and IAM roles
I am a beginner in cloud and would like to limit my dataproc cluster access to a given gcs buckets in my project.
Lets says I have created a service account named as 'data-proc-service-account@my-...
0
votes
1
answer
138
views
Is the maintenance window burning error budget
Is the maintenance window burning error budget?
Example:
Let's say I have a 1h error budget left. I stop the service for planned maintenance for 30 minutes. Is the error budget still 1h or is it 30 ...
0
votes
1
answer
216
views
what are best practices for deploying new features for spring boot application?
i have a spring boot application with too many users, and there are many incoming requests to my application, what should i do for deploying a new feature to the application without losing incoming ...