Devops Ultimate Monitoring Project
Devops Ultimate Monitoring Project
Devops Ultimate Monitoring Project
DevOps
Ultimate Monitoring Project
Introduction
In this project, we implemented a comprehensive monitoring solution using
Prometheus and various exporters to ensure the reliability and performance of
a web application hosted on AWS EC2 instances. This setup includes Node
Exporter for hardware and OS metrics, Blackbox Exporter for probing
endpoints, and Alertmanager for handling alerts. Gmail integration was also
configured to receive notifications for critical alerts.
Architecture
Shubham Mukherjee
Monitoring Project
Shubham Mukherjee
Monitoring Project
Best Practices
1. Define Clear Objectives and Metrics
• Identify Key Metrics: Determine which metrics are critical for the health
and performance of your application and infrastructure (e.g., CPU usage,
memory usage, response times, error rates).
• Set Baselines and Thresholds: Establish baseline performance levels and
set thresholds for alerts to distinguish between normal and abnormal
behavior.
2. Use Multiple Data Sources
• Combine Metrics and Logs: Use both metrics and logs to get a
comprehensive view of the system's health.
• Integrate Various Exporters: Use relevant exporters (Node Exporter,
Blackbox Exporter, etc.) to collect metrics from different parts of your
infrastructure.
3. Implement Robust Alerting
• Define Relevant Alerting Rules: Create alerting rules that cover various
failure scenarios and performance degradation.
• Avoid Alert Fatigue: Ensure alerts are actionable and avoid too many
false positives. Group related alerts to reduce noise.
• Use Multiple Notification Channels: Configure alerts to be sent via
multiple channels (email, SMS, chat tools) to ensure they are noticed.
4. Ensure High Availability and Redundancy
• Deploy Across Multiple Regions: Set up monitoring components in
multiple regions to avoid single points of failure.
• Backup and Replicate Data: Regularly back up Prometheus data and
configuration files. Use replication to ensure data availability.
5. Optimize Performance and Resource Usage
• Tune Scrape Intervals and Retention Policies: Set appropriate scrape
intervals and data retention policies to balance between data granularity
and resource usage.
Shubham Mukherjee
Monitoring Project
Shubham Mukherjee
Monitoring Project
Shubham Mukherjee
Monitoring Project
Shubham Mukherjee
Monitoring Project
Configuration Files
Prometheus Configuration (prometheus.yml)
Go inside the prometheus.yml file and add these configurations.
• Global Configuration:
global:
scrape_interval: 15s
evaluation_interval: 15s
• Alertmanager Configuration:
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
• Scrape Configuration:
o Prometheus:
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
o Node Exporter:
- job_name: "node_exporter"
static_configs:
- targets: ["<instance_ip>:9100"]
Shubham Mukherjee
Monitoring Project
o Blackbox Exporter:
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- http://prometheus.io
- https://prometheus.io
- http://<instance_ip>:8080/
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement:<instance_ip>:9115
You should restart your Prometheus after completing all this configuration
using this command .
pgrep Prometheus
You will get some service id . example:- 3445
Kill this service using command:
Kill 3445
Shubham Mukherjee
Monitoring Project
Shubham Mukherjee
Monitoring Project
labels:
severity: warning
annotations:
summary: "Host out of memory (instance {{ $labels.instance }})"
description: "Node memory is filling up (< 25% left)\n VALUE = {{ $value
}}\n LABELS: {{ $labels }}"
- alert: HostOutOfDiskSpace
expr: (node_filesystem_avail{mountpoint="/"} * 100) /
node_filesystem_size{mountpoint="/"} < 50
for: 1s
labels:
severity: warning
annotations:
summary: "Host out of disk space (instance {{ $labels.instance }})"
description: "Disk is almost full (< 50% left)\n VALUE = {{ $value }}\n
LABELS: {{ $labels }}"
- alert: HostHighCpuLoad
expr: (sum by (instance)
(irate(node_cpu{job="node_exporter_metrics",mode="idle"}[5m]))) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Host high CPU load (instance {{ $labels.instance }})"
description: "CPU load is > 80%\n VALUE = {{ $value }}\n LABELS: {{
$labels }}"
- alert: ServiceUnavailable
expr: up{job="node_exporter"} == 0
Shubham Mukherjee
Monitoring Project
for: 2m
labels:
severity: critical
annotations:
summary: "Service Unavailable (instance {{ $labels.instance }})"
description: "The service {{ $labels.job }} is not available\n VALUE = {{
$value }}\n LABELS: {{ $labels }}"
- alert: HighMemoryUsage
expr: (node_memory_Active / node_memory_MemTotal) * 100 > 90
for: 10m
labels:
severity: critical
annotations:
summary: "High Memory Usage (instance {{ $labels.instance }})"
description: "Memory usage is > 90%\n VALUE = {{ $value }}\n LABELS: {{
$labels }}"
- alert: FileSystemFull
expr: (node_filesystem_avail / node_filesystem_size) * 100 < 10
for: 5m
labels:
severity: critical
annotations:
summary: "File System Almost Full (instance {{ $labels.instance }})"
description: "File system has < 10% free space\n VALUE = {{ $value }}\n
LABELS: {{ $labels }}"
Add alert_rules inside prometheus.yml - Uncomment the alert_rules.yml .
Shubham Mukherjee
Monitoring Project
Shubham Mukherjee
Monitoring Project
receivers:
- name: 'email-notifications'
email_configs:
- to: [email protected]
from: [email protected]
smarthost: smtp.gmail.com:587
auth_username: your_email
auth_identity: your_email
auth_password: "bdmq omqh vvkk zoqx"
send_resolved: true
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
Shubham Mukherjee
Monitoring Project
Results
Node Exporter
Prometheus
Shubham Mukherjee
Monitoring Project
Alert manager
Stop you web application & Node exporter to get an alert on alert manager.
Shubham Mukherjee
Monitoring Project
Gmail Notification
Shubham Mukherjee
Monitoring Project
Acknowledgement
I would like to extend my heartfelt gratitude to Aditya Jaiswal from the
YouTube channel "DevOps Shack" for his invaluable guidance and insights
throughout this project. His tutorials and resources provided a solid foundation
for implementing a comprehensive monitoring solution using Prometheus and
its ecosystem. The knowledge and best practices shared on his channel greatly
contributed to the successful completion of this project. Thank you for your
support and dedication to the DevOps community.
Conclusion
In this project, we successfully implemented a robust monitoring solution using
Prometheus and its ecosystem to ensure the reliability and performance of a
web application hosted on AWS EC2 instances. By utilizing Node Exporter,
Blackbox Exporter, and Alertmanager, we achieved a setup capable of
collecting detailed metrics, monitoring endpoint availability, and managing
alerts effectively.
Key achievements include:
• Multi-Instance Setup:
o Instance 1: Hosts the web application, Node Exporter, and Nginx.
o Instance 2: Hosts Prometheus, Blackbox Exporter, and
Alertmanager.
• Gmail Integration: Configured Alertmanager to send email notifications
via Gmail for timely alerts on critical issues.
This project highlights the importance of a well-planned monitoring strategy in
maintaining the operational excellence of web-based services. Regular updates
and adherence to best practices will ensure the monitoring solution remains
effective and responsive to new challenges.
Shubham Mukherjee