Day: March 13, 2022

Learning – Prometheus Monitor

Learning - Prometheus Monitor

Monitoring Tool for

  • Highly dynamic container environments

  • Container & Microservices Infrastructure

  • Traditional, bare server

  • constantly monitor all the services

  • alert when crash

  • identify problem before

  • checking memory usage

  • notify administrator

  • Trigger alert at 50%

  • Monitor network loads

Prometheus Server

Does the actual monitoring work

  • Time Series Database
    Storage - stores metrics data (CPU usage, No. of exception)

  • Data Retrieval Worker
    Retrieval - pulls metrics data (Applications, Servers, ...)

  • Accepts PromQL queries
    HTTP Server - accepts queries

  • Prometheus Web UI

  • Grafana

  • etc.

Targets and Metrics

Targets

  • What does Prometheus monitor?

    • Linux/Windows Server
    • Single Application
    • Apache Server
    • Service, like Database
  • Which units are monitored of those targets?

    • CPU Status
    • Memory/Disk Space Usage
    • Requests Count
    • Exceptions Count
    • Request Duration

Metrics

  • Format: Human-readable text-based
  • Metrics entries: TYPE and HELP attributes
    HELP - description of what the metrics is
    TYPE - 3 metrics types

    • Counter - how many times x happened
    • Gauge - what is the current value of x now?
    • Histogram - how long or how big?

Collecting Metrics Data from Targets

Data Retrieval Worker => pull over HTTP => Target (Linux Server, External Service)

  • Pulls from HTTP endpoints
  • hostaddress/metrics
  • must be in correct format

Target Endpoints and Exporters

  • Exposing /metrics endpoints by default
  • Many services need another component

Exporter

  • fetches metrics from target (some service)
  • converts to correct format
  • expose /metrics

List of official exporters ...

Monitor a Linux Server?

  • download a node exporter
  • untar and execute
  • converts metrics of the server
  • exposes /metrics endpoint
  • configure prometheus to scrape this endpoint

Monitoring your own applications

  • How many requests?
  • How many exceptions?
  • How many server resources are used?

Using client libraries you can expose /metrics endpoint

Pull Mechanism

Data Retrieval Worker pulls Targets /metrics

Push system

Amazon Cloud Watch, New Relic - Applications/Servers push to a centralized collection platform

  • high load of network traffic
  • monitoring can become your bottleneck
  • install additional software or tool to push metrics

Pull system - more advantages

  • multiple Premetheus instances can pull metrics data
  • better detection/insight if service is up and running

Pushgateway

What, when target only runs for a short time?

"short-lived job" => push metrics at exit => Pushgateway

Pushgateway <= pull <= Prometheus Server
Prometheus targets <= pull <= Prometheus Server

Configuring Prometheus

How does Prometheus know what to scrape and when?

  • prometheus.yml

    • which targets?
    • at what interval?
  • service discovery
    service discovery <= discover targets <= Prometheus Server

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  # - "first.rules"
  # = "second.rules"

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ['localhost:9090']
  - job_name: node_exporter
    scrape_interval: 1m
    scrape_timeout: 1m
    static_configs:
      - targets: ['localhost:9100]
  • How often Prometheus will scrape its targets
  • Rules for aggregating metric values or creating alerts when condition met
  • What resources Prometheus monitors
    • Prometheus has its own /metrics endpoint

Alert Manager

  • How does Prometheus trigger the alerts?
  • Who receives the alerts?

Prometheus Server => push alerts => Alertmanager => Email, Slack, etc.

Prometheus Data Storage

Where does Prometheus store the data?

  • Local - Disk (HDD/SSD)

  • Remote Storage Systems

  • Custom Time Series Format

    • Can't write prometheus data directly into a relational database

PromQL Query Language

Prometheus Web UI => PromQL => Prometheus Server
Data Visualization Tools => PromQL => Prometheus Server

  • Query target directly
  • Or use more powerful visualization tools - e.g. Grafana

PromQL Query

Query all HTTP status codes except 4xx ones

http_requests_total{status!~"4.."}

Returns the 5-minute rate of the http_requests_total metric for the past 30mins

rate(http_requests_total[5m])[30m:]

Prometheus Characteristics

Pros

  • reliable
  • stand-alone and self-containing
  • works, even if other parts of infrastructure broken
  • no extensive set-up needed
  • less complex

Cons

  • difficult to scale
  • limits monitoring

Workarounds

  • increase Prometheus server capacity
  • limit number of metrics

Prometheus with Docker and Kubernetes

  • fully compatible
  • Prometheus components available as Docker images
  • can easily be deployed in Container Environments like Kubernetes
  • Monitoring of K8s Cluster Node Resources out-of-the box!

References

How Prometheus Monitoring works | Prometheus Architecture explained