Table of contents
Nowadays its a very important aspect to understand what is happening in a system, often by instrumenting it to collect metrics, logs, or traces. You have to have a proactive approch towards your infrastructure and application issues before it leads to down time where observability comes in picture.
In the cloud, observability can be hard to achieve due to sheer system complexity. Whether in data centers or in the cloud, to achieve operational excellence and meet business objectives, you need to understand how your systems are performing. Observability solutions enable you to collect and analyze data from applications and infrastructure so that you can understand their internal states and be alerted to, troubleshoot, and resolve issues with application availability and performance to improve the end-user experience.
To achieve this observability cloudwatch is a monitoring and management service that provides data and actionable insights for AWS, hybrid, and on-premises applications and infrastructure resources. With CloudWatch, you can collect and access all your performance and operational data in form of logs and metrics from a single platform.
You can correlate your metrics and logs to better understand the health and performance of your resources. You can also create alarms based on metric value thresholds you specify, or that can watch for anomalous metric behavior based on machine learning algorithms. To take action quickly, you can set up automated actions to notify you if an alarm is triggered and automatically start auto scaling, for example, to help reduce mean-time-to-resolution. You can also dive deep and analyze your metrics, logs, and traces, to better understand how to improve application performance as below
Cloudwatch In Action
If we try to understand Amazon CloudWatch in lay man language its basically a metrics repository a place or container where something is deposited or stored . Other amazon services that you want to monitor puts their metrics in these repository and we retrieve metrics data from it to derive our statistics. Cloudwatch has its own in-built metrices but it does not limits you till that, it also allows you to define you custom metrics and use it for your statistical derivation.
You can use metrics to calculate statistics and then present the data graphically in the CloudWatch console. There is a very nice feature in cloudwatch to create custom dashboards.
In diagram you can see that different AWS Resources are sending their metrices to cloudwath which is being categorized under namespaces. Based on this available metrices you can derive your statistics or you can capture an event and take action according either by sending SNS notification or by triggering a lambda function.
Unified CloudWatch Agent
The unified CloudWatch agent enables you to do the following: Collect internal system-level metrics from Amazon EC2 instances across operating systems. ... Collect logs from Amazon EC2 instances and on-premises servers, running either Linux or Windows Server. The CloudWatch agent is supported on x86-64 architecture on the following operating systems:
- Amazon Linux version 2014.03.02 or later
- Amazon Linux 2
- Ubuntu Server versions 20.04, 18.04, 16.04, and 14.04
- CentOS versions 8.0, 7.6, 7.2, and 7.0
- Red Hat Enterprise Linux (RHEL) versions 8, 7.7, 7.6, 7.5, 7.4, 7.2, and 7.0
- Debian version 10 and version 8.0
- SUSE Linux Enterprise Server (SLES) version 15 and version 12
- Oracle Linux versions 7.8, 7.6, and 7.5
- macOS, including EC2 Mac1 instances
- 64-bit versions of Windows Server 2019, Windows Server 2016, and Windows Server 2012
The agent is supported on ARM64 architecture on the following operating systems:
- Amazon Linux 2
- Ubuntu Server versions 20.04 and 18.04
- Red Hat Enterprise Linux (RHEL) version 7.6
- SUSE Linux Enterprise Server 15
You can download and install the CloudWatch agent manually using the command line, or you can integrate it with SSM. When you install Cloudwatch agent on your machine you get a utility know as cloudwatch agent wizard which helps you configure your cloudwatch agent configuration.
Stream EC2 logs to cloudwatch using cloudwatch agent wizard
To know more about cloudwatch agent wizard click here
Amazon CloudWatch Pricing
AWS CouldWatch comes in pricing tiers, a free and paid tier. The paid tier of CloudWatch has no upfront fees or commitments and is billed at the end of the month based on usage. Keep in mind, CloudWatch prices do vary by region and are subject to change. Below are the current prices listed by Amazon:
- First 10,000 metric - $0.30 (metric/month)
- Next 240,000 metrics - $0.10
- Next 750,000 metrics - $0.05
- Over 1,000,000 metrics - $0.02
- GetMetricData, GetInsightRuleReport - $0.01 per 1,000 metrics requested
- GetMetricWidgetImage - $0.02 per 1,000 metrics requested
- GetMetricStatistics, ListMetrics, PutMetricData, GetDashboard, ListDashboards, PutDashboard and DeleteDashboards requests - $0.01 per 1,000 requests
Dashboard $3.00 per dashboard
- Standard resolution - $0.10 per alarm metric
- High resolution - $0.30 per alarm metric
- Standard resolution anomaly detection - $0.30 per alarm
- High resolution anomaly detection - $0.90 per alarm
- Composite - $0.50 per alarm
- Collect (data ingestion) - $0.67 per GB
- Store (archival) - $0.033 per GB
- Analyze (logs insights queries) - $0.0067 per GB of data scanned
- Custom Events - $1.00 per million events
- Cross-Account Events - $1.00 per million events
- Contributor Insights Rule - $0.50 per rule per month
- Matched Log Events - $0.027 per one million log events that match the rule per month
Canaries $0.0017 per canary run
Note:- These pricing might vary based on AWS update so its always recommended to check latest pricing on AWS offcial website here
Benefits of CloudWatch
- Observability on a single platform across applications and infrastructure :- Modern applications such as those running on microservices architectures generate large volumes of data in the form of metrics, logs, and events. Amazon CloudWatch enables you to collect, access, and correlate this data on a single platform from across all your AWS resources, applications, and services that run on AWS and on-premises servers, helping you break down data silos so you can easily gain system-wide visibility and quickly resolve issues.
- Easiest way to collect metrics in AWS and on-premises :- Easiest way to collect metrics in AWS and on-premises:- Monitoring your AWS resources and applications is easy with CloudWatch. It natively integrates with more than 70 AWS services such as Amazon EC2, Amazon DynamoDB, Amazon S3, Amazon ECS, Amazon EKS, and AWS Lambda, and automatically publishes detailed 1-minute metrics and custom metrics with up to 1-second granularity so you can dive deep into your logs for additional context. You can also use CloudWatch in hybrid cloud architectures by using the CloudWatch Agent or API to monitor your on-premises resources.
- Improve operational performance and resource optimization :- Amazon CloudWatch enables you to set alarms and automate actions based on either predefined thresholds, or on machine learning algorithms that identify anomalous behavior in your metrics. For example, it can start Amazon EC2 Auto Scaling automatically, or stop an instance to reduce billing overages. You can also use CloudWatch Events for serverless to trigger workflows with services like AWS Lambda, Amazon SNS, and AWS CloudFormation.
- Get operational visibility and insight :- To optimize performance and resource utilization, you need a unified operational view, real-time granular data, and historical reference. CloudWatch provides automatic dashboards, data with 1-second granularity, and up to 15 months of metrics storage and retention. You can also perform metric math on your data to derive operational and utilization insights; for example, you can aggregate usage across an entire fleet of EC2 instances.
- Derive actionable insights from logs :- CloudWatch enables you to explore, analyze, and visualize your logs so you can troubleshoot operational problems with ease. With CloudWatch Logs Insights, you only pay for the queries you run. It scales with your log volume and query complexity giving you answers in seconds. In addition, you can publish log-based metrics, create alarms, and correlate logs and metrics together in CloudWatch Dashboards for complete operational visibility.
Utilization of Cloudwatch
- Infrastructure monitoring and troubleshooting :- Monitor key metrics and logs, visualize your application and infrastructure stack, create alarms, and correlate metrics and logs to understand and resolve root cause of performance issues in your AWS resources. This includes monitoring your container ecosystem across Amazon ECS, AWS Fargate, Amazon EKS, and Kubernetes.
- Mean-time-to-resolution improvement :- CloudWatch helps you correlate, visualize, and analyze metrics and logs, so you can act quickly to resolve issues, and combine them with trace data from AWS X-Ray for end-to-end observability. You can also analyze user requests to help speed up troubleshooting and debugging, and reduce overall mean-time-to-resolution (MTTR).
- Proactive resource optimization :- CloudWatch alarms watch your metric values against thresholds that either you specify, or that CloudWatch creates for you using machine learning models to detect anomalous behavior. If an alarm is triggered, CloudWatch can take action automatically to enable Amazon EC2 Auto Scaling or stop an instance, for example, so you can automate capacity and resource planning.
- Application monitoring :- Monitor your applications that run on AWS (on Amazon EC2, containers, and serverless) or on-premises. CloudWatch collects data at every layer of the performance stack, including metrics and logs on automatic dashboards.
- Log analytics:- Explore, analyze, and visualize your logs to address operational issues and improve applications performance. You can perform queries to help you quickly and effectively respond to operational issues. If an issue occurs, you can start querying immediately using a purpose-built query language to rapidly identify potential causes.
Features Of Cloudwatch
- Easily collect and store logs :- The Amazon CloudWatch Logs service allows you to collect and store logs from your resources, applications, and services in near real-time
- Built-in metrics :-Collecting metrics from distributed applications (such as those built using microservices architectures) is time consuming. Amazon CloudWatch allows you to collect default metrics from more than 70 AWS services, such as Amazon EC2, Amazon DynamoDB, Amazon S3, Amazon ECS, AWS Lambda, and Amazon API Gateway, without any action on your part
- Custom Metrics :- Amazon CloudWatch allows you to collect custom metrics from your own applications to monitor operational performance, troubleshoot issues, and spot trends.
- Collect and aggregate container metrics and logs :- Container Insights simplifies the collection and aggregation of curated metrics and container ecosystem logs. It collects compute performance metrics such as CPU, memory, network, and disk information from each container as performance events and automatically generates custom metrics used for monitoring and alarming.
- Collect and aggregate Lambda metrics and logs :- CloudWatch Lambda Insights simplifies the collection and aggregation of curated metrics and logs from AWS Lambda functions. It collects compute performance metrics such as CPU, memory, and network from each Lambda function as performance events, while automatically generating custom metrics used for monitoring and alarming. The performance events are ingested as CloudWatch logs to simplify monitoring and troubleshooting. CloudWatch custom metrics are automatically extracted from these ingested logs and can be further analyzed using CloudWatch Logs Insights’ advanced query language.
- Unified operational view with dashboards:- Amazon CloudWatch dashboards enable you to create re-usable graphs and visualize your cloud resources and applications in a unified view. You can graph metrics and logs data side by side in a single dashboard to quickly get the context and go from diagnosing the problem to understanding the root cause.
- Composite alarms :- Amazon CloudWatch composite alarms allow you to combine multiple alarms and reduce alarm noise. If an application issue affects several resources in an application, you will receive a single alarm notification for the entire application instead of one for each affected service component or resource.
- High resolution alarms :- Amazon CloudWatch alarms allow you to set a threshold on metrics and trigger an action. You can create high-resolution alarms, set a percentile as the statistic, and either specify an action or ignore as appropriate.
- Logs and metrics correlation :-Applications and infrastructure resources generate lots of operational and monitoring data in form of logs and metrics. In addition to providing ability to access and visualize these data sets in a single platform, Amazon CloudWatch also makes it easy to correlate metrics and logs.
- Application Insights :- Amazon CloudWatch Application Insights provides automated setup of observability for your enterprise applications, so you can get visibility into the health of such applications. It helps identify and set up key metrics and logs across your application resources and technology stack i.e. database, web (IIS) and application servers, Operating System, load balancers, queues, etc. It constantly monitors these telemetry data to detect and correlate anomalies and errors, to notify you of any problems in your application. To aid in troubleshooting, it creates automated dashboards for the detected problems with correlated metric anomalies and log errors, along with additional insights to point you to their potential root-cause. This enables you to take quick remedial actions to ensure that your applications are healthy and end-users are not impacted.
- Container monitoring insights :- Container Insights provides automatic dashboards in the CloudWatch console. These dashboards summarize the compute performance, errors, and alarms by cluster, pod/task, and service. For Amazon EKS and k8s, dashboards are also available for nodes/EC2 instances and namespaces. Each dashboard summarizes the list of running pods/tasks or containers by CPU and memory for the selected time window, and allows you to contextually - based on time window and selected pod/task or container - dive deeper into application logs, AWS X-Ray traces, and performance events.
- Lambda monitoring insights :-Lambda Insights provides automatic dashboards in the CloudWatch console. These dashboards summarize the compute performance and errors. Each dashboard includes the list of metrics for the selected time window and allows you to contextually dive deeper — based on time window and selected function — into application logs, AWS X-Ray traces, and performance events.
- Anomaly Detection :-Amazon CloudWatch Anomaly Detection applies machine-learning algorithms to continuously analyze data of a metric and identify anomalous behavior. It allows you to create alarms that auto-adjust thresholds based on natural metric patterns, such as time of day, day of week seasonality, or changing trends. You can also visualize metrics with anomaly detection bands on dashboards. This enables you to monitor, isolate, and troubleshoot unexpected changes in your metrics.
- ServiceLens :- You can use Amazon CloudWatch ServiceLens to visualize and analyze the health, performance, and availability of your applications in a single place. CloudWatch ServiceLens ties together CloudWatch metrics and logs as well as traces from AWS X-Ray to give you a complete view of your applications and their dependencies. This enables you to quickly pinpoint performance bottlenecks, isolate root causes of application issues, and determine users impacted. CloudWatch ServiceLens enables you to gain visibility into your applications in three main areas: Infrastructure monitoring (using metrics and logs to understand the resources supporting your applications), transaction monitoring (using traces to understand dependencies between your resources), and end user monitoring (using canaries to monitor your endpoints and notify you when your end user experience has degraded). CloudWatch ServiceLens provides a Service Map that visualizes the contextual linking of all your resources, along with an intuitive interface so you can dive deep into correlated monitoring data.
- Synthetics :- Amazon CloudWatch Synthetics allows you to monitor application endpoints more easily. It runs tests on your endpoints every minute, 24x7, and alerts you as soon as your application endpoints don’t behave as expected. These tests can be customized to check for availability, latency, transactions, broken or dead links, step by step task completions, page load errors, load latencies for UI assets, complex wizard flows, or checkout flows in your applications. You can also use CloudWatch Synthetics to isolate alarming application endpoints and map them back to underlying infrastructure issues to reduce mean time to resolution. With this new feature, CloudWatch now collects canary traffic, which can continually verify your customer experience even when you don’t have any customer traffic on your applications, enabling you to discover issues before your customers do. CloudWatch Synthetics supports monitoring of your REST APIs, URLs, and website content, checking for unauthorized changes from phishing, code injection and cross-site scripting.
- Stream metrics :- Amazon CloudWatch Metric Streams enables you to create continuous, near real-time streams of metrics to a destination of your choice. Metrics Streams makes it easier to send CloudWatch metrics to popular third-party service providers using an Amazon Kinesis Data Firehose HTTP endpoint. You can create a continuous, scalable stream including the most up-to-date CloudWatch metrics data to power dashboards, alarms, and other tools that rely on accurate and timely metric data. You can also easily direct your metrics to your data lake on AWS such as on Amazon Simple Storage Service (S3), ready to start analyzing usage or performance with tools such as Amazon Athena.
Core Concepts Of Cloudwatch
- Namespaces:- A namespace is a container for CloudWatch metrics. Metrics in different namespaces are isolated from each other, so that metrics from different applications are not mistakenly aggregated into the same statistics.
- Metrics :- Metrics are the fundamental concept in CloudWatch. A metric represents a time-ordered set of data points that are published to CloudWatch. Think of a metric as a variable to monitor, and the data points as representing the values of that variable over time. For example, the CPU usage of a particular EC2 instance is one metric provided by Amazon EC2. The data points themselves can come from any application or business activity from which you collect data.
- Time stamps :- Each metric data point must be associated with a time stamp. The time stamp can be up to two weeks in the past and up to two hours into the future. If you do not provide a time stamp, CloudWatch creates a time stamp for you based on the time the data point was received.
- Metrics retention :- CloudWatch retains metric data as follows:
- Data points with a period of less than 60 seconds are available for 3 hours. These data points are high-resolution custom metrics.
- Data points with a period of 60 seconds (1 minute) are available for 15 days
- Data points with a period of 300 seconds (5 minute) are available for 63 days
- Data points with a period of 3600 seconds (1 hour) are available for 455 days (15 months) Data points that are initially published with a shorter period are aggregated together for long-term storage. For example, if you collect data using a period of 1 minute, the data remains available for 15 days with 1-minute resolution. After 15 days this data is still available, but is aggregated and is retrievable only with a resolution of 5 minutes. After 63 days, the data is further aggregated and is available with a resolution of 1 hour.
- Dimensions :- A dimension is a name/value pair that is part of the identity of a metric. You can assign up to 10 dimensions to a metric.
- Resolution :- Each metric is one of the following:
- Standard resolution, with data having a one-minute granularity
- High resolution, with data at a granularity of one second
- Statistics:- Statistics are metric data aggregations over specified periods of time. CloudWatch provides statistics based on the metric data points provided by your custom data or provided by other AWS services to CloudWatch. Aggregations are made using the namespace, metric name, dimensions, and the data point unit of measure, within the time period you specify.
- Units :- Each statistic has a unit of measure. Example units include Bytes, Seconds, Count, and Percent. For the complete list of the units that CloudWatch supports, see the MetricDatum data type in the Amazon CloudWatch API Reference.
- Periods :- A period is the length of time associated with a specific Amazon CloudWatch statistic. Each statistic represents an aggregation of the metrics data collected for a specified period of time. Periods are defined in numbers of seconds, and valid values for period are 1, 5, 10, 30, or any multiple of 60. For example, to specify a period of six minutes, use 360 as the period value. You can adjust how the data is aggregated by varying the length of the period. A period can be as short as one second or as long as one day (86,400 seconds). The default value is 60 seconds.
- Aggregation :- Amazon CloudWatch aggregates statistics according to the period length that you specify when retrieving statistics. You can publish as many data points as you want with the same or similar time stamps. CloudWatch aggregates them according to the specified period length. CloudWatch does not automatically aggregate data across Regions, but you can use metric math to aggregate metrics from different Regions.
- Alarms :- You can use an alarm to automatically initiate actions on your behalf. An alarm watches a single metric over a specified time period, and performs one or more specified actions, based on the value of the metric relative to a threshold over time. The action is a notification sent to an Amazon SNS topic or an Auto Scaling policy. You can also add alarms to dashboards.
a) CloudWatch Metrics for EC2
AWS Provided metrics:
- Basic Monitoring (default): metrics are collected at a 5 minute internal
- Detailed Monitoring (paid): metrics are collected at a 1 minute interval
- Includes CPU, Network, Disk and Status Check Metrics
Custom metric (yours to push):
- Basic Resolution: 1 minute resolution
- High Resolution: all the way to 1 second resolution
- Include RAM, application level metrics
- Make sure the IAM permissions on the EC2 instance roles are correct
b) CloudWatch Metrics for Load Balancers
- HealthyHostCount /UnHealthyHostCount
- HTTPCode_Backend_2XX:Successful request.
- HTTPCode_Backend_3XX,redirected request
- HTTPCode_ELB_4XX: Client error codes
- HTTPCode_ELB_5XX: Server error codes generated by the load balancer.
- SurgeQueueLength:The total number of requests (HTTP listener) or connections (TCP listener) that are pending routing to a healthy instance. Help to scale out ASG. Max value is 1024
- SpilloverCount:The total number of requests that were rejected because the surge queue is full.
c) CloudWatch Metrics for ASG :- Metrics are collected every 1 minute
- GroupMinSize, GroupMaxSize, GroupDesiredCapacity
- GroupInServiceInstances, GroupPendingInstances, GroupStandbyInstances
- GroupTerminatingInstances, GroupTotalInstances
d) CloudWatch Metrics For EFS
- PercentIOLimit :- How close the file system reaching the I/O limit (General Purpose) ,If at 100%, move to Max I/O (migration)
- BurstCreditBalance:-The number of burst credits the file system can use to achieve higher throughput levels
- StorageBytes:- File system’s size in bytes (15 minutes interval), Dimensions: Standard, IA, Total (Standard + IA)
e) CloudWatch metrics associated with RDS
- ReadIOPS / WriteIOPS
- ReadLatency / WriteLatency
- ReadThroughPut / WriteThroughPut
Analyze Query And Visualize AWS Cloudwatch Logs Using Logs Insight
CloudWatch Logs Insights enables you to interactively search and analyze your log data in Amazon CloudWatch Logs. You can perform queries to help you more efficiently and effectively respond to operational issues. If an issue occurs, you can use CloudWatch Logs Insights to identify potential causes and validate deployed fixes. For deepdive in cloudwatch logs insight click here
Automate AWS CloudWatch Log Group Retention Using Lambda In Python
CloudWatch organises logs in a log group and when a new log group is created, it’s retention period is set to Never expire, which means logs will be retained forever. When we are streaming logs via cloudwatch agent using cloudwatch agent json file we dont get option to define log group retention in that json file also. What Is Accomplished By This Automation? When a new CloudWatch log group is created directly via console or via cloudwatch agent, a CloudWatch event rule triggers a lambda function. Then the lambda function sets a desirable retention time for the CloudWatch log group. After that retention time all log stream(s) data of log group will be deleted automatically. For deepdive in cloudwatch logs insight click here
Automate Disabling And Re-enabling AWS Cloudwatch Alarms During Maintenance Window
In this blog we are going to create a python script in which will disable all active cloudwatch alarms before maintenance/deployment window and enable it back again after maintenance/deployment window. For deepdive in cloudwatch automation click here
Set AWS Cloudwatch log groups Retention Policy for all Log using python boto3 script
In this blog we will write python script using boto3 which will set retention policy for all existing log groups which are already created in the account at one go. For deepdive in cloudwatch automation click here
As we have seen CloudWatch is AWS cloud native powerful monitoring service which provides complete visiblity in your AWS resources/applications and has tremendous features to Analyze,Visualize and Query your logs along with logs insight, lamda insights, Container insights etc.
Stay tuned for my next blog.....