Detect operational anomalies in Serverless Applications with Amazon DevOps Guru - AWS User Group Berlin 2024

Vadym Kazulkin | @VKazulkin | ip.labs GmbH
Detect operational anomalies in Serverless applications
with Amazon DevOps Guru
Vadym Kazulkin, ip.labs , AWS User Group Berlin, 22 October 2024

Contact
Vadym Kazulkin
ip.labs GmbH Bonn, Germany
Co-Organizer of the Java User Group Bonn
v.kazulkin@gmail.com
@VKazulkin
https://dev.to/vkazulkin
https://github.com/Vadym79/
https://de.slideshare.net/VadymKazulkin/
https://www.linkedin.com/in/vadymkazulkin
https://www.iplabs.de/

About ip.labs
3 Amazon DevOps Guru for the Serverless Applications

DevOps Lifecycle
c

Amazon DevOps Guru

6
AIOPs
Artificial Intelligence for IT Operations (AIOps) is the process of using
machine learning techniques to solve operational problems. The goal of
AIOps is to reduce human intervention in the IT operations processes.
By using advanced machine learning techniques, you can reduce
operational incidents and increase service quality. AIOps can help you
with:
• Increase service quality
• for example, by grouping related incidents based on time and
language
• Predict incidents before they happen
https://aws.amazon.com/devops-guru
Amazon DevOps Guru for the Serverless Applications

7
What is AWS DevOps Guru
Amazon DevOps Guru offers a fully managed AIOps platform powered
by machine learning (ML) that is designed to make it easy to improve an
application’s operational performance and availability
DevOps Guru helps detect behaviors that deviate from normal operating
patterns so you can identify operational issues long before they impact
your customers
• increased latency
• error rates (timeouts, throttles, CPU, memory and, disk utilization)
• resource constraints (exceeding AWS account limits)

8
Benefits of DevOps Guru

9
How DevOps Guru work

10
DevOps Guru is powered by pre-trained ML models
• Built domain-specific, single-purpose models to identify known failure
modes instead of normal metric behavior.
• DevOps Guru relies on a large ensemble of detectors—statistical models
tuned to detect common adverse scenarios in a variety of operational
metrics.
• DevOps Guru detectors don’t need to be trained or configured. They
work instantly as long as enough history is available.
• Individual detectors work in preconfigured ensembles to generate
anomalies on some of the most important metrics: error rates,
availability, latency, incoming request rates, CPU, memory, and disk
utilization, among others.
https://aws.amazon.com/blogs/machine-learning/amazon-devops-guru-is-powered-by-pre-trained-ml-models-that-encode-operational-excellence/

12
DevOps Guru pre-trained ML detectors with periodic behaviors
• Many metrics, such as the number of
incoming requests in customer-facing
APIs, exhibit periodic behavior.
• The purpose of the causal
convolution detector is to analyze
temporal data with such patterns and
to determine expected periodic
behavior.
• When the detector infers that a
metric is periodic, it adapts normal
metric behavior thresholds to the
seasonal pattern.
https://aws.amazon.com/blogs/machine-learning/amazon-devops-guru-is-powered-by-pre-trained-ml-models-that-encode-operational-excellence/

How future of software developers may look like

14
Monitoring & Alerting of the Serverless Applications

15
Monitoring & Alerting of the Serverless Applications

DevOps Guru Example Application
16
https://github.com/Vadym79/DevOpsGuruWorkshopDemo inspired by https://github.com/aws-samples/serverless-java-frameworks-samples

17
DevOps Guru Set Up

18
DevOps Guru Set Up with AWS Organizations
https://aws.amazon.com/blogs/mt/how-to-easily-configure-devops-guru-across-your-organization-with-systems-manager-quick-setup/

DevOps Guru Dashboard

DevOps Guru Reactive Insights

DevOps Guru Examples
22
• Warm up the application (takes between 1 and 24 hours) to create a base line
• Design test experiment to provoke errors and latency increase
• Reduce the service quote of the AWS service (API Gateway, Lambda,
DynamoDB)
• Set very low service quotas for the sake of reducing AWS costs
• Add latency artificially
• Stress test with Hey Tool to run into the operational issues
• See if the DevOps Guru recognized the operational issues
• Remediate the operational issues by increasing service quote, removing the
artificial latency or stopping the stress test
• See whether DevOps Guru closes the incident when it’s resolved
https://github.com/rakyll/hey

DevOps Guru: Recognize Operational Issues in DynamoDB
c

DevOps Guru Examples: DynamoDB Throttling
24
hey -q 20 -z 15m -c 20 -H "X-API-Key: XXXa6XXXX "
https://XXX.execute-api.eu-central
1.amazonaws.com/prod/products/1
c

25

26

27

28
c

29

30

31

32

33

34

c

36
DevOps Guru Examples: API Gateway
HTTP 429 „too many requests“ Error
Query to exaust the quota
hey -q 10 -z 1m -c 10 -H "X-API-Key:
XXXa6XXXX" https://XXX.execute-api.eu
-central-1.amazonaws.com/prod/
products/1

37
DevOps Guru Examples: API Gateway
HTTP 404 „Not Found“ Error
Query for not existing product id, e.g. 200
hey -q 1 -z 15m -c 1 -H "X-API-Key: XXXa6XXXX" https://XXX.execute-
api.eu-central-1.amazonaws.com/prod/products/200

c

40
DevOps Guru Examples: Lambda Throttling 1
hey -q 5 -z 15m -c 5 -H "X-API-Key: XXXa6XXXX" https://XXX.execute-api.eu-
central-1.amazonaws.com/prod/products/1

41
DevOps Guru Examples: Lambda Throttling 1

42
Add 31 sec latency in the code of the Lambda function
DevOps Guru Examples: Lambda Timeout Error

43
DevOps Guru Examples: Lambda Error

44
Temporary add 28 sec latency in the code of
the Lambda function
DevOps Guru Examples: Lambda Increased Latency

45
DevOps Guru Examples: Lambda Increased Latency

46
DevOps Guru: Recognize Operational Issues in SQS

47
Temporary add 26 sec latency in
the code of the Lambda function
DevOps Guru: Operational Issues in SQS

48
DevOps Guru: Operational Issues in SQS

49
DevOps Guru: Recognize Operational Issues Amazon
in Kinesis

50
DevOps Guru Examples: Operational Issues in
Amazon Kinesis Data Stream -> Lambda -> (S3)

51
DevOps Guru: Recognize Operational Issues in
AWS Step Functions

52
DevOps Guru Examples: Operational Issues
in Amazon Step Functions -> Lambda

53
DevOps Guru: Recognize Operational Issues in Aurora
Serverless v2 PostgreSQL

54
DevOps Guru Examples: Enabling Performance
Insights for Aurora Serverless v2

55
DevOps Guru Examples: Operational Issues Lambda -
> Aurora Serverless v2 w/o RDS Proxy
api.eu-central-1.amazonaws.com/prod/productsWithoutDataApi/2

56
DevOps Guru: Recognize Operational Issues in Aurora
Serverless v2 PostgreSQL using DataAPI

57
DevOps Guru Examples: Operational Issues Lambda -> Aurora
Serverless v2 using DataAPI
api.eu-central-1.amazonaws.com/prod/productsWithDataApi/2
No Aurora Serverless DB anomalous metrics
detected

58
DevOps Guru Examples: Operational Issues Lambda -> Aurora
Serverless v2 using DataAPI
api.eu-central-1.amazonaws.com/prod/productsWithDataApi/1
Data API
Non Data API Non Data API Data API
Non Data API Non Data API
Data API Data API
Non Data API Data API

59
DevOps Guru Proactive Insights

60
DevOps Guru Proactive Examples: DynamoDB table
reads/writes are under utilized

61
DevOps Guru Proactive Examples: DynamoDB table
point in time recovery not enabled

62
DevOps Guru Proactive Examples: Lambda
timeout exceeds recommended SQS visibility

63
DevOps Guru Proactive Examples: Lambda Timeout Exceeds
Recommended SQS Visibility

64
DevOps Guru Proactive Examples: SQS Triggered Lambda
Does Not Have a DLQ

65
DevOps Guru Proactive Examples: Lambda Function Consuming
DynamoDB/Kinesis Stream Without Failure Destination

66
DevOps Guru Proactive Examples: Lambda Function Has
Concurrency Spillover
hey -q 1 -z 30m -c 9 -m DELETE -H "X-API-Key: XXXa6XXXX" -H "Content-Type: application/json;charset=utf-
8" https://XXX.execute-api.eu-central-1.amazonaws.com/prod/products/11

67
DevOps Guru Proactive Examples: Lambda Function
does not have enough subnets

68
DevOps Guru integration in Incident
Management Tools
• AWS OPsCenter (via AWS Systems Manager)
• PagerDuty
• Atlassian Opsgenie

69
DevOps Guru Integration Settings

70
DevOps Guru Integration with PagerDuty
https://www.pagerduty.com/docs/guides/amazon-devops-guru-integration-guide/

71
https://www.pagerduty.com/docs/guides/amazon-devops-guru-integration-guide/

72
Enter „Integration
URL“ generated by
PagerDuty

73
DevOps Guru PagerDuty Incidents

74
DevOps Guru Supported Services and Pricing
https://aws.amazon.com/de/devops-guru/pricing/

75
$3,024 per
resource per month
$2,016 per
resource per month
DevOps Guru Supported Services and Pricing

76
DevOps Guru Cost Estimator

77
DevOps Guru Conclusions, Obeservations, Suggestions
• Most operational issues have been correctly recognized so far
• It took several (at least 7) minutes to create an incident after
anomaly appeared
• Correctly no insights created for the temporary incidents
• Short time Lambda, DynamoDB and API Gateway Throttling
• Lambda duration anomalous insights (Duration p90)
• took time to create such an insight (sometimes more than 30
minutes). Maybe because of the medium severity

78
• Recommendations for the insight reason could be more precise (these are
limitations of CloudWatch though)
• No precise HTTP response code as API Gateway response but 4XX and
5XX
• No differentiation between Lambda throttling because of reaching
individual function concurrency limit or the total AWS account
concurrency limit
• No differentiation between Lambda Timeout and Init Error
• DevOps Guru Proactive Insights
• Missed some important ones, like not used Lambda Provisioned
Concurrency for a long period of time

79
• #AWS #Wishlist for DevOps Guru
• Support for EventBridge (and EventBridge Pipes)
• Support for AppSync
• Support for Aurora (Serverless v2 )over DataAPI
• Better support for tracing i.e. AWS X-Ray, CloudWatch ServiceLens
and integrations with the 3rd observability tools i.e. Lumigo,
Datadog

FAQ Ask me Anything

81
Thank you

Detect operational anomalies in Serverless Applications with Amazon DevOps Guru - AWS User Group Berlin 2024

Recommended

More Related Content

Similar to Detect operational anomalies in Serverless Applications with Amazon DevOps Guru - AWS User Group Berlin 2024 (20)

More from Vadym Kazulkin (16)

Recently uploaded (20)

Detect operational anomalies in Serverless Applications with Amazon DevOps Guru - AWS User Group Berlin 2024