AWS Cloud Operations Blog
Best practices for analyzing AWS Config recording frequencies
AWS Config tracks configuration changes across your AWS resources and AWS Organizations. AWS Config uses the configuration recorder to detect changes and records them as configuration items (CIs). As your infrastructure grows and becomes more complex, choosing the appropriate recording frequency becomes critical for maintaining operational visibility, meeting compliance requirements, and supporting your security posture. Since the launch of the periodic recording […]
Centralized Multi-Account Application Resilience Assessment Using AWS Resilience Hub
Introduction As organizations scale their cloud environments across multiple AWS accounts and regions, managing and accessing resilience becomes increasingly complex. Traditional approaches of evaluating resilience separately for each workload, account, or region can lead to inefficiencies, inconsistencies, and coverage gaps. This challenge is particularly pronounced in distributed architectures utilizing various Infrastructure as Code (IaC) tools […]
Optimize querying AWS CloudTrail logs with partitioning in Amazon Athena
Organizations leveraging AWS CloudTrail to audit API access encounter a common challenge: CloudTrail data volume grows proportionally with AWS infrastructure expansion. A multi-account AWS organization generating millions of API calls daily can quickly amass terabytes of CloudTrail logs. When security teams conduct incident investigations or account activity audits, querying these logs in Amazon Athena becomes […]
Learn from AWS Fault Injection Service team’s approach to Game Days
In today’s digital world, availability and reliability are crucial competitive advantages. For DevOps and SRE teams, the ability to respond quickly and effectively to incidents can mean the difference between a minor issue and a major disruption of service that impacts millions of customers. Teams must have clear-cut runbooks and appropriate observability to be ready […]
Alarming on SLOs in Amazon Search with CloudWatch Application Signals – Part 2
In practice: SLO monitoring with CloudWatch Application Signals In the previous post, we’ve shared the basic concepts and benefits of burn rate monitoring. In this post, we, the Amazon Product Search team, will share anecdotes from our migration from an in-house solution to CloudWatch Application Signals, and introduce how we actually implement monitoring and dashboards. […]
Alarming on SLOs in Amazon Search with CloudWatch Application Signals – Part 1
In theory: SLO concepts applied to Amazon Product Search In this series of posts, we will show you how we, the Amazon Product Search team, monitor key systems using Service Level Objectives (SLOs) and share our migration journey from an in-house solution to Amazon CloudWatch Application Signals. Amazon Product Search is a large distributed system […]
Using Amazon Bedrock and Amazon Nova for AI-Powered Incident Response
In today’s cloud-native world, incident response teams face overwhelming challenges. When critical applications fail, engineers must sift through mountains of observability data across multiple services; all while under intense pressure to restore service quickly. This manual correlation process is time-consuming, error-prone, and often delays resolution, resulting in extended outages and frustrated customers. Traditional monitoring tools […]
Launching Amazon CloudWatch generative AI observability (Preview)
As organizations rapidly deploy large language models (LLMs) and generative AI agents to power increasingly intelligent workloads, they struggle to monitor and troubleshoot the complex interactions within their AI applications. Traditional monitoring tools fall short in providing the visibility across components, leading to developers and AI/ML engineers to manually correlate interaction logs or building custom […]
SAP on AWS – Streamlined Operations and Monitoring
SAP ERP (Enterprise Resource Planning) systems are at the core of many enterprises, supporting a wide range of mission-critical processes, including Procure to Pay, Order to Cash, Production Planning, Financial Accounting, Supply Chain Management (SCM), and Human Capital Management. Given the critical role of SAP ERP, maintaining the stability, security, and efficiency of these ERP […]
Automate installing AWS Systems Manager agent on unmanaged Amazon EC2 nodes
Managing a fleet of AWS resources at scale can be challenging. Organizations rely on multiple solutions to automate tasks, collect inventory, patch instances, and maintain security compliance. Organizations need to access instances without opening inbound ports or managing SSH keys. AWS Systems Manager (SSM) simplifies this by serving as a centralized management solution that supports […]