Cloud Operations and Management
Expert-defined terms from the Certificate in Cloud Transformation Management course at LearnUNI. Free to read, free to share, paired with a professional course.
Auto Scaling – related terms #
elasticity, load balancer, capacity planning. Auto Scaling automatically adjusts compute resources based on real‑time demand, adding or removing instances to maintain performance while minimizing cost. Example: scaling web servers during a flash sale. Challenge: configuring policies that avoid oscillation and ensure rapid response to spikes.
Availability Zone – related terms #
region, fault domain, redundancy. An Availability Zone (AZ) is an isolated location within a cloud region that provides independent power, cooling, and networking. Deploying across multiple AZs improves resilience. Example: replicating a database in two AZs for high availability. Challenge: managing data consistency and latency between zones.
Backup as a Service (BaaS) – related terms #
snapshot, recovery point objective (RPO), data protection. BaaS provides automated, cloud‑based backup of on‑premises or cloud workloads, often with encryption and retention policies. Example: nightly backups of virtual machines stored in object storage. Challenge: balancing backup frequency with storage cost and ensuring rapid restoration.
Capacity Planning – related terms #
forecasting, utilization, scaling strategy. Capacity planning forecasts resource needs based on growth trends, seasonal patterns, and performance metrics. Example: using historical CPU usage to predict future instance sizes. Challenge: inaccurate forecasts can lead to over‑provisioning or resource shortages.
CloudFormation – related terms #
Infrastructure as Code (IaC), template, stack. AWS CloudFormation enables declarative provisioning of resources via JSON/YAML templates, allowing repeatable environments. Example: defining a VPC, subnets, and security groups in a single stack. Challenge: managing template drift and handling complex dependencies.
Cloud Governance – related terms #
policy enforcement, compliance, audit. Cloud governance establishes rules, roles, and processes to control resource usage, cost, and security. Example: enforcing tagging policies to track expenses. Challenge: achieving compliance without hindering developer agility.
CloudWatch – related terms #
metrics, alarm, log aggregation. Amazon CloudWatch collects performance data, triggers alarms, and visualizes trends for resources and applications. Example: setting an alarm for CPU usage > 80 %. Challenge: avoiding alarm fatigue and correlating metrics across services.
Container Orchestration – related terms #
Kubernetes, Docker Swarm, scheduler. Orchestration automates deployment, scaling, and management of containerized workloads. Example: using Kubernetes Deployments to roll out new versions. Challenge: mastering networking, storage, and security in a dynamic cluster.
Cost Optimization – related terms #
rightsizing, reserved instances, waste reduction. Cost optimization identifies under‑utilized resources, applies discounts, and eliminates idle services. Example: converting on‑demand instances to reserved contracts after a year of stable usage. Challenge: maintaining performance while cutting expenses and tracking savings across many accounts.
Data Residency – related terms #
sovereignty, compliance, geographic location. Data residency refers to storing data within specific legal jurisdictions to meet regulatory requirements. Example: retaining EU customer data in an EU region. Challenge: navigating conflicting regulations and ensuring low‑latency access.
Disaster Recovery (DR) – related terms #
RTO, backup, failover. DR defines processes to restore services after catastrophic failure, specifying recovery time objective (RTO) and recovery point objective (RPO). Example: replicating a primary database to a secondary region for rapid failover. Challenge: balancing DR costs with stringent RTO/RPO goals.
Elastic Load Balancer (ELB) – related terms #
traffic distribution, health check, layer 4/7. ELB automatically distributes incoming traffic across multiple targets to improve availability and fault tolerance. Example: directing web requests to a pool of EC2 instances. Challenge: configuring session persistence and handling sudden traffic spikes.
Federated Identity – related terms #
SSO, SAML, OAuth. Federated identity allows users to authenticate using a trusted external provider, enabling single sign‑on across cloud services. Example: logging into a cloud console with corporate Azure AD credentials. Challenge: mapping external roles to cloud permissions securely.
Fault Tolerance – related terms #
redundancy, graceful degradation, resilience. Fault‑tolerant designs continue operating despite component failures, often via replication and automatic failover. Example: a multi‑AZ deployment that reroutes traffic when one zone loses connectivity. Challenge: designing stateful services that can recover without data loss.
Hybrid Cloud – related terms #
on‑premises, edge, connectivity. Hybrid cloud integrates private infrastructure with public cloud resources, enabling workloads to span environments. Example: extending a private data center with burst capacity in the public cloud. Challenge: ensuring consistent security policies and managing data movement latency.
Infrastructure as Code (IaC) – related terms #
declarative, provisioning, version control. IaC treats infrastructure definitions as software code, enabling automated, repeatable deployments. Example: using Terraform to provision networking, compute, and storage in a single run. Challenge: keeping code synchronized with manual changes and handling state drift.
Incident Management – related terms #
ticketing, root cause analysis, escalation. Incident management coordinates response to service disruptions, from detection through resolution and post‑mortem. Example: a PagerDuty alert triggers a run‑book to restart a failed service. Challenge: reducing mean time to resolution (MTTR) while preserving thorough documentation.
Kubernetes – related terms #
pod, service, operator. Kubernetes is an open‑source platform for automating deployment, scaling, and operation of containerized applications. Example: using a Horizontal Pod Autoscaler to adjust replica count based on CPU. Challenge: mastering networking, storage classes, and security contexts in production.
Latency – related terms #
round‑trip time, jitter, edge computing. Latency measures the delay between request and response, impacting user experience. Example: measuring API response times from different geographic regions. Challenge: reducing latency for globally distributed users while controlling cost.
Logging – related terms #
log aggregation, audit trail, retention. Logging captures events from applications and infrastructure for troubleshooting and compliance. Example: forwarding syslog entries to a centralized log analytics service. Challenge: managing log volume, ensuring searchable indexing, and protecting sensitive data.
Managed Service – related terms #
SaaS, PaaS, vendor responsibility. A managed service offloads operational tasks to the provider, allowing teams to focus on core business logic. Example: using a fully managed relational database instead of self‑hosting. Challenge: understanding shared responsibility and avoiding vendor lock‑in.
Multi‑Cloud – related terms #
cloud‑agnostic, portability, federation. Multi‑cloud strategy uses two or more public cloud providers to avoid reliance on a single vendor. Example: deploying workloads in both AWS and Azure for redundancy. Challenge: harmonizing APIs, monitoring, and security across disparate platforms.
Network Security Group (NSG) – related terms #
firewall, rule set, inbound/outbound. An NSG is a virtual firewall that controls traffic to and from resources based on defined rules. Example: permitting SSH from a specific subnet while blocking all other inbound traffic. Challenge: maintaining rule consistency as environments scale.
Orchestration – related terms #
workflow, automation, orchestration engine. Orchestration coordinates multiple automated tasks into a cohesive process. Example: a CI/CD pipeline that builds, tests, and deploys containers automatically. Challenge: handling error propagation and ensuring idempotent steps.
Platform as a Service (PaaS) – related terms #
runtime, managed environment, abstraction. PaaS provides a complete development and deployment platform, abstracting underlying infrastructure. Example: deploying a web app to Azure App Service without managing servers. Challenge: limited control over low‑level configurations and potential vendor lock‑in.
Quality of Service (QoS) – related terms #
bandwidth, priority, SLA. QoS defines performance guarantees for network traffic, ensuring critical workloads receive sufficient resources. Example: assigning higher priority to database replication traffic. Challenge: configuring QoS policies that align with business priorities and avoid congestion.
Resource Tagging – related terms #
metadata, cost allocation, governance. Tagging adds key‑value metadata to cloud resources to facilitate organization, cost tracking, and policy enforcement. Example: tagging all production instances with “env=prod”. Challenge: enforcing consistent tagging across teams and preventing tag sprawl.
Service Level Agreement (SLA) – related terms #
uptime, penalties, commitment. An SLA is a formal contract that defines expected service performance and remedies for non‑compliance. Example: a 99.9 % uptime guarantee with service credits for downtime. Challenge: aligning provider SLAs with internal business requirements and monitoring compliance.
Service Mesh – related terms #
sidecar, traffic management, observability. A service mesh provides a dedicated infrastructure layer for handling service‑to‑service communication, security, and monitoring. Example: using Istio to enforce mutual TLS between microservices. Challenge: added complexity and resource overhead for control plane components.
Service Catalog – related terms #
self‑service, governance, portfolio. A service catalog offers pre‑approved cloud resources and configurations for users to provision on demand. Example: a catalog entry for a standard three‑tier web application stack. Challenge: keeping catalog items up‑to‑date and preventing shadow IT.
Spot Instances – related terms #
preemptible, cost savings, interruption. Spot instances are unused cloud capacity sold at steep discounts, but can be reclaimed with short notice. Example: running batch processing jobs on spot VMs to reduce cost. Challenge: designing workloads that tolerate interruptions and handling sudden termination.
Terraform – related terms #
state file, provider, modules. Terraform is an open‑source IaC tool that uses a declarative language to manage resources across multiple cloud providers. Example: a Terraform module that creates a VPC, subnets, and security groups. Challenge: managing state files securely and handling provider API changes.
Traffic Shaping – related terms #
throttling, rate limiting, QoS. Traffic shaping controls the flow of network packets to ensure fair usage and prevent overload. Example: limiting API calls to 100 requests per second per client. Challenge: configuring policies that protect services without degrading legitimate traffic.
Unified Monitoring – related terms #
observability, dashboards, correlation. Unified monitoring aggregates metrics, logs, and traces from diverse sources into a single pane of glass. Example: correlating CPU spikes with increased error rates across services. Challenge: normalizing data formats and avoiding alert duplication.
Virtual Private Cloud (VPC) – related terms #
subnet, routing table, isolation. A VPC is an isolated virtual network that emulates a traditional data center within the public cloud. Example: creating public and private subnets for web and database tiers. Challenge: configuring proper network segmentation and NAT gateways.
Workload Migration – related terms #
lift‑and‑shift, re‑architect, data transfer. Workload migration moves applications and data from on‑premises or another cloud to a target environment. Example: using a migration service to replicate a SQL Server database to a managed cloud instance. Challenge: minimizing downtime, handling compatibility issues, and validating performance post‑migration.
Zero‑Trust Security – related terms #
microsegmentation, identity verification, least privilege. Zero‑trust assumes no implicit trust, requiring continuous verification for every request. Example: enforcing strict network policies that only allow necessary service communication. Challenge: implementing comprehensive policies without excessive complexity.
API Gateway – related terms #
request routing, throttling, authentication. An API gateway acts as a front‑door for APIs, handling traffic management, security, and monitoring. Example: exposing microservice endpoints through a centralized gateway with JWT validation. Challenge: balancing performance overhead with security features.
Backup Retention Policy – related terms #
compliance, archival, lifecycle. Defines how long backup copies are kept before deletion or transition to cheaper storage. Example: retaining daily backups for 30 days and weekly backups for 12 months. Challenge: aligning retention with regulatory requirements while controlling storage costs.
Change Management – related terms #
version control, approval workflow, impact analysis. Change management governs modifications to cloud environments to reduce risk. Example: submitting a change request to increase instance size, followed by peer review. Challenge: maintaining speed for DevOps while ensuring proper oversight.
Compliance Framework – related terms #
GDPR, HIPAA, audit. A compliance framework outlines required controls and procedures to meet regulatory standards. Example: implementing encryption at rest to satisfy PCI‑DSS requirements. Challenge: continuously monitoring for compliance drift in dynamic environments.
Cost Allocation Tag – related terms #
chargeback, budgeting, reporting. A cost allocation tag is a label used to assign cloud spend to specific departments or projects. Example: tagging resources with “project=Alpha” for internal chargeback. Challenge: enforcing tag usage and reconciling tags with financial systems.
Data Encryption at Rest – related terms #
KMS, key management, compliance. Encrypting data stored in disks, databases, or object storage protects it from unauthorized access. Example: enabling server‑side encryption for S3 buckets using a customer‑managed key. Challenge: managing key rotation and ensuring performance impact is minimal.
Disaster Recovery as a Service (DRaaS) – related terms #
replication, failover, service level. DRaaS provides cloud‑based recovery capabilities, allowing rapid restoration of entire environments. Example: replicating virtual machines to a secondary region for instant failover. Challenge: testing failover processes without impacting production workloads.
Edge Computing – related terms #
latency reduction, fog, local processing. Edge computing processes data near its source, reducing latency and bandwidth usage. Example: deploying a containerized analytics function on edge nodes for IoT sensor data. Challenge: managing distributed security and consistent updates across edge devices.
Feature Flag – related terms #
canary release, toggle, rollout. Feature flags enable dynamic activation of new functionality without redeploying code. Example: turning on a new recommendation engine for 10 % of users. Challenge: preventing flag leakage and ensuring proper cleanup after release.
Global Load Balancer – related terms #
DNS routing, latency‑based routing, geo‑distribution. A global load balancer distributes traffic across regions based on health, proximity, or performance. Example: directing European users to an EU region while Asian users go to an Asia‑Pacific region. Challenge: synchronizing stateful sessions and handling cross‑region latency.
Health Check – related terms #
readiness probe, liveness probe, monitoring. Health checks verify that services are operational before routing traffic to them. Example: an HTTP endpoint returning 200 OK indicates a web server is healthy. Challenge: designing checks that are both accurate and low‑impact.
Identity and Access Management (IAM) – related terms #
roles, policies, least privilege. IAM controls who can do what in a cloud environment through users, groups, and permission policies. Example: granting a developer read‑only access to production logs. Challenge: preventing permission creep and regularly auditing access.
Infrastructure Monitoring – related terms #
metrics, dashboards, alerting. Infrastructure monitoring tracks resource utilization, performance, and health of servers, networks, and storage. Example: setting alerts for disk usage exceeding 80 %. Challenge: correlating metrics across layers to pinpoint root causes.
Job Scheduler – related terms #
cron, batch processing, dependency. A job scheduler automates the execution of recurring tasks or batch jobs. Example: scheduling a nightly ETL pipeline to run at 02:00 UTC. Challenge: handling job failures and ensuring idempotent execution.
KMS (Key Management Service) – related terms #
encryption keys, rotation, access control. KMS provides centralized creation, storage, and management of cryptographic keys. Example: using KMS to encrypt EBS volumes automatically. Challenge: integrating KMS with multiple services and auditing key usage.
Latency Monitoring – related terms #
synthetic testing, real‑user monitoring, SLO. Latency monitoring measures response times from the end‑user perspective to verify service level objectives. Example: synthetic probes from five global locations checking API latency. Challenge: distinguishing network latency from application processing delays.
Log Retention Policy – related terms #
compliance, storage tier, deletion. Defines how long logs are kept before archival or deletion. Example: retaining security logs for 365 days to satisfy audit requirements. Challenge: balancing regulatory mandates with cost of long‑term storage.
Managed Kubernetes Service – related terms #
control plane, node pool, auto‑upgrade. A managed service provides a fully operated Kubernetes control plane, reducing operational overhead. Example: using Amazon EKS to run clusters without patching the master nodes. Challenge: understanding provider limits and customizing networking.
Network Latency – related terms #
RTT, packet loss, jitter. Network latency is the time for a packet to travel from source to destination, influencing application responsiveness. Example: measuring ping times between data center and cloud region. Challenge: mitigating latency for latency‑sensitive applications like gaming.
Observability – related terms #
metrics, traces, logs. Observability is the ability to infer internal system states from external outputs, enabling debugging and performance tuning. Example: using distributed tracing to visualize request flow across microservices. Challenge: collecting high‑volume telemetry without impacting performance.
Performance Baseline – related terms #
benchmark, SLA, trend analysis. Establishing a performance baseline defines normal operating metrics against which anomalies are detected. Example: recording average CPU usage during peak business hours. Challenge: accounting for seasonal variations and scaling changes.
Provisioning Automation – related terms #
scripts, IaC, orchestration. Automation scripts create resources without manual intervention, ensuring consistency and speed. Example: a PowerShell script that provisions a storage account and assigns RBAC roles. Challenge: handling errors gracefully and maintaining idempotency.
Quality Gate – related terms #
CI/CD, static analysis, compliance. A quality gate enforces code quality thresholds before allowing progression in a pipeline. Example: requiring zero critical vulnerabilities before merging a pull request. Challenge: avoiding false positives that block legitimate changes.
Quota Management – related terms #
limits, enforcement, request throttling. Quotas restrict the amount of resources a tenant can consume, preventing accidental overspend. Example: setting a maximum of 50 VM instances per project. Challenge: monitoring quota usage and requesting increases proactively.
Resource Scaling Policy – related terms #
threshold, metric, step scaling. Defines criteria that trigger scaling actions, such as CPU > 70 % for 5 minutes. Example: a step policy that adds two instances per scaling event. Challenge: tuning thresholds to avoid thrashing and ensuring scaling actions complete timely.
Service Discovery – related terms #
DNS, etcd, Consul. Service discovery enables services to locate each other dynamically without hard‑coded endpoints. Example: using Kubernetes DNS to resolve “orders‑svc” to its pod IPs. Challenge: handling service churn and ensuring consistent resolution.
Service Level Objective (SLO) – related terms #
SLA, error budget, reliability. An SLO defines a target for a specific metric, such as 99.9 % request success rate. Example: an SLO that 99.9 % of API calls respond within 200 ms. Challenge: tracking error budgets and making trade‑offs when limits are breached.
Session Persistence – related terms #
sticky sessions, load balancer, stateful. Session persistence forces a client’s requests to be routed to the same backend instance, preserving session state. Example: enabling sticky sessions for a web app that stores session data in memory. Challenge: reducing the ability to scale out and handling node failures.
Tag Enforcement Policy – related terms #
governance, automation, compliance. A policy that validates required tags on resource creation, rejecting non‑compliant deployments. Example: requiring “owner” and “environment” tags on all new VMs. Challenge: integrating enforcement into CI pipelines without causing friction.
Traffic Encryption – related terms #
TLS, SSL, in‑transit security. Encrypting data as it moves between client and server protects confidentiality and integrity. Example: enforcing HTTPS on all API endpoints with TLS 1.2. Challenge: managing certificates, supporting cipher suite updates, and handling performance overhead.
Unified Billing – related terms #
cost aggregation, cross‑account, reporting. Consolidates charges from multiple accounts or subscriptions into a single invoice for easier management. Example: aggregating spend from development, testing, and production accounts under a master payer. Challenge: attributing costs accurately to individual teams.
Version Control – related terms #
Git, branching, commit. Version control tracks changes to code and IaC templates, enabling collaboration and rollback. Example: storing Terraform modules in a Git repository with pull‑request reviews. Challenge: coordinating merges in fast‑moving environments and preventing drift between code and deployed resources.
Virtual Machine (VM) Image – related terms #
snapshot, golden image, immutability. A VM image is a pre‑configured template used to launch instances with consistent software stacks. Example: creating a hardened Linux image with security patches applied. Challenge: keeping images up‑to‑date and avoiding image sprawl.
Workflow Automation – related terms #
BPM, orchestrator, triggers. Automates business processes by chaining tasks and integrating services. Example: an automated workflow that provisions a database, configures network rules, and sends a notification upon completion. Challenge: handling exception paths and ensuring idempotent steps.
Zero‑Downtime Deployment – related terms #
blue‑green, canary, rolling update. Deploys new application versions without interrupting user traffic. Example: using a rolling update strategy that updates 10 % of pods at a time while monitoring health. Challenge: managing stateful services and ensuring backward compatibility.