CloudPro #98

One of the few GenAI tools that actually feels built for engineers

kubernetes-faces-gaps-in-handling-device-failures-for-aiml-pods-img-0

Most GenAI tools just dress up autocomplete. Shield’s AmplifAI is different. It uses agentic AI, systems that reason and act across steps, to take real work off your plate.

Think: auto-surfacing hidden compliance risks, navigating tangled comms threads, explaining every decision clearly. No magic, just well-architected automation with human-in-the-loop guardrails.

If you're curious what useful AI looks like in practice, start here.

Learn More

> Attack graphs are redefining IAM risk modeling from the ground up

> Airbnb’s load testing framework bakes chaos into CI/CD

> Kubernetes is still awkward with GPU failures, and no one’s fixed it yet

Plus: SRE agents with $21M backing, mirrord’s new team debugging trick, and visual Kubernetes troubleshooting that finally makes sense.

Cheers,

Shreyans Singh

Editor-in-Chief

Network security that just works: no apps, no friction

kubernetes-faces-gaps-in-handling-device-failures-for-aiml-pods-img-1

Security shouldn’t depend on whether your users remember to install something. That’s why I found Whalebone so interesting: it protects millions of devices from phishing, malware, and scams at the DNS level, no downloads required.

It’s cleanly integrated, telco-ready, and surprisingly quick to deploy (2 months). Telcos like O2 and A1 are already using it to boost ARPU while quietly shielding users in the background.

For teams building secure, seamless infra:

Learn More

🔐 Cloud Security

Why Default Pod Communication in Kubernetes is a Security Risk

By default, all pods in a Kubernetes cluster can talk to each other, which simplifies app deployment but opens up security risks. Network policies are the main way to restrict this traffic, using labels and namespaces to control ingress and egress. Support for policies depends on your CNI plugin: tools like Calico enable advanced rules, while others like flannel do not.

Why IAM demands an Attack Graph first approach

Most IAM programs start with static access lists, but attackers exploit paths, not lists. An Attack Graph shows how identities and permissions can be chained for lateral movement and takeover. By modeling these paths first, security teams can prioritize real, exploitable risks and fix what matters. This shift helps align identity security with how attacks actually happen, not just how access is managed.

12-Month Cloud Security Challenge Just Dropped – Practice, Compete, and Get Certified

Wiz has launched Cloud Champions, a monthly CTF challenge series focused on real-world cloud security scenarios. Each challenge is crafted by Wiz researchers and designed to help practitioners sharpen their skills through hands-on problem-solving. The first challenge, “Perimeter Leak,” went live in June, with more slated through May 2026. A leaderboard tracks participant progress and highlights top performers.

Building AI agents that hunt like cloud adversaries

Security researchers are building AI agents that think and act like advanced cloud attackers: chaining permissions, pivoting across services, and executing real-world privilege escalation paths in AWS. These agents outperform traditional tools by reasoning contextually and automating multi-step attack logic.

Simplify Kubernetes Security With Kyverno and OPA Gatekeeper

Kyverno and OPA Gatekeeper help secure Kubernetes by blocking risky configurations before they’re deployed. Kyverno is easier to use, with YAML policies and native Kubernetes integration, while OPA Gatekeeper offers deeper flexibility using Rego for complex rules. Both tools can enforce critical security practices, like banning :latest image tags, to improve cluster safety and compliance.

⚙️ Infrastructure & DevOps

Uber Cuts CI Costs by 53% Using Smarter Build Prioritization

Uber enhanced its SubmitQueue CI system to reduce CPU usage by 53% and cut wait times by 37% across its massive monorepos. The update uses a new probabilistic model to prioritize builds that are more likely to succeed or unblock smaller changes. This lets faster commits bypass larger ones.

Figma spends $300,000 on AWS daily

Figma disclosed in its IPO filing that it now spends nearly $300,000 daily on AWS, committing to $545 million over five years. The design platform is fully dependent on AWS infrastructure and policies, highlighting vendor lock-in risks.

TOP 10 DevOps Tools in 2025: Based on 300 LinkedIn job posts

GitHub Actions, Terraform, Kubernetes, and ArgoCD top the list, praised for integration and power, but not without their quirks. The takeaway: there's no perfect stack, just the right mix for your team’s context and scale.

mirrord Adds Queue Splitting to Enable Shared Debugging in the Cloud

mirrord for Teams now supports queue splitting, letting developers work on the same service in a shared cloud environment without stepping on each other’s toes. With support for AWS SQS (Kafka and RabbitMQ coming soon), devs can apply filters so only their local app receives relevant messages. This enables real-time debugging with zero disruption to live services or teammates.

📦 Kubernetes & Cloud Native

Kubernetes Faces Gaps in Handling Device Failures for AI/ML Pods

As AI/ML workloads relying on GPUs become more common, Kubernetes struggles with device failure modes like partial GPU outages, degraded performance, and scheduling fragility. DIY fixes exist, but lack standardization, and core systems don’t correlate device health with pod behavior.

Simplifying platform engineering at John Lewis - part one | Google Cloud Blog

John Lewis replaced its monolithic commerce system with a multi-tenant, microservice-based architecture on Google Kubernetes Engine. A central “paved road” platform now automates provisioning, observability, and security, letting product teams deploy independently while maintaining guardrails. This approach boosts developer velocity, minimizes cognitive load, and balances consistency with flexibility as new services emerge.

A visual guide on troubleshooting Kubernetes deployments

kubernetes-faces-gaps-in-handling-device-failures-for-aiml-pods-img-2

Azure Boosts PostgreSQL Performance on AKS With Local NVMe & CloudNativePG

Microsoft now supports high-performance PostgreSQL on Azure Kubernetes Service using local NVMe via Azure Container Storage and the CloudNativePG operator. Benchmarks show up to 26,000 TPS with sub-5ms latency. For price-sensitive workloads, Premium SSD v2 offers flexible scaling and solid performance.

🔍 Observability & SRE

Airbnb Scales Load Testing with Impulse Framework

Airbnb developed Impulse, a decentralized load-testing framework integrated with CI/CD, to help teams test service reliability at scale. It includes a context-aware load generator, dependency mocker, traffic replay collector, and synthetic API generator for async flows.

How we're building an agentic system to drive Grafana | Grafana Labs

Grafana is moving beyond simple AI chat responses by building agentic systems that can reason and take action, like creating dashboards or debugging metrics, based on real-time context. Powered by the open source MCP Server, these agents interact with Grafana APIs to perform complex, multi-step workflows.

Ciroos Launches AI SRE Teammate with $21M in Funding

Ciroos has raised $21 million to launch its AI-powered “SRE Teammate,” a multi-agent system that autonomously detects, diagnoses, and resolves incidents across cloud, Kubernetes, and networking environments. Unlike traditional observability tools, it acts like an expert partner, correlating signals and automating root-cause analysis without runbooks.

Benchmarking OpenTelemetry Overhead in Go Applications

A recent benchmark measured the performance impact of enabling OpenTelemetry tracing in a Go app under 10,000 req/s. CPU usage rose ~35% and memory jumped from 10MB to 15–18MB, mostly due to span processing. p99 latency increased by ~5ms, and outbound telemetry added 4MB/s of network traffic.

Forward to a Friend

📢 If your company is interested in reaching an audience of developers and, technical professionals, and decision makers, you may want toadvertise with us.

If you have any comments or feedback, just reply back to this email.

Thanks for reading and have a great day!