





















































Hi
Welcome to the fifth issue of Deep Engineering.
With AI workloads reshaping infrastructure demands and distributed systems becoming the default, engineers are facing new failure modes, stricter trade-offs, and rising expectations in both practice and hiring.
To explore what today’s engineers need to know, we spoke with Dhirendra Sinha (Software Engineering Manager at Google, and long-time distributed systems educator) and Tejas Chopra (Senior Engineer at Netflix and Adjunct Professor at UAT). Their recent book, System Design Guide for Software Professionals (Packt, 2024), distills decades of practical experience into a structured approach to design thinking.
In this issue, we unpack their hard-won lessons on observability, fault tolerance, automation, and interview performance—plus what it really means to design for scale in a world where even one-in-a-million edge cases are everyday events.
You can watch the full interview and read the transcript here—or keep reading for our distilled take on the design mindset that will define the next decade of systems engineering.
Join us on July 19 for a 150-minute interactive MCP Workshop. Go beyond theory and learn how to build and ship real-world MCP solutions. Limited spots available! Reserve your seat today.
“Foundational system design principles—like scalability, reliability, and efficiency—are remarkably timeless,” notes Chopra, adding that “the rise of AI only reinforces the importance of these principles.” In other words, new AI systems can’t compensate for poor architecture; they reveal its weaknesses. Sinha concurs: “If the foundation isn’t strong, the system will be brittle—no matter how much AI you throw at it.” AI and system design aren’t at odds – “they complement each other,” says Chopra, with AI introducing new opportunities and stress-tests for our designs.
One area where AI is elevating system design is in AI-driven operations (AIOps). Companies are increasingly using intelligent automation for tasks like predictive autoscaling, anomaly detection, and self-healing.
“There’s a growing demand for observability systems that can predict service outages, capacity issues, and performance degradation before they occur,” notes Sam Suthar, founding director of Middleware. AI-powered monitoring can catch patterns and bottlenecks ahead of failures, allowing teams to fix issues before users notice. At the same time, designing the systems to support AI workloads is a fresh challenge. The recent rollout of a Ghibli-style image generator saw explosive demand – so much that OpenAI’s CEO had to ask users to pause as GPU servers were overwhelmed. That architecture didn’t fully account for the parallelization and scale such AI models required. AI can optimize and automate a lot, but it will expose any gap in your system design fundamentals. As Sinha puts it, “AI is powerful, but it makes mastering the fundamentals of system design even more critical.”
So, what does it take to operate at web scale in 2025? Sinha highlights four key challenges facing large-scale systems today:
Chopra offers an example from Netflix: “We once had a live-streaming event where we expected a certain number of users – but ended up with more than three times that number.” The system struggled not because it was fundamentally mis-designed, but due to hidden dependency assumptions. In a microservices world, “you don’t own all the parts—you depend on external systems. And if one of those breaks under load, the whole thing can fall apart,” Chopra warns. A minor supporting service that wasn’t scaled for 3× traffic can become the linchpin that brings down your application. This is why observability is paramount. At Netflix’s scale (hundreds of microservices handling asynchronous calls), tracing a user request through the maze is non-trivial. Teams invest heavily in telemetry to know “which service called what, when, and with what parameters” when things go wrong. Even so, “stitching together a timeline can still be very difficult” in a massive distributed system, especially with asynchronous workflows. Modern observability tools (distributed tracing, centralized logging, etc.) are essential, and even these are evolving with AI assistance to pinpoint issues faster.
So how do Big Tech companies approach scalability and robustness by design? One mantra is to design for failure. Assume everything will eventually fail and plan accordingly. “We operate with the mindset that everything will fail,” says Chopra. That philosophy birthed tools like Netflix’s Chaos Monkey, which randomly kills live instances to ensure the overall system can survive outages. If a service or an entire region goes down, your architecture should gracefully degrade or auto-heal without waking up an engineer at 2 AM. Sinha recalls an incident from his days at Yahoo:
“I remember someone saying, “This case is so rare, it’s not a big deal,” and the chief architect replied, “One in a million happens every hour here.” That’s what scale does—it invalidates your assumptions.”
In high-scale systems, even million-to-one chances occur regularly, so no corner case is truly negligible. In Big Tech, achieving resilience at scale has resulted in three best practices:
These same principles—resilience, clarity, and structured thinking—also underpin how engineers should approach system design interviews.
Cracking the system design interview is a priority for many mid-level engineers aiming for senior roles, and for good reason. Sinha points out that system design skill isn’t just a hiring gate – it often determines your level/title once you’re in a company. Unlike coding interviews where problems have a neat optimal solution, “system design is messy. You can take it in many directions, and that’s what makes it interesting,” Sinha says. Interviewers want to see how you navigate an open-ended problem, not whether you can memorize a textbook solution. Both Sinha and Chopra emphasize structured thinking and communication. Hiring managers deliberately ask ambiguous or underspecified questions to see if the candidate will impose structure: Do they ask clarifying questions? Do they break the problem into parts (data storage, workload patterns, failure scenarios, etc.)? Do they discuss trade-offs out loud? Sinha and Chopra offer two guidelines:
Through years of interviews, Sinha and Chopra have noticed three common pitfalls:
Tech interviews in general have gotten more demanding in 2025. The format of system design interviews hasn’t drastically changed, but the bar is higher. Companies are more selective, sometimes even “downleveling” strong candidates if they don’t perfectly meet the senior criteria. Evan King and Stefan Mai, cofounders of interview preparation startup, in an article in The Pragmatic Engineer observe, “performance that would have secured an offer in 2021 might not even clear the screening stage today”. This reflects a market where competition is fierce and expectations for system design prowess are rising. But as Chopra and Sinha illustrate, the goal is not to memorize solutions – it’s to master the art of trade-offs and critical thinking.
System design isn’t just an interview checkbox – it’s a fundamental skill for career growth in engineering. “A lot of people revisit system design only when they're preparing for interviews,” Sinha says. “But having a strong grasp of system design concepts pays off in many areas of your career.” It becomes evident when you’re vying for a promotion, writing an architecture document, or debating a new feature in a design review.
Engineers with solid design fundamentals tend to ask the sharp questions that others miss (e.g. What happens if this service goes down? or Can our database handle 10x writes?). They can evaluate new technologies or frameworks in the context of system impact, not just code syntax. Technical leadership roles especially demand this big-picture thinking. In fact, many companies now expect even engineering managers to stay hands-on with architecture – “system design skills are becoming non-negotiable” for leadership.
Mastering system design also improves your technical communication. As you grow more senior, your success depends on how well you can simplify complexity for others – whether in documentation or in meetings. “It’s not just about coding—it’s about presenting your ideas clearly and convincingly. That’s a huge part of leadership in engineering,” Sinha notes. Chopra agrees, framing system design knowledge as almost a mindset: “System design is almost a way of life for senior engineers. It’s how you continue to provide value to your team and organization.” He compares it to learning math: you might not explicitly use the quadratic formula daily, but learning it trains your brain in problem-solving.
Perhaps the most exciting aspect is that the future is wide open. “Many of the systems we’ll be working on in the next 10–20 years haven’t even been built yet,” Chopra points out. We’re at an inflection point with technologies like AI agents and real-time data streaming pushing boundaries; those with a solid foundation in distributed systems will be the “go-to” people to harness these advances. And as Chopra notes,
“seniority isn’t about writing complex code. It’s about simplifying complex systems and communicating them clearly. That’s what separates great engineers from the rest.”
System design proficiency is a big part of developing that ability to cut through complexity.
While core principles remain steady, the ecosystem around system design is evolving rapidly. We can identify three significant trends:
Despite faster cycles, sharper constraints and more automation system design remains grounded in principles. As Chopra and Sinha make clear, the ability to reason about failure, scale, and trade-offs isn’t just how systems stay up; it’s also how engineers move up in their career.
If you found Sinha and Chopra’s perspective on designing for scale and failure compelling, their book System Design Guide for Software Professionals unpacks the core attributes that shape resilient distributed systems. The following excerpt from the book breaks down how consistency, availability, partition tolerance, and other critical properties interact in real-world architectures. You’ll see how design choices around reads, writes, and replication influence system behavior—and why understanding these trade-offs is essential for building scalable, fault-tolerant infrastructure.
…
Before we jump into the different attributes of a distributed system, let’s set some context in terms of how reads and writes happen.
Let’s consider an example of a hotel room booking application (Figure 2.1). A high-level design diagram helps us understand how writes and reads happen:
Figure 2.1 – Hotel room booking request flow
As shown in Figure 2.1, a user (u1) is booking a room (r1) in a hotel and another user is trying to see the availability of the same room (r1) in that hotel. Let’s say we have three replicas of the reservations database (db1, db2, and db3). There can be two ways the writes get replicated to the other replicas: The app server itself writes to all replicas or the database has replication support and the writes get replicated without explicit writes by the app server.
Let’s look at the write and the read flows:
System Design Guide for Software Professionals by Dhirendra Sinha and Tejas Chopra (Packt, August 2024) is a comprehensive, interview-ready manual for designing scalable systems in real-world settings. Drawing on their experience at Google, Netflix, and Yahoo, the authors combine foundational theory with production-tested practices—from distributed systems principles to high-stakes system design interviews.
For a limited time, get the eBook for $9.99 at packtpub.com — no code required.
Diagrams 0.24.4 — Architecture Diagrams as Code, for System Designers
Diagrams is an open source Python toolkit that lets developers define cloud architecture diagrams using code. Designed for rapid prototyping and documentation, it supports major cloud providers (AWS, GCP, Azure), Kubernetes, on-prem infrastructure, SaaS services, and common programming frameworks—making it ideal for reasoning about modern system design.
The latest release (v0.24.4, March 2025) adds stability improvements and ensures compatibility with recent Python versions. Diagrams has been adopted in production projects like Apache Airflow and Cloudiscovery, where infrastructure visuals need to be accurate, automatable, and version controlled.
Highlights:
That’s all for today. Thank you for reading the first issue of Deep Engineering. We’re just getting started, and your feedback will help shape what comes next.
Take a moment to fill out this short survey we now run monthly—as a thank-you, we’ll add one Packt credit to your account, redeemable for any book of your choice.
We’ll be back next week with more expert-led content.
Stay awesome,
Divya Anne Selvaraj
Editor in Chief, Deep Engineering
If your company is interested in reaching an audience of developers, software engineers, and tech decision makers, you may want toadvertise with us.