Deep Engineering | 0 articles | Packt Newsletter Hub

Deep Engineering #7: Managing Software Teams in the Post-AI Era with Fabrizio Romano

Divya Anne Selvaraj

03 Jul 2025

How to decide whether you want to move into management and lead without losing touch#7Managing Software Teams in the Post-AI Era with Fabrizio RomanoFrom lean organizations and AI tools to Gen Z teams, the software team manager’s job has changed. Here’s how to lead without losing touch and decide whether you want to move into management.Workshop: Unpack OWASP Top 10 LLMs with SnykJoin Snyk and OWASP Leader Vandana Verma Sehgal on Tuesday, July 15 at 11:00AM ET for a live session covering:✓ The top LLM vulnerabilities✓ Proven best practices for securing AI-generated code✓ Snyk’s AI-powered tools automate and scale secure dev.See live demos plus earn 1 CPE credit!Register todayHi Welcome to the seventh issue of Deep Engineering.The software manager’s role is being remade—less by choice than by necessity. The old playbook, where managers translated product priorities into sprints and stayed one layer removed from the code, no longer holds. In 2025, development managers are navigating leaner organizations, AI-assisted teams, hybrid work models, and a workforce increasingly shaped by Gen Z expectations.To understand this shift and glean best practices, we spoke with Fabrizio Romano, author of Learn Python Programming and development manager at Sohonet. We also examine what the transition from senior engineer to manager really entails—and how to know if that’s the right move for you. Throughout, we draw on Romano’s experience, alongside insights from other engineering leaders like Gergely Orosz, Mirek Stanek, Nick Centino, and Vladimir Klepov, to unpack the changing traits, tensions, and tradeoffs of modern development management.You can watch Romano’s complete interview which covers both his experiences with Python and as an engineering manager and read the transcript here, or read on for an engineering management focussed deep dive.Sign Up |AdvertiseLeading Software Teams in Changing Times with Fabrizio RomanoWhile a desire to nurture growth in others is crucial to success in management, the evolving landscape of software development today brings a set of external challenges that shape how development managers must lead. As Romano suggests, becoming a development manager isn’t just about mastering technical skills, but about understanding and adapting to the broader trends reshaping the industry—particularly in a post-AI world. The role has become more complex and dynamic than ever, influenced by forces like leaner organizations and teams, more millennials and Gen Zs in the workforce, remote-first work, and AI-powered development tools, and an increasing focus on efficiency over expansion. These shifts have led to new expectations for managers, testing their ability to balance people development with technical leadership.The Current State of Development ManagementThe post-COVID world is seeing significant changes in how development teams are structured, with many organizations flattening their hierarchies to reduce layers of management. This shift to leaner teams, combined with the increasing use of AI tools like GitHub Copilot, Cursor, and others, has led to new challenges for development managers.Leaner OrganizationsAs Mirek Stanek, PL Engineering Site Lead at Papaya Global points out, one of the most profound changes in development management is the trend towards fewer managers and a greater emphasis on individual contributors (ICs). In organizations where budget cuts and performance metrics dominate, managers are now expected to maximize the productivity of their teams with fewer resources. This is in line with Amazon's directive shared in a letter from CEO, Andy Jassy, to employees in September 2024, to increase the ratio of ICs to managers by 15% by Q1 2025. This shift reflects a broader trend where leadership roles are being scrutinized more heavily, and managers must justify their position by demonstrating tangible value to the organization.The hands-on expectations of development managers have therefore increased. In previous decades, a manager could expect to focus on strategy, vision, and team alignment, while ICs handled the bulk of coding tasks. Today, however, many engineering managers (Ems) are expected to stay deeply involved in the technical aspects of development. As Vladimir Klepov, EM at Ozon Bank, discusses in his reflections, a manager who is disconnected from the technical work risks losing touch with the challenges their team faces on the ground. Therefore, hands-on leadership—being embedded in the development process—is now a critical competency for effective development managers.Managing Gen-Z, Millennials, and the New Workforce ExpectationsAnother change reshaping development management is the increasing presence of Gen-Z and Millenials in the workforce. According to Elizabeth Faber, Deloitte Global Chief People & Purpose Officer,“Projected to make up roughly two-thirds of the labor force within the next few years, Gen Zs and millennials are likely to be a defining force in the future of work—one that looks less like a ladder and more like an interconnected web of growth, values, and reinvention.”Stanek also points out how Gen-Z values work-life balance, professional growth opportunities, and authentic leadership.Concluding from the 14th Deloitte Global Gen Z and Millennial Survey, Faber writes that, for Gen Z and millennial workers to feel truly supported and fulfilled, managers must be empowered to support employee well-being by:Addressing team stressorsPromoting work/life balanceRecognizing contributionsEnabling growthFacilitating access to mental health resourcesFor development managers, this means adapting leadership styles to align with these expectations. Managers must be more emotionally intelligent, open to feedback, and flexible in how they structure their teams.This also reflects the broader trend of remote and hybrid work models. While some companies, like Amazon, are pushing for a return to the office, many development managers will need to navigate the challenges of managing a distributed, remote-first workforce while ensuring cohesion and a sense of purpose within their teams.Working with Distributed and Diverse TeamsManaging teams split across cities or continents adds its own set of challenges – and opportunities. Stanek writes,“The pandemic showed us how teams can function effectively remotely, but it also highlighted the limitations of remote work, such as the lack of nonverbal communication cues and the blurring of work-life boundaries.”Nataliia Peterheria, Operations Manager, Django Stars, recommends the following practices to overcome dissonance in remote and hybrid team setups:Choose one primary communication channel (e.g., Slack, Google Hangouts) and stick to it to reduce information loss. Supplement with one or two backups only when necessary. Every team member should maintain a complete profile—with a real photo, job title, contact number, and bio—so others can quickly understand roles and reach out when needed.Set up a single “source of truth” for documentation and decisions—like Confluence or a shared wiki—structured simply (no more than three nested levels). Keep specs, requirements, and changes in one place, and annotate directly on the relevant topic pages to avoid fragmentation.Create a structured work schedule with overlapping hours for live collaboration. Use this shared window for time-sensitive interactions like team calls or joint problem-solving. Schedule overlapping meetings in advance, prioritize ruthlessly, and stay consistent to avoid drifting into 24/7 work mode.Use daily checklists to track questions, progress, and blockers. Organize them by project and link them to your source of truth or project repo. Checklists help ensure timely answers and keep asynchronous work from stalling.Standardize request communication to prevent missed inputs. Assign a single person (often the PM) to collect product owner requests, or reserve regular meeting slots to introduce new requirements to the full team.Require approval for all logic changes or scope updates, no matter how minor. Even well-intentioned “improvements” by developers must be signed off by business stakeholders to prevent misalignment or scope creep.Define escalation paths clearly. Publish a diagram showing who is responsible for what and who to contact when something goes wrong. Team members should know exactly how to escalate unresolved issues—internally or with the client.Align on a common task tracking and documentation toolset before kickoff. Avoid fragmented tracking (e.g., team members using their own spreadsheets). Centralize around one system, even if it means switching from a personal favorite.Codify remote technical workflows. Set clear guidelines for pull request handling, commit hygiene, and review expectations. Include code style guides to prevent inconsistency and ensure maintainability when multiple people contribute to the same codebase.Assess technical readiness before the project starts. Identify gaps in tooling knowledge, run onboarding sessions where needed, and provide up-to-date guides for any systems that require self-service support.In addition to these, there is the human side to management. Romano describes watching body language and Slack message tones for signs of stress in his team. If a developer seems off or tensions are brewing, he takes time to talk one-on-one and understand the issue. In some cases, he even teaches simple meditation or mindfulness techniques to help his engineers re-center under pressure. “When you’re upset, frustrated, or angry… it triggers a fight-or-flight response… If you keep stimulating that state… it becomes a health risk,” he explains, drawing from his experience in martial arts that a “relaxed mind is a creative mind.” By coaching his team in emotional intelligence and stress management, he not only cares for their well-being but also ensures they stay productive and collaborative. This kind of empathetic leadership – once rare in engineering circles – is increasingly recognized as key to maintaining high-performing teams.AI Tools: A Double-Edged Sword for Development ManagersIn addition to managing shifting workforce dynamics, AI is becoming an integral tool for development teams. AI-driven tools like GitHub Copilot are no longer just productivity boosters but are changing how software is developed at a fundamental level. For example, Gergely Orosz, author of The Software Engineer’s Guidebook, in The Pragmatic Engineer reports that,“90% of the code for Claude Code is written by Claude Code(!).”The rise of AI coding assistants and automation is one of the defining trends reshaping development management. Tools like GitHub Copilot, ChatGPT, and other AI pair programmers are rapidly becoming part of daily software engineering workflows.Gitlab’s 2024 Global DevSecOps Report found that 39% of software professionals are already using AI in development, up 16 percentage points from the year prior. Moreover, 60% say implementing AI is now essential to avoid falling behind competitively.Development managers now face the challenge of integrating AI effectively into their team's workflow while also ensuring that these tools don’t hinder creativity or lead to over-reliance.“We have to use AI. I think a developer who refuses to embrace AI today is probably going to be obsolete very soon,” says Romano, underscoring the urgency of adaptation. He adds: “At Sohonet, in my role, I got everyone on my team set up with GitHub Copilot. I wanted them to start using it, get familiar with it, and understand how to leverage what it can offer.”By equipping his engineers with Copilot, he aimed to help them embrace AI-assisted development rather than fear it. Romano notes,“Copilot is especially helpful for menial or repetitive tasks—like hardcoding different test cases. It’s really good at predicting what the next test case might be.” “Even when it's just acting like a better IntelliSense, it’s still useful… instead of rewriting a line yourself, you just hit Tab and it’s done,” Romano saysFor development managers, the benefit of such tools is twofold: they boost team productivity and free up human developers for more complex, creative work.According to Infragistics’ Reveal 2024 survey report, the top reasons developers leverage generative AI are to increase productivity (49%), eliminate repetitive tasks (38%), and speed up development cycles (36%).Managers who proactively introduce approved AI tools can thus accelerate output and improve developer satisfaction. Romano mentions that his team continually experiments with new AI aides (from code editors like Cursor to AI pair-programming prototypes) to stay on the cutting edge. This reflects a broader best practice: staying up to date with emerging tools and evaluating their potential.However, Romano also points out that over-relying on AI tools can stunt problem-solving skills, as developers might bypass critical thinking or creative solutions in favor of quick, AI-generated responses. 55% of Gitlab’s survey respondents also felt that introducing AI into the software development lifecycle is risky.Effective development management in the AI era means finding a balance between leveraging AI and honing human skill. Romano emphasizes that developers shouldn’t offload all problem-solving to machines:“Part of the job… was to smash my brain against a problem now and then. That’s really beneficial for your thinking… It keeps your mental muscles in shape.” “Relying too much on AI to… figure out the next step… that’s risky. I still want to ‘go to the gym’ up here,” he quips, referring to exercising one’s own mental faculties. Romano encourages each developer to “find the right balance—using AI as a tool, but still keeping their minds fit and challenged.”This balanced approach ensures that while AI accelerates routine coding, it doesn’t “dumb down” the team’s critical thinking. “If you stop challenging the [AI’s] recommendations, they run the risk of dumbing down the reasoning. The true risk is in placing naive faith in quick fixes,” cautions Sammi Li, co-founder and CEO of JuCoin, noting that AI can expedite work but must not replace understanding. It falls on the EM to ensure this balance is maintained both for the team’s and the business’ benefit.What the Shift to Engineering Management Really Looks LikeThe move from senior engineer to EM is often misunderstood—frequently treated as a natural promotion rather than a deliberate change in function. But this is not a bigger version of the same job. It’s a transition into a fundamentally different role, with a new definition of success and a new center of gravity. Here is what development and EMs say about their shift from development to management felt like.You stop being measured by what you ship. Engineers derive a tangible sense of accomplishment from writing code and seeing it run in production. That feedback loop is fast and direct. Management breaks that loop. “As an EM, you’re not the one building the things,” says Nick Centino, Principal Engineering Manager at Microsoft. “You’re helping empower others to build the things more effectively”. This shift—away from hands-on output and toward enabling others—can take years to internalize. Centino himself spent nearly eight years in a dual role before realizing his highest leverage was no longer in the code.You have to redefine what ‘impact’ means. Orosz writes: “As an engineering manager, you’ll need to put company first, team second, and your team members third. And I would also add: yourself as fourth”. That’s a reversal from the individual contributor mindset, where engineers focus on executing their own tasks and helping teammates as needed. The EM role requires strategic alignment across teams—not just personal productivity.You stop optimizing for technical challenges. Engineers advance by solving complex problems. Managers progress by preventing them. As Klepov writes, “Of all the possible career moves a seasoned engineer can make, switching to management gives you the most new challenges...without hitting your salary”. But these challenges are rarely technical. They involve process alignment, team dynamics, emotional management, and cross-functional friction. As Romano puts it: “Most of what we do is fairly routine...The real challenges lie in everything around the code”.Your working memory breaks down. Many new managers underestimate the cognitive overhead of managing a team. Orosz notes that while ICs can often track all their tasks in their head, managers can't: “As a manager, I have far more things to pay attention to…Keeping all of this in my head doesn’t work well for me—so I’ve started to write things down”. Time and task management become not just useful, but essential.You spend less time writing code—and often none at all. The drop is not optional; it’s structural. According to Centino, once you manage five or six people, meaningful individual contribution becomes unsustainable without either cutting corners or burning out. Even if you retain technical context, your job is no longer to build—it’s to coach, unblock, coordinate, and align. “If you feel like you have time to code,” Centino warns, “you’re either working long hours or not spending enough time with your team”.You enter the domain of slow, uncertain feedback. ICs can validate ideas quickly: deploy a fix, measure a metric, refactor a function. Managers don’t get that immediacy. Feedback loops are long and ambiguous. “Very few of your actions produce a visible result in under a month,” Klepov notes. “Even the right changes can make things get worse before they get better”.You have to manage people, not just lead them. This distinction matters. Leadership is about vision and influence. Management is about one-on-ones, reviews, process hygiene, and psychological safety. “There’s a lot of peopling involved,” Centino says. “You need to be listening to people, understand them, spend time with them”. For many introverted engineers, that’s emotionally exhausting—but non-negotiable. Skipping the people work results in burnout, distrust, and attrition.You give up control, but remain accountable. Orosz captures the paradox: as a tech lead, you can write code and drive decisions. As a manager, you may do neither—but you’re still responsible for outcomes. That means learning to influence without coding, to steer without micromanaging, and to delegate without detaching.None of this means the shift is a demotion of technical skill. If anything, it requires expanding your judgment from systems to humans. As Romano puts it, “The skills we learn as developers aren’t confined to software. They transfer to life”. But it is a shift. And for those unprepared, it can be jarring. As Centino warns, “Engineering management and individual contribution are completely different roles”.Is Moving into Management Right for you?A move into management is often seen as the natural career progression after senior developer or tech lead. However, not everyone is suited to be a development manager – and that’s okay.“Managing people is a completely different skill set,” Romano candidly remarks. “If you’re someone who’s drawn to logic, machines, and technical problems—and you’re not interested in helping people grow—then you probably shouldn’t go down the management path.”Strong coding ability alone does not guarantee success in leadership. The core of the development manager role, Romano says, comes down to a genuine desire to care for people:“That’s what this job is really about: doing your best to help the people you manage become healthier, happier, more skilled professionals – and hopefully better human beings too.”If that mission excites you more than writing code yourself, it’s a sign you might find the management path rewarding.Despite the persistent narrative that “eventually you’re going to become an engineering manager,” Centino points out: engineering management and individual contribution are “completely different roles” with different success criteria, daily rhythms, and reward systems. The most common trap is assuming that strong technical performance qualifies someone to lead people. As Romano puts it,“In our industry, we often promote people into management roles just because they’re technically strong. But managing people is a completely different skill set”. For those drawn to logic, systems, and clean abstractions, people management may feel frustrating and opaque. “People aren’t logical like machines,” Romano warns. “Managing them requires effort, empathy, and patience”.The core question isn’t whether you can manage—it’s whether you want to. “I do think it’s important to have a solid foundation in software development before stepping into this role,” Romano says. But that’s table stakes. What distinguishes successful managers is not technical depth, but a “genuine desire to care for people”.Centino echoes this point:“As an engineering manager, I like to focus most of my attention and effort into growing individuals on the team… If I can align that with the direction the business is heading, then I think we have a great recipe”.But if that alignment never comes—if writing code is still your deepest source of satisfaction—management may not be the right move.Self-awareness, not seniority, should drive the decision.“This type of thing will change over time,” Centino notes. “I found myself in a dual role for eight years and didn’t really know until the end… what I really felt would motivate me most”.Regular reflection, honest conversations with your manager, and exposure to the demands of the role are more reliable indicators than promotion ladders or external expectations.As Romano says, “If you’re only doing it because it’s your next step, or because someone handed you the role, it can be tough”. But if helping others grow feels like a worthwhile use of your time—and you’re willing to trade code for conversations, and systems for people, you may be ready to step into the role.Making the Move: Traits of Successful EMsIf you feel you fit the bill and are ready to take on the challenges that come with managing software teams today, start by building a foundation of both technical and leadership experience:Learn to manage time and context switching deliberately: Orosz emphasizes that time management shifts from “maker schedule” to “manager schedule.” Future EMs should practice structuring recurring meetings, protect deep work time, and use lightweight systems, for e.g., Getting Things Done (GTD), to track tasks across people and priorities—not just their own.Get fluent in setting and supporting growth goals—for others and yourself: As a manager, you won’t just pursue your own goals—you’ll guide others in theirs. Orosz suggests practicing this by helping peers articulate growth goals, using role frameworks where available. Future EMs must also apply the same discipline to their own goal-setting, or risk drift.Seek and learn from mentors before the transition: Orosz didn’t wait until he was fully in the role—he proactively asked his management chain to connect him with internal mentors who understood the company’s management expectations. Engineers eyeing management should do the same, asking for guidance and observation opportunities ahead of time.Develop the habit of reflection, not just execution: Romano and Orosz both stress the importance of stepping back. Engineers often optimize for output; future managers must learn to observe team dynamics, reflect on what’s working, and adapt. Orosz models this by reading, writing, attending conferences, and running lightweight experiments with how he works.Strengthen emotional awareness and communication range: Romano explicitly notes that successful managers listen closely, pick up non-verbal cues, and adjust their communication style to fit each team member. Aspiring EMs should build this muscle early by observing tone, response patterns, and interpersonal signals on their teams.Practice coaching and teaching—not just explaining: Romano compares great management to good teaching: if one explanation fails, try another. Aspiring EMs should practice helping others understand by adapting to their learning style—not defaulting to their own.Clarify your own motivation: Denis D., Software Engineering Manager at PaySaaS Technology and Romano both warn that without intrinsic interest in people and leadership, the transition becomes painful. Future EMs should reflect early: do they enjoy unblocking others? Does seeing someone else grow feel like progress? If not, they should reconsider the path.Build a low-friction system to stay tech-adjacent: Denis maintains a Notion glossary, logs unknown terms, and watches short tutorials to stay grounded in the tech domain even after moving into management. Aspiring EMs can adopt this habit early to prevent drift and preserve confidence in technical discussions.On the technical side, credibility matters: working several years as an engineer, shipping projects, and understanding the software development lifecycle from firsthand experience will make you a more empathetic and effective leader. As Romano notes, having been “under deadline pressure” or stuck on a stubborn bug helps you relate to the struggles your team faces – “that empathy makes you more effective as a manager.”Ex software development manager and author of Coding in Delphi, Nick Hodges’ words sum up the job of a software development team manager nicely,“Sometimes being a manager is hard—even impossible. Sometimes you have to give up being right and put the needs of the entire organization over yourself. Sometimes you have to balance protecting your people with being a loyal member of the management team. Sometimes you have to manage up as well as you manage down. Being right isn’t enough—being effective matters more.”If Romano’s reflections on team dynamics and career growth sparked your interest, his book Learn Python Programming offers a different kind of guidance, focused on building solid, modern Python skills. Now in its fourth edition, the book covers everything from core syntax and idioms to web APIs, CLI tools, and competitive programming techniques.Get the Book🛠️Tool of the Week⚒️Backstage: Open-Source Developer PortalBackstage provides a central Software Catalog, project templates, and “docs-as-code” infrastructure (TechDocs) so teams can standardize their architecture, onboarding and documentation. For engineering managers, this means you can enforce coding standards and best practices (via templates and catalogs), keep architecture and ownership information up-to-date, and give developers self-service access to resources.Learn more about Backstage📰 Tech BriefsBuilding Strategic Influence as a Staff Engineer or Engineering Manager by Mark Allen, Engineering Leader & Technical Co-Founder @ Isometric: Outlines how staff engineers and engineering managers can build strategic influence by identifying business priorities, acting with curiosity beyond their role, cultivating cross-functional relationships, shaping their internal brand, and selectively saying yes to high-impact opportunities to grow their organizational visibility and impact.How Staff+ Engineers Can Develop Strategic Thinking by Shweta Saraf, Director of Network and Infra Management @Netflix: Explains how to odevelop strategic thinking by diagnosing organizational needs, aligning technical decisions with business goals, influencing cross-functional stakeholders, and balancing innovation with risk—emphasizing that strategic impact stems as much from mindset and relationship-building as from technical expertise.The AI productivity paradox in software engineering: Balancing efficiency and human skill retention: AI adoption in software engineering is creating a productivity paradox—delivering short-term task efficiency while eroding system performance, cognitive skills, and governance, unless teams integrate AI responsibly with oversight, skill development, and systemic alignment.That’s all for today. Thank you for reading this issue of Deep Engineering. We’re just getting started, and your feedback will help shape what comes next.Take a moment to fill out this short survey we run monthly—as a thank-you, we’ll add one Packt credit to your account, redeemable for any book of your choice.We’ll be back next week with more expert-led content.Stay awesome,Divya Anne SelvarajEditor-in-Chief, Deep EngineeringTake the Survey, Get a Packt Credit!If your company is interested in reaching an audience of developers, software engineers, and tech decision makers, you may want toadvertise with us.*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

Deep Engineering #2: Dave Westerveld on Scalable API Testing and Orchestration

Divya Anne Selvaraj

29 May 2025

Shift-left strategies, parallel test design, and the realities of testing AI-driven APIs at scale#2Dave Westerveld on Scalable API Testing and OrchestrationShift-left strategies, parallel test design, and the realities of testing GraphQL, gRPC, and AI-driven APIs at scaleHi ,Welcome to the second issue of Deep Engineering.Postman’s 2024 State of the API report reveals that 74% of teams now follow an API-first approach, signaling a major shift in software development from the code-first approach. As APIs grow more complex—and as AI agents, gRPC, and GraphQL reshape how services communicate—the question is no longer whether to test early, but how to test well at scale.In this issue, we speak with Dave Westerveld—developer, author of API Testing and Development with Postman, and testing specialist with years of experience across both mature systems and early-stage teams. Drawing from his work on automation strategy, API integration, and scaling quality practices, Dave offers a grounded take on CI pipelines, parallel execution, and the tradeoffs of modern API protocols.You can watch the full interview and read the full transcript here— or keep reading for our distilled take on what makes modern test design both reliable and fastSign Up |AdvertiseSponsored:Webinar: Make Your App a Moving Targetand Leave Attackers GuessingLearn how your app could evolve automatically, leaving reverse engineers behind with every release.Hosted by Guardsquare featuring:Anton Baranenko-Product ManagerDate/time: Tuesday, June 10th at 4 PM CET (10 AM EDT)Register NowFrom REST to Agents: Why Systems Thinking Still Anchors Good API Testing with Dave WesterveldSome testing principles are foundational enough to survive revolutions in tooling. That’s the starting point for Dave Westerveld’s approach to API testing in the post-AI tech landscape.“There are testing principles that were valid in the '80s, before the consumer internet was even a thing. They were valid in the world of desktop and internet computing, and they’re still valid today in the world of AI and APIs.”And while the landscape has shifted dramatically in the last two years—Postman now ships an AI assistant, supports gRPC and GraphQL, and offers orchestration features for agentic architectures—Westerveld believes the best way to scale quality is to combine these new capabilities with timeless habits of mind: systems thinking, structured test design, and a bias for clarity over cleverness.Systems Thinking as the FoundationWesterveld argues that API testers need to operate with a systems-level understanding of the software they’re validating. He calls this:“(The) ability to zoom out and see the entire forest first, and then come back in and see the tree, and realize how it fits into the larger picture and how to approach thinking about and testing it.”In practice, that means asking not just whether an endpoint returns the expected result, but how it fits into the larger architecture and user experience. It means understanding when to run exploratory tests, when to assert workflows, and when to defer to contract validation.These instincts, he says, haven’t changed even as APIs have diversified:“Things like how to approach and structure your testing are … timeless when it comes to REST APIs. They haven’t fundamentally changed in the last 20 years—neither should the way you think about testing them.”What matters more than syntax is structure—how testers reason about coverage, maintainability, and feedback cycles.AI as Accelerant, Not OraclePostman’s Postbot is the most visible new capability in the platform’s AI strategy. Built atop LLM infrastructure, it can suggest test cases, generate assertions, and translate prompts into working scripts. Internally, it draws on your Postman data—collections, environments, history—to provide context-aware assistance.Westerveld sees the benefit, but draws a hard line between skilled and unskilled use:“For a skilled tester, someone with a lot of experience, these AI tools can help you move more quickly through tasks you already know how to do. Often, when you reach that level, you’ve done a lot of testing—you can look at something and say, ‘OK, this is what I need to do here.’ But it can get repetitive to implement some scripts or write things out again and again.”He frames AI as an accelerant: helpful when you understand the underlying logic, risky when you don’t.“For more junior people, there’s a temptation to use AI to auto-generate scripts without fully understanding what those scripts are doing. I think that’s the wrong approach early in your career, because once the AI gets stuck, you won’t know how to move forward.”This caution aligns with Postman’s architectural choices. Postbot uses a deterministic intent classifier to map prompts to supported capabilities, orchestrates tool usage through a controlled execution layer, and codifies outputs as structured in-app actions—such as generating test scripts, visualizing responses, or updating request metadata. Its latest iteration adds a memory-aware agent model that supports multi-turn conversations and multi-action workflows, but with strict boundaries around tool access and state transitions.In this, Westerveld agrees: AI-generated tests are often brittle and opaque. Use them, he advises,“more as a learning tool than an autocomplete tool.”Scaling Through Independence and RestraintOne of Westerveld’s strongest positions concerns test design: automated tests should be independent of each other. This is both a correctness and scalability concern. When teams overuse shared setup code or rely on common state, it breaks test parallelism and increases the chance of cascading failures.In Postman, reusable scripts are managed via the Package Library, which allows teams to store JavaScript test logic in named packages and import them into requests, collections, monitors, and Flows. While this enables consistency and reuse, Westerveld notes that it also introduces new failure points if not applied judiciously.“If something in the shared code breaks—or if a dependency the shared code relies on fails—you can end up with all your tests failing. …So, you have to be careful that a single point of failure doesn’t take everything down.”His solution: only abstract what truly reduces duplication, and mock where necessary.“In cases like that, it’s worth asking: ‘Do we really need this to be a shared script, or can we mock this instead?’ For example, if you're repeatedly calling an authentication endpoint that you're not explicitly testing, maybe you could insert credentials directly instead. That might be a cleaner and faster solution.”He also advocates for test readability. Tests, he says, should act as documentation. Pulling too much logic into shared libraries makes them harder to understand.“A well-written test tells you what the system is supposed to do. It shows real examples of expected behavior, even if it's not a production scenario. You can read it and understand the intent.But when you extract too much into shared libraries, that clarity goes away. Now, instead of reading one script, you’re bouncing between multiple files trying to figure out how things work. That hurts readability and reduces the test's value as living documentation.”Contracts, Specs, and CI IntegrationWith Postman’s new Spec Hub, teams can now author, govern, and publish API specifications across supported formats, helping standardize collaboration around internal and external APIs. As Westerveld puts it:“The whole point of having a specification is that it defines the contract we’re all agreeing to—whether that’s between frontend and backend teams, or with external consumers.”He recommends integrating schema checks as early as possible:“If you're violating that contract, the right response is to stop. … So yes, in that sense, we want development to ‘slow down’ when there’s a spec violation. But in the long run, this actually speeds things up by improving quality. You’re building on a solid foundation.”He advocates running validation as part of the developer CI pipeline—using lightweight checks at merge gates or as part of pull requests.This pattern aligns with what Postman now enables. Spec Hub introduces governance features such as built-in linting to enforce organizational standards by default. For CI integration, Postman’s contract validation tooling can be executed using the Postman CLI or Newman, both of which support running test collections—including those that validate OpenAPI contracts—within continuous integration pipelines. Together, these tools allow teams to maintain a single, trusted specification that anchors both collaboration and automated enforcement across environments.From REST to gRPC and GraphQLProtocol diversity is a reality for modern testers. Westerveld emphasizes that while core principles carry over across styles, testing strategies must adapt to the nuances of each protocol.gRPC, for example, provides low-level access through strongly typed RPC calls defined in .proto files. This increases both the power and the surface area of test logic.“One area where you really see a difference with modern APIs is in how you think about test coverage. The way you structure and approach that will be different from how you’d handle a REST API.That said, there are still similar challenges. For instance, if you’re using gRPC and you’ve got a protobuf or some kind of contract, it’s easier to test—just like with REST, if you have an OpenAPI specification.So, advocating for contracts stays the same regardless of API type. But with GraphQL or gRPC, you need more understanding of the underlying code to test them adequately. With REST, you can usually just look at what the API provides and get a good sense of how to test it.”GraphQL, he notes, introduces different complexities. Because it’s introspective and highly composable:“With GraphQL, there are a lot of possible query combinations… A REST API usually has simple, straightforward docs—‘here are the endpoints, here’s what they do’—maybe a page or two.With GraphQL, the documentation is often dynamically generated and feels more like autocomplete. You almost have to explore the graph to understand what’s available. It’s harder to get comprehensive documentation.”Postman supports both gRPC and GraphQL natively, enabling users to inspect schemas, craft requests, and run tests—all without writing code. But effective testing still depends on schema discipline and clarity. Westerveld points out that with GraphQL, where documentation can feel implicit or opaque, mock servers and contract-first workflows are critical. Postman helps here too, offering design features that can generate mocks and example responses directly from imported specs.Orchestration and Shift-Left StrategiesPostman’s recent support for the Model Context Protocol (MCP) and the launch of its AI Tool Builder mark a shift toward integrating agent workflows into the API lifecycle. Developers can now build and test MCP-compliant servers and requests using Postman’s familiar interface—lowering the barrier to designing autonomous agent interactions atop public or internal APIs.But as Westerveld points out, these advances don’t replace fundamentals. His focus remains on feedback speed, execution reliability, and test independence.“Shift-left and orchestration have been trending for quite a while. As an industry, we’ve been investing in these ideas for years—and we’re still seeing those trends grow. We’re pushing testing closer to where the code is written, which is great. At the same time, we’re seeing more thorough and complete API testing, which is another great development.”He notes a natural tension between shift-left principles and orchestration complexity:“Shift-left means running tests as early as possible, close to the code. The goal is quick feedback. But orchestration often involves more complexity—more setup, broader coverage—and that takes longer to run.So those two trends can pull in different directions: speed versus depth.”The path forward, he argues, lies in test design and execution architecture:“We’re pushing testing left and improving the speed of execution. That’s happening through more efficient test design, better hardware, and—importantly—parallelization.Parallelization is key. If we want fast feedback loops and shift-left execution, we need to run tests in parallel. For that to work, tests must be independent. That ties back to an earlier point I made—test independence isn’t just a nice-to-have. It’s essential for scalable orchestration.”“So I think test orchestration is evolving in a healthy direction. We’re getting both faster and broader at the same time. And that’s making CI/CD pipelines more scalable and effective overall.”💡What This Means for YouPrioritize test independence for parallelization: To scale reliably in CI/CD, design tests that don’t share state. This is a prerequisite for fast, parallel execution and essential for shift-left strategies to succeed at scale.Use AI tools to accelerate, not replace, expertise: Tools like Postbot can speed up repetitive tasks, but they’re most effective in the hands of experienced testers. Treat AI as a companion to structured thinking—not a shortcut for understanding.Be cautious with reusable scripts: Shared logic can improve maintainability, but overuse increases fragility. Mock where appropriate, and abstract only what truly reduces duplication without harming readability.Enforce contracts early through CI: Combine schema-first design with early validation in pull requests. Postman’s Spec Hub and CLI support this model, helping teams catch errors before they spread downstream.Adapt your strategy to protocol complexity: REST, gRPC, and GraphQL each demand different approaches to coverage and validation. Understand the shape of your APIs—and tailor your tooling, mocks, and tests accordingly.If you are looking to implement the principles discussed in our editorial—from contract-first design to CI integration, Westerveld’s book, API Testing and Development with Postman, offers a clear, hands-on walkthrough. Here is an excerpt from the book which explains how contract testing verifies that APIs meet agreed expectations and walks you through setting up and validating these tests in Postman using OpenAPI specs, mock servers, and automated tooling.Expert Insight: Using Contract Testing to Verify an APIAn Excerpt from "Chapter 13: Using Contract Testing to Verify an API" in the book API Testing and Development with Postman, Second Edition by Dave Westerveld (Packt, June 2024)In this chapter, we will learn how to set up and use contract tests in Postman, but before we do that, it’s important to make sure that you understand what they are and why you would use them. So, in this section, we will learn what contract testing is. We will also learn how to use contract testing and then discuss approaches to contract testing – that is, both consumer-driven and provider-driven contracts. To kick all this off, we are going to need to know what contract testing is. So, let’s dive into that.What is contract testing?…Contract testing is a way to make sure that two different software services can communicate with each other. Often, contracts are made between a client and a server. This is the typical place where an API sits, and in many ways, an API is a contract. It specifies the rules that the client must follow in order to use the underlying service. As I’ve mentioned already, contracts help make things run more smoothly. It’s one of the reasons we use APIs. We can expose data in a consistent way that we have contractually bound ourselves to. By doing this, we don’t need to deal with each user of our API on an individual basis and everyone gets a consistent experience.However, one of the issues with an API being a contract is that we must change things. APIs will usually change and evolve over time, but if the API is the contract, you need to make sure that you are holding up your end of the contract. Users of your API will come to rely on it working in the way that you say it will, so you need to check that it continues to do so.When I bought my home, I took the contract to a lawyer to have them check it over and make sure that everything was OK and that there would be no surprises. In a somewhat similar way, an API should have some checks to ensure that there are no surprises. We call these kinds of checks contract testing. An API is a contract, and contract testing is how we ensure that the contract is valid, but how exactly do you do that?Read the Complete ExcerptAPI Testing and Development with Postman, Second Edition by Dave Westerveld (Packt, June 2024) covers everything from workflow and contract testing to security and performance validation, the book combines foundational theory with real-world projects to help developers and testers automate and improve their API workflows.Use code POSTMAN20 for 20% off at packtpub.com.Get the Book🛠️Tool of the Week⚒️Bruno 2.3.0 — A Git-Native API Client for Lightweight, Auditable WorkflowsBruno is an open source, offline-first API client built for developers who want fast, version-controlled request management. The latest release, version 2.3.0 (May 2025), adds capabilities that push it further into production-ready territory:OAuth2 CLI Flows: Streamlined authentication for secure endpoints.Secrets Integration: Native support for AWS Secrets Manager and Azure Key Vault.OpenAPI Sync: Improved support for importing and validating OpenAPI specs.Dev-Centric Design: Files are stored in plain text, organized by folder, and easy to diff in Git.It’s a strong fit for small teams, CI/CD testing, or cases where you want to keep everything under version control—without a heavyweight UI.Westerveld on Bruno“I recently tried Bruno. I liked it—I thought their approach to change management was really well designed. But it didn’t support some of the features I rely on. I experimented with it on a small project, but in the end, I decided I still needed Postman for my main workflows.”“That said, I still open Bruno now and then. It’s useful, simple, and interesting—but we’re not ready to adopt it team-wide.”Westerveld's advice: evaluate new tools with clear use cases in mind. Bruno may not replace your primary API platform overnight, but it’s a valuable addition to your workflow toolkit—especially for Git-native or OpenAPI-first teams.Read more about Bruno📰 Tech Briefs2024 State of the API Report: Postman’s 2024 State of the API report reveals that 74% of teams now follow an API-first approach, linking it to faster API delivery, improved failure recovery, rising monetization, and growing reliance on tools like Postman Workspaces, Spec Hub, and Postbot to navigate collaboration, governance, and security challenges.The MCP Catalog: Postman’s MCP Catalog offers a live, collaborative workspace to discover and test Model Context Protocol (MCP) servers from verified publishers like Stripe, Notion, and Perplexity—enabling developers to prototype LLM-integrated tools quickly using ready-to-run Postman Collections and JSON-RPC 2.0 examples.If an AI agent can’t figure out how your API works, neither can your users: This article argues that improving developer experience (DX) for LLM-powered agents (AX) is now table stakes, advocating for consistent design, clear docs, actionable errors, and golden-path smoke tests as shared foundations for both human and machine usability.15 Best API Testing Tools in 2025: Free and Open-source: Reviews 15 tools covering both established options like Postman, SoapUI, and JMeter, as well as emerging platforms such as Apidog, which offers an all-in-one solution for API design, testing, and mocking—positioning itself as a powerful alternative to fragmented toolchains.The new frontier of API governance: Ensuring alignment, security, and efficiency through decentralization: Decentralized API governance replaces rigid control with shared responsibility, combining design-time standards and runtime enforcement—augmented by AI—to enable secure, scalable, and autonomous API development across distributed teams.That’s all for today. Thank you for reading this issue of Deep Engineering. We’re just getting started, and your feedback will help shape what comes next.Take a moment to fill out this short survey—as a thank-you, we’ll add one Packt credit to your account, redeemable for any book of your choice.We’ll be back next week with more expert-led content.Stay awesome,Divya Anne SelvarajEditor in Chief, Deep EngineeringTake the Survey, Get a Packt Credit!If your company is interested in reaching an audience of developers, software engineers, and tech decision makers, you may want toadvertise with us.*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}.reverse{display:table;width: 100%;

Deep Engineering #6: Imran Ahmad on Algorithmic Thinking, Scalable Systems, and the Rise of AI Agents

Divya Anne Selvaraj

26 Jun 2025

How classical algorithms and real-world trade-offs will shape the next generation of software#6Imran Ahmad on Algorithmic Thinking, Scalable Systems, and the Rise of AI AgentsHow classical algorithms, system constraints, and real-world trade-offs will shape the next generation of intelligent softwareWorkshop: Unpack OWASP Top 10 LLMs with SnykJoin Snyk and OWASP Leader Vandana Verma Sehgal on Tuesday, July 15 at 11:00AM ET for a live session covering:✓ The top LLM vulnerabilities✓ Proven best practices for securing AI-generated code✓ Snyk’s AI-powered tools automate and scale secure dev.See live demos plus earn 1 CPE credit!Register todayHi Welcome to the sixth issue of Deep Engineering.A recent IBM and Morning Consult survey found that 99% of enterprise developers are now exploring or developing AI agents. Some have even christened 2025 as “the year of the AI agent”. We are experiencing a shift from standalone models to agentic systems.To understand what this shift means for developers we spoke with Imran Ahmad, data scientist at the Canadian Federal Government’s Advanced Analytics Solution Center (A2SC) and visiting professor at Carleton University. Ahmad is also the author of 50 Algorithms Every Programmer Should Know (Packt, 2023) and is currently working on his highly anticipated next book with us, 30 Agents Every AI Engineer Should Know, due out later this year. He has deep experience working on real-time analytics frameworks, multimedia data processing, and resource allocation algorithms in cloud computing.You can watch the full interview and read the transcript here—or keep reading for our take on the algorithmic mindset that will define the next generation of agentic software.Sign Up |AdvertiseFrom Models to Agents with Imran AhmadAccording to Gartner by 2028, 90% of enterprise software engineers will use AI code assistants (up from under 14% in early 2024). But we are already moving beyond code assistants to agents: software entities that don’t just respond to prompts, but plan, reason, and act by orchestrating tools, models, and infrastructure independently.“We have a lot of hope around AI – that it can eventually replace a human,” Ahmad says. “But if you think about how a person in a company solves a problem, they rely on a set of tools… After gathering information, they create a solution. An ‘agent’ is meant to replace that kind of human reasoning. It should be able to discover the tools in the environment around it, and have the wisdom to orchestrate a solution tailored to the problem. We're not there yet, but that's what we're striving for.”This vision aligns with where industry leaders are headed. Maryam Ashoori, Director of Product Management, IBM watsonx.ai concurs that 2025 is “the year of the AI agent”, and a recent IBM and Morning Consult survey found 99% of enterprise developers are now exploring or developing AI agents. Major platforms are rushing to support this paradigm: for instance, at Build 2025 Microsoft announced an Azure AI Agent Service to orchestrate multiple specialized agents as modular microservices. Such developments underscore the momentum behind agent-based architectures – which Igor Fedulov, CEO of Intersog, in an article for Forbes Technology Council, predicts will be a defining software trend by the end of 2025. Ahmad’s predicts this to be “the next generation of the algorithmic world we live in.”What is an agent?An AI agent is more than just a single model answering questions – it’s a software entity that can plan, call on various tools (search engines, databases, calculators, other models, etc.), and execute multi-step workflows to achieve a goal. “An agent is an entity that has the wisdom to work independently and autonomously,” Ahmad explains. “It can explore its environment, discover available tools, select the right ones, and create a workflow to solve a specific problem. That’s the dream agent.” Today’s implementations only scratch the surface of that ideal. For example, many so-called agents are basically LLMs augmented with function-calling abilities (tool APIs) – useful, but still limited in reasoning. Ahmad emphasizes that “a large language model is not the only tool. It’s perhaps the most important one right now, but real wisdom lies outside the LLM – in the agent.” In other words, true intelligence emerges from how an agent chooses and uses an ecosystem of tools, not just from one model’s output.The Practitioner’s Lens: Driving vs. Building the EngineEven as new techniques emerge, software professionals must decide how deep to go into theory. Ahmad draws a line between researchers and practitioners when it comes to algorithms. The researcher may delve into proofs of optimality, complexity theory, or inventing new algorithms. The practitioner, however, cares about applying algorithms effectively to solve real problems. Ahmad uses an analogy to explain this:“Do you want to build a car and understand every component of the engine? Or do you just want to drive it? If you want to drive it, you need to know the essentials – how to maintain it – but not necessarily every internal detail. That’s the practitioner role.”A senior engineer doesn’t always need to derive equations from scratch, but they do need to know the key parameters, limitations, and maintenance needs of the algorithmic “engines” they use.Ahmad isn’t advocating ignorance of theory. In fact, he stresses that having some insight under the hood improves decision-making. “If you know a bit more about how the engine works, you can choose the right car for your needs,” he explains. Similarly, knowing an algorithm’s fundamentals (even at a high level) helps an engineer pick the right tool for a given job. For example: Is your search problem better served by a Breath-First Search (BFS) or Depth-First Search (DFS) approach? Would a decision tree suffice, or do you need the boost in accuracy from an ensemble method? Experienced engineers approach such questions by combining intuition with algorithmic knowledge – a very practical kind of expertise. Ahmad’s advice is to focus on the level of understanding that informs real-world choices, rather than getting lost in academic detail irrelevant to your use case.Algorithm Choices and Real-World ScalabilityIn the wild, data is messy and scale is enormous – revealing which algorithms truly perform. “When algorithms are taught in universities… they’re usually applied to small, curated datasets. I call this ‘manicured pedicure data.’ But that’s not real data,” Ahmad quips. In his career as a public-sector data scientist, he routinely deals with millions of records and offers three key insights that shape how engineers should approach algorithm selection in production environments:Performance at scale requires different choices than in theory: Ahmad uses an example from his experience when he applied the Apriori algorithm (a well-known method for association rule mining). “When I used Apriori in practice, I found it doesn’t scale,” he admits. “It generates thousands of rules and then filters them after the fact. There’s a newer, better algorithm called (Frequent Pattern) FP-Growth that does the filtering at the source. It only generates the rules you actually need, making it far more scalable” A theoretically correct algorithm can become unusable when faced with big data volumes or strict latency requirements.Non-functional requirements often determine success: Beyond just picking the right algorithm, non-functional requirements like performance, scalability, and reliability must guide engineering decisions. “In academia, we focus on functional requirements… ‘this algorithm should detect fraud.’ And yes, the algorithm might technically work. But in practice, you also have to consider how it performs, how scalable it is, whether it can run as a cloud service, and so on.” Robust software needs algorithms that meet functional goals and the operational demands of deployment (throughput, memory, cost, etc.).Start simple, escalate only as needed:Simpler algorithms are easier to implement, explain, and maintain – valuable qualities especially in domains like finance or healthcare where interpretability matters. While discussing predictive models, Ahmad describes an iterative approach – perhaps begin with intuitive rules, upgrade to a decision tree for more structure, then if needed move to a more powerful model like XGBoost or an SVM. Jumping straight to a deep neural net can be overkill for a simple classification. “It’s usually a mistake to begin with something too complex – it can be overkill, like using a forklift to lift a sheet of paper,” he says.However, Algorithmic choices don’t occur in a vacuum – they influence and are influenced by software architecture. Modern systems, especially AI systems, have distinct phases (training, testing, inference) and often run in distributed cloud environments. Engineers therefore must integrate algorithmic thinking into high-level design and infrastructure decisions.Bridging Algorithms and Architecture in PracticeTake the example of training a machine learning model versus deploying it. “During training, you need a lot of data... a lot of processing power – GPUs, ideally. It’s expensive and time-consuming,” Ahmad notes. This is where cloud architecture shines. “The cloud gives you elastic architectures – you can spin up 2,000 nodes for 2 or 10 hours, train your model, and then shut it down. The cost is manageable…and you’re done.” Cloud platforms allow an elastic burst of resources: massive parallelism for a short duration, which can turn a week-long training job into a few hours for a few hundred dollars. Ahmad highlights that this elasticity was simply not available decades ago in on-prem computing. Today, any team can rent essentially unlimited compute for a day, which removes a huge barrier in building complex models. “If you want to optimize for cost and performance, you need elastic systems. Cloud computing… offers exactly that” for AI workloads, he says.Once trained, the model often compresses down to a relatively small artifact (Ahmad jokes that the final model file is “like the tail of an elephant – tiny compared to the effort to build it”). Serving predictions might only require a lightweight runtime that can even live on a smartphone. Thus, the hardware needs vary drastically between phases: heavy GPU clusters for training; maybe a simple CPU or even embedded device for inference. Good system design accommodates these differences – e.g., by separating training pipelines from inference services, or using cloud for training but edge devices for deployment when appropriate.So, how does algorithm choice drive architecture? Ahmad recommends evaluating any big design decision on three axes:CostPerformanceTime-to-deliverIf adopting a more sophisticated algorithm (or distributed processing framework, etc.) will greatly improve accuracy or speed and the extra cost is justified, it may be worth it. “First, ask yourself: does this problem justify the additional complexity…? Then evaluate that decision along three axes: cost, performance, and time,” he advises. “If an algorithm is more accurate, more time-efficient, and the cost increase is justified, then it’s probably the right choice.” On the flip side, if a fancy algorithm barely improves accuracy or would bust your budget/latency requirements, you might stick with a simpler approach that you can deploy more quickly. This trade-off analysis – weighing accuracy vs. expense vs. speed – is a core skill for architects in the age of AI. It prevents architecture astronautics (over-engineering) by ensuring complexity serves a real purpose.Classical Techniques: The Unsung Heroes in AI SystemsAhmad views classical computer science algorithms and modern AI methods as complementary components of a solution.“Take search algorithms, for instance,” Ahmad elaborates. “When you're preparing datasets for AI… you often have massive data lakes – structured and unstructured data all in one place. Now, say you're training a model for fraud detection. You need to figure out which data is relevant from that massive repository. Search algorithms can help you locate the relevant features and datasets. They support the AI workflow by enabling smarter data preparation.” Before the fancy model ever sees the data, classical algorithms may be at work filtering and finding the right inputs. Similarly, Ahmad points out, classic graph algorithms might be used to do link analysis or community detection that informs feature engineering. Even some “old-school” NLP (like tokenization or regex parsing) can serve as preprocessing for LLM pipelines. These building blocks ensure that the complex AI has quality material to work with.Ahmad offers an apt metaphor:“Maybe AI is your ‘main muscle,’ but to build a strong body – or a performant system – you need to train the supporting muscles too. Classical algorithms are part of that foundation.”Robust systems use the best of both worlds. For example, he describes a hybrid approach in real-world data labeling. In production, you often don’t have neat labeled datasets; you have to derive labels or important features from raw data. Association rule mining algorithms like Apriori or FP-Growth (from classical data mining) can uncover patterns. These patterns might suggest how to label data or which combined features could predict an outcome. “If you feed transaction data into FP-Growth, it will find relationships – like if someone buys milk, they’re likely to buy cheese too… These are the kinds of patterns the algorithm surfaces,” Ahmad explains. Here, a classical unsupervised algorithm helps define the inputs to a modern supervised learning task – a symbiosis that improves the overall system.Foundational skills like devising efficient search strategies, using dynamic programming for optimal substructure problems, or leveraging sorting and hashing for data organization are still extremely relevant. They might operate behind the scenes of an AI pipeline or bolster the infrastructure (e.g., database indexing, cache eviction policies, etc.) that keeps your application fast and reliable. Ahmad even notes that Google’s hyperparameter tuning service, Vizier, is “based on classical heuristic algorithms” rather than any neural network magic – yet it significantly accelerates model optimization.Optimization: The (Absolute) Necessity of Efficiency“Math can be cruel,” Ahmad warns. “If you’re not careful, your problem might never converge… If you accidentally introduce an exponential factor in the wrong place, it might take years – or even centuries – for the solution to converge. The sun might die before your algorithm finishes!” This colorful exaggeration underscores a serious point: computational complexity can explode quickly, and engineers need to be vigilant. It’s not acceptable to shrug off inefficiencies with “just let it run longer” if the algorithmic complexity is super-polynomial. “Things can spiral out of control very quickly. That’s why optimization isn't a luxury – it’s a necessity,” Ahmad says.Ahmad talks about three levels at which we optimize AI systems:Hardware: Choosing the right compute resources can yield massive speedups. For example, training a deep learning model on a GPU or TPU vs. a CPU can be orders of magnitude faster. “For deep learning especially, using a GPU can speed up training by a factor of 1,000,” Ahmad notes, based on his experience. So, part of an engineer’s algorithmic thinking is knowing when to offload work to specialized hardware, or how to parallelize tasks across a cluster.Hyperparameter tuning and algorithmic settings: Many algorithms (especially in machine learning) have knobs to turn – learning rate, tree depth, number of clusters, etc. The wrong settings can make a huge difference in both model quality and compute time. Traditionally, tuning was an art of trial and error. But now, tools like Google’s Vizier (and open-source libraries for Bayesian optimization) can automate this search efficiently.Ensuring the problem is set up correctly: A common mistake is diving into training without examining the data’s signal-to-noise ratio. Ahmad recommends the CRISP-DM approach – spend ample time on data understanding and preparation. “Let’s say your dataset has a lot of randomness and noise. If there's no clear signal, then even a Nobel Prize–winning scientist won’t be able to build a good model,” he says. “So, you need to assess your data before you commit to AI.” This might involve using statistical analysis or simple algorithms to verify that patterns exist. “Use classical methods to ensure that your data even has a learnable pattern. Otherwise, you’re wasting time and resources,” Ahmad advises.The cost of compute – and the opportunity cost of engineers’ time – is too high to ignore optimization. Or as Ahmad bluntly puts it, “It’s not OK to say, ‘I’m not in a hurry, I’ll just let it run.’” Competitive teams optimize both to push performance and to control time/cost, achieving results that are fast, scalable, and economically sensible.Learning by Doing: Making Algorithms StickMany developers first encounter algorithms as leetcode-style puzzles or theoretical exercises for interviews. But how can they move beyond rote knowledge to true mastery? Ahmad’s answer: practice on real problems. “Learning algorithms for interviews is a good start… it shows initiative,” he acknowledges. “But in interview prep, you're not solving real-world problems… To truly make algorithmic knowledge stick, you need to use algorithms to solve actual problems.”In the artificial setting of an interview question, you might code a graph traversal or a sorting function in isolation. The scope is narrow and hints are often provided by the problem constraints. Real projects are messier and more holistic. When you set out to build something end-to-end, you quickly uncover gaps in your knowledge and gain a deeper intuition. “That’s when you'll face real challenges, discover edge cases, and realize that you may need to know other algorithms just to get your main one working,” Ahmad says. Perhaps you’re implementing a network flow algorithm but discover you need a good data structure for priority queues to make it efficient, forcing you to learn or recall heap algorithms. Or you’re training a machine learning model and hit a wall until you implement a caching strategy to handle streaming data. Solving real problems forces you to integrate multiple techniques, and shows how classical and modern methods complement each other in context. Ahmad puts it succinctly: “There’s an entire ecosystem – an algorithmic community – that supports every solution. Classical and modern algorithms aren’t separate worlds. They complement each other, and a solid understanding of both is essential.”So, what’s the best way to gain this hands-on experience? Ahmad recommends use-case-driven projects, especially in domains that matter to you. He suggests tapping into the wealth of public datasets now available. “Governments around the world are legal custodians of citizen data… If used responsibly, this data can change lives,” he notes. Portals like data.gov host hundreds of thousands of datasets spanning healthcare, transportation, economics, climate, and more. Similar open data repositories exist for other countries and regions. These aren’t sanitized toy datasets – they are real, messy, and meaningful. “Choose a vertical you care about, download a dataset, pick an algorithm, and try to solve a problem. That’s the best way to solidify your learning,” Ahmad advises. The key is to immerse yourself in a project where you must apply algorithms end-to-end: from data cleaning and exploratory analysis, to choosing the right model or algorithmic approach, through optimization and presenting results. This process will teach more than any isolated coding puzzle, and the lessons will stick because they’re tied to real outcomes.Yes, 2025 is “the year of the AI agent”, but as the industry shifts from standalone models to agentic systems, engineers must learn to pair classical algorithmic foundations with real-world pragmatism, because in this era of AI agents, true intelligence lies not only in models, but in how wisely we orchestrate them.If Ahmad’s perspective on real-world scalability and algorithmic pragmatism resonated with you, his book 50 Algorithms Every Programmer Should Know goes deeper into the practical foundations behind today’s AI systems. The following excerpt explores how to design and optimize large-scale algorithms for production environments—covering parallelism, cloud infrastructure, and the trade-offs that shape performant systems.🧠Expert Insight: Large-Scale Algorithms by Imran AhmadThe complete “Chapter 15: Large‑Scale Algorithms” from the book 50 Algorithms Every Programmer Should Know by Imran Ahmad (Packt, September 2023).Large-scale algorithms are specifically designed to tackle sizable and intricate problems. They distinguish themselves by their demand for multiple execution engines due to the sheer volume of data and processing requirements. Examples of such algorithms include Large Language Models (LLMs) like ChatGPT, which require distributed model training to manage the extensive computational demands inherent to deep learning. The resource-intensive nature of such complex algorithms highlights the requirement for robust, parallel processing techniques critical for training the model.In this chapter, we will start by introducing the concept of large-scale algorithms and then proceed to discuss the efficient infrastructure required to support them. Additionally, we will explore various strategies for managing multi-resource processing. Within this chapter, we will examine the limitations of parallel processing, as outlined by Amdahl’s law, and investigate the use of Graphics Processing Units (GPUs).Read the Complete Chapter50 Algorithms Every Programmer Should Know by Imran Ahmad (Packt, September 2023) is a practical guide to algorithmic problem-solving in real-world software. Now in its second edition, the book covers everything from classical data structures and graph algorithms to machine learning, deep learning, NLP, and large-scale systems.For a limited time, get the eBook for $9.99 at packtpub.com — no code required.Get the Book🛠️Tool of the Week⚒️OSS Vizier — Production-Grade Black-Box Optimization from GoogleOSS Vizier is a Python-based, open source optimization service built on top of Google Vizier—the system that powers hyperparameter tuning and experiment optimization across products like Search, Ads, and YouTube. Now available to the broader research and engineering community, OSS Vizier brings the same fault-tolerant, scalable architecture to a wide range of use cases—from ML pipelines to physical experiments.Highlights:Flexible, Distributed Architecture: Supports RPC-based optimization via gRPC, allowing Python, C++, Rust, or custom clients to evaluate black-box objectives in parallel or sequentially.Rich Integration Ecosystem: Includes native support for PyGlove, TensorFlow Probability, and Vertex Vizier—enabling seamless connection to evolutionary search, Bayesian optimization, and cloud workflows.Research-Ready: Comes with standardized benchmarking APIs, a modular algorithm interface, and compatibility with AutoML tooling—ideal for evaluating and extending new optimization strategies.Resilient and Extensible: Fault-tolerant by design, with evaluations stored in SQL-backed datastores and support for retry logic, partial failure, and real-world constraints (e.g., human-evaluated objectives or lab settings).Learn more about OSS Vizier📰 Tech BriefsAI agents in 2025: Expectations vs. reality by Ivan Belcic and Cole Stryker, IBM Think: In 2025, AI agents are widely touted as transformative tools for work and productivity, but experts caution that while experimentation is accelerating, current capabilities remain limited, true autonomy is rare, and success depends on governance, strategy, and realistic expectations.Agent Mode for Gemini added to Android Studio: Google has introduced Agent Mode for Gemini in Android Studio, enabling developers to describe high-level goals that the agent can plan and execute—such as fixing build errors, adding dark mode, or generating UI from a screenshot—while allowing user oversight, feedback, and iteration, with expanded context support via Gemini API and MCP integration.Google’s Agent2Agent protocol finds new home at the Linux Foundation: Google has donated its Agent2Agent (A2A) protocol—a standard for enabling interoperability between AI agents—to the Linux Foundation, aiming to foster vendor-neutral, open development of multi-agent systems, with over 100 tech partners now contributing to its extensible, secure, and scalable design.Azure AI Foundry Agent Service GA Introduces Multi-Agent Orchestration and Open Interoperability: Microsoft has launched the Azure AI Foundry Agent Service into general availability, offering a modular, multi-agent orchestration platform that supports open interoperability, seamless integration with Logic Apps and external tools, and robust capabilities for monitoring, governance, and cross-cloud agent collaboration—all aimed at enabling scalable, intelligent agent ecosystems across diverse enterprise use cases.How AI Is Redefining The Way Software Is Built In 2025 by Igor Fedulov, CEO of Intersog: AI is transforming software development by automating tasks, accelerating workflows, and enabling more intelligent, adaptive systems—driving a shift toward agent-based architectures, cloud-native applications, and advanced technologies like voice and image recognition, while requiring developers to upskill in AI, data analysis, and security to remain competitive.That’s all for today. Thank you for reading this issue of Deep Engineering. We’re just getting started, and your feedback will help shape what comes next.Take a moment to fill out this short survey we run monthly—as a thank-you, we’ll add one Packt credit to your account, redeemable for any book of your choice.We’ll be back next week with more expert-led content.Stay awesome,Divya Anne SelvarajEditor-in-Chief, Deep EngineeringTake the Survey, Get a Packt Credit!If your company is interested in reaching an audience of developers, software engineers, and tech decision makers, you may want toadvertise with us.*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

Deep Engineering #5: Dhirendra Sinha (Google) and Tejas Chopra (Netflix) on Scaling, AI Ops, and System Design Interviews

Divya Anne Selvaraj

19 Jun 2025

Lessons on designing for failure and the importance of trade-off thinking#5Dhirendra Sinha (Google) and Tejas Chopra (Netflix) on Scaling, AI Ops, and System Design InterviewsFrom designing fault-tolerant systems at Big Tech and hiring for system design roles, Chopra and Sinha share lessons on designing for failure and the importance of trade-off thinkingHi Welcome to the fifth issue of Deep Engineering.With AI workloads reshaping infrastructure demands and distributed systems becoming the default, engineers are facing new failure modes, stricter trade-offs, and rising expectations in both practice and hiring.To explore what today’s engineers need to know, we spoke with Dhirendra Sinha (Software Engineering Manager at Google, and long-time distributed systems educator) and Tejas Chopra (Senior Engineer at Netflix and Adjunct Professor at UAT). Their recent book, System Design Guide for Software Professionals (Packt, 2024), distills decades of practical experience into a structured approach to design thinking.In this issue, we unpack their hard-won lessons on observability, fault tolerance, automation, and interview performance—plus what it really means to design for scale in a world where even one-in-a-million edge cases are everyday events.You can watch the full interview and read the transcript here—or keep reading for our distilled take on the design mindset that will define the next decade of systems engineering.Sign Up |AdvertiseJoin us on July 19 for a 150-minute interactive MCP Workshop. Go beyond theory and learn how to build and ship real-world MCP solutions. Limited spots available! Reserve your seat today.Use Code EARLY35 for 35% off!Designing for Scale, Failure, and the Future — With Dhirendra Sinha and Tejas Chopra“Foundational system design principles—like scalability, reliability, and efficiency—are remarkably timeless,” notes Chopra, adding that “the rise of AI only reinforces the importance of these principles.” In other words, new AI systems can’t compensate for poor architecture; they reveal its weaknesses. Sinha concurs: “If the foundation isn’t strong, the system will be brittle—no matter how much AI you throw at it.” AI and system design aren’t at odds – “they complement each other,” says Chopra, with AI introducing new opportunities and stress-tests for our designs.One area where AI is elevating system design is in AI-driven operations (AIOps). Companies are increasingly using intelligent automation for tasks like predictive autoscaling, anomaly detection, and self-healing.“There’s a growing demand for observability systems that can predict service outages, capacity issues, and performance degradation before they occur,” notes Sam Suthar, founding director of Middleware. AI-powered monitoring can catch patterns and bottlenecks ahead of failures, allowing teams to fix issues before users notice. At the same time, designing the systems to support AI workloads is a fresh challenge. The recent rollout of a Ghibli-style image generator saw explosive demand – so much that OpenAI’s CEO had to ask users to pause as GPU servers were overwhelmed. That architecture didn’t fully account for the parallelization and scale such AI models required. AI can optimize and automate a lot, but it will expose any gap in your system design fundamentals. As Sinha puts it, “AI is powerful, but it makes mastering the fundamentals of system design even more critical.”Scaling Challenges and Resilience in PracticeSo, what does it take to operate at web scale in 2025? Sinha highlights four key challenges facing large-scale systems today:Scalability under unpredictable load: global services must handle sudden traffic spikes without falling over or grossly over-provisioning. Even the best capacity models can be off, and “unexpected traffic can still overwhelm systems,” Sinha says.Balancing the classic trade-offs between consistency, performance, and availability: This remains as relevant as ever. In practice, engineers constantly juggle these – and must decide where strong consistency is a must versus where eventual consistency will do.Security and privacy at scale have grown harder: Designing secure systems for millions of users, with evolving privacy regulations and threat landscapes, is an ongoing battle.The rise of AI introduces “new uncertainties”: we’re still learning how to integrate AI agents and AI-driven features safely into large architectures.Chopra offers an example from Netflix: “We once had a live-streaming event where we expected a certain number of users – but ended up with more than three times that number.” The system struggled not because it was fundamentally mis-designed, but due to hidden dependency assumptions. In a microservices world, “you don’t own all the parts—you depend on external systems. And if one of those breaks under load, the whole thing can fall apart,” Chopra warns. A minor supporting service that wasn’t scaled for 3× traffic can become the linchpin that brings down your application. This is why observability is paramount. At Netflix’s scale (hundreds of microservices handling asynchronous calls), tracing a user request through the maze is non-trivial. Teams invest heavily in telemetry to know “which service called what, when, and with what parameters” when things go wrong. Even so, “stitching together a timeline can still be very difficult” in a massive distributed system, especially with asynchronous workflows. Modern observability tools (distributed tracing, centralized logging, etc.) are essential, and even these are evolving with AI assistance to pinpoint issues faster.So how do Big Tech companies approach scalability and robustness by design? One mantra is to design for failure. Assume everything will eventually fail and plan accordingly. “We operate with the mindset that everything will fail,” says Chopra. That philosophy birthed tools like Netflix’s Chaos Monkey, which randomly kills live instances to ensure the overall system can survive outages. If a service or an entire region goes down, your architecture should gracefully degrade or auto-heal without waking up an engineer at 2 AM. Sinha recalls an incident from his days at Yahoo:“I remember someone saying, “This case is so rare, it’s not a big deal,” and the chief architect replied, “One in a million happens every hour here.” That’s what scale does—it invalidates your assumptions.”In high-scale systems, even million-to-one chances occur regularly, so no corner case is truly negligible. In Big Tech, achieving resilience at scale has resulted in three best practices:Fault-tolerant, horizontally scalable architectures: In Netflix and other companies, such architecture ensure that if one node or service dies, the load redistributes and the system heals itself quickly. Teams focus not just on launching features but “landing” them safely – meaning they consider how each new deployment behaves under real-world loads, failure modes, and even disaster scenarios. Automation is key: from continuous deployments to automated rollback and failover scripts. “We also focus on automating everything we can—not just deployments, but also alerts. And those alerts need to be actionable,” Sinha says.Explicit capacity planning and graceful degradation: Engineers define clear limits for how much load a system can handle and build in back-pressure or shedding mechanisms beyond that. Systems often fail when someone makes unrealistic assumptions about unlimited capacity. Caching, rate limiting, and circuit breakers become your safety net. Gradual rollouts further boost robustness. “When we deploy something new, we don’t release it to the entire user base in one go,” Chopra explains. Whether it’s a new recommendation algorithm or a core infrastructure change, Netflix will enable it for a small percentage of users or in one region first, observe the impact, then incrementally expand if all looks good. This staged rollout limits the blast radius of unforeseen issues. Feature flags, canary releases, and region-by-region deployments should be standard operating procedure.Infrastructure as Code (IaC): Modern infrastructure tooling also contributes to resiliency. Many organizations now treat infrastructure as code, defining their deployments and configurations in declarative scripts. As Sinha notes, “we rely heavily on infrastructure as code—using tools like Terraform and Kubernetes—where you define the desired state, and the system self-heals or evolves toward that.” By encoding the target state of the system, companies enable automated recovery; if something drifts or breaks, the platform will attempt to revert to the last good state without manual intervention. This codified approach also makes scaling and replication more predictable, since environments can be spun up from the same templates.These same principles—resilience, clarity, and structured thinking—also underpin how engineers should approach system design interviews.Mastering the System Design InterviewCracking the system design interview is a priority for many mid-level engineers aiming for senior roles, and for good reason. Sinha points out that system design skill isn’t just a hiring gate – it often determines your level/title once you’re in a company. Unlike coding interviews where problems have a neat optimal solution, “system design is messy. You can take it in many directions, and that’s what makes it interesting,” Sinha says. Interviewers want to see how you navigate an open-ended problem, not whether you can memorize a textbook solution. Both Sinha and Chopra emphasize structured thinking and communication. Hiring managers deliberately ask ambiguous or underspecified questions to see if the candidate will impose structure: Do they ask clarifying questions? Do they break the problem into parts (data storage, workload patterns, failure scenarios, etc.)? Do they discuss trade-offs out loud? Sinha and Chopra offer two guidelines:There’s rarely a single “correct” answer: What matters is reasoning and demonstrating that you can make sensible trade-offs under real-world constraints. “It’s easy to choose between good and bad solutions,” Sinha notes, “but senior engineers often have to choose between two good options. I want to hear their reasoning: Why did you choose this approach? What trade-offs did you consider?” A strong candidate will articulate why, say, they picked SQL over NoSQL for a given scenario – and acknowledge the downsides or conditions that might change that decision. In fact, Chopra may often follow up with “What if you had 10× more users? Would your choice change?” to test the adaptability of a candidate’s design. He also likes to probe on topics like consistency models: strong vs eventual consistency and the implications of the CAP theorem. Many engineers “don’t fully grasp how consistency, availability, and partition tolerance interact in real-world systems,” Chopra observes, so he presents scenarios to gauge depth of understanding.Demonstrate a collaborative, inquisitive approach: A system design interview shouldn’t be a monologue; it’s a dialogue. Chopra says, “I try to keep the interview conversational. I want the candidate to challenge some of my assumptions.” For example, a candidate might ask: What are the core requirements? Are we optimizing for latency or throughput? or How many users are we targeting initially? — “that kind of questioning is exactly what happens in real projects,” Chopra explains. It shows the candidate isn’t just regurgitating a pre-learned architecture, but is actively scoping the problem like they would on the job. Strong candidates also prioritize requirements on the fly – distinguishing must-haves (e.g. high availability, security) from nice-to-haves (like an optional feature that can be deferred).Through years of interviews, Sinha and Chopra have noticed three common pitfalls:Jumping into solution-mode too fast: “Candidates don’t spend enough time right-sizing the problem,” says Chopra. “The first 5–10 minutes should be spent asking clarifying questions—what exactly are we designing, what are the constraints, what assumptions can we make?” Diving straight into drawing boxes and lines can lead you down the wrong path. Sinha agrees: “They hear something familiar, get excited, and dive into design mode—often without even confirming what they’re supposed to be designing. In both interviews and real life, that’s dangerous. You could end up solving the wrong problem.”Lack of structure – jumping randomly between components without a clear plan: This scattered approach makes it hard to know if you’ve covered the key areas. Interviewers prefer a candidate who outlines a high-level approach (e.g. client > service > data layer) before zooming in, and who checks back on requirements periodically.Poor time management: It’s common for candidates to get bogged down in details early (like debating the perfect database indexing scheme) and then run out of time to address other important parts of the system. Sinha and Chopra recommend practicing pacing yourself and be willing to defer some details. It’s better to have a complete, if imperfect, design than a perfect cache layer with no time to discuss security or analytics requirements. If an interviewer hints to move on or asks about an area you haven’t covered, take the cue. “Listen to the interviewer’s cues,” Sinha advises. “We want to help you succeed, but if you miss the hints, we can’t evaluate you properly.”Tech interviews in general have gotten more demanding in 2025. The format of system design interviews hasn’t drastically changed, but the bar is higher. Companies are more selective, sometimes even “downleveling” strong candidates if they don’t perfectly meet the senior criteria. Evan King and Stefan Mai, cofounders of interview preparation startup, in an article in The Pragmatic Engineer observe, “performance that would have secured an offer in 2021 might not even clear the screening stage today”. This reflects a market where competition is fierce and expectations for system design prowess are rising. But as Chopra and Sinha illustrate, the goal is not to memorize solutions – it’s to master the art of trade-offs and critical thinking.Beyond Interviews: System Design as a Career CatalystSystem design isn’t just an interview checkbox – it’s a fundamental skill for career growth in engineering. “A lot of people revisit system design only when they're preparing for interviews,” Sinha says. “But having a strong grasp of system design concepts pays off in many areas of your career.” It becomes evident when you’re vying for a promotion, writing an architecture document, or debating a new feature in a design review.Engineers with solid design fundamentals tend to ask the sharp questions that others miss (e.g. What happens if this service goes down? or Can our database handle 10x writes?). They can evaluate new technologies or frameworks in the context of system impact, not just code syntax. Technical leadership roles especially demand this big-picture thinking. In fact, many companies now expect even engineering managers to stay hands-on with architecture – “system design skills are becoming non-negotiable” for leadership.Mastering system design also improves your technical communication. As you grow more senior, your success depends on how well you can simplify complexity for others – whether in documentation or in meetings. “It’s not just about coding—it’s about presenting your ideas clearly and convincingly. That’s a huge part of leadership in engineering,” Sinha notes. Chopra agrees, framing system design knowledge as almost a mindset: “System design is almost a way of life for senior engineers. It’s how you continue to provide value to your team and organization.” He compares it to learning math: you might not explicitly use the quadratic formula daily, but learning it trains your brain in problem-solving.Perhaps the most exciting aspect is that the future is wide open. “Many of the systems we’ll be working on in the next 10–20 years haven’t even been built yet,” Chopra points out. We’re at an inflection point with technologies like AI agents and real-time data streaming pushing boundaries; those with a solid foundation in distributed systems will be the “go-to” people to harness these advances. And as Chopra notes,“seniority isn’t about writing complex code. It’s about simplifying complex systems and communicating them clearly. That’s what separates great engineers from the rest.”System design proficiency is a big part of developing that ability to cut through complexity.Emerging Trends and Next Frontiers in System DesignWhile core principles remain steady, the ecosystem around system design is evolving rapidly. We can identify three significant trends:Integration of AI Agents with Software Systems: As Gavin Bintz writes in Agent One, an emerging trend is the integration of AI agents with everyday software systems. New standards like Anthropic’s Model Context Protocol (MCP), are making it easier for AI models to securely interface with external tools and services. You can think of MCP as a sort of “universal adapter” that lets a large language model safely query your database, call an API like Stripe, or post a message to Slack – all through a standardized interface. This development opens doors to more powerful, context-aware AI assistants, but it also raises architectural challenges. Designing a system that grants an AI agent limited, controlled access to critical services requires careful thought around authorization, sandboxing, and observability (e.g., tracking what the AI is doing). Chopra sees MCP as fertile ground for new system design patterns and best practices in the coming years.Deepening of observability and automation in system management: Imagine systems that not only detect an anomaly but also pinpoint the likely root cause across your microservices and possibly initiate a fix. As Sam Suthar, Founding Director at Middleware, observes, early steps in this direction are already in play – for example, tools that correlate logs, metrics, and traces across a distributed stack and use machine learning to identify the culprit when something goes wrong. The ultimate goal is to dramatically cut Mean Time to Recovery (MTTR) when incidents occur, using AI to assist engineers in troubleshooting. As one case study showed, a company using AI-based observability was able to resolve infrastructure issues 75% faster while cutting monitoring costs by 75%. The complexity of modern cloud environments is pushing us toward this new normal of predictive, adaptive systems.Sustainable software architecture: There is growing dialogue now about designing systems that are not only robust and scalable, but also efficient in their use of energy and resources. The surge in generative AI has shone a spotlight on the massive power consumption of large-scale services. According to Kemene et al., in an article published by the World Economic Forum (WEF), Data centers powering AI workloads can consume as much electricity as a small city; the International Energy Agency projects data center energy use will more than double by 2030, with AI being “the most important driver” of that growth. Green software engineering principles urge us to consider the carbon footprint of our design choices. Sinha suggests this as an area to pay attention to.Despite faster cycles, sharper constraints and more automation system design remains grounded in principles. As Chopra and Sinha make clear, the ability to reason about failure, scale, and trade-offs isn’t just how systems stay up; it’s also how engineers move up in their career.If you found Sinha and Chopra’s perspective on designing for scale and failure compelling, their book System Design Guide for Software Professionals unpacks the core attributes that shape resilient distributed systems. The following excerpt from the book breaks down how consistency, availability, partition tolerance, and other critical properties interact in real-world architectures. You’ll see how design choices around reads, writes, and replication influence system behavior—and why understanding these trade-offs is essential for building scalable, fault-tolerant infrastructure.Expert Insight: Distributed System Attributes by Dhirendra Sinha and Tejas ChopraThe complete “Chapter 2: Distributed System Attributes” from the book System Design Guide for Software Professionals by Dhirendra Sinha and Tejas Chopra (Packt, August 2024)…Before we jump into the different attributes of a distributed system, let’s set some context in terms of how reads and writes happen.Let’s consider an example of a hotel room booking application (Figure 2.1). A high-level design diagram helps us understand how writes and reads happen:Figure 2.1 – Hotel room booking request flowAs shown in Figure 2.1, a user (u1) is booking a room (r1) in a hotel and another user is trying to see the availability of the same room (r1) in that hotel. Let’s say we have three replicas of the reservations database (db1, db2, and db3). There can be two ways the writes get replicated to the other replicas: The app server itself writes to all replicas or the database has replication support and the writes get replicated without explicit writes by the app server.Let’s look at the write and the read flows:Read the Complete ChapterSystem Design Guide for Software Professionals by Dhirendra Sinha and Tejas Chopra (Packt, August 2024) is a comprehensive, interview-ready manual for designing scalable systems in real-world settings. Drawing on their experience at Google, Netflix, and Yahoo, the authors combine foundational theory with production-tested practices—from distributed systems principles to high-stakes system design interviews.For a limited time, get the eBook for $9.99 at packtpub.com — no code required.Get the Book🛠️Tool of the Week⚒️Diagrams 0.24.4 — Architecture Diagrams as Code, for System DesignersDiagrams is an open source Python toolkit that lets developers define cloud architecture diagrams using code. Designed for rapid prototyping and documentation, it supports major cloud providers (AWS, GCP, Azure), Kubernetes, on-prem infrastructure, SaaS services, and common programming frameworks—making it ideal for reasoning about modern system design.The latest release (v0.24.4, March 2025) adds stability improvements and ensures compatibility with recent Python versions. Diagrams has been adopted in production projects like Apache Airflow and Cloudiscovery, where infrastructure visuals need to be accurate, automatable, and version controlled.Highlights:Diagram-as-Code: Define architecture models using simple Python scripts—ideal for automation, reproducibility, and tracking in Git.Broad Provider Support: Over a dozen categories including cloud platforms, databases, messaging systems, DevOps tools, and generic components.Built on Graphviz: Integrates with Graphviz to render high-quality, publishable diagrams.Extensible and Scriptable: Easily integrate with build pipelines or architecture reviews without relying on external design tools.Visit Diagrams' GitHub Repo📰 Tech BriefsAnalyzing Metastable Failures in Distributed Systems: A new HotOS'25 paper builds on prior work to introduce a simulation-based pipeline—spanning Markov models, discrete event simulation, and emulation—to help engineers proactively identify and mitigate metastable failure modes in distributed systems before they escalate.A Senior Engineer's Guide to the System Design Interview: A comprehensive, senior-level guide to system design interviews that demystifies core concepts, breaks down real-world examples, and equips engineers with a flexible, conversational framework for tackling open-ended design problems with confidence.Using Traffic Mirroring to Debug and Test Microservices in Production-Like Environments: Explores how production traffic mirroring—using tools like Istio, AWS VPC Traffic Mirroring, and eBPF—can help engineers safely debug, test, and profile microservices under real-world conditions without impacting users.Designing Instagram: This comprehensive system design breakdown of Instagram outlines the architecture, APIs, storage, and scalability strategies required to support core features like media uploads, feed generation, social interactions, and search—emphasizing reliability, availability, and performance at massive scale.Chiplets and the Future of System Design: A forward-looking piece on how chiplets are reshaping the assumptions behind system architecture—covering yield, performance, reuse, and the growing need for interconnect standards and packaging-aware system design.That’s all for today. Thank you for reading the first issue of Deep Engineering. We’re just getting started, and your feedback will help shape what comes next.Take a moment to fill out this short survey we now run monthly—as a thank-you, we’ll add one Packt credit to your account, redeemable for any book of your choice.We’ll be back next week with more expert-led content.Stay awesome,Divya Anne SelvarajEditor in Chief, Deep EngineeringTake the Survey, Get a Packt Credit!If your company is interested in reaching an audience of developers, software engineers, and tech decision makers, you may want toadvertise with us.*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}