Deep Engineering

57 Articles

04 Sep 2025

Deep Engineering #16: Designing Systems for Longevity with Alexander Kushnir

04 Sep 2025

Practical takeaways on HAL/RTOS, SoMs, CI-gated releases, and crypto agility#16Designing Systems for Longevity with Alexander KushnirPractical takeaways on HAL/RTOS, SoMs, CI-gated releases, and crypto agility—grounded in current FDA and NIST guidanceFor Tech Leaders Shaping AI Strategy in the EnterpriseAI adoption brings real pressures:• Prove ROI on LLM initiatives.• Protect data privacy & compliance when using open-source models.• Scale responsibly without being derailed by hallucinations, talent gaps, or security risks.That’s why we built TechLeader Voices by Packt — a newsletter that delivers real-world playbooks, frameworks, and lessons from frontline AI leaders.Subscribe today and unlock the Executive Insights Pack — including 1 report, 1 case study, and 5 power talks — valid for the next 48 hours only.Join TechLeader Voices to Access the PackHi Welcome to the sixteenth issue of Deep Engineering.This edition was created in collaboration with Alexander Kushnir, principal software engineer at Johnson & Johnson MedTech. Kushnir specialises in electrophysiology systems and brings roughly two decades across medical devices, industrial controllers, and networked embedded platforms. He has worked on motion-control firmware, network switches, VoIP, and medical software; his core expertise spans embedded Linux, modern C++, cross-platform development, and HW/SW integration. He has also built and led a two-day workshop on CMake.In today’s feature, we use Kushnir’s field experience to examine what it takes to design software that must survive regulatory cycles, hardware obsolescence, and engineering turnover. You’ll find practical, transferable habits: clear boundaries between certified and updatable code, disciplined OTA with rollback, modularity via HAL/RTOS, SoM-based hardware strategies, CI-gated changes, and a quality culture that resists bit-rot.📢 Important: Deep Engineering is Moving to SubstackThe shift to Substack is underway. Some of you may receive two emails today including one [email protected]. Thank you for bearing with us during this transition and the inconvenience is regretted.To ensure uninterrupted delivery, please whitelist [email protected] your mail client. No other action is required.Sign Up |AdvertiseDesigning Systems for Longevity in Safety-Critical Embedded Domains with Alexander KushnirThe FDA’s June 2025 guidance on medical device cybersecurity explicitly requires manufacturers to provide a “reasonable assurance” of ongoing security throughout a device’s lifecycle, including verifiable secure update mechanisms and a documented process for managing vulnerabilities over at least a decade. That means devices must support signed and authenticated firmware updates, with audit logs and the ability to rollback if needed, for 10+ years after release. Regulators now expect evidence in premarket submissions that companies have planned for ten-year supportability via patch management and cybersecurity monitoring. This shift, reinforced by new laws like the U.S. FD&C Act §524B, reinforces that designing for longevity is a fundamental safety requirement, not just a business choice.As Alexander Kushnir observes:“Any update that could affect safety or compliance still requires formal review and, if necessary, re-certification. In practice, we (can) minimize disruption by:Separating safety-critical functions into a stable, validated firmware baseline that is rarely touched.Isolating updatable modules (non-critical logic, UI features, analytics, etc.) so they can evolve without impacting certified components.Using risk-based change management to decide when an update is worth the cost of triggering the regulatory process — for example, prioritizing security patches and critical bug fixes, while bundling minor enhancements into larger, less frequent releases.In this way, the need to keep embedded software up to date becomes operationally similar to maintaining conventional PC or cloud-based software, but with the extra discipline required for regulated environments.”Tooling and Platform Hygiene as Survival StrategiesMeeting a 10+ year lifespan requires disciplined platform and toolchain management. Here are three strategies to achieve this:Build on long-term-supported software baselines: For example, the Yocto Project’s Scarthgap 5.0 release (April 2024) is a Long-Term Support (LTS) Linux baseline that will receive updates until 2028. Yocto 5.0.11 was released in mid-2025, showing that even 18 months in, the LTS branch is actively maintained with security patches and kernel updates. By starting with such an LTS OS (or an RTOS with vendor commitment to longevity), teams can ensure they can pull in fixes for years. It is much easier to support a product until 2030 when your base OS isn’t stuck back in 2021.Keep your toolchains current in a controlled way. If your firmware is built with a 2017-era compiler, you may struggle to find expertise or patches by 2027. We see the risks in tooling evolution: Arm’s Development Studio 2025.0 (released July 2025) introduced a next-generation LLVM-based compiler toolchain and provided final updates to the previous Arm Compiler 6. Projects that linger on older toolchain versions risk missing out on optimizations for new processor cores and eventually losing support entirely. The new Arm toolchain, for example, adds full support for the latest ARMv9.6-A and other CPU IP, while Arm Compiler 6 is now end-of-life.Make incremental updates as a habit. The CMake build system had version 4.1.0 released in August 2025, followed by a 4.1.1 patch just weeks later fixing regressions (test timeouts, Ninja build issues, etc.). Embracing minor updates regularly – rather than a massive jump every few years – keeps your development environment fresh and prevents “version rot” that can make future maintenance or security fixes painfully expensive.Kushnir also recommends that teams:“Use industry-standard and up-to-date tools: Even though it is not a hard requirement, tools keep evolving, and if you fall too far behind, then when you eventually need to investigate an issue in the field, you may find yourself forced to use newer tools you’ve never worked with—leaving you at a disadvantage.”Long-lived projects should treat toolchain, OS, and library updates as part of the regular engineering cycle (with CI tests to catch issues early). It’s far easier to sustain a system when its underpinnings aren’t frozen in time.Secure Updates as a First-Class FunctionRobust OTA (over-the-air) update capability is now a first-class feature of any long-lived system. In fact, the FDA’s new rules require manufacturers to implement secure update mechanisms and to have a postmarket vulnerability management plan. This aligns with the best practices many teams have already learned through hard experience.A recent OTA checklist from Memfault highlights the critical elements of a sustainable update framework: signed and validated firmware images, explicit hardware/firmware compatibility checks before deployment, support for atomic rollback if an update fails, staged rollout mechanisms, and fleet monitoring to catch issues early. This means:Updates should be fail-safe: your device should never attempt an update if, say, it has low battery or insufficient flash space; it should verify the new firmware’s authenticity and integrity (e.g. cryptographic signature and hash) before swapping; and it must never “brick” itself – a watchdog or bootloader should detect a bad update and revert to a known-good version.Deployments should be phased: Roll out to a small canary group (perhaps 5% of devices) and pause to observe telemetry for crashes or anomalies. Only when the update proves stable do you ramp to wider percentages. This staged rollout with live monitoring (looking at crash reports, reboot rates, memory usage, etc.) ensures that if an issue appears, it affects at most a sliver of the fleet and can be fixed in the next update.Implementing such an update system though is not trivial. Fortunately, tooling is rising to meet the challenge. Cloud IoT platforms and services can manage device cohorts and track update metrics. Even development tools are integrating quality gates to enforce update readiness: for example, GitHub in August 2025 made repository rulesets and merge queues generally available, allowing teams to require that all code passes tests and security checks before it’s merged and queued for release.As Kushnir emphasizes“CI tests are a must. I would even say that every pull request should be gated, i.e. only if the pull request passes all the tests, should it be merged. Many developers don’t like writing tests, but as a matter of fact, the tests protect them, and provide developers the confidence to make major changes without breaking things.”Long-lived systems will face newfound vulnerabilities (e.g. a cryptographic library weakness discovered in 2028) and changing requirements, so shipping without a safe update path is simply not acceptable in 2025.Architecture Choices to Mitigate ObsolescenceA recurring theme in longevity engineering is abstraction with discipline. Systems that endure tend to be built with clear hardware abstraction layers and modular components, so that sub-parts can be replaced or updated with minimal ripple effect. Two tactics to mitigate obselescence:Using a standardized Real-Time Operating System (RTOS) or Hardware Abstraction Layer (HAL)Using standardized RTOS and HAL can allow you to swap out one microcontroller for another years later if the original becomes obsolete, without rewriting the entire codebase. Kushnir recommends:Abstract all you can. Whether one is taking the OOP approach… or a procedural one, abstraction and modularity must be applied. Hardware Abstraction Layer (HAL) is an excellent example of abstraction, as the application logic is not aware of the hardware (for example Linux paradigm took abstraction to the edge - everything is a file, whether it is a network connection, hardware device, or a real file - the user reads from and writes to a file).Recently the Zephyr RTOS (an open-source, cross-platform RTOS) has gained massive adoption and ecosystem support. Zephyr’s latest 4.2 release (July 2025) was its largest ever, setting a contribution record with 810 developers contributing and adding support for 96 new boards and 22 shields in one release.This kind of broad hardware support and vendor backing (Silicon Labs, NXP, Intel, etc. are all involved) means that a product built on Zephyr can more easily move to new hardware in the future – the OS abstracts the MCU details and already supports a wide range of CPUs from ARM Cortex-M to RISC-V and even Renesas RX. The modularity and community maintenance of such an RTOS reduce the burden on your team to support every low-level detail for years.In fact, companies traditionally known for proprietary environments are embracing this: IAR Systems (known for their commercial compilers/IDEs) announced in 2025 full production-grade support for Zephyr RTOS in their toolchain, including optimized compilers, debugging, and even functional safety certification for using Zephyr in ISO 26262 or IEC 62304 contexts. This convergence of open-source OS with professional tools indicates that using open ecosystems no longer means forsaking quality or support. On the contrary, it can lower integration costs and prolong maintainability, since you benefit from upstream improvements and a larger talent pool familiar with the technology.Leveraging system-on-modules (SoMs) or long-life silicon platformsAnother architectural tactic to fight obsolescence is to leverage system-on-modules (SoMs) or long-life silicon platforms. According to Kushnir:“When designing a hardware platform, the engineer must ensure that the components he chooses have a “long-term support”. Having said that, I prefer to use off-the-shelf System-on-Module (SOM) integrated on a custom board, rather than developing a board with the same CPU (or FPGA) and having to address most basic interfaces such as memory or a flash storage during the board bring-up. This reduces the complexity of board bring-up and makes it easier to handle hardware obsolescence, because the SOM vendor typically manages low-level design, interface validation, and long-term component sourcing.”Many embedded suppliers also offer product longevity programs – for example, NXP’s Product Longevity program guarantees certain microcontrollers will be manufactured for 10 or 15 years. Choosing components under such programs, or designing your PCB to accommodate multiple pin-compatible variants, can save you from a costly redesign when one chip goes end-of-life.In a training webinar on designing for longevity, NXP experts John Terpening and Jim Hoffmann highlighted the importance of SoC selection and supply chain planning: pick parts with known long-term availability, and avoid “cutting-edge” components that might be discontinued in a few years if they don’t find a broad market. They also suggest maximizing the bill-of-materials lifecycle while minimizing sustainment activity – i.e. use parts and modules that will stay around, even if it means slightly less cutting-edge tech, so that you aren’t forced into constant component churn.Abstraction helps here too: if you design with a clean separation between hardware-specific code (device drivers, BSPs) and your business logic, swapping out a sensor or radio module for a newer one (or a second source) is far less painful.Longevity in hardware means expecting that something will become unavailable or outdated – be it the processor, an OS kernel, a radio chipset, or an encryption algorithm – and structuring your system so you can adapt. Loose coupling, standard interfaces, and broad community support all increase your resilience to inevitable change.Crypto Agility and Security for DecadesDesigning for longevity also means anticipating the evolution of security threats and cryptography. What is secure today may be woefully insufficient in 10-15 years (consider that 15 years ago, SHA-1 and 1024-bit RSA were considered fine, which is no longer true). As Kushnir says:“To keep a system secure for 10+ years, the update mechanism itself must be secure: signed and verified updates, encrypted transport, and a rollback option in case an update fails.”An illustrative development in 2025 was NIST’s finalization of its Lightweight Cryptography standard, built on the Ascon algorithm family. This standard (NIST SP 800-232) specifically targets constrained devices – IoT sensors, medical implants, wireless tags – providing modern authenticated encryption and hashing that are efficient enough for low-power microcontrollers. Ascon was chosen for its robustness (withstood years of public cryptanalysis) and its ease of implementing countermeasures against side-channel attacks.The emergence of standards like this indicates three best practices:Devices expected to function into the 2030s should use crypto algorithms designed for longevity (both in terms of security margin and performance). Engineering teams should design their security architecture to be crypto-agile – for example, abstracting the cryptographic library so that if a flaw is found in an algorithm or a new standard emerges, you can update algorithms without rewriting the whole system.Long-lived systems must account for secure boot and decommissioning. The FDA guidelines require plans for securely decommissioning devices (e.g. wiping sensitive data when a device is retired), and for transferring responsibility if a device outlives its official support period.Design with security over the full life – from initial deployment to end-of-life. Incorporating hardware roots of trust, updatable cryptographic primitives, and fail-safe modes (where a device can be constrained or isolated if it’s past support and thus more vulnerable) are increasingly seen as best practices. As NIST’s Kerry McKay put it, lightweight cryptography standards are there to ensure even the smallest devices can “protect the information” they handle over their lifespan, without exhausting their limited CPU or battery resources.Longevity requires a security roadmap. Engineering teams must choose the right algorithms, build in the means to rotate keys or ciphers, and plan for secure updates.Fostering a Culture of Maintainability and QualityPerhaps the most important ingredient for longevity is culture. As Kushnir says:“Code review isn’t done just because “that’s the rule”; it’s done to catch defects and improve design. The same goes for documentation and tests—they’re tools, not rituals.”Technical strategies will fall short if the engineering culture doesn’t value quality and continuous improvement. NASA’s approach to software is a case in point: to support missions that last decades (or human-rated vehicles with zero tolerance for failure), NASA instills a culture of exhaustive testing, peer review, and learning from mistakes. In a 2025 talk on how NASA tests their software for the Space Shuttle and Orion programs, NASA’s Darrel Raines explained that they use 4 to 7 levels of testing for each change, with independent verification teams and software quality assurance groups dedicated to finding potential defects. They deliberately bring in fresh eyes – separate teams that try to validate the software – to avoid blind spots.This multi-layer test strategy, combined with strict coding standards and oversight, is how they achieved the Shuttle’s famously low defect rate (arguably one of the most reliable software systems ever built). Raines also emphasized using a diversity of tools and methods to catch errors: simulations, hardware-in-the-loop, static analysis, formal methods, etc., each can find different classes of bugs.The philosophy is to never trust a single approach – if something is mission-critical, test it in many ways and assume nothing. For engineers in other domains, NASA’s example underscores that preventing and catching bugs early is far cheaper and safer than troubleshooting in the field. Thus, investing in robust test automation, code reviews, static analysis, and even techniques like fuzz testing or model checking for critical modules can pay immense dividends over a product’s life.Modern programming practices and language features can also help here. The upcoming C++26 standard, for example, is adding design by contract capabilities (preconditions, postconditions, and assertions built into the language) as well as compile-time checks and safer standard library features. Embracing such features in embedded code can catch errors early or make the code’s assumptions explicit, easing maintenance. C++26 also brings improvements like bounds-checked iterators, nullptr validation, and pattern matching, which aim to reduce common sources of bugs and make code more self-documenting. Using these language improvements (once available in compilers) or similar features in other languages (Ada/SPARK contracts, Rust’s ownership model, etc.) can significantly improve software maintainability. They enable what one might call self-testing code – code that fails loudly if misused rather than silently corrupting data.Finally, maintainability culture means continuous refactoring and knowledge sharing. A codebase is not a fixed asset; it’s a living one. Teams that succeed over the long run treat technical debt with the same seriousness as feature development. They allocate time in each release cycle to update stale documentation, improve code clarity, and refactor overly complex areas – especially if those areas hinder testing or pose a risk for future bugs. Long-serving products often see multiple generations of engineers; investing in clear code and design pays off when new eyes must understand the system in 5 or 8 years. In regulated industries, it also pays off during audits and recertifications if the design rationale is well-documented.💡Key TakeawaysPlan for 10+ years from the start – choose components and OS/tool versions with known long-term support and map out an update strategy (both technical and operational) for the entire lifecycle.Make secure, fail-safe updates a core feature – implement signed OTA updates with rollback, test them rigorously in real-world scenarios, and monitor your fleet continuously.Stay current incrementally – continuously integrate minor updates to tools and dependencies so that you’re never more than one version behind, allowing you to benefit from fixes and avoid legacy lock-in.Abstract and modularize – design hardware and software boundaries that allow component swaps; use standardized modules, HALs, and widely adopted RTOS or middleware to reduce the cost of change.Invest in quality and testing – adopt a “test-first” culture with multiple layers of verification, and use the best tools available (from static analyzers to CI pipelines) to catch issues early.Think ahead in security – be crypto-agile and design with future threats in mind (e.g. quantum-resistant crypto, secure element hardware, etc.), so your device isn’t frozen with today’s defenses decades later.In the end, the true measure of success for safety-critical products is not just how well they work at launch, but how reliably and safely they continue to operate years and millions of hours later, when initial developers have moved on and the world around them has evolved. Achieving that kind of lasting dependability is difficult – but it has become the baseline expectation.🧠Expert InsightRead the complete Q&A with Alexander Kushnir for real-world insights:Designing for Decades: A Conversation with Alexander Kushnir on Longevity, Maintainability, and Embedded Systems at ScaleA MedTech systems engineer unpacks what it means to build software that must survive regulatory cycles, hardware obsolescence, and engineering turnover.Read the Complete Q&A🛠️ Tool of the WeekZephyr RTOS 4.2 – Open-source real-time OS for production-grade embedded devicesZephyr is a small, modular RTOS used in commercial products across Arm, RISC-V, and other MCUs. The 4.2 release (July 19, 2025) expands hardware coverage and streamlines portability work for long-lived systems.Highlights:Portability by design: Standardized HAL/OS abstraction and Devicetree let you retarget boards without invasive code changes.Broader hardware support: 4.2 adds 96 new boards and 22 shields, plus a migration guide for upgrades.Production tooling: IAR’s latest Arm toolchain ships production-ready Zephyr support with RTOS-aware debugging and code analysis, signaling mature vendor backing.Ecosystem momentum: Active release cadence and documented upgrade paths reduce sustainment risk over multi-year lifecycles.Learn more about Zephyr📎Tech BriefsCybersecurity in Medical Devices: Quality System Considerations and Content of Premarket Submissions by U.S. Food and Drug Administration (Final Guidance): Clarifies lifecycle cybersecurity obligations for medical devices, including signed/verified updates, documented vulnerability handling, and evidence of long-term supportability.NIST Finalizes ‘Lightweight Cryptography’ Standard to Protect Small Devices by NIST (News): Finalizes Ascon-based authenticated encryption and hashing for constrained devices, enabling modern, efficient crypto for IoT and medical implants.Zephyr RTOS 4.2 Now Available: Introduces Renesas RX Support, USB Video Class, and More by Benjamin Cabé (Zephyr Project): Announces a record community release—810 contributors, 96 new boards, 22 shields—underscoring RTOS standardisation and portability momentum.Product update: Arm Development Studio 2025.0 now available by Stephen Theobald (Arm Community): Debuts the next-generation Arm Toolchain for Embedded Professional and latest core support, guiding teams toward controlled compiler upgrades for long-lived products.Improved repository creation generally available, plus ruleset & insights improvements by GitHub (Changelog): Expands repository rulesets and merge queues so organizations can enforce CI/security checks at merge time for safer, auditable releases.That’s all for today. Thank you for reading this issue ofDeep Engineering. We’re just getting started, and your feedback will help shape what comes next. Do take a moment tofill out this short surveywe run monthly—as a thank-you, we’ll addone Packt creditto your account, redeemable for any book of your choice.We’ll be back next week with more expert-led content.Stay awesome,Divya Anne SelvarajEditor-in-Chief,Deep EngineeringIf your company is interested in reaching an audience of developers, software engineers, and tech decision makers, you may want toadvertise with us.*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

Divya Anne Selvaraj

28 Aug 2025

Deep Engineering #15: Steven F. Lott on Pragmatic Object-Oriented Python

Deep Engineering #14: Mihalis Tsoukalos on Go’s Concurrency Discipline

Divya Anne Selvaraj

21 Aug 2025

Contexts, cancellations, and bounded work—plus Chapter 8 from Mastering Go#14Mihalis Tsoukalos on Go’s Concurrency DisciplineContexts, cancellations, and bounded work—plus Chapter 8 from Mastering GoMastering Memory in C++ with Patrice RoyMore than 70% of severe vulnerabilities come from memory safety errors. This masterclass will show you how to write C++ that isn’t part of that problem.Join Patrice Roy — ISO C++ Standards Committee member and author of C++ Memory Management — for a 2-day live masterclass on writing safe, efficient, and robust C++ code.What you’ll learn (hands-on):✔Smart pointers and RAII for predictable ownership✔Exception-safe, high-performance techniques✔Debugging leaks, alignment, and ownership issues✔Building memory-safe code that performs under pressurePatrice has taught C++ since 1998, trained professional programmers for over 20 years, and speaks regularly at CppCon. This masterclass distills his experience into practical skills you can apply immediately in production.Use code DEEPENG30 for 30% off.Register NowHi Welcome to the fourteenth issue of Deep Engineering.Go 1.25 has arrived with container-aware GOMAXPROCS defaults—automatically sizing parallelism to a container’s CPU limit and adjusting as limits change—so services avoid kernel throttling and the tail-latency spikes that follow. This issue applies the same premise at the code level—structure concurrency to real capacity with request-scoped contexts, explicit deadlines, and bounded worker pools—so behavior under load is predictable and observable.For today’s issue we spoke with Mihalis Tsoukalos, a UNIX systems engineer and prolific author of Go Systems Programming and Mastering Go (4th ed.). He holds a BSc (University of Patras) and an MSc (UCL), has written for Linux Journal, USENIX ;login:, and C/C++ Users Journal, and brings deep systems, time-series, and database expertise.We open with a feature on request-scoped concurrency, cancellations, and explicit limits—then move straight into the complete Chapter 8: Go Concurrency from Mastering Go. You can watch the interview and read the complete transcript here, or scroll down for today’s feature.📢 Important: Deep Engineering is Moving to SubstackIn two weeks, we’ll be shifting Deep Engineering fully to Substack. From that point forward, all issues will come from [email protected] ensure uninterrupted delivery, please whitelist this address in your mail client. No other action is required.You’ll continue receiving the newsletter on the same weekly cadence, and on Substack you’ll also gain more granular control over preferences if you wish to adjust them later.We’ll send a reminder in next week’s issue as the cutover approaches.Sign Up |AdvertiseStructured Concurrency in Go for Real-World Reliability with Mihalis TsoukalosGo’s structured concurrency model represents a set of disciplined practices for building robust systems. By tying goroutines to request scopes with context, deadlines, and limits, engineers can prevent leaks and overload, achieving more predictable, observable behavior under production load.Why Structured Concurrency Matters in Go (and What It Prevents)In production Go services, concurrency must be deliberate. Structured concurrency means organizing goroutines with clear lifecycles—so no worker is left running once its purpose is served. This prevents common failure modes like memory leaks, blocked routines, and resource exhaustion from runaway goroutines.As Mihalis Tsoukalos emphasizes, concurrency in Go “is not just a feature—it’s a design principle. It influences how your software scales, how efficiently it uses resources, and how it behaves under pressure”.Unstructured use of goroutines (e.g. spawning on every request without coordination) can lead to unpredictable latencies and crashes. In contrast, a structured approach ensures that when a client drops a request or a deadline passes, all related goroutines cancel promptly. The result is a system that degrades gracefully instead of accumulating ghosts and locked resources.Request-Scoped Concurrency with Context and CancellationGo’s context.Context is the cornerstone of request-scoped concurrency. Every inbound request or task should carry a Context that child goroutines inherit, allowing coordinated cancellation and timeouts. By convention, functions accept a ctx parameter and propagate it downward.As Tsoukalos advises, “always be explicit about goroutine ownership and lifecycle” by using contexts for cancellation—this way, goroutines “don’t hang around longer than they should, avoiding memory leaks and unpredictable behavior”.A common pattern is to spawn multiple sub-tasks and cancel all of them if one fails or the client disconnects. The golang.org/x/sync/errgroup package provides a convenient way to manage such groups of goroutines with a shared context. Using errgroup.WithContext, each goroutine returns an error, and the first failure cancels the group’s context, immediately signaling siblings to stop. Even without this package, you can achieve similar structure with sync.WaitGroup and manual cancellation signals, but errgroup streamlines error propagation.The following is a snippet from Mastering Go, 4th Ed. demonstrating context cancellation in action. A goroutine is launched to simulate some work and then cancel the context, while the main logic uses a select to either handle normal results or react to cancellation:c1, cancel := context.WithCancel(context.Background())defer cancel()go func() { time.Sleep(4 * time.Second) cancel()}()select {case <-c1.Done(): fmt.Println("Done:", c1.Err()) returncase r := <-time.After(3 * time.Second): fmt.Println("result:", r)}Listing: Using context.WithCancel to tie a goroutine’s work to a cancelable context.In this example, if the work doesn’t finish before the context is canceled (or a 3-second timeout elapses), the Done channel is closed and the function prints the error (e.g. “context canceled”). In real services, you would derive the context from an incoming request (HTTP, RPC, etc.), use context.WithTimeout or WithDeadline to bound its lifetime, and pass it into every database call or external API request. All goroutines spawned to handle that request listen for ctx.Done() and exit when cancellation or deadline occurs. This structured approach prevents goroutine leaks – every launched goroutine is tied to a request context that will be canceled on completion or error. It also centralizes error handling: the context’s error (such as context.DeadlineExceeded) signals a timeout, which can be logged or reported upstream in a consistent way.Bounding Concurrency and Backpressure with Semaphores and ChannelsAnother key to structured concurrency is bounded work. Go’s goroutines are cheap, but they aren’t free – unchecked concurrency can exhaust memory or overwhelm databases.Tsoukalos warns that just because goroutines are lightweight, you shouldn’t “spin up thousands of them without thinking. If you’re processing a large number of tasks or I/O operations, use worker pools, semaphores, or bounded channels to keep things under control”.In practice, this means limiting the number of concurrent goroutines doing work for a given subsystem. By applying backpressure (through limited buffer channels or tokens), you avoid queueing infinite work and crashing under load.One simple pattern is a worker pool: maintain a fixed pool of goroutines that pull tasks from a channel.This provides controlled concurrency — “you’re not overloading the system with thousands of goroutines, and you stay within limits like memory, file descriptors, or database connections,” as Tsoukalos notes.The system’s behavior under load becomes predictable because you’ve put an upper bound on parallel work.Another powerful primitive is a weighted semaphore. The Go team provides golang.org/x/sync/semaphore for this purpose. You can create a semaphore with weight equal to the maximum number of workers, then acquire a weight of 1 for each job. If all weights are in use, further acquisitions block – naturally throttling the input. The following code (from the Mastering Go chapter) illustrates a semaphore guarding a section of code that launches goroutines:Workers := 4sem := semaphore.NewWeighted(int64(Workers))results := make([]int, nJobs)ctx := context.TODO()for i := range results { if err := sem.Acquire(ctx, 1); err != nil { fmt.Println("Cannot acquire semaphore:", err) break } go func(i int) { defer sem.Release(1) results[i] = worker(i) // do work and store result }(i)}// Block until all workers have released their permits:_ = sem.Acquire(ctx, int64(Workers))Listing: Bounded parallelism with a semaphore limits workers to Workers at a time.In this pattern, no more than 4 goroutines will be active at once because any additional Acquire(1) calls must wait until a permit is released. The final Acquire of all permits is a clever way to wait for all workers to finish (it blocks until it can acquire Workers permits, i.e. until all have been released). Bounded channels can achieve a similar effect: for example, a buffered channel of size N can act as a throttle by blocking sends when N tasks are in flight. Pipelines, a series of stages connected by channels, also inherently provide backpressure – if a downstream stage is slow or a channel is full, upstream goroutines will pause on send, preventing unlimited buildup. The goal in all cases is the same: limit concurrency to what your system resources can handle. Recent runtime changes in Go 1.25 even adjust GOMAXPROCS automatically to the container’s CPU quota, preventing the scheduler from running too many threads on limited CPUgo.dev. By design, structured concurrency forces us to think in terms of these limits, so that a surge of traffic translates to graceful degradation (e.g. queued requests or slower processing) rather than a self-inflicted denial of service.Observability and Graceful Shutdown in PracticeStructured concurrency not only makes systems more reliable during normal operation, but also improves their observability and shutdown behavior. With context-based cancellation, timeouts and cancellations surface explicitly as errors that can be logged and counted, rather than lurking silently. For instance, if a database call times out, Go returns a context.DeadlineExceeded error that you can handle – perhaps logging a warning with the operation name and duration.These error signals let you differentiate between a real failure (bug or unavailable service) and an expected timeout. In metrics, you might track the rate of context cancellations or deadlines exceeded to detect slowness in dependencies. Similarly, because every goroutine is tied to a context, you can instrument how many goroutines are active per request or service. Go’s pprof and runtime metrics make it easy to measure goroutine count; if it keeps rising over time, that’s a red flag for leaks or blocked goroutines. By structuring concurrency, any unexpected goroutine buildup is easier to trace to a particular code path, since goroutines aren’t spawned ad-hoc without accountability.Shutdown sequences also benefit. In a well-structured Go program, a SIGINT (Ctrl+C) or termination signal can trigger a cancellation of a root context, which cascades to cancel all in-flight work. Each goroutine will observe ctx.Done() and exit, typically logging a final message. Using deadlines on background work ensures that even stuck operations won’t delay shutdown indefinitely – they’ll timeout and return. The result is a clean teardown: no hanging goroutines or resource leaks after the program exits.As Tsoukalos puts it, “goroutine supervision is critical. You need to track what your goroutines are doing, make sure they shut down cleanly, and prevent them from sitting idle in the background”.This discipline means actively monitoring and controlling goroutines’ lifecycle in code and via observability tools.Production Go teams often implement heartbeat logs or metrics for long-lived goroutines to confirm they are healthy, and use context to ensure any that get stuck can be cancelled. In distributed tracing systems, contexts carry trace IDs and cancellation signals across service boundaries, so a canceled request’s trace clearly shows which operations were aborted. All of this contributes to a system where concurrency is not a source of mystery bugs – instead, cancellations, timeouts, and errors become first-class, visible events that operators can understand and act upon.7-Point Structured Concurrency Checklist for ProductionContext Everywhere: Pass a context.Context to every goroutine and function handling a request. Derive timeouts or deadlines to avoid infinite waits.Always Cancel (Cleanup): Use defer cancel() after context.WithTimeout/Cancel so resources are freed promptly. Never leave a context dangling.Bound the Goroutines: Limit concurrency with worker pools, semaphores, or bounded channels – don’t spawn unbounded goroutines on unbounded work.Propagate Failures: Use errgroup or sync.WaitGroup + channels to wait for goroutines and propagate errors. If one task fails, cancel the rest to fail fast.Graceful Shutdown Hooks: On service shutdown, signal cancellation (e.g. cancel a root context or close a quit channel) and wait for goroutines to finish or timeout.Avoid Blocking Pitfalls: Use buffered channels for high-volume pipelines and select with a default or timeout case in critical loops to prevent global stalls.Instrument & Observe: Monitor goroutine counts, queue lengths, and context errors in logs/traces. A spike in “context canceled” or steadily rising goroutines means your concurrency is getting out of control.In Go, by consciously scoping and bounding every goroutine – and embracing cancellation as a normal outcome – engineers can build services that stay robust and transparent under stress. The effort to impose this structure pays off with systems that fail gracefully instead of unpredictably, proving that well-managed concurrency is a prerequisite for reliable production Go.🧠Expert InsightThe complete “Chapter 8: Go Concurrency” from Mastering Go, 4th ed. by Mihalis TsoukalosIn this comprehensive chapter, Tsoukalos walks you through the production primitives you’ll actually use: goroutines owned by a Context, channels when appropriate (and when to prefer mutex/atomics), pipelines and fan-in/out, WaitGroup discipline, and a semaphore-backed pool that keeps concurrency explicitly bounded.The key component of the Go concurrency model is the goroutine, which is theminimum executable entityin Go. To create a new goroutine, we must use thegokeyword followed by a function call or an anonymous function—the two methods are equivalent. For a goroutine or a function to terminate the entire Go application, it should callos.Exit()instead ofreturn. However, most of the time, we exit a goroutine or a function usingreturn because...Read the Complete Chapter🛠️ Tool of the WeekRay – Open-Source, High-Performance Distributed Computing FrameworkRay is an open-source distributed execution engine that enables developers to scale applications from a single machine to a cluster with minimal code changes.Highlights:Easy Parallelization: Ray offers a simple API (e.g. the @ray.remote decorator) to turn ordinary functions into distributed tasks, running across cores or nodes with minimal code modifications and hiding the complexity of threads or networking behind the scenes.Scalable & Heterogeneous: It supports fine-grained and coarse-grained parallelism, efficiently executing many concurrent tasks on a cluster.Resilient Execution: Built-in fault tolerance means Ray automatically retries failed tasks and can persist state (checkpointing), so even long-running jobs recover from node failures without manual intervention.Battle-Tested at Scale: It’s been deployed on clusters with thousands of nodes (over 1 million CPU cores) for demanding applications – demonstrating robust operation at extreme scale.Learn more about Ray📎Tech BriefsGo 1.25 is released: The version update brings improvements across tools, runtime, compiler, linker, and the standard library, along with opt-in experimental features like a new garbage collector and an updated encoding/json/v2 package.Container-aware GOMAXPROCS: Go 1.25 introduces container-aware defaults for GOMAXPROCS, automatically aligning parallelism with container CPU limits to reduce throttling, improve tail latency, and make Go more production-ready out of the box.Combine Or-Channel Patterns Like a Go Expert: Advanced Go Concurrency by Archit Agarwal: Explains the “or-channel” concurrency pattern in Go, showing how to combine multiple done channels into one so that execution continues as soon as any goroutine finishes, and demonstrates a recursive implementation that scales elegantly to handle any number of channels.Concurrency | Learn Go with tests by Chris James: Shows you how to speed up a slow URL-checking function in Go by introducing concurrency: using goroutines to check multiple websites in parallel, and channels to safely coordinate results without race conditions, making the function around 100× faster while preserving correctness through tests and benchmarks.Singleflight in Go : A Clean Solution to Cache Stampede by Dilan Dashintha: Explains how Go’s singleflight package addresses the cache stampede problem by ensuring that only one request for a given key is in-flight at any time, while other concurrent requests wait and reuse the result.That’s all for today. Thank you for reading this issue ofDeep Engineering. We’re just getting started, and your feedback will help shape what comes next. Do take a moment tofill out this short surveywe run monthly—as a thank-you, we’ll addone Packt creditto your account, redeemable for any book of your choice.We’ll be back next week with more expert-led content.Stay awesome,Divya Anne SelvarajEditor-in-Chief,Deep EngineeringIf your company is interested in reaching an audience of developers, software engineers, and tech decision makers, you may want toadvertise with us.*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

Divya Anne Selvaraj

14 Aug 2025

Deep Engineering #13: Designing Staleness SLOs for Dynamo-Style KV Stores with Archit Agarwal

Divya Anne Selvaraj

14 Aug 2025

Make “eventual” measurable: N/R/W tuning, staleness SLIs, instrumentation, and repair budgets.#13Designing Staleness SLOs for Dynamo-Style KV Stores with Archit AgarwalMake “eventual” measurable: N/R/W tuning, staleness SLIs, instrumentation, and repair budgets.Staying sharp in .NET takes more than just keeping up with release notes. You need practical tips, battle-tested patterns, and scalable solutions from experts who’ve been there. That’s exactly what you’ll find in .NETPro, Packt’s new newsletter, with a free eBook waiting for you as a welcome bonus when you sign up.Join .NETPro — It’s FreeHi Welcome to the thirteenth issue of Deep Engineering.Eventual consistency is a fact of life in distributed key‑value stores. The operational task is to bound staleness and make it observable.This issue features a guest article by Archit Agarwal that builds a Dynamo‑style store in Go from first principles—consistent hashing, replication, quorums, vector clocks, gossip, and Merkle trees—without hiding the details. Building on it, our feature turns those primitives into a staleness SLO. We cover selecting N/R/W, defining SLIs (stale‑read rate, staleness age, convergence time), sizing anti‑entropy and hinted‑handoff budgets, and placing instrumentation on the read and write paths.Agarwal is a Principal Member of Technical Staff at Oracle, where he engineers ultra‑low‑latency authorization services in Go. He also writes The Weekly Golang Journal, focused on turning advanced system design into practical tools, with a consistent emphasis on performance and operational efficiency.You can start with Agarwal’s walkthrough for the mechanics, then read today’s feature for SLIs/SLOs, alert thresholds, and more.Become a C++ Memory Expert and Learn Live with Patrice Roy40% off for a Limited TimeUse code PRELAUNCH40 at checkout to get a 40% discount - our lowest-ever, available only until August 18th, when we officially launch.Register NowSign Up |AdvertiseDesigning Staleness SLOs for Dynamo‑Style KV Stores with Archit AgarwalIn an eventually consistent, Dynamo-style key-value store, not all reads immediately reflect the latest writes – some reads may return stale data until replication catches up. Staleness is the window during which a read sees an older value than the freshest replica. Defining a Service Level Objective (SLO) for staleness makes this behavior explicit and measurable, so teams can control how “eventual” the consistency is in operational terms.Control surfaces for stalenessIn Dynamo-style systems, three parameters shape staleness behavior: N, R, and W. N is the replication factor (number of replicas per key). R and W are the read and write quorum counts – the minimum replicas that must respond to consider a read or write successful. These define the overlap between readers and writers. If you choose quorums such that R + W > N, every read set intersects every write set by at least one replica, guaranteeing that a read will include at least one up-to-date copy (no stale values) under normal conditions.Tuning R and W affects latency and availability. A larger R means each read waits for more replicas, reducing the chance of stale data but increasing read latency (and failing if fewer than R nodes are available). A larger W similarly slows writes (and risks write unavailability if W nodes aren’t up) but ensures more replicas carry the latest data on write acknowledge. The replication factor N provides fault tolerance and influences quorum choices: a higher N lets the system survive more failures, but if R and W aren’t adjusted, it can also increase propagation delay (more replicas to update) and the quorum sizes needed for consistency. Under network partitions, a Dynamo-style store can choose to continue with a partial quorum (favoring availability at the cost of serving stale data) or pause some operations to preserve consistency – R, W, N settings determine these trade-offs on the CAP spectrum (for example, a low R/W will serve data in a partition but possibly outdated, whereas high R/W might block reads/writes during a partition to avoid inconsistency).Read path vs. write path: On writes, a coordinating node sends the update to all N replicas but considers the write successful once W replicas have acknowledged it. Only those W (or more) nodes are guaranteed to have the new version when the client gets a “success”. The remaining replicas will receive the update asynchronously (hinted handoff or background sync).Here is a simplified Go snippet enforcing a write quorum:// Write quorum acknowledgement checkif ackCount >= W { fmt.Println("Write successful")} else { fmt.Println("Write failed: insufficient replicas")}This check ensures the write isn’t confirmed to the client until at least W replicas have persisted it. Operational impact: we can instrument this point to count how often writes succeed versus fail quorum. A high failure rate (ackCount < W) would hurt availability, whereas a success with only W acknowledgments means N - W replicas are still lagging – a window where stale reads are possible. On reads, the coordinator contacts R replicas (often via a digest query). It waits for R responses and, typically, returns the latest version among those responses to the client (often using timestamps or vector clocks to identify freshness). If R < N, the coordinator might not see some newer replica that wasn’t queried, so it’s possible the client got a slightly stale value. That’s why ensuring quorum overlap (R+W > N) or using R = N mitigates staleness. Still, even with quorums, if a write just succeeded with W acks, there may be N−W replicas not updated yet; a subsequent read that happens at a lower consistency level or before repair could encounter an older copy. In summary, R and W are the dials: crank them up for fresher reads (at the cost of latency/availability), or dial them down for speed and uptime (accepting a higher stale-read window).What to Measure: Staleness SLIs and SLOTo manage staleness, we define Service Level Indicators (SLIs) that capture how stale the data is, and set SLO targets for them. Key metrics include:Stale-read rate: the fraction of reads that return data older than the newest replica’s value at the moment of read. In practice, a “stale read” can be flagged if a read request did not fetch the most up-to-date version that exists in the system. (Detecting this may require the coordinator to compare all R responses or consult a freshness timestamp from a designated primary.) This rate should ideally trend toward 0% once the system has quiesced after writes. It directly indicates how often users see outdated data.Staleness age: the time difference between the value’s timestamp (or version) that a read returned and the latest write timestamp for that item at read time. This measures how old the data is.Convergence time: how long it takes for a write to propagate to all N replicas. Even after a write is acknowledged (at W nodes), the remaining replicas might get the update later (through gossip or anti-entropy). Convergence time can be measured by tracking the time from write commit to the time when the last replica has applied it. We should aim to keep convergence time low (and predictable) so that the window for stale reads (N−W replicas catching up) is bounded.Repair backlog: the amount of data needing anti-entropy repair. This can be measured in number of keys or bytes that are out-of-sync across replicas. For example, if using Merkle trees for anti-entropy, we might track how many tree partitions differ between replicas, or how many hints are queued waiting to be delivered. In Cassandra, metrics like Hints_created_per_node reflect the number of pending hinted handoff messages per target node. A growing repair backlog indicates the system is accumulating inconsistency (replicas lagging behind) – which threatens the staleness SLO if not addressed. Operators should budget how much lag is acceptable and tune repair processes to keep this backlog small.Hinted-handoff queue depth: if the system uses hinted handoff (buffering writes destined for a temporarily down node), this is a specific backlog metric tracking how many hints are stored and waiting. A large queue of hints means one or more replicas have been down or slow for a while and have many writes to catch up on. This directly correlates with staleness: those downed replicas might serve significantly stale data if read (or will cause consistency repair load when they recover). Monitoring the hints queue (count and age of oldest hint) helps ensure a down node doesn’t silently violate staleness objectives by falling too far behind.Vector clock conflict rate: the rate at which concurrent updates are detected, leading to divergent versions (siblings) that need reconciliation. Dynamo-style systems often use vector clocks to detect when two writes happened without knowledge of each other (e.g. during a network partition or offline write merges). Each unique conflict means a client might read two or more versions for the same key – an extreme form of staleness where causal order is unclear. We measure the proportion of operations (or writes) that result in conflict reconciliation. A higher conflict rate suggests the system is frequently writing in partitions or without coordination, requiring merges and possibly exposing clients to multi-version data. Lowering conflict rate (via stronger quorums or a “last write wins” policy) usually reduces stale anomalies at the cost of losing some update history. In Agarwal’s Dynamo-Go implementation, vector clocks are represented as:// Vector clock representationtype VectorClock map[string]intEach node’s counter in this map increments on local updates. When a write is replicated, the vector clocks are merged. If a read finds two concurrent VectorClock states that neither dominates (i.e., different nodes each advanced their own counter), it indicates a conflict. We could emit a metric at that point (e.g. conflict_versions_total++). Tracking this helps quantify how often clients might see non-linear history that needs merging. A rising conflict rate might trigger an alert to consider increasing W or improving network reliability.With these SLIs defined, we can now set an SLO for staleness. Typically, an SLO will specify a threshold for staleness that should be met a certain percentage of the time. For example, an organization might decide: “95% of reads should have a staleness age below 500 milliseconds, and stale-read occurrences should stay under 0.1% of all reads.” Such an SLO sets clear expectations that nearly all reads are fresh (within 0.5s of the latest data) and very few return old data. It’s important to pair these objectives with alerting thresholds and operational responses:Example SLO (Staleness) – Target: P95 staleness age ≤ 500 ms, and stale-read rate ≤ 0.1% (per 1 hour window). Alerts: If 95th percentile staleness exceeds 500 ms for more than 10 minutes (primary alert), on-call should investigate lagging replicas or network issues (possible causes: replication failing, anti-entropy backlog). If it exceeds 500 ms intermittently (e.g. 5 minutes in an hour – secondary warning), schedule a closer look at load or repair processes. Likewise, if stale-read rate rises above 0.1%, a primary alert signals potential consistency problems – operators might check for nodes down or heavy write load overwhelming W acknowledgments. A secondary alert at 0.05% could warn of a trend toward SLO violation, prompting checks of the hinted-handoff queue or Merkle tree diffs. We also set an absolute convergence time cap: e.g. maximum convergence time 5 s at P99.9. If any write takes more than 5 s to reach all replicas, that’s a primary alert (perhaps a replica is stuck or a stream is failing – check the repair service or consider removing the node from rotation). A softer alert at 3–4 s convergence can help catch issues early. Runbook notes: on stale-read alerts, first identify if a particular replica or region is lagging (e.g. check the repair backlog metrics and hint queues). On convergence-time alerts, verify the anti-entropy jobs aren’t backlogged or throttled, and look for network partitions. The SLO is met when these metrics stay within targets.Anti-Entropy and Repair BudgetsAchieving a staleness SLO requires active repair mechanisms to limit how long inconsistencies persist. Dynamo-style systems use two complementary approaches: read repair and background anti-entropy. Read repair triggers during a read operation when the system discovers that the replicas contacted have mismatched versions. In Cassandra, for example, if a quorum read finds one replica out-of-date, it will update that replica on the spot before returning to the client. The client gets the up-to-date value, and the involved replicas are made consistent. Read repair thus opportunistically burns down staleness for frequently-read data – the more a piece of data is read, the more chances to fix any replica that missed a write. However, read repair alone isn’t enough for rarely-read items (which might remain inconsistent indefinitely if never read). That’s where background anti-entropy comes in.Background anti-entropy tasks (often using Merkle trees or similar data digests) run periodically to compare replicas and repair differences in bulk. Each replica maintains a Merkle tree of its key-range; by comparing trees between replicas, the system can find which segments differ without comparing every item. A simple representation of a Merkle tree node in Go might look like:type MerkleNode struct { hash []byte left *MerkleNode right *MerkleNode}Using such trees, a background job can efficiently identify out-of-sync keys and synchronize them. The cadence and rate of this repair job act as a budget for staleness: if you run anti-entropy more frequently (or allow it to use more bandwidth), inconsistencies are corrected sooner, reducing worst-case staleness. For example, if repairs run every hour, a replica that missed an update will be stale at most an hour (ignoring hints) before the Merkle tree comparison catches it. If that’s too long for your SLO, you might increase repair frequency or switch to continuous incremental repair.It’s important to configure repair rate limits so that anti-entropy doesn’t overwhelm the cluster. Repair can be I/O-intensive; throttling it (e.g. limiting streaming bandwidth or number of partitions fixed per second) prevents impact to front-end latency but prolongs how long replicas remain inconsistent. The SLO provides a guideline here: if our SLO is “staleness age P95 < 500ms”, and we notice background repairs are taking minutes to hours to cover the dataset, that’s a mismatch – we’d need either a faster repair cycle or rely on stronger quorums to mask that delay.Membership churn (nodes leaving or joining) can rapidly inflate the repair backlog. For instance, when a node goes down, any writes it misses will generate hints and differences. If it’s down for 30 minutes, that’s 30 minutes of writes to reconcile when it comes back. If nodes frequently fail or if we add new nodes (which require streaming data to them), the system could constantly be in “catch-up” mode. Operators should track how quickly repair debt accrues vs. how fast it’s paid off.Parameter Playbook: N, R, W Trade-offsTo concretely guide tuning, here’s a playbook of quorum settings and their qualitative effects. Each row shows a representative (N, R, W) configuration, the quorum overlap (R + W – N), tolerance to failures, and the read/write latency-consistency trade-off:In practice, many deployments choose a middle ground like (N=3, R=2, W=1) or (N=3, R=1, W=2) for eventually consistent behavior, or (R=2, W=2) for firm consistency. The overlap formula R + W – N indicates how many replicas’ data a read is guaranteed to share with the last write; positive overlap means at least one replica in common (so a read will catch that write), zero or negative means it’s possible for a read to entirely miss the latest writes. As shown above, larger quorums improve consistency at the expense of latency and fault tolerance. Smaller quorums improve performance and fault tolerance (you can lose more nodes and still operate) but increase the chance of stale responses. When setting an SLO, you can use this table to pick a configuration that meets your freshness targets.(Note: The table uses N=3 for illustration; higher N follow similar patterns. For instance, (5, 3, 1) has overlap -1 (fast writes, slow-ish reads, likely stale), whereas (5, 3, 3) has overlap +1 (quorum consistency), and (5, 4, 4) would have overlap +3 but little failure tolerance).Implementation Hooks and MetricsFinally, let’s tie these concepts to the actual implementation (as in Agarwal’s Dynamo-style Go store) and discuss where to instrument. We’ve already seen how write quorum enforcement is coded and where we could count successes/failures. Another crucial piece is replica selection – knowing which nodes are responsible for a key. Agarwal’s store uses consistent hashing to map keys to nodes. For a given key, the system finds the N replicas in the ring responsible for it:// Replica selection for a key (basis for R/W placement and convergence measurement)func (ring *HashRing) GetNodesForKey(key string) ([]ICacheNode, error) { h, err := ring.generateHash(key) if err != nil { return nil, err } start := ring.search(h) seen := map[string]struct{}{} nodes := []ICacheNode{} for i := start; len(nodes) < ring.config.ReplicationFactor && i < start+len(ring.sortedKeys); i++ { vHash := ring.sortedKeys[i%len(ring.sortedKeys)] node, _ := ring.vNodeMap.Load(vHash) n := node.(ICacheNode) if _, ok := seen[n.GetIdentifier()]; !ok { nodes = append(nodes, n) seen[n.GetIdentifier()] = struct{}{} } } return nodes, nil}This function returns the list of nodes that should hold a given key (up to N distinct nodes). It’s the backbone of both the write and read paths – writes go to these N nodes, reads query a subset (of size R) of them. From an SLO perspective, GetNodesForKey provides the scope of where we must monitor consistency for each item. We could instrument right after a write is accepted to track convergence. Also, if a read at consistency level < ALL is performed, using this function we could compare the version it got to other replicas’ versions – if one of the other replicas has a higher version, that read was stale. This check could increment the stale-read counter. Essentially, GetNodesForKey lets us pinpoint which replicas to compare; it’s where we “measure” consistency across the replica set.For conflict detection, we already discussed vector clocks. Instrumentation-wise, whenever the system merges vector clocks (after a write or read repair), it can check if the merge resulted in multiple surviving branches. If yes, increment the conflict metric. The VectorClock type above is simple, but in usage, e.g., vc1 := VectorClock{ "nodeA":5, "nodeB":3 } and vc2 := VectorClock{ "nodeA":5, "nodeB":4 } would be compared – if neither dominates, you have a conflict. By observing how often that happens (and perhaps how many versions result), we quantify the “consistency anomalies” experienced.Throughout the code, there are many places to emit metrics: when writes succeed or fail the quorum check, when read repair runs (count how many rows repaired), size of hinted-handoff queues, etc. The key is to map them to our SLO. For instance, after the Write successful log above, we might record the lagging replicas count (N - ackCount) for that write – if >0, that write contributes to potential staleness until those catch up. Summing such lag over time or tracking the max lag can inform convergence times. Similarly, each read could log the staleness age (now - last_write_timestamp seen) for that item. These instrumentations ensure that the theoretical SLI definitions (stale-read rate, staleness age, etc.) have concrete counters and timers in the running system.With careful tuning (quorum sizes, repair cadence) and diligent monitoring, teams can reap the benefits of high availability while keeping staleness within acceptable bounds.Archit Agarwal’s guest article provides the implementation details of these mechanisms in Go:🧠Expert InsightBuilding a Distributed Key-Value Store in Go: From Single Node to Planet Scale by Archit AgarwalA build-to-learn exercise that walks through the architectural primitives behind Dynamo-style systems.Read the Complete Article🛠️ Tool of the WeekFoundationDB – Open-Source, Strongly Consistent Distributed DatabaseFoundationDB is a distributed key-value store that delivers strict serializable ACID transactions at scale, letting teams build multi-model services (documents, graphs, SQL-ish layers) on a single fault-tolerant core.Highlights:End-to-End Transactions: Global, multi-key ACID transactions with strict serializability simplify correctness versus eventually consistent or ad-hoc sharded systems.Layered Multi-Model: Build higher-level data models (queues, doc/graph, catalog/metadata) as “layers” on top of the core KV engine—one reliable substrate for many services.Resilience by Design: Automatic sharding, replication, and fast failover; continuous backup/restore and encryption options for enterprise reliability.Deterministic Simulation Testing: Each release is hammered by large-scale fault-injection simulation, yielding exceptional robustness under node and network failures.Learn more about FoundationDB📎Tech BriefsSkybridge: Bounded Staleness for Distributed Caches by Lyerly et al. | Meta Platforms Inc. and OpenAI: This conference paper describes Skybridge, a lightweight system developed at Meta that provides fine-grained, per-item staleness metadata for distributed caches, enabling best-effort or fail-closed bounded staleness (e.g., two seconds) at global scale by indexing recent writes across all shards, detecting replication gaps, and allowing cache hosts to prove most reads are fresh without re-fills—achieving up to 99.99998% 2-second consistency with minimal CPU, memory, and bandwidth overhead.DAG-based Consensus with Asymmetric Trust (Extended Version) by AMORES-SESAR et al.: This paper proves that naïvely swapping threshold quorums for asymmetric ones breaks DAG common-core (“gather”) primitives, then introduces a constant-round asymmetric gather and, from it, the first randomized asynchronous DAG-based consensus for asymmetric trust that decides in expected constant rounds.Rethinking Distributed Computing for the AI Era by Akshay Mittal | Staff Software Engineer at PayPal: This article calls for rethinking distributed computing for AI, highlighting how current architectures clash with transformer workloads and advocates for AI-native designs such as asynchronous updates, hierarchical communication, and adaptive resource use, drawing on DeepSeek’s sparse Mixture-of-Experts model.Repairing Sequential Consistency in C/C++11 by Lahav et al.: This paper identifies that the C/C++11 memory model’s semantics for sequentially consistent (SC) atomics are flawed, and proposes a corrected model called RC11 that restores soundness of compilation, preserves DRF-SC, strengthens SC fences, and prevents out-of-thin-air behaviors.Amazon SQS Fair Queues: a New Approach to Multi-Tenant Resiliency: Introduced in July 2025, this is a new feature that automatically mitigates noisy neighbor effects in multi-tenant message queues by prioritizing messages from quieter tenants to maintain low dwell times, combining the performance of standard queues with group-level fairness without requiring changes to existing consumer logic.That’s all for today. Thank you for reading this issue ofDeep Engineering. We’re just getting started, and your feedback will help shape what comes next. Do take a moment tofill out this short surveywe run monthly—as a thank-you, we’ll addone Packt creditto your account, redeemable for any book of your choice.We’ll be back next week with more expert-led content.Stay awesome,Divya Anne SelvarajEditor-in-Chief,Deep EngineeringIf your company is interested in reaching an audience of developers, software engineers, and tech decision makers, you may want toadvertise with us.*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

Divya Anne Selvaraj

07 Aug 2025

Deep Engineering #12: Tony Dunsworth on AI for Public Safety and Critical Systems

Divya Anne Selvaraj

07 Aug 2025

From quantization to synthetic data, how to build AI that’s fast, private, and resilient#12Tony Dunsworth on AI for Public Safety and Critical SystemsFrom quantization to synthetic data, how to build AI that’s fast, private, and resilient under pressure.Live Virtual Workshop: Securing Vibe CodingJoin Snyk's Staff Developer Advocate Sonya Moisset on August 28th at 11:00AM ET to learn:✓ How Vibe Coding is reshaping development and the risks that come with it✓ How Snyk secures your AI-powered SDLC from code to deployment✓ Strategies to secure AI-generated code at scaleEarn 1 CPE Credit!Register today!Hi Welcome to the twelfth issue of Deep Engineering“The challenge isn’t how to train the biggest model—it’s how to make a small one reliable.”That’s how Tony Dunsworth sums up his work building AI infrastructure for 911 emergency systems. In public safety, failure can have devastating effects with lives at stake. You’re also working with limited compute, strict privacy mandates, and call centers staffed by only two to five people at a time. There’s no budget for a proprietary AI stack. And there’s no tolerance for downtime.Dunsworth holds a Ph.D. in data science, with a dissertation focused on forecasting models for public safety answering points. For over 15 years, he’s worked across the full data lifecycle—from backend engineering to analytics and deployment—in some of the most sensitive domains in government. Today, he leads AI and data efforts for the City of Alexandria, where he’s building secure, on-prem AI systems that help triage calls, reduce response time, and improve operational resilience.To understand what it takes to design AI systems that are cost-effective, maintainable, and safe to use in critical systems, we spoke with Dunsworth about his use of synthetic data, model quantization, open-weight LLMs, and risk validation under operational load.You can watch the complete interview and read the transcript here or scroll down for our synthesis of what it takes to build mission-ready AI with small teams, tight constraints, and hardly any margin for error.Ending on August 25 11:00 AM PTLearn from top-rated books such as C++ Memory Management,C++ in Embedded Systems, Asynchronous Programming with C++, and more. Elevate your C++ skills and help support The Global FoodBanking Network with your purchase!Get the BundleSign Up |AdvertiseBuilding Emergency-Ready AI: Scaling Down to Meet Constraints —with Tony DunsworthHow engineers in critical systems can design reliable, resource-efficient AI to meet hard limits on privacy, compute, and risk.AI adoption in the public sector is accelerating but slowly. A June 2025 EY survey of government executives found 64% see AI’s cost-saving potential and 63% expect improved services, yet only 26% have integrated AI across their organizations. The appetite is there, but so are steep barriers. 62% cited data privacy and security concerns as a major hurdle – the top issue – along with lack of a clear data strategy, inadequate infrastructure and skills, unclear ROI, and funding shortfalls. Public agencies face tight budgets, limited tech staff, legacy hardware, and strict privacy mandates, all under an expectation of near-100% uptime for critical services.Public safety systems epitomize these constraints. Emergency dispatch centers can’t ship voice transcripts or medical data off to a cloud API that might violate privacy or go down mid-call. They also can’t afford fleets of cutting-edge GPUs; many 9-1-1 centers run on commodity servers or even ruggedized edge devices. AI solutions here must fit into existing, resource-constrained environments. For engineers building AI systems in production, scale isn't always the hard part—constraints are.By treating public safety as a high-constraint exemplar, we can derive patterns applicable to other domains like healthcare (with HIPAA privacy and limited hospital IT), fintech (with heavy regulation and risk controls), logistics (where AI might run on distributed edge devices), embedded systems (tiny hardware, real-time needs), and regulated enterprises (compliance and uptime demands). In all such cases, “bigger” AI is not necessarily better – adaptability, efficiency, and trustworthiness determine adoption.Leaner Models for Mission-Critical SystemsOpen models come with transparent weights and permissive licenses that allow self-hosting and fine-tuning, which is crucial when data cannot leave your premises. In 2025, several open large language models (LLMs) have emerged that combine strong capabilities with manageable size:Meta LLaMA 3: Released in 2025, with 8B and 70B parameter versions. LLaMA 3 offers state-of-the-art performance on many tasks and improved reasoning, and Meta touts it as “the best open source models of their class”. However, its license restricts certain commercial uses and the training data is not fully disclosed. In practice, the 70B model is powerful but demanding to run, while the 8B version is much more lightweight.Mistral 7B / Mixtral: The French startup Mistral AI has focused on efficiency. Mistral 7B (a 7-billion-parameter model) punches above its weight, often outperforming larger 13B models, especially on English and code tasks. They also introduced Mixtral 8×7B, a sparse Mixture-of-Experts model with 46.7B total parameters where only ~13B are active per token. This clever design means “Mixtral outperforms Llama 2 70B on most benchmarks with 6× faster inference” while maintaining a permissive Apache 2.0 open license. It matches or beats GPT-3.5-level performance at a fraction of the runtime cost. Mixtral’s trick of not using all parameters at once lets a smaller server handle a model that behaves like a much larger one.Swiss “open-weight” LLM: This is a new 70B-parameter model developed by a coalition of academic institutions (EPFL/ETH Zurich) on the public Alps supercomputer. The Swiss LLM is fully open: weights, code, and training dataset are released for transparency. Its focus is on multilingual support (trained on data in 1,500+ languages) and sovereignty – no dependency on Big Tech or hidden data. Licensed under Apache 2.0, it represents the “full trifecta: openness, multilingualism, and sovereign infrastructure,” designed explicitly for high-trust public sector applications. Importantly, the Swiss model was developed to comply with EU AI Act and Swiss privacy laws from the ground up.Other open models like Falcon 180B (UAE’s giant model) or BLOOM 176B (the BigScience community model) exist, but their sheer size makes them less practical in constrained settings. The models above strike a better balance. Table 1 compares these representative options by size, hardware needs, and privacy posture:Table 1: Open-source LLMs suited for constrained deployments, compared by size, infrastructure needs, and privacy considerations.Choosing an open model allows agencies to avoid vendor lock-in and meet governance requirements. By fine-tuning these models in-house on domain-specific data, teams can achieve high accuracy without sending any data to third-party services. However, open models do come with trade-offs.The biggest of these Dunsworth says:“ is understanding that the speed is going to be a lot slower. Even with my lab having 24 gigs of RAM, or my office lab having 32 gigs of RAM, they are still noticeably slower than if I'm using an off-site LLM to do similar tasks. So, you have to model your trade-off, because I have to also look at what kind of data I'm using—so that I'm not putting protected health information or criminal justice information out into an area where it doesn't belong and where it could be used for other purposes. So, the on-premises local models are more appealing for me because I can do more with them—I don't have the same concern about the data going out of the networks.”That’s where techniques like Quantization and altering the model architecture come in – effectively scaling down the model to meet your hardware where it is:Quantization: Dunsworth defines quantization as:"a way to optimize an LLM by making it work more efficiently with fewer resources—less memory consumption. It doesn’t leverage all of the parameters at once, so it’s able to package things a little bit better. It packages your requests and the tokens a little more efficiently so that the model can work a little faster and return your responses—or return your data—a little quicker to you, so that you can be more interactive with it.”By reducing model weight precision (e.g. from 16-bit to 4-bit), quantization can shrink memory footprint dramatically and speed up inference with minimal impact on accuracy. For example, a 70B model quantized to 4-bit effectively behaves like a ~17B model in memory terms, often retaining ~95% of its performance. Combined with efficient runtimes (like Meta’s GGML for CPU and GPU kernels optimized for int4/int8 arithmetic), quantization lets even a single GPU PC host models that previously needed a whole cluster.Altering the model architecture for efficiency: The Mixture-of-Experts (MoE) design in Mixtral increases parameter count (for capacity) but only activates a subset of experts per token, so you don’t pay the full compute cost every time. This architecture is a natural fit when you need bursts of capability without constant heavy throughput – much like emergency systems that must handle occasional complex queries quickly, but don’t see GPT-scale volumes continuously. The result: big-model performance on small-model infrastructure.Dunsworth’s field lab architecture offers a practical view into how these techniques are actually used. “I’ve been doing more of the work with lightweight or smaller LLM models because they’re easier to get ramped up with,” he says—emphasizing that local deployments reduce risk of data exposure while enabling fast iteration. But even with decent hardware (24–32 GB RAM), resource contention remains a bottleneck:“The biggest challenge is resource base… I’m pushing the model hard for something, and at the same time I’m pushing the resources… very hard—it gets frustrating.”That frustration led him to explore quantization hands-on, particularly for inference responsiveness. “I’ve got to make my work more responsive to my users—or it’s not worth it.” Quantization, local hosting, and iterative fine-tuning become less about efficiency for its own sake, and more about achieving practical performance under constraints—especially when “inexpensive” also has to mean maintainable.In practice, deploying a lean model in a mission-critical setting also demands robust inference software. Projects like vLLM have emerged to maximize throughput on a given GPU by intelligently batching and streaming requests. vLLM’s architecture can yield 24× higher throughput than naive implementations by scheduling token generation across multiple requests in parallel.Synthetic Data Pipelines: Fidelity with PrivacyData is the fuel for AI models, but in public safety and healthcare, real data is often sensitive or scarce. This is where synthetic data pipelines have become game-changers, allowing teams to generate realistic, statistically faithful data that mimics real-world patterns without exposing real personal information. By using generative models or simulations to create synthetic call logs, incident reports, sensor readings, etc., engineers can vastly expand their training and testing datasets while staying privacy-compliant.Dunsworth, who builds AI infrastructure for emergency services, describes this approach. Rather than rely on real 911 call logs, Dunsworth reconstructs patterns from operational data to generate synthetic equivalents. “I take it apart and find the things I need to see in it… so when I make that dataset, it reflects those ratios properly,” he explains. This includes recreating distributions across service types—for e.g. police, fire, medical—and reproducing key statistical features like call arrival intervals, elapsed event times, or geospatial distribution.“For me, it’s a lot of statistical recreation… I can feed that into an AI model and say, ‘OK, I need you to examine this.’”Dunsworth’s pipeline is entirely Python-based and open source. He uses local LLMs to iteratively refine the generated datasets: “I build a lot of it, and then I pass it off to my local models to refine what I’m working on.” That includes teaching the model to correct for misleading assumptions—such as when synthetic time intervals defaulted to normal distributions, even though real data followed Poisson or gamma curves. He writes scripts to analyze and feed the correct distributions back into generation:“Then it tells me, ‘Here’s the distribution, here are its details.’ And I feed that back into the model and say, ‘OK, make sure you’re using this distribution with these parameters.’”Workforce and Organizational ImplicationsThe shift to synthetic pipelines can solve multiple problems at once: data scarcity, privacy compliance, and edge-case testing. For training, synthetic records make it easy to balance class frequency—whether you’re modeling rare floods or unusual fraud patterns. For validation, they offer controlled stress tests that historical logs simply can’t provide.“I use it in testing my analytics models… then I can have my model do the same thing. I make sure that they match.”Unlike real-world events, synthetic scenarios can be manufactured to simulate extreme or simultaneous failures—testing the AI under precise conditions.Adoption Grows with Education and PrecisionEarly adoption wasn’t smooth, Dunsworth says. “The biggest hurdle was pushback from peers at first,” he noted. But that changed as datasets improved in realism, and the utility of using synthetic data for demos, teaching, or sandbox testing became obvious.“Now people are more interested… I keep it under an open-source license. Just give me the improvements back—that’s the last rule.”A crucial distinction is that synthetic ≠ anonymized. Rather than redact real identities, Dunsworth starts from a clean slate, using only statistical patterns from real data as seed material. He avoids copying event narratives and even manually inspects Faker-generated names to ensure no accidental leakage:“I don’t reproduce narratives… I go through my own list of people I know to make sure that name doesn’t show up.”He also aligns his work with formal ethical frameworks.“I was very fortunate throughout my education—through my software engineering courses, my analytics and data science courses at university—that ethics was stressed as one of the most important things we needed to focus on alongside practice. So, I have very solid ethical programming training.”Dunsworth also reviews frameworks like the NIST AI RMF to maintain development guardrails.These practices map directly onto any domain where real data is hard to access—medical records, financial logs, customer transcripts, or operational telemetry. The principles are universal:Reconstruct statistical structure from clean seedsValidate outputs against known metricsStress test systematically, not opportunisticallyNever copy real content—synthesize structure, not substanceBuild ethical discipline into your generation workflowFor teams building AI tools without access to real production data, this is a practical playbook. You don’t need a GPU farm or proprietary toolchain. You need controlled pipelines, structured validation, and a robust sense of responsibility. As Dunsworth says:“I feel confident that the people I work with… are all operating from the same place: protecting as much information as we can… making sure we're not exposing anything that we can’t.”Stress Before Success: Risk Management and Resilience EngineeringBuilding AI systems for constrained environments isn’t only about latency, memory, or cost. It’s also about failure and how to survive it.Dunsworth’s work in emergency response illustrates the stakes clearly, but his framework for risk mitigation is widely transferable: define the use case tightly, control where the data flows, and validate under load—not just in ideal cases.“One of the biggest risk mitigations is starting out from the beginning—knowing what you want to use AI for and how you define how it’s working well.”Instead of treating vendor-provided models as turnkey solutions, Dunsworth interrogates the entire data path—from ingestion through inference to retention. That includes third-party dependencies:“What data am I feeding, and how do I work with that vendor to make sure the data is being used the way I intend?” In sensitive environments, he keeps training in-house: “That way… it doesn't leave my organization.”Success is measured operationally:“If you're using it(AI) just to say, ‘Well, we're using AI,’ I'm going to be the first one to raise my hand and say, ‘Stop.’” Instead, AI is validated through concrete outcomes: “It’s enabled our QA manager to process more calls… improving our ability to service our community.”Stress It Twice, Then ShipFor AI systems that might break under pressure, Dunsworth prescribes a straightforward and brutal regimen:“Get synthetic data together to test it (the model)—and then just, in the middle of your testing lab, hit it all at once. Hit it with everything you've got, all at the same time.”Only if the system remains responsive under full overload does it move forward. “If it continues to perform well, then you have some confidence… it’s still going to be reliable enough for you to continue to operate.”Failure is expected but it must be observable and recoverable.“Even if it breaks… we know it can still recover and come back to service quickly.”One real-world implementation of this mindset is LogiDebrief, a QA automation system deployed in the Metro Nashville Department of Emergency Communications. Developed to audit 9-1-1 calls in real time, LogiDebrief formalizes emergency protocol as logical rules and then uses an LLM to interpret unstructured audio transcripts, match them against those rules, and flag any deviations. As Chen et al. explain: “The framework formalizes call-taking requirements as logical specifications, enabling systematic assessment of 9-1-1 calls against procedural guidelines”.In practice, it executes a three-stage pipeline:Context extraction (incident type, responder actions),Formal rule evaluation using Signal-Temporal Logic,Deviation reporting for any missed steps.This enables automated QA for both AI and human decisions—a form of embedded auditing that surfaces failure as it happens. In deployment, LogiDebrief reviewed 1,701 calls and saved over 311 hours of manual evaluator time. More importantly, when something procedural is missed—like a mandatory question for a specific incident type—it gets flagged and can be corrected in downstream training, improving both model performance and human compliance.From Monoliths to Micro-SolutionsWhen one early AI analytics platform failed under edge-case data—“it just said, ‘I got nothing’”—Dunsworth scrapped the codebase entirely. Why? The workflow made sense to him, but not to his users.“I assumed I could develop an analytics flow that would work for everybody… it worked well for me, but it didn’t work well for my target audience.”This led to a major design shift. Instead of building one global solution, he pivoted to “micro-solutions that will do different things inside the same framework.” This insight should be familiar to any engineer who’s seen a service fail not because it was wrong, but because no one could use it.“If they’re not going to use it—it doesn’t work.”Anticipating the Next FrontiersLooking forward, Dunsworth is focused on redirecting complexity, not increasing it. One focus area: offloading non-emergency calls using AI assistants. “It really is a community win-win, because now we can get those services out faster.”Another: multilingual responsiveness. In cities where services span four or more languages, Dunsworth sees multilingual AI as a matter of equity and latency:“If we can improve the quality and speed of translation… (that can save a life.)”Takeaways for Engineers Weighing AI Adoption in Critical SystemsTo wrap up, here are some key risk mitigation strategies – from technical safeguards to policy measures – that can enable engineers and organizations to confidently adopt AI in sensitive environments:Model Compression (Quantization & Pruning): We’ve discussed quantization as a way to make models smaller and faster. This not only enables using cheaper hardware, but also reduces power consumption (important for e.g. mobile or field deployments) and even attack surface (smaller models are slightly easier to analyze for vulnerabilities). Pruning (removing redundant weights) is another technique to shrink models. The overall effect is a lean model less likely to overload your systems.Encryption and Secure Execution: In high-trust domains, data encryption is mandatory not just at rest but in transit – and increasingly during computation. Self-hosting an LLM doesn’t automatically guarantee security; teams must ensure all connections are encrypted (HTTPS/TLS) so that input/output data can’t be intercepted. Tools like Caddy (a web server with automatic TLS) are often used as front-ends to internal AI APIs to enforce this. Moreover, techniques like homomorphic encryption and secure enclaves (Intel SGX, etc.) are emerging so that even if someone got a hold of the model runtime, they couldn’t extract sensitive data. While these techniques can be expensive computationally, they’re improving.Robust Vendor Governance: If using any third-party models or services, public sector teams impose strict governance – similar to vetting a physical supplier. Open-source models don’t come from a vendor per se, but they still warrant a security review (has the model or its code been audited? is there a risk of embedded trojans?). It is also important to focus on what vendors bring: requiring transparency about model training data (to avoid hidden biases or privacy violations), demanding uptime SLAs if it’s a cloud API, and ensuring models meet regulatory standards.In-House Fine-Tuning & Monitoring: Rather than rely on a vendor’s generic model, high-constraint deployments should favor owning the last-mile training of the model. By fine-tuning open models on local data, organizations not only boost performance for their specific tasks, they also retain full control of the model’s behavior. This makes it easier to mitigate bias or inappropriate behavior – if the model says something it shouldn’t, you can adjust the training data or add safety filters and retrain. Continuous monitoring is part of this loop: logs of the AI’s outputs should be reviewed (often with tools like LogiDebrief or simple dashboards) to catch any drift or errors. Essentially, the AI should be treated as a critical piece of infrastructure that gets constant telemetry and maintenance, not a “set and forget” software. This reduces the risk of unseen failure modes accumulating over time.Fallback and Redundancy: Finally, a practical strategy – always have a Plan B. In emergency systems, if the AI fails or is uncertain, it should gracefully hand off to a human or a simpler rule-based system. While this isn’t unique to AI (classic high-availability design), it’s worth noting that large AI models can fail in novel ways (e.g. getting stuck in a hallucination loop). Having a watchdog process that can kill and restart an AI service if it behaves oddly is a form of automated risk mitigation too.Each of these strategies – from squeezing models to encrypting everything, from vetting vendors to fine-tuning internally – contributes to an overall posture of trust through transparency and control. They turn the unpredictable black box into something that engineers and auditors can reason about and rely on. Dunsworth repeatedly comes back to the theme of discipline in engineering choices. Public safety and other critical systems can’t afford guesswork.By enforcing these risk mitigations, engineers can build systems that move fast and not break things beyond rapid recovery.🛠️ Tool of the WeekBlueSky Statistics – A GUI-Driven Analytics Platform for R UsersBlueSky Statistics is a desktop-based, open-source analytics platform designed to make R more accessible to non-programmers—offering point-and-click simplicity without sacrificing statistical power. It supports data management, traditional and modern machine learning, advanced statistics, and quality engineering workflows, all through a rich graphical interface.Highlights:Drag-and-Drop Data Science for R: BlueSky lets users load, browse, edit, and analyze datasets through interactive data grids—no scripting required.Modeling, Machine Learning & Deep Learning: BlueSky supports over 50 modeling algorithms, including decision trees, SVMs, KNN, logistic regression, and ANN/CNN/RNN.Full Statistical Suite + DoE + Survival Analysis: The platform includes descriptive and inferential statistics, survival models (Kaplan-Meier, Cox), and advanced modules for longitudinal analysis and power studies.Quality, Process, and Six Sigma Tools: Tailored for manufacturing and process improvement, BlueSky integrates tools aligned with the DMAIC cycle: Pareto and fishbone diagrams, SPC control charts, Gage R&R, process capability analysis, and equivalence testing.Integrated R IDE for Programmers: For technical users, BlueSky offers a built-in R IDE to write, import, execute, and debug R scripts—bridging GUI simplicity with code-based extensibility.Learn more about BlueSky Statistics📎Tech BriefsA New Perspective On AI Safety Through Control Theory Methodologies | Ullrich et al. | IEEE: Proposes a novel approach to AI safety using principles from control theory—specifically “data control”—to provide a top-down, system-theoretic framework for analyzing and assuring the safety of AI systems in real-world, safety-critical environments.Can We Make Machine Learning Safe for Safety-Critical Systems? | Dr. Thomas G. Dietterich | Distinguished Professor Emeritus Oregon State University: Outlines a comprehensive framework for integrating machine learning into safety-critical systems by combining risk-driven data collection, formal verification, and continuous anomaly and near-miss detection.AI Safety vs. AI Security: Demystifying the Distinction and Boundaries | Lin et al. | The Ohio State University: Establishes clear conceptual and technical boundaries between AI Safety (unintentional harm prevention) and AI Security (defense against intentional threats), arguing that precise definitions are essential for effective research, governance, and trustworthy system design—especially as misuse increasingly straddles both domains.Making certifiable AI a reality for critical systems: SAFEXPLAIN core demo | Barcelona Supercomputing Center (BSC): Introduces the SafeExplain platform which offers a structured safety lifecycle and modular architecture for AI-based cyber-physical systems, integrating explainable AI, functional safety patterns, and runtime monitoring.That’s all for today. Thank you for reading this issue ofDeep Engineering. We’re just getting started, and your feedback will help shape what comes next. Do take a moment tofill out this short surveywe run monthly—as a thank-you, we’ll addone Packt creditto your account, redeemable for any book of your choice.We’ll be back next week with more expert-led content.Stay awesome,Divya Anne SelvarajEditor-in-Chief,Deep EngineeringIf your company is interested in reaching an audience of developers, software engineers, and tech decision makers, you may want toadvertise with us.*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

Divya Anne Selvaraj

04 Aug 2025

Deep Engineering Specials: Vibe Coding—Promise, Pressure, and Practical Limits

Divya Anne Selvaraj

04 Aug 2025

Coding by prompt is still a desired dream—appealing, but not yet reliableSpecialsVibe Coding—Promise, Pressure, and Practical LimitsWhat recent research tells us about vibe coding: where it accelerates, where it breaks, and how to adopt it without undermining engineering disciplineLive Virtual Workshop: Securing Vibe CodingJoin Snyk's Staff Developer Advocate Sonya Moisset on August 28th at 11:00AM ET to learn:✓ How Vibe Coding is reshaping development and the risks that come with it✓ How Snyk secures your AI-powered SDLC from code to deployment✓ Strategies to secure AI-generated code at scaleEarn 1 CPE Credit!Register today!Hi Welcome to this special issue of Deep Engineering.With this issue we go beyond the hype of vibe coding. Drawing on first-party research from Microsoft, Google, IFS, and independent academics, we examine where this paradigm helps, where it breaks, and what it asks of software teams if it scales. For architects, leads, and developers navigating a shifting toolchain, this piece aims to provide some coordinates: empirical findings, adoption thresholds, and governance strategies.Coming soon...Launching today (Monday, August 4) at 11:00 AM PT and ending on August 25 11:00 AM PT.Master the ultimate high-performance, general-purpose programming language with our C++ lessons bundle from the experts at Packt. Learn from top-rated books such as C++ Memory Management,C++ in Embedded Systems,Asynchronous Programming with C++, and more. Elevate your C++ skills and help support The Global FoodBanking Network with your purchase!Save the Link (Goes live at 11:00 a.m. PT today)Sign Up |AdvertiseTo Vibe or Not to Vibe: That is the QuestionA research-based examination of vibe coding’s promises, pitfalls, and what it means for the future of software teams.According to Stack Overflow’s 2025 Developer Survey, nearly 72% of developers said *“vibe coding” – defined as generating entire applications from prompts – is not part of their workflow, with an additional 5% emphatically rejecting it as ever becoming a part of their workflow.Empirical research and position papers published this year provide some more context.Sarkar, A., (University of Cambridge and University College London) and Drosos, I., (Microsoft) conducted an observational study (June, 2025) with 12 professional developers from Microsoft, all experienced in programming and familiar with tools like GitHub Copilot. Participants used a conversational LLM-based coding interface to complete programming tasks, with the researchers analyzing session transcripts, interaction logs, and follow-up interviews to identify usage patterns and cognitive strategies. They found that while participants reported efficiency gains for familiar or boilerplate tasks, particularly when generating or modifying standard patterns, these benefits diminished for more complex assignments.Debugging AI-generated code remained a major friction point, often requiring developers to mentally reverse-engineer the logic or manually rewrite portions of the output. Importantly, users expressed consistent uncertainty about the correctness and reliability of generated code, underscoring that trust in the AI remained limited.Gadde, A., (May, 2025), in their literature review based paper, positions vibe coding as the next evolution in AI-assisted software development, arguing that it significantly lowers barriers to entry by enabling users to generate working software from natural language prompts. Gadde characterizes vibe coding as a practical middle ground between low-code platforms and agentic AI systems, combining human intent expression with generative code synthesis. Unlike traditional development workflows, Gadde claims vibe coding empowers users—even those without formal programming experience—to act as high-level specifiers, while generative models handle much of the underlying implementation.Sapkota, R., et al. (2025) conducted a structured literature review and conceptual comparison of two emerging AI-assisted programming paradigms: vibe coding and agentic coding. The paper defines vibe coding as an intent-driven, prompt-based programming style in which humans interact with an LLM through conversational instructions, iteratively refining output. By contrast, agentic coding involves AI agents that autonomously plan, code, execute, and adapt with minimal human input. The authors argue that these paradigms represent distinct axes in AI-assisted development—one human-guided and interactive, the other goal-oriented and autonomous.They propose a comparative taxonomy based on ten dimensions, including autonomy, interactivity, task granularity, execution environment, and user expertise required. They claim that vibe coding excels in creative, exploratory, and early-stage prototyping contexts, while agentic coding shows promise in automating repetitive, well-scoped engineering tasks. However, both approaches face common challenges, including error handling, debugging, quality assurance, and system integration. The authors conclude that hybrid systems combining the strengths of vibe coding and agentic coding—conversational guidance with agentic automation—may be the most practical path forward.Stephane H. Maes, CTO and CPO at IFS & ESSEM Research, in their literature review and enterprise experience-based position paper (April 2025), state that code written through vibe coding often lacks documentation, architectural coherence, and design rationale. Without rigorous standards and tooling for verification, maintainability, and lifecycle control, the adoption of AI-generated code introduces operational risks. Maes proposes that successful adoption of vibe coding in production environments requires not just technical integration but structured governance—workflows, tooling, and cultural norms that enforce accountability, traceability, and testability. The core thesis is that “real coding is support and maintenance,” and vibe coding, in its current form, largely sidesteps these responsibilities.And yet, despite these limitations and negative developer experience, vibe coding remains very much a part of the conversation. Why? Not because it works at scale today, but because it gestures toward a future where programming feels more like intent-driven design than manual construction. It flatters a seductive idea: that software can be summoned by describing it, rather than engineered line by line.Gadde, A., (May, 2025), in their literature review based paper, positions vibe coding more positively as the next evolution in AI-assisted software development, arguing that it significantly lowers barriers to entry by enabling users to generate working software from natural language prompts. Gadde characterizes vibe coding as a practical middle ground between low-code platforms and agentic AI systems, combining human intent expression with generative code synthesis. Unlike traditional development workflows, Gadde claims vibe coding empowers users—even those without formal programming experience—to act as high-level specifiers, while generative models handle much of the underlying implementation.Engineers don’t just build systems for today, they chart trajectories. And so, with today’s special feature, we aim to:Identify where vibe coding works today (early-stage prototypes, educational contexts, speculative design),Understand why it falls short elsewhere (debugging, integration, maintainability),Anticipate the organizational and skill implications, so you can lead with context when the tooling matures.Where and How Vibe Coding HelpsVibe coding works best when the goal is to explore, not to ship; to experiment, not to scale. In these scenarios, its limitations are tolerable, and its productivity gains are real.Contexts where vibe coding is most effective:Rapid prototyping and ideation: The AI-assisted conversational workflow drastically accelerates early development. What once took weeks can often be scaffolded in hours. Solo developers, according to Ardor Labs, report building functional prototypes—from simple web apps to plugin systems—by iteratively prompting an LLM, adjusting results, and redeploying within a single day.Startups and hackathons: Early-stage teams exploit vibe coding to punch above their weight. Y Combinator managing partner Jared Friedman has said that, “A quarter of the W25 startup batch have 95% of their codebases generated by AI.” In this context, code maintainability is a secondary concern; speed to demo or MVP is paramount.Exploratory use by professionals: Developers may use vibe coding for spinning up proof-of-concepts or exploring unfamiliar frameworks, even if they ultimately rewrite the code manually. AI researcher Andrej Karpathy (the originator of the term vibe coding) himself has described this as ideal for “weekend projects” or “rapid ideation” scenarios.One-click deployment pipelines: Google’s guide notes that coupling vibe coding with integrated cloud deployment creates “the fastest path from concept to a live, shareable application,” especially when platforms like Replit or Google Cloud streamline backend provisioning.Lowering the barrier to entry: Because it uses natural language, vibe coding attracts those with minimal programming background. Google highlights that it makes “app building more accessible,” while Gadde frames it as the next phase in no-code evolution—enabling domain experts to act as high-level specifiers without writing syntax-bound code.Educational and learning contexts: Sapkota et al. note that vibe coding performs well in educational and exploratory settings, particularly when the emphasis is on learning through experimentation rather than delivering production-ready systems. Students can engage in prompt-driven debugging or request scaffolded solutions to better understand programming constructs.For all its speed and surface-level convenience, vibe coding introduces architectural liabilities that make experienced developers cautious—if not outright resistant—to using it beyond disposable or exploratory projects.Limitations: Maintainability, Debugging, and Technical DebtThe issues with vibe coding are not liked just to code that fails to run, but about code that fails to last. Vibe coding shortcuts implementation, but often bypasses the rigor, clarity, and accountability that production-grade systems require.Why vibe-coded software tends to erode under pressure:Poor structural hygiene. AI-generated code often lacks internal consistency and coherent design. As Ardor Labs reports, repetitive prompting typically results in a patchwork of quick fixes, duplicated logic, and workarounds that accumulate into technical debt.Invisible complexity. Maes notes that repeated AI-driven edits can produce systems even their authors no longer understand. Without documentation or rationale, the code becomes opaque—even to its original creator.Debugging burdens. Because developers often see AI-generated code only after an error appears, root cause analysis becomes guesswork. IBM’s overview highlights the lack of clear architectural structure, making it harder to trace failures through unfamiliar logic paths.Prompting is not a substitute for engineering judgment. While it's tempting to patch issues by prompting another fix, this iterative loop can obscure responsibility and create brittle dependencies. As some developers now observe, “using one AI to debug another” may sound clever but is often insufficient without human involvement.Production pitfalls: performance, scale, and securityScalability bottlenecks. Sarkar & Drosos observed that developers often had to switch from vibe coding to manual optimization as application complexity increased. AI-generated prototypes may appear functional but suffer from poor resource usage and brittle error handling when scaled.Security vulnerabilities. A 2021 NYU cybersecurity study found that around 40% of GitHub Copilot’s generated code contained exploitable flaws, from SQL injection risks to use of deprecated libraries. These same vulnerabilities can silently propagate in vibe-coded applications, especially when users copy output without review.False confidence. Vibe coding’s conversational interface can lull developers—particularly those with limited experience—into accepting functional output as production-ready. As Ardor Labs warns, this “move fast” approach may ship apps that run but cannot be maintained, audited, or secured.Neglected lifecycle thinking. Maes (2025) captures this gap directly: “coding can be done with ‘no code’” via AI, “but such code is not maintainable”—a critical failure if the system is expected to evolve beyond a demo.For all its promise, vibe coding comes with serious “gotchas” that make seasoned engineers hesitant to use it in production. But as all the attention the paradigm continues to attract it is still very much something developers and enterprises are not giving up on yet.Workforce and Organizational ImplicationsThe rise of vibe coding raises important questions about software engineering roles, required skills, and how organizations should adapt. Who stands to benefit the most, and whose work might be displaced or transformed?Democratization vs. De-skillingVibe coding lowers the barriers to entry. Non-developers and junior developers can now build software that once required full-stack expertise. A solo entrepreneur, equipped only with a vision and the right AI tools, can ship a working prototype. In this framing, the AI serves as a kind of expert consultant, accelerating iteration and enabling domain specialists to turn ideas into software without hiring a team. This democratization of software creation is one of vibe coding’s most widely advertised benefits.But this accessibility comes with a paradox. Heavy reliance on AI for everyday coding tasks can cause skills to atrophy. Ray, P. (May 2025) identifies this as a core concern: if developers grow accustomed to prompting and accepting output without deep understanding, they risk losing the foundational skills required to validate, debug, and maintain that software.The illusion of productivity can further obscure the issue. A 2025 METR study, “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity,” found that “When developers are allowed to use AI tools, they take 19% longer to complete issues—a significant slowdown that goes against developer beliefs and expert forecasts. This gap between perception and reality is striking: developers expected AI to speed them up by 24%, and even after experiencing the slowdown, they still believed AI had sped them up by 20%.”Without strong engineering judgment, the value of AI assistance can quickly become negative.Who Benefits—and Who Might Be Left Behind?In its current form, vibe coding offers the greatest leverage to small, agile teams and individuals operating under time constraints. For early-stage startups, the appeal is obvious: speed to prototype, speed to market. For these teams, robustness is a secondary concern—shipping something that works, even partially, is often enough to secure feedback, funding, or traction. Similarly, larger organizations may use vibe coding to prototype features quickly without committing senior developer time, particularly in product discovery phases.By contrast, engineers at companies with established production systems remain cautious. The architectural demands of long-lived systems, along with maintainability and security concerns, make “pure” vibe coding untenable. Google’s guidance distinguishes between two modes: an “experimental” vibe coding mode suited to rapid ideation, and a “disciplined” mode in which the AI acts as a subordinate pair-programmer, with the human remaining accountable for quality.This bifurcation in usage reflects a broader split in how developers perceive AI's impact on the profession. According to the 2025 Stack Overflow Developer Survey, 64% of respondents said they do not view AI tools—including coding assistants—as a threat to their employment. Instead, many see these tools as a way to offload repetitive work and focus on higher-order engineering problems. However, that figure has dropped from 68% the previous year, indicating a subtle but real shift: developers increasingly recognize that roles are evolving, and that staying competitive will require new skills.The differentiator is not whether one uses AI, but how. Engineers who add prompt engineering, AI supervision, and LLM-aware debugging to their toolset will likely outperform those who default to traditional workflows for all tasks. Conversely, those who resist this shift entirely may find themselves outpaced—not by the AI, but by peers who know how to manage it effectively.Leadership Response: Strategic Adoption with GuardrailsFor CTOs, software architects, and engineering leads, the responsible response to vibe coding is neither rejection nor blind adoption, but strategic containment. Its introduction should be scoped to workflows where quality risk is minimal and speed adds clear value—such as internal prototypes, automated test generation, or scaffolding of non-critical features that engineers can later refactor. Governance is essential. Maes proposes structured frameworks like VIBE4M, which emphasize verification, maintainability, and monitoring as prerequisites for accepting AI-generated code into supported systems. Even in the absence of formal frameworks, the principle holds: all AI contributions must undergo human review. Review checklists may need to explicitly flag AI-authored code for scrutiny, and CI pipelines should incorporate tools like Snyk or ESLint with AI-focused rules to catch common faults. These checks inevitably introduce friction—but they are precisely what distinguish engineering from experimentation. As Maes notes, rigorous validation “goes against the trend [of] AI makes developers more productive” in the short term, but is non-negotiable for sustainable practice.Equally critical is the cultural framing of vibe coding within teams. Leaders should position it not as a shortcut, but as a collaboration—one that still demands comprehension, accountability, and domain judgment. Encouraging developers to re-express or review AI-generated solutions—whether to a colleague or back to the model—can ensure they understand the logic they are deploying. This guards against blind acceptance and reinforces human agency. Forward-looking leaders will also recognize and reward the kinds of work AI cannot yet replicate: deep architectural reasoning, creative problem decomposition, and user empathy. These capabilities will define developer impact in a world where code generation is easy but understanding remains hard.When it comes to delivering reliable, maintainable systems at scale, the fundamentals of software engineering still apply. The organizations that will benefit most are those that blend the “vibes” with vigilance: embracing AI-driven development to speed up outcomes, while doubling down on human expertise in architecture, validation, and security to ensure those outcomes stand the test of time. In doing so, we can harness the promise of vibe coding – conversational and intuitive development – without losing the hard-won lessons of decades of engineering practice.🛠Vibe Coding in Practice: The Tooling LandscapeIn the pre-publication paper, "A Review on Vibe Coding: Fundamentals, State-of-the-art, Challenges and Future Directions," Ray, P., presents a qualitative, exploratory analysis of non-peer-reviewed sources such as product blogs, documentation, and public demos. The paper surveys a wide range of vibe coding tools—natural language-driven development environments—and maps them across an interaction spectrum (delegation to pairing) and a layered stack architecture extending from prompt interfaces to deployment infrastructure. It highlights the growing sophistication of both browser-native platforms and IDE-integrated agents. Here is a summary.Browser-native platforms feature prominently. Tools such as v0 by Vercel, Bolt.new, Create, and Lazy AI allow users to scaffold, preview, and deploy full-stack applications from prompt-based workflows. These platforms commonly embed frontend frameworks like Next.js and Tailwind, along with real-time CI/CD, auth, and database orchestration. Others—Trickle AI and Napkins.dev—generate UIs from screenshots or sketches, while HeyBoss, Softgen, and Rork focus on zero-config application builds with export to GitHub or direct deployment.IDE-integrated tools like Cursor, Cody, and Zed offer agent-assisted development with context-aware completions, semantic diffs, and local vector search. More advanced platforms such as Windsurf and Zencoder AI incorporate retrieval-augmented generation, multi-agent workflows, and enterprise readiness features. Some, including Cline and Trae AI, extend into terminal and plugin-based workflows, supporting Git integration, shell execution, and modular agent control.Finally, autonomous coding agents—notably Devin AI and All Hands AI—aim to handle entire software lifecycles: building, testing, debugging, and deploying with minimal human intervention.Ray’s survey suggests that these tools do not converge on a single model or interface. Instead, they reflect a broader shift: from programming as manual construction to software as orchestrated dialogue between developer intent and agentic execution. Read the complete paper.That’s all for today. Thank you for reading this special issue of Deep Engineering. We’re just getting started, and your feedback will help shape what comes next. Just reply to this email to tell us what you think.Stay awesome,Divya Anne SelvarajEditor-in-Chief, Deep EngineeringIf your company is interested in reaching an audience of developers, software engineers, and tech decision makers, you may want toadvertise with us.*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

Divya Anne Selvaraj

31 Jul 2025

Deep Engineering #11: Quentin Colombet on Modular Codegen and the Future of LLVM’s Backend

Deep Engineering #10: Prof. Elías F. Combarro on Programming Quantum Systems in Flux

Divya Anne Selvaraj

24 Jul 2025

Writing code for quantum computers that don’t fully exist yet—and why it matters now.#10Prof. Elías F. Combarro on Programming Quantum Systems in FluxWhat it takes to design, debug, and reason about quantum programs when the hardware, abstractions, and rules are all still evolvingHi Welcome to the tenth issue of Deep EngineeringLast week, analysts at Bank of America (BofA) released a note on quantum computing saying, “This could be the biggest revolution for humanity since discovering fire.” It may seem like an audacious comparison at first for a field known till now to be abstract with hardware that is not there yet. But IBM has already laid out a comprehensive roadmap to build a large-scale, fault-tolerant quantum computer by 2029 and expects to achieve practical quantum advantage by 2026.If quantum computing is to deliver on its promise, it won’t be physicists alone who get us there—it will be software teams building the abstractions, compilers, and algorithms that bridge theory and hardware. Engineers now face a peculiar challenge: to write software for machines that don’t fully exist, on hardware that changes year to year, using abstractions that must bridge mathematical theory, noisy processors, and unpredictable outcomes.To understand how industry professionals can prepare to face this challenge, we spoke to Prof. Elías F. Combarro, co-author of A Practical Guide to Quantum Computing (Packt, 2025). Combarro is a full professor in the Department of Computer Science at the University of Oviedo in Spain, with degrees in both Mathematics and Computer Science and national academic honors in each. He completed his PhD in Mathematics in 2001, with research spanning computability theory and logic, and has since authored over 50 papers across quantum computing, algebra, machine learning, and fuzzy systems. His recent work focuses on applying quantum methods to problems in optimization and algebraic structures. He has held research appointments at CERN and Harvard and served on the Advisory Board of CERN’s Quantum Technology Initiative from 2021 to 2024.You can watch the full interview and read the transcript here—or read on for our synthesis of what it means to design and debug quantum code in the context of real-world constraints and developments.Sign Up |AdvertiseTwilio Segment: Data you can depend on, Built your wayTwilio Segment was purpose-built so that you don’t have to worry about your data. Forget the data chaos, dissolve the silos between teams and tools, and bring your data together with ease. So that you can spend more time innovating and less time integrating.Learn moreExecutable Abstractions in an Unfinished Machine with Prof. Elías F. CombarroWhat it means to build quantum software before the hardware — or the rules — are fully written.Analysts at BofA earlier this month made quite a riveting statement: quantum computing “could be the biggest revolution for humanity since discovering fire.” For a field known for being very abstract, this claim underscores how concretely disruptive its proponents now expect it to be—reshaping computation, shifting global power, and pressuring industries well ahead of full-scale machines. In fact, in an interview with CNBC International Live, Haim Israel, Head of Global Thematic Research at BofA, stated that quantum computing is no longer “20 years away.” He credits recent breakthroughs—largely enabled by AI—with accelerating progress to a point where early commercial applications are already emerging. Israel projects that quantum advantage will be achieved by 2030, with quantum supremacy arriving five to six years later.Yet, realizing that potential requires software developers and researchers to think very differently about programming. Quantum programs don’t run on stable, deterministic digital processors; they run on fragile qubits governed by probabilistic physics. As Prof. Elías F. Combarro puts it,“Quantum programs are fundamentally different. You don’t have loops. You don’t have persistent memory or data structures in the way you do in classical programming. What you have is a quantum circuit—a finite sequence of operations that runs once, from start to finish. You can't stop, inspect, or loop within the circuit. You run it, you measure, and then you’re done.”This new paradigm forces a reimagining of everything from algorithm design and debugging to testing and maintenance.From Qubits to Entanglement: New Mental ModelsClassical developers are used to variables holding definite values and code flowing through deterministic steps. By contrast, a single qubit can exist in a superposition of basis states, represented by a two-dimensional complex state vector. A single qubit can be represented geometrically using the Bloch sphere where every point on the surface corresponds to a possible quantum state and operations appear as rotations. As Combarro explains,“Every point on the surface of the sphere represents a possible state of your qubit, and quantum gates—operations—can be visualized as rotations of this sphere.”But as soon as we move beyond one qubit, our everyday intuition falters. Two qubits live in a 4-dimensional state space, ten qubits in a $2^{10}=1024$-dimensional space, and so on – exponential growth that quickly outpaces human imagination.A defining feature of multi-qubit systems is entanglement, a phenomenon with no classical equivalent.“Entangled systems can’t be described by just looking at the states of their individual parts… You need the full global state,” Combarro notes.An entangled pair of qubits shares a joint state that cannot be factored into two independent single-qubit states. Change or measure one part, and the other seems to instantly reflect that change – a mystery so striking that Einstein dubbed it “spooky action at a distance.” This “spookiness” is not just a quirk of physics; it’s a resource for computation.“Entanglement… only exists in quantum systems. It doesn’t happen in classical physics… you can use it to implement protocols and algorithms that are simply impossible with classical resources,” Combarro says.Indeed, algorithms like superdense coding (sending two classical bits by transmitting a single entangled qubit) or quantum teleportation of states require entanglement to work. In quantum computing, entanglement is the magic that enables a kind of collaborative computation across qubits – and it’s central to any future quantum advantage.When Measurement Changes the AnswerAnother fundamental difference between classical and quantum computation lies in how information is retrieved from a system. In classical software, reading a variable doesn’t disturb its value. In quantum software, measurement fundamentally changes the system. A qubit’s rich state is collapsed to a definite outcome (like |0⟩ or |1⟩) when measured, and all the other information encoded in its amplitudes is lost.“In quantum computing, when you perform a measurement, you can't access all that information. You only get a small part of it,” Combarro explains.Measuring a single qubit yields just one classical bit (0 or 1) of information, no matter how complex the prior state.And after measurement, “you’ve lost everything about the prior superposition. The system collapses, and that collapse is irreversible.”This means a quantum program can’t freely check intermediate results or branch on qubit values without destroying the very quantum state it’s computing with.The consequence is that quantum algorithms are often designed to minimize measurements until the end, or to cleverly avoid needing to know too much about the state. Even then, the outcome of a quantum circuit is usually probabilistic. Running the same circuit twice can give different answers, a shock to those accustomed to deterministic code.“For people used to classical programming, that's very strange—how can the same inputs give different outputs? But it’s intrinsic to quantum mechanics,” Combarro says.To manage this randomness, quantum algorithms rely on repetition and statistical analysis. Developers run circuits many times (often thousands of shots) and aggregate the results. For example, a quantum classifier might be run 100 times, yielding say 70 votes for “cat” and 30 for “dog,” which indicates a high probability the input was a cat. Many algorithms, like phase estimation, improve their accuracy by repeated runs:“In quantum phase estimation… you repeat the procedure to get better and better approximations. The more you repeat it, the more accurate the estimate.”In other words, you rarely trust a single run of a quantum program – you gather evidence from many runs to reach a reliable answer.Developers must also separate intrinsic quantum uncertainty from extrinsic hardware noise. The randomness of quantum measurement is unavoidable, but today’s quantum processors add extra uncertainty via errors (decoherence, gate faults, crosstalk). Mitigating these is an active area of research. Techniques like error mitigation calibrate and correct for known error rates in the readouts. More ambitiously, quantum error correction (QEC) encodes a “logical” qubit into multiple physical qubits to detect and fix errors on the fly. This too flips classical assumptions: in quantum, you can’t simply copy bits for redundancy (the no-cloning theorem forbids cloning an unknown quantum state). Instead, QEC uses entanglement and syndrome measurements to indirectly monitor errors.Researchers at QuEra achieved a milestone in this regard through magic state distillation on logical qubits – a technique proposed 20 years ago as essential for universal, fault-tolerant computing. As Sergio Cantu, vice president of quantum systems at QuEra even said, “Quantum computers would not be able to fulfill their promise without this process of magic state distillation. It’s a required milestone.”Even as such advances bring fully error-corrected quantum computers closer, they underline that today’s hardware is still very much unfinished.Circuits, Qubits, and the Tools of the TradeHow do you write software for machines that operate under these strange rules? The answer is to raise the level of abstraction—while keeping physics in mind. Modern quantum programming frameworks like Qiskit, Cirq, PennyLane, and others allow developers to describe quantum programs as circuits: sequences of quantum gates and operations applied to qubits. This is a low-level, assembly-like model of computation, but it’s the lingua franca of quantum algorithms. High-level constructs familiar from classical languages (loops, if-else branches, function recursion) are largely absent inside a quantum circuit. Instead, any classical logic (like looping until a condition is met) has to run outside the quantum computer, orchestrating multiple circuit executions. As Combarro recounts, the shift can be jarring:“I remember the first student who asked, ‘How do you implement a loop in a quantum computer?’ And I had to say, ‘Come in and sit down—I have bad news.’”In practice, a quantum program might consist of a Python script that calls a quantum circuit many times, adjusting parameters or processing results on a classical computer between calls.Despite these challenges, certain abstractions and libraries have emerged to help manage complexity. IBM’s Qiskit has become a popular choice, especially in education, for its extensive features and cloud access to real quantum processors.“Qiskit has the largest number of features, and it’s the easiest one for accessing quantum computers online,” Combarro notes.In fact, one can prototype an algorithm on a local simulator and then, with only a few lines changed, run it on a real back-end.“You only need to change three or four lines of code to make that switch, but it’s very satisfying to say, ‘I’m running this on an actual quantum computer.’”This ease of swapping targets is a boon in an environment where hardware is evolving – it lets developers test their abstractions against today’s best machines and see the effects of real noise and connectivity constraints.Quantum compilers (transpilers) play a crucial role here. They take the high-level circuit and map it to the specific gates and qubits of a given device. Unlike a classical compiler, a quantum transpiler must contend with hardware quirks like limited qubit connectivity.“Not all qubits in a quantum computer are connected to each other. So, if you want to apply a gate to two distant qubits, the transpiler has to insert extra operations to move data around — introducing noise and increasing circuit depth,” Combarro explains.The transpiler may also optimize the circuit, combining gates or reordering operations to shorten the runtime (important before qubits decohere). Understanding what the transpiler is doing – and sometimes guiding it – has become part of the quantum developer’s skill set. For example, a programmer might constrain their circuit to use only certain qubits that have higher fidelity or explicitly insert swap gates to relocate qubits logically. It’s a delicate dance between abstract algorithm design and the very concrete limitations of hardware. Every additional gate is a risk when devices have error rates around 0.1–1% per operation.Debugging an Algorithm You Can’t Fully SeeWorking with quantum software can feel like coding with one eye closed. Because measuring qubits destroys their state, developers can’t step through a quantum program in the same way as a classical one. You can’t pause midway and inspect all qubit values – that would collapse the superpositions and entanglements you painstakingly created. Instead, quantum developers lean heavily on simulation and mathematical reasoning to debug.“To untangle issues, you start by running your code on a classical simulator. These simulators are deterministic and noise-free – they give you the exact mathematical result of the circuit, assuming perfect qubits. This lets you validate whether your logic is correct before moving to actual quantum hardware,” Combarro says.Simulators can output the full statevector of 20 or 30 qubits, allowing a developer to verify that, say, an entangled state or an amplitude amplification step is correct. Visualization tools can display probability distributions or Bloch sphere orientations for small circuits, providing insights that no current hardware can directly reveal.However, simulation has its limits. The memory required grows exponentially with qubit count, so beyond roughly 30 qubits (needing 16 GB of RAM or more), it becomes intractable to simulate general states. This is why today’s quantum algorithms for larger qubit numbers either rely on theoretical reasoning or are tested on actual quantum chips. When running on hardware, developers adopt statistical approaches to debugging: varying parameters, collecting lots of runs, and comparing aggregate results against expectations. They also must account for the possibility that an unexpected result is due to a device error rather than a flaw in the algorithm. As a safeguard, many will run the same circuit on multiple back-end devices (or noise models) to see if a result persists. This is quantum computing’s version of cross-platform testing. Even then, true reproducibility in the classical sense is unattainable on a quantum device – you can’t demand the same random outcome twice. Instead, reproducibility is about getting the same probability distribution of outcomes when conditions are repeated.As Combarro succinctly puts it, “Quantum computations are inherently probabilistic, so you can’t reproduce the exact same measurement result every time. What you can do is ensure a high probability of success.”The Hardware Frontier: Evolving and UncertainPerhaps the biggest challenge in writing quantum software today is that the machine itself is a moving target. Every year brings new devices with more qubits, different noise characteristics, and even new fundamental approaches to quantum bits. Superconducting qubits (used by IBM, Google, and others) dominate the current landscape with devices at 127 qubits and beyond, but they require cryogenic cooling and still have very short coherence times (microseconds). Trapped-ion qubits offer longer-lived states and all-to-all connectivity, but operations are slower and scaling to hundreds of qubits is difficult in practice. Photonic quantum computers, neutral atoms in optical tweezers, silicon spin qubits – each technology comes with trade-offs in coherence, gate fidelity, connectivity, and scalability. No one knows which approach (or fusion of approaches) will ultimately deliver a large-scale, fault-tolerant quantum computer. In a moderated virtual panel titled ‘Future of Quantum Computing’ at the 8thInternational Conference on Quantum Techniques in Machine Learning hosted by the University of Melbourne, Scott Aaronson said,“We do not have a clear winner between architectures such as trapped ion, neutral atoms, superconducting qubits, photonic qubits. Very much still a live race.”This uncertainty means quantum software must be somewhat hardware-agnostic yet ready to embrace new capabilities as they come. A few years ago, for instance, most cloud quantum computers did not support mid-circuit measurement or dynamic circuit logic; now some do, allowing new hybrid algorithms where measurement outcomes can influence subsequent operations. The “rules” of what a quantum program can do in one run are still being rewritten by hardware advances. Developers also contend with frequent library updates and deprecations. “Quantum software libraries evolve very quickly,” Combarro notes, reflecting on how code from his first book had to be updated as Qiskit advanced. This pace has started to stabilize – Qiskit’s major 2.0 release, for example, made relatively few breaking changes – but keeping code working may require more vigilance than in mature fields. Documentation sometimes lags behind new features, requiring quantum coders to read research papers or even source code to understand the cutting edge.Amid the rapid progress, it’s important to recognize that quantum computing is still largely in a pre-advantage era. While researchers have begun to demonstrate quantum advantage on carefully structured tasks, one recent milestone stands out: in July 2025, a team from USC and Johns Hopkins used IBM’s 127-qubit Eagle processors to show an unconditional exponential speedup on a modified version of Simon’s algorithm—a first in the field that doesn’t rely on unproven assumptions about classical limits. But even this breakthrough, as the lead researcher noted, has no immediate practical application beyond demonstrating capability. In fact, the 2025 MIT Quantum Index Report found that large-scale commercial applications of quantum computing remain “far off” despite the surge in patents and investments. Practical quantum advantage is an ongoing race: early claims can evaporate if improved classical algorithms catch up.Google’s much-publicized 2019 quantum supremacy experiment, for example, was soon matched by classical methods, nullifying that particular “advantage.” So, we are in a stage where the promise is undeniable and enormous (quantum computing could “change everything” from drug discovery to encryption), but the delivery is incremental and challenging.Navigating the Coming Quantum AgeIBM has laid out a comprehensive roadmap to build a large-scale, fault-tolerant quantum computer by 2029, called Quantum Starling, capable of running 100 million gates on 200 logical qubits. The plan integrates modular architecture, bivariate bicycle codes for quantum error correction, efficient logical processing units, universal adapters for inter-module communication, and magic state distillation to enable universal computation. IBM’s confidence rests on meeting successive milestones with custom hardware (like the upcoming Nighthawk processor), improved connectivity, and a newly introduced real-time decoder architecture. The company expects to achieve practical quantum advantage by 2026, with Starling serving as the scalable platform for fault tolerance.Lanes et al., researchers at IBM Quantum and PASQAL SAS, in their July 2025 paper have proposed a formal framework for quantum advantage that is platform-agnostic and empirically testable. They argue that advantage should mean outperforming classical systems on specific tasks with rigorously validated results—not theoretical superiority or isolated hardware feats, but measurable, reproducible performance gains in fields like chemistry, materials science, or optimization.In this environment, how should software professionals and technology leaders prepare? The consensus is to start small and start now. Even without large-scale quantum computers at hand, there is much to learn about quantum algorithms, error mitigation techniques, and integration with classical systems.“My advice is simple: start now,” urges Combarro. “If you think quantum computing might be relevant to your domain, begin exploring it as early as possible. The learning curve is steep… If you wait until quantum computing is mainstream, it may be too late to catch up.”This means building up quantum programming skills (in linear algebra, complex probability, and Quantum Processing Unit (QPU)-specific idioms), experimenting with simulators and cloud QPUs, and following the rapid research developments in both hardware and algorithms. Companies are already establishing small quantum teams or partnerships to identify long-term use cases – not because a quantum solution can be deployed today, but to be ready when the hardware crosses key thresholds in the next few years.There is a palpable excitement in the field, tempered by an understanding that quantum computing’s unfinished machine is being completed step by step. Writing quantum software today requires building abstractions for hardware that is still evolving, with each new qubit, error-correction scheme, and algorithm incrementally advancing the field toward practical, fault-tolerant systems. Until then, the work is foundational: preparing tools, methods, and mental models that future machines will depend on.If you found the insights in our feature on quantum software illuminating, A Practical Guide to Quantum Computing by Elías F. Combarro and Samuel González-Castillo (Packt, July 2025) offers a comprehensive and hands-on introduction to the field.Using Qiskit 2.1 throughout, the book walks readers through foundational quantum concepts, key algorithms like Grover’s and Shor’s, and practical techniques for writing and running real quantum programs. It’s ideal for professionals and self-learners looking to build solid, executable intuition—from single qubits to full-stack algorithm design.Use code QUANTUM20 for 20% off at packtpub.com.Get the Book🛠️Tool of the Week⚒️Qiskit – Python‑based Quantum SDK & Compiler StackQiskit is an open-source, Python-first SDK and compiler stack for quantum computing, developed by IBM and widely adopted across industry and academia. It enables developers to design, simulate, transpile, and deploy quantum circuits—whether running on local simulators or real quantum hardware.Highlights:Complete Quantum Software Workflow: Create quantum circuits using a flexible Python API, simulate them with Aer backends (statevector or noisy models), optimize and map circuits to hardware via transpilation, then run them on supported quantum devices like IBM’s QPUs—without changing your code structure.Optimizing Compiler & Hardware-Agnostic Deployment: Qiskit’s advanced transpiler performs qubit mapping, gate fusion, and noise-aware optimizations tailored to target hardware. It supports multiple backends (not just IBM), provides OpenQASM export, and has emerged as a performance leader in gate-depth reduction.Rich Application & Tooling Ecosystem: Includes domain-specific libraries (chemistry, finance, machine learning), visualizers for circuits and Bloch spheres, and profiling tools—empowering debugging and performance analysis across the entire quantum software stack.Actively Maintained & Rapidly Evolving: Since its major v2.0 release in March 2025, Qiskit has continued to advance with v2.1 (June–July 2025), adding a C API for high-throughput workflows, new synthesis and Clifford+T optimizations, multiqubit-gate support, and enhanced dynamic circuit constructs—showing vibrant and ongoing development.Learn more about Qiskit📰 Tech BriefsQuantum Computing Architecture and Hardware for Engineers -- Step by Step -- Volume II by H. Y. Wong (July, 2025): Extends Wong’s earlier work by providing a step-by-step, engineering-focused introduction to trapped-ion quantum computers, covering their physics, mathematics, laser control, and electronics in relation to DiVincenzo's criteria.Scientists make 'magic state' breakthrough after 20 years — without it, quantum computers can never be truly useful: Scientists at QuEra have, for the first time, demonstrated fault-tolerant magic state distillation using logical qubits—an essential breakthrough for running non-Clifford gates and enabling scalable, error-corrected quantum computation.Quantum Computers Just Reached the Holy Grail – No Assumptions, No Limits: Researchers from USC and Johns Hopkins have demonstrated, for the first time, an unconditional exponential speedup on a real quantum computer—solving a variation of Simon’s problem using IBM’s Eagle processors, marking a major milestone in proving quantum advantage without relying on unproven assumptions.The dawn of quantum advantage: A new white paper from IBM and Pasqal outlines a rigorous, empirically testable framework for quantum advantage—defining it as a validated performance edge over classical systems in real-world tasks—and argues that such advantage will emerge from hybrid quantum-classical workflows, likely beginning with variational algorithms and error-mitigated circuits by 2026.2025 MIT Quantum Index Report: The report finds that while investment, research, and job growth in quantum computing are accelerating, large-scale commercial applications remain “far off” due to current limitations in quantum processor performance and scalability.That’s all for today. Thank you for reading this issue ofDeep Engineering. We’re just getting started, and your feedback will help shape what comes next.Take a moment tofill out this short surveywe run monthly—as a thank-you, we’ll addone Packt creditto your account, redeemable for any book of your choice.We’ll be back next week with more expert-led content.Stay awesome,Divya Anne SelvarajEditor-in-Chief, Deep EngineeringTake the Survey, Get a Packt Credit!If your company is interested in reaching an audience of developers, software engineers, and tech decision makers, you may want toadvertise with us.*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

Divya Anne Selvaraj

17 Jul 2025

Deep Engineering #9: Unpacking MLIR and Mojo with Ivo Balbaert

Divya Anne Selvaraj

17 Jul 2025

MLIR’s impact on compilers and Mojo’s promise for AI scale development#9Unpacking MLIR and Mojo with Ivo BalbaertHow MLIR is reshaping compilers for heterogeneous hardware despite adoption challenges—and how Mojo builds on it to unify Pythonic ease with AI‑scale performanceTwilio Segment: Data you can depend on, Built your wayTwilio Segment was purpose-built so that you don’t have to worry about your data. Forget the data chaos, dissolve the silos between teams and tools, and bring your data together with ease. So that you can spend more time innovating and less time integrating.Learn moreHi Welcome to the ninth issue of Deep Engineering.As CPUs, GPUs, TPUs, and custom accelerators proliferate, compilers have become the thin yet critical layer that enables both abstraction and performance.Our feature this week looks at Multi-Level Intermediate Representation (MLIR)—a compiler infrastructure that promises to unify optimization across wildly different domains. Born at Google and now adopted in projects like OpenXLA, LLVM Flang, NVIDIA’s CUDA Quantum, and even hardware DSLs like Chisel, MLIR offers a powerful foundation—but one that comes with real‑world friction: steep learning curves, ecosystem fragmentation, and legacy integration challenges. We unpack where MLIR delivers, where developers struggle with it, and what its future might mean for software architects.Building on this theme, we’re also kicking off a new series on Mojo🔥, a programming language built entirely on MLIR. Written by Ivo Balbaert, Lector at CVO Antwerpen and author of The Way to Go and Packt introductions to Dart, Julia, Rust, and Red, Building with Mojo (Part 1): A Language Born for AI and Systems explores Mojo’s origins, its design goals, and its promise to unify Pythonic ergonomics with AI‑scale performance. Future parts will go deeper—covering Mojo’s tooling, metaprogramming, hardware abstraction, and its role in simplifying development pipelines that currently span Python, CUDA, and systems languages.Read on for our take on MLIR’s trajectory—and then take your first step into Mojo, a language built for the next wave of AI and systems programming.Sign Up |AdvertiseMLIR’s Promise, Pain Points, and the Path ForwardTo use a cliched statement: hardware and software are becoming increasingly diverse and complex. And because modern workloads must run efficiently across this diversity and complexity in form of CPUs, GPUs, TPUs, and custom accelerators, compilers are now critical for both abstraction and performance. MLIRemerged to tame this complexity by enabling multiple layers of abstraction in one framework. MLIR has rapidly grown from a Google research project into an industry-wide technology. After being open-sourced and contributed to LLVM in 2019, MLIR’s modular design attracted a broad community.Today MLIR underpins projects beyond Google’s TensorFlow. For example, it is the foundation of OpenXLA, an open compiler ecosystem co-developed by industry leaders (AMD, Apple, NVIDIA, etc.) to unify ML model deployment on diverse hardware. It’s also inside OpenAI’s Triton (for GPU kernel optimization) and even quantum computing compilers like NVIDIA’s CUDA Quantum (which defines a “Quake” IR on MLIR). In hardware design, the LLVM-affiliated experimental CIRCT project applies MLIR to circuit design and digital logic – so much so that a modern hardware DSL like Chisel moved its back-end to MLIR for richer analysis than standard RTL provides. MLIR’s multi-dialect flexibility has proven useful well beyond machine learning.MLIR has also made inroads into a traditional compiled language. The new LLVM Fortran compiler (Flang) adopted MLIR to represent high-level Fortran IR (FIR), allowing more powerful optimizations than the old approach of jumping straight to LLVM IR. This MLIR-based Flang already achieves performance on par with classic Fortran compilers in many benchmarks (within a few percent of GCC’s Fortran). In fact, in 2024, AMD announced its next-gen Fortran compiler will be based on Flang/MLIR to target AMD GPUs and CPUs in a unified way.However, MLIR’s adoption remains uneven across domains. For example, the LLVM C/C++ frontend (Clang) still uses its traditional monolithic pipeline. There is work in progress on a Clang IR dialect (“CIR”) to eventually bring C/C++ into MLIR, but Clang’s large legacy and stability requirements mean it won’t rewrite itself overnight.MLIR is proving itself in new or specialized compilers (AI, HPC, DSLs) faster than it can retrofit into long-established general-purpose compilers. It is technically capable of being a general compiler framework, but the industry is still in transition.The Hard Gaps – Adoption Challenges in PracticeEngineers may be enthusiastic about MLIR’s potential but also hit real pain points when evaluating it for production. Some key challenges include:Steep learning curve and tooling maturity: The MLIR ecosystem is complex and still maturing, which can intimidate new developers. Ramalho et al., in a 2024 conference paper note that “the MLIR ecosystem has a steep learning curve, which hinders adoption by new developers.” Building a new dialect or pass often means delving into MLIR’s internals (C++ templates, TableGen definitions, etc.) with sparse documentation. In fact, MLIR’s flexibility can be a double-edged sword – there are many moving parts to learn (dialects, ops, attributes, patterns, builders), and patterns are still emerging. Google’s engineers originally writing machine-learning kernels directly in MLIR found it “a productivity challenge”, which led them to create the Mojo language to get a higher-level syntax on top of MLIR. The lack of out-of-the-box IDE support or debugging tools for MLIR IR further adds friction. Adopting MLIR often requires hiring or developing compiler expertise, and that investment can be hard to justify for every team.Integration with Legacy Compiler Stacks: For organizations with existing compilers, taking advantage of MLIR might mean significant refactoring or a total rewrite of the front-end or middle-end. The LLVM community has been careful with Clang for this reason: “Clang also has a legacy to protect, so it is unlikely to fully adopt MLIR quickly.” Instead, they are introducing MLIR gradually via a new CIR dialect for C/C++. Retrofitting MLIR into a mature compiler is expensive and risky because you must maintain feature-parity during the transition. Unless starting a compiler from scratch or facing a dead-end with current tools, it can be hard to justify MLIR’s long-term benefits over short-term upheaval.Dialect Fragmentation and Ecosystem Maturity: One strength of MLIR is its dialect system – you can create domain-specific IR “dialects” and compose them. However, in practice this has led to an explosion of dialects, especially in the AI domain, not all of which are stable or even compatible. As Chris Lattner (MLIR’s co-creator) observed:“Unfortunately, this explosion happened very early in MLIR’s design, and many design decisions in these dialects weren’t ideal for the evolving requirements of GenAI. For example, much of this early work was directed towards improving TensorFlow and building OpenXLA, so these dialects weren’t designed with first-class PyTorch and GenAI support.”The result was that by the time generative AI and PyTorch use cases rose, the upstream MLIR dialects (like linalg or tensor) were not a perfect fit for new workloads. Companies ended up forking or inventing their own dialects (e.g., Google’s StableHLO vs. others), leading to ecosystem fracture. Lattner describes it as an “identity crisis.” Architecturally, it is difficult to determine which dialects to build on or standardize around. On the bright side, the MLIR project recently established a new governance structure and an MLIR area team to improve consistency, but it will take time to harmonize the dialect zoo.Unpredictable Performance in Niche Scenarios: MLIR adds its own layer of transformations and scheduling – if the compiler pipeline isn’t expertly constructed, you might not hit peak performance for a given target. Until more of these optimizations are shared in the community, teams adopting MLIR in new domains might face a period of performance tuning and even uncertainty. (On the flip side, MLIR’s structure can enable new performance tools. For example, Lücke et al. in their CGO 2025 Main Conference paper demonstrate through five case studies that the transform dialect enables precise, safe composition of compiler transformations and allows for straightforward integration with state-of-the-art search methods.)But probably the most practical pain point is day-to-day developer experience. Debugging an MLIR-based compiler can be challenging – error messages often come from deep in the MLIR/LLVM machinery, and stepping through multi-dialect lowering is hard. So, there are challenges and tradeoffs in MLIR adoption at both the organizational and individual levels. But how have these trade-offs played out in the real world: who is successfully using MLIR today, and what did they learn from it?MLIR in the Real WorldDespite the hurdles, some teams have embraced MLIR and demonstrated tangible benefits. Let’s explore four use cases:Fortran & HPC Applications: The LLVM Flang project’s adoption of MLIR is a showcase for using MLIR in a non-ML domain. By inserting MLIR into the compilation flow (via FIR dialects), Flang keeps more high-level semantics available for optimization than the old approach that dropped straight to LLVM IR. This enabled powerful transformations for array operations, loop optimizations, and OpenMP parallelism, all within the MLIR framework. Notably, an MLIR dialect for OpenMP was created so Flang could represent parallel loops in a higher form than just runtime calls. Software engineers at Linaro showed that the new Flang compared favorably with Classic Flang and was not far behind GFortran on benchmarks. Researchers at national labs have run full applications through Flang and confirmed its output is efficient, while also praising the new compiler’s extensibility for future needs. This hints that MLIR can deliver HPC performance while providing a more modern, maintainable codebase. It’s not all rosy – Flang is still catching up on full Fortran 2018 feature support – but it’s a concrete proof that MLIR can anchor a production compiler for a decades-old language. It also drove industry involvement: Fujitsu and ARM are contributing to Flang’s MLIR optimizations, and AMD is aligning its own Fortran compiler with Flang’s MLIR pipeline. For HPC architects, MLIR’s holds potential to unify CPU/GPU optimization (Flang will emit GPU offload code to AMD and NVIDIA through LLVM) and to lower maintenance in the long run by leveraging common infrastructure.SiFive RISC-V Intelligence Products: Hardware startups and AI accelerator teams can adopt MLIR as their compiler toolkit rather than writing everything from scratch. For example, SiFive RISC-V Intelligence Products use Google’s open-source MLIR-based compiler IREE as the core of their ML software stack. SiFive added their own custom dialect (VCIX) to MLIR so that IREE could target SiFive’s vector extensions and custom AI accelerators. This allowed them to lower deep learning models (like LLaMA LLMs) onto RISC-V hardware with relative ease, reusing IREE’s many optimization passes and then adding just the pieces needed for SiFive’s architecture. The result was the ability to run LLMs on RISC-V and get real-time performance – something that would have been immensely difficult without a framework like MLIR.NVIDIA’s CUDA Quantum platform: MLIR can be leveraged to build compilers for quantum computing and other novel processors. NVIDIA’s CUDA Quantum platform uses MLIR under the hood, mapping quantum IR into MLIR’s SSA form (the Quake dialect) and allowing compiler optimizations on quantum circuits. The same infrastructure enabling tensor optimizations can also optimize quantum gate pipelines. For software architects at companies making custom chips (AI or otherwise), MLIR provides a common compiler backbone where you plug in hardware-specific pieces (dialects, cost models) rather than reinventing entire compilers.OpenXLA: On the enterprise side, MLIR is creeping into data centers. OpenXLA, which as noted uses MLIR in components like StableHLO and IREE, has been used in production at companies like DeepMind, Waymo, and Amazon. A Google blog noted that OpenXLA (with MLIR inside) has been used for training AlphaFold, serving large Transformer models in self-driving car systems, and even accelerating Stable Diffusion inference on AMD GPUs. These are real workloads where the MLIR-based compiler achieved better throughput or latency than default frameworks, often by performing advanced optimizations (fusions, layout optimizations, multi-host parallelization) that framework runtimes alone couldn’t.Torch-MLIR: This is an open project to compile PyTorch models via MLIR. While not yet mainstream in PyTorch deployments, it’s gaining traction among researchers trying to optimize PyTorch beyond what TorchScript or Inductor can do. The mere existence of Torch-MLIR underscores the interest in MLIR’s ability to serve as a common IR bridge – here, between the dynamic PyTorch ecosystem and lower-level backends like LLVM, SPIR-V, or custom accelerators.CIRCT in hardware design: companies designing FPGAs and ASICs (e.g., in the FPGA EDA industry) are experimenting with MLIR to replace or augment HDLs. Chisel, a high-level hardware language, now emits MLIR (via CIRCT) instead of a custom IR, allowing use of MLIR’s analysis to optimize hardware generators. This could streamline chip design workflows by enabling cross-optimization of hardware and software models. While still experimental, it’s a real adoption in a traditionally conservative domain (EDA).MLIR’s value multiplies in “greenfield” projects or where incumbents are hitting limits. New hardware with no legacy compiler, new languages (like Mojo, which we will talk about shortly) or AI serving stacks that need every ounce of performance – these are where MLIR has shined. The most effective MLIR deployments often abstract MLIR behind a higher-level interface. Flang hides MLIR behind normal Fortran semantics for end-users; SiFive’s users see an AI runtime API, not MLIR directly; even OpenXLA exposes a compiler API and uses MLIR internally. This suggests a potential best practice to ease adoption: shield developers from MLIR’s complexity via good APIs or DSLs, so they benefit from it without needing to write MLIR from scratch.Mojo & MLIRNo discussion of MLIR in 2025 is complete without Mojo – a new programming language from Modular (a company founded by Chris Lattner and others) that has been making waves. Mojo is essentially a distilled essence of what MLIR can enable in software design. It’s billed as a superset of Python, combining Python’s ease with C++/Rust-like performance. Under the hood, Mojo is built entirely on MLIR – in fact, Mojo’s compiler is an MLIR pipeline specialized for the language. This design choice sheds light on what MLIR brings that classic LLVM IR could not:Multi-level abstraction and optimization: Mojo uses MLIR to represent Python-like high-level features (e.g., list comprehensions, dynamic dispatch) in rich intermediate forms, then progressively lowers them to efficient native code via LLVM dialects—something impractical with LLVM IR alone.Hardware abstraction with performance: By leveraging MLIR dialects for CPUs, GPUs, and TPUs, Mojo can specialize code for diverse hardware while keeping a single high-level language surface, preserving type and shape information longer for deeper optimizations.Seamless Python interoperability: MLIR enables Mojo to handle Python’s dynamic typing and runtime behaviors, compiling only what benefits from optimization while falling back to the Python runtime, allowing a smooth transition from interpreted to compiled execution.Mojo’s success so far validates MLIR’s promised benefits. Within a few months of Mojo’s preview release, the Modular team itself used Mojo to write all the high-performance kernels in their AI engine. Like we mentioned earlier, Mojo was born because writing those kernels in pure MLIR was too slow – by creating a high-level language that compiles via MLIR, the Modular team combined productivity with performance.Figure 1.1: “Mojo is built on top of MLIR, which makes it uniquely powerful when writing systems-level code for AI workloads.” (Source: Modular Blog)Mojo’s compile-time cost is mitigated by MLIR’s design as well – parallelizing and caching in the compiler are easier with MLIR’s explicit pass pipeline, so Mojo can afford to do more heavy analysis without long build times. The language is still young, but it shines a promising light on what’s possible.(As an aside for readers, Mojo’s use of MLIR is a deep topic on its own. In Building with Mojo (Part 1): A Language Born for AI and Systems, Ivo introduces Mojo’s origins, design goals, and its promise to unify Pythonic ergonomics with AI-scale performance—but only at a high level. Later parts of the series will go deeper into Mojo’s internals, including how MLIR enables compile-time metaprogramming, hardware-specific optimizations, and seamless Python interoperability. To receive these articles in your inbox as soon as they are published, subscribe here)Wrapping UpMLIR’s trajectory over the past year shows cautious but real momentum toward broader adoption. The community has addressed key pain points like dialect fragmentation with new governance and curated core dialects, while new tooling—such as the Transform dialect presented at CGO 2025—lowers the barrier for tuning compiler optimizations. Proposed additions like a WebAssembly dialect and Clang CIR integration suggest MLIR is expanding beyond its “ML-only” roots into systems compilers and web domains. Industry trends reinforce its relevance: heterogeneous compute continues to grow, and MLIR already underpins projects like OpenXLA with backing from NVIDIA, AMD, Intel, Apple, and AWS. Still, its success depends on balancing generality with usability and proving its value beyond Google and Modular; competing approaches like SPIR‑V and TVM remain viable alternatives. Yet with advocates like Chris Lattner, ongoing research from firms like Meta and DeepMind, and AMD and Fujitsu adopting MLIR for HPC compilers, it’s likely to become a cornerstone of future compiler infrastructure if it maintains this pace.Read the Article🛠️Tool of the Week⚒️IREE – MLIR-Based Compiler & RuntimeIntermediate Representation Execution Environment (IREE) is an open-source end-to-end compiler and runtime for machine learning models, built on MLIR. In the OpenXLA ecosystem, IREE serves as a modular MLIR-based compiler toolchain that can lower models from all major frameworks (TensorFlow, PyTorch, JAX, ONNX, etc.) into highly optimized executables for a wide variety of hardware targets.Highlights:Broad Framework & Hardware Support: IREE can import models from multiple frontends (TensorFlow, PyTorch, JAX, ONNX, TFLite, etc.) and target nearly any platform – from x86 or Arm servers to mobile GPUs, DSPs, and custom NPUs.Intuitive Tooling & Integration: IREE provides a command-line compiler tool (iree-compile) and libraries that are straightforward to use. Models are compiled ahead-of-time into an efficient binary format, and runtime APIs are available in C and multiple languages (with language bindings) to easily load and execute the compiled models in your application. The tool comes with clear documentation and examples on its official site.Debugging & Profiling Support: Unlike many experimental compilers, IREE doesn’t treat the compiled model as a black box – it includes developer-friendly features like IR inspection, logging flags, and integration with MLIR’s debugging tools. There are guides for debugging model issues and profiling performance (e.g., integration with CPU/GPU profilers and the Tracy profiler).Active Community & Extensibility: Because IREE is built on MLIR, it is highly extensible – you can author custom MLIR dialects or passes and plug them into IREE’s pipeline if you have domain-specific optimizations or new hardware. The project’s community (spanning industry and academia) is very active, offering support and continuously adding features.Learn more about IREE📰 Tech BriefsWAMI: Compilation to WebAssembly through MLIR without Losing Abstraction by Kang et al. from Carnegie Mellon University and Yale University: Introduces a new MLIR-based compilation pipeline that preserves high-level abstractions by adding Wasm-specific MLIR dialects, enabling direct, modular generation of WebAssembly code with better support for evolving Wasm features and comparable performance to LLVM-based compilers.2025 AsiaLLVM - Sanitizing MLIR Programs with Runtime Operation Verification by Matthias Springer: Introduces MLIR's new runtime operation verification interface, which enables dynamic checks for undefined behavior—complementing static verification, improving debugging, and supporting tools like memory leak sanitizers, though with trade-offs in runtime overhead and adoption maturity.Leveraging the MLIR infrastructure for the computing continuum by Bi et al. presented at the CPSW’24: CPS Workshop: This WIP paper presents a node-level compiler and deployment framework built on MLIR for the MYRTUS project, targeting heterogeneous computing across the cloud-edge continuum by extending dataflow dialects, optimizing for CGRAs and FPGAs, and enabling adaptive execution with tools like Mocasin and CIRCT.Precise control of compilers: a practical approach to principled optimization | Doctoral thesis by Martin Paul Lücke, The University of Edinburgh: Demonstrates how integrating principled program representations like Rise and flexible transformation control mechanisms such as the Transform dialect and Elevate into MLIR enables production compilers to achieve systematic, verifiable optimizations while giving developers fine-grained control over complex optimization strategies.2025 AsiaLLVM - Data-Tiling in IREE: Achieving High Performance Through Compiler Design by Han-Chung Wang: Explains how the IREE MLIR-based compiler uses tensor encodings and progressive lowering to optimize data layout, memory access, and instruction scheduling across CPUs and GPUs, enabling efficient, retargetable compilation for heterogeneous hardware.That’s all for today. Thank you for reading this issue of Deep Engineering. We’re just getting started, and your feedback will help shape what comes next.Take a moment to fill out this short survey we run monthly—as a thank-you, we’ll add one Packt credit to your account, redeemable for any book of your choice.We’ll be back next week with more expert-led content.Stay awesome,Divya Anne SelvarajEditor-in-Chief, Deep EngineeringTake the Survey, Get a Packt Credit!If your company is interested in reaching an audience of developers, software engineers, and tech decision makers, you may want toadvertise with us.*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

Divya Anne Selvaraj

10 Jul 2025

Deep Engineering #8: Gabriel Baptista and Francesco Abbruzzese on Architecting Resilience with DevOps

Deep Engineering #7: Managing Software Teams in the Post-AI Era with Fabrizio Romano

Divya Anne Selvaraj

03 Jul 2025

How to decide whether you want to move into management and lead without losing touch#7Managing Software Teams in the Post-AI Era with Fabrizio RomanoFrom lean organizations and AI tools to Gen Z teams, the software team manager’s job has changed. Here’s how to lead without losing touch and decide whether you want to move into management.Workshop: Unpack OWASP Top 10 LLMs with SnykJoin Snyk and OWASP Leader Vandana Verma Sehgal on Tuesday, July 15 at 11:00AM ET for a live session covering:✓ The top LLM vulnerabilities✓ Proven best practices for securing AI-generated code✓ Snyk’s AI-powered tools automate and scale secure dev.See live demos plus earn 1 CPE credit!Register todayHi Welcome to the seventh issue of Deep Engineering.The software manager’s role is being remade—less by choice than by necessity. The old playbook, where managers translated product priorities into sprints and stayed one layer removed from the code, no longer holds. In 2025, development managers are navigating leaner organizations, AI-assisted teams, hybrid work models, and a workforce increasingly shaped by Gen Z expectations.To understand this shift and glean best practices, we spoke with Fabrizio Romano, author of Learn Python Programming and development manager at Sohonet. We also examine what the transition from senior engineer to manager really entails—and how to know if that’s the right move for you. Throughout, we draw on Romano’s experience, alongside insights from other engineering leaders like Gergely Orosz, Mirek Stanek, Nick Centino, and Vladimir Klepov, to unpack the changing traits, tensions, and tradeoffs of modern development management.You can watch Romano’s complete interview which covers both his experiences with Python and as an engineering manager and read the transcript here, or read on for an engineering management focussed deep dive.Sign Up |AdvertiseLeading Software Teams in Changing Times with Fabrizio RomanoWhile a desire to nurture growth in others is crucial to success in management, the evolving landscape of software development today brings a set of external challenges that shape how development managers must lead. As Romano suggests, becoming a development manager isn’t just about mastering technical skills, but about understanding and adapting to the broader trends reshaping the industry—particularly in a post-AI world. The role has become more complex and dynamic than ever, influenced by forces like leaner organizations and teams, more millennials and Gen Zs in the workforce, remote-first work, and AI-powered development tools, and an increasing focus on efficiency over expansion. These shifts have led to new expectations for managers, testing their ability to balance people development with technical leadership.The Current State of Development ManagementThe post-COVID world is seeing significant changes in how development teams are structured, with many organizations flattening their hierarchies to reduce layers of management. This shift to leaner teams, combined with the increasing use of AI tools like GitHub Copilot, Cursor, and others, has led to new challenges for development managers.Leaner OrganizationsAs Mirek Stanek, PL Engineering Site Lead at Papaya Global points out, one of the most profound changes in development management is the trend towards fewer managers and a greater emphasis on individual contributors (ICs). In organizations where budget cuts and performance metrics dominate, managers are now expected to maximize the productivity of their teams with fewer resources. This is in line with Amazon's directive shared in a letter from CEO, Andy Jassy, to employees in September 2024, to increase the ratio of ICs to managers by 15% by Q1 2025. This shift reflects a broader trend where leadership roles are being scrutinized more heavily, and managers must justify their position by demonstrating tangible value to the organization.The hands-on expectations of development managers have therefore increased. In previous decades, a manager could expect to focus on strategy, vision, and team alignment, while ICs handled the bulk of coding tasks. Today, however, many engineering managers (Ems) are expected to stay deeply involved in the technical aspects of development. As Vladimir Klepov, EM at Ozon Bank, discusses in his reflections, a manager who is disconnected from the technical work risks losing touch with the challenges their team faces on the ground. Therefore, hands-on leadership—being embedded in the development process—is now a critical competency for effective development managers.Managing Gen-Z, Millennials, and the New Workforce ExpectationsAnother change reshaping development management is the increasing presence of Gen-Z and Millenials in the workforce. According to Elizabeth Faber, Deloitte Global Chief People & Purpose Officer,“Projected to make up roughly two-thirds of the labor force within the next few years, Gen Zs and millennials are likely to be a defining force in the future of work—one that looks less like a ladder and more like an interconnected web of growth, values, and reinvention.”Stanek also points out how Gen-Z values work-life balance, professional growth opportunities, and authentic leadership.Concluding from the 14th Deloitte Global Gen Z and Millennial Survey, Faber writes that, for Gen Z and millennial workers to feel truly supported and fulfilled, managers must be empowered to support employee well-being by:Addressing team stressorsPromoting work/life balanceRecognizing contributionsEnabling growthFacilitating access to mental health resourcesFor development managers, this means adapting leadership styles to align with these expectations. Managers must be more emotionally intelligent, open to feedback, and flexible in how they structure their teams.This also reflects the broader trend of remote and hybrid work models. While some companies, like Amazon, are pushing for a return to the office, many development managers will need to navigate the challenges of managing a distributed, remote-first workforce while ensuring cohesion and a sense of purpose within their teams.Working with Distributed and Diverse TeamsManaging teams split across cities or continents adds its own set of challenges – and opportunities. Stanek writes,“The pandemic showed us how teams can function effectively remotely, but it also highlighted the limitations of remote work, such as the lack of nonverbal communication cues and the blurring of work-life boundaries.”Nataliia Peterheria, Operations Manager, Django Stars, recommends the following practices to overcome dissonance in remote and hybrid team setups:Choose one primary communication channel (e.g., Slack, Google Hangouts) and stick to it to reduce information loss. Supplement with one or two backups only when necessary. Every team member should maintain a complete profile—with a real photo, job title, contact number, and bio—so others can quickly understand roles and reach out when needed.Set up a single “source of truth” for documentation and decisions—like Confluence or a shared wiki—structured simply (no more than three nested levels). Keep specs, requirements, and changes in one place, and annotate directly on the relevant topic pages to avoid fragmentation.Create a structured work schedule with overlapping hours for live collaboration. Use this shared window for time-sensitive interactions like team calls or joint problem-solving. Schedule overlapping meetings in advance, prioritize ruthlessly, and stay consistent to avoid drifting into 24/7 work mode.Use daily checklists to track questions, progress, and blockers. Organize them by project and link them to your source of truth or project repo. Checklists help ensure timely answers and keep asynchronous work from stalling.Standardize request communication to prevent missed inputs. Assign a single person (often the PM) to collect product owner requests, or reserve regular meeting slots to introduce new requirements to the full team.Require approval for all logic changes or scope updates, no matter how minor. Even well-intentioned “improvements” by developers must be signed off by business stakeholders to prevent misalignment or scope creep.Define escalation paths clearly. Publish a diagram showing who is responsible for what and who to contact when something goes wrong. Team members should know exactly how to escalate unresolved issues—internally or with the client.Align on a common task tracking and documentation toolset before kickoff. Avoid fragmented tracking (e.g., team members using their own spreadsheets). Centralize around one system, even if it means switching from a personal favorite.Codify remote technical workflows. Set clear guidelines for pull request handling, commit hygiene, and review expectations. Include code style guides to prevent inconsistency and ensure maintainability when multiple people contribute to the same codebase.Assess technical readiness before the project starts. Identify gaps in tooling knowledge, run onboarding sessions where needed, and provide up-to-date guides for any systems that require self-service support.In addition to these, there is the human side to management. Romano describes watching body language and Slack message tones for signs of stress in his team. If a developer seems off or tensions are brewing, he takes time to talk one-on-one and understand the issue. In some cases, he even teaches simple meditation or mindfulness techniques to help his engineers re-center under pressure. “When you’re upset, frustrated, or angry… it triggers a fight-or-flight response… If you keep stimulating that state… it becomes a health risk,” he explains, drawing from his experience in martial arts that a “relaxed mind is a creative mind.” By coaching his team in emotional intelligence and stress management, he not only cares for their well-being but also ensures they stay productive and collaborative. This kind of empathetic leadership – once rare in engineering circles – is increasingly recognized as key to maintaining high-performing teams.AI Tools: A Double-Edged Sword for Development ManagersIn addition to managing shifting workforce dynamics, AI is becoming an integral tool for development teams. AI-driven tools like GitHub Copilot are no longer just productivity boosters but are changing how software is developed at a fundamental level. For example, Gergely Orosz, author of The Software Engineer’s Guidebook, in The Pragmatic Engineer reports that,“90% of the code for Claude Code is written by Claude Code(!).”The rise of AI coding assistants and automation is one of the defining trends reshaping development management. Tools like GitHub Copilot, ChatGPT, and other AI pair programmers are rapidly becoming part of daily software engineering workflows.Gitlab’s 2024 Global DevSecOps Report found that 39% of software professionals are already using AI in development, up 16 percentage points from the year prior. Moreover, 60% say implementing AI is now essential to avoid falling behind competitively.Development managers now face the challenge of integrating AI effectively into their team's workflow while also ensuring that these tools don’t hinder creativity or lead to over-reliance.“We have to use AI. I think a developer who refuses to embrace AI today is probably going to be obsolete very soon,” says Romano, underscoring the urgency of adaptation. He adds: “At Sohonet, in my role, I got everyone on my team set up with GitHub Copilot. I wanted them to start using it, get familiar with it, and understand how to leverage what it can offer.”By equipping his engineers with Copilot, he aimed to help them embrace AI-assisted development rather than fear it. Romano notes,“Copilot is especially helpful for menial or repetitive tasks—like hardcoding different test cases. It’s really good at predicting what the next test case might be.” “Even when it's just acting like a better IntelliSense, it’s still useful… instead of rewriting a line yourself, you just hit Tab and it’s done,” Romano saysFor development managers, the benefit of such tools is twofold: they boost team productivity and free up human developers for more complex, creative work.According to Infragistics’ Reveal 2024 survey report, the top reasons developers leverage generative AI are to increase productivity (49%), eliminate repetitive tasks (38%), and speed up development cycles (36%).Managers who proactively introduce approved AI tools can thus accelerate output and improve developer satisfaction. Romano mentions that his team continually experiments with new AI aides (from code editors like Cursor to AI pair-programming prototypes) to stay on the cutting edge. This reflects a broader best practice: staying up to date with emerging tools and evaluating their potential.However, Romano also points out that over-relying on AI tools can stunt problem-solving skills, as developers might bypass critical thinking or creative solutions in favor of quick, AI-generated responses. 55% of Gitlab’s survey respondents also felt that introducing AI into the software development lifecycle is risky.Effective development management in the AI era means finding a balance between leveraging AI and honing human skill. Romano emphasizes that developers shouldn’t offload all problem-solving to machines:“Part of the job… was to smash my brain against a problem now and then. That’s really beneficial for your thinking… It keeps your mental muscles in shape.” “Relying too much on AI to… figure out the next step… that’s risky. I still want to ‘go to the gym’ up here,” he quips, referring to exercising one’s own mental faculties. Romano encourages each developer to “find the right balance—using AI as a tool, but still keeping their minds fit and challenged.”This balanced approach ensures that while AI accelerates routine coding, it doesn’t “dumb down” the team’s critical thinking. “If you stop challenging the [AI’s] recommendations, they run the risk of dumbing down the reasoning. The true risk is in placing naive faith in quick fixes,” cautions Sammi Li, co-founder and CEO of JuCoin, noting that AI can expedite work but must not replace understanding. It falls on the EM to ensure this balance is maintained both for the team’s and the business’ benefit.What the Shift to Engineering Management Really Looks LikeThe move from senior engineer to EM is often misunderstood—frequently treated as a natural promotion rather than a deliberate change in function. But this is not a bigger version of the same job. It’s a transition into a fundamentally different role, with a new definition of success and a new center of gravity. Here is what development and EMs say about their shift from development to management felt like.You stop being measured by what you ship. Engineers derive a tangible sense of accomplishment from writing code and seeing it run in production. That feedback loop is fast and direct. Management breaks that loop. “As an EM, you’re not the one building the things,” says Nick Centino, Principal Engineering Manager at Microsoft. “You’re helping empower others to build the things more effectively”. This shift—away from hands-on output and toward enabling others—can take years to internalize. Centino himself spent nearly eight years in a dual role before realizing his highest leverage was no longer in the code.You have to redefine what ‘impact’ means. Orosz writes: “As an engineering manager, you’ll need to put company first, team second, and your team members third. And I would also add: yourself as fourth”. That’s a reversal from the individual contributor mindset, where engineers focus on executing their own tasks and helping teammates as needed. The EM role requires strategic alignment across teams—not just personal productivity.You stop optimizing for technical challenges. Engineers advance by solving complex problems. Managers progress by preventing them. As Klepov writes, “Of all the possible career moves a seasoned engineer can make, switching to management gives you the most new challenges...without hitting your salary”. But these challenges are rarely technical. They involve process alignment, team dynamics, emotional management, and cross-functional friction. As Romano puts it: “Most of what we do is fairly routine...The real challenges lie in everything around the code”.Your working memory breaks down. Many new managers underestimate the cognitive overhead of managing a team. Orosz notes that while ICs can often track all their tasks in their head, managers can't: “As a manager, I have far more things to pay attention to…Keeping all of this in my head doesn’t work well for me—so I’ve started to write things down”. Time and task management become not just useful, but essential.You spend less time writing code—and often none at all. The drop is not optional; it’s structural. According to Centino, once you manage five or six people, meaningful individual contribution becomes unsustainable without either cutting corners or burning out. Even if you retain technical context, your job is no longer to build—it’s to coach, unblock, coordinate, and align. “If you feel like you have time to code,” Centino warns, “you’re either working long hours or not spending enough time with your team”.You enter the domain of slow, uncertain feedback. ICs can validate ideas quickly: deploy a fix, measure a metric, refactor a function. Managers don’t get that immediacy. Feedback loops are long and ambiguous. “Very few of your actions produce a visible result in under a month,” Klepov notes. “Even the right changes can make things get worse before they get better”.You have to manage people, not just lead them. This distinction matters. Leadership is about vision and influence. Management is about one-on-ones, reviews, process hygiene, and psychological safety. “There’s a lot of peopling involved,” Centino says. “You need to be listening to people, understand them, spend time with them”. For many introverted engineers, that’s emotionally exhausting—but non-negotiable. Skipping the people work results in burnout, distrust, and attrition.You give up control, but remain accountable. Orosz captures the paradox: as a tech lead, you can write code and drive decisions. As a manager, you may do neither—but you’re still responsible for outcomes. That means learning to influence without coding, to steer without micromanaging, and to delegate without detaching.None of this means the shift is a demotion of technical skill. If anything, it requires expanding your judgment from systems to humans. As Romano puts it, “The skills we learn as developers aren’t confined to software. They transfer to life”. But it is a shift. And for those unprepared, it can be jarring. As Centino warns, “Engineering management and individual contribution are completely different roles”.Is Moving into Management Right for you?A move into management is often seen as the natural career progression after senior developer or tech lead. However, not everyone is suited to be a development manager – and that’s okay.“Managing people is a completely different skill set,” Romano candidly remarks. “If you’re someone who’s drawn to logic, machines, and technical problems—and you’re not interested in helping people grow—then you probably shouldn’t go down the management path.”Strong coding ability alone does not guarantee success in leadership. The core of the development manager role, Romano says, comes down to a genuine desire to care for people:“That’s what this job is really about: doing your best to help the people you manage become healthier, happier, more skilled professionals – and hopefully better human beings too.”If that mission excites you more than writing code yourself, it’s a sign you might find the management path rewarding.Despite the persistent narrative that “eventually you’re going to become an engineering manager,” Centino points out: engineering management and individual contribution are “completely different roles” with different success criteria, daily rhythms, and reward systems. The most common trap is assuming that strong technical performance qualifies someone to lead people. As Romano puts it,“In our industry, we often promote people into management roles just because they’re technically strong. But managing people is a completely different skill set”. For those drawn to logic, systems, and clean abstractions, people management may feel frustrating and opaque. “People aren’t logical like machines,” Romano warns. “Managing them requires effort, empathy, and patience”.The core question isn’t whether you can manage—it’s whether you want to. “I do think it’s important to have a solid foundation in software development before stepping into this role,” Romano says. But that’s table stakes. What distinguishes successful managers is not technical depth, but a “genuine desire to care for people”.Centino echoes this point:“As an engineering manager, I like to focus most of my attention and effort into growing individuals on the team… If I can align that with the direction the business is heading, then I think we have a great recipe”.But if that alignment never comes—if writing code is still your deepest source of satisfaction—management may not be the right move.Self-awareness, not seniority, should drive the decision.“This type of thing will change over time,” Centino notes. “I found myself in a dual role for eight years and didn’t really know until the end… what I really felt would motivate me most”.Regular reflection, honest conversations with your manager, and exposure to the demands of the role are more reliable indicators than promotion ladders or external expectations.As Romano says, “If you’re only doing it because it’s your next step, or because someone handed you the role, it can be tough”. But if helping others grow feels like a worthwhile use of your time—and you’re willing to trade code for conversations, and systems for people, you may be ready to step into the role.Making the Move: Traits of Successful EMsIf you feel you fit the bill and are ready to take on the challenges that come with managing software teams today, start by building a foundation of both technical and leadership experience:Learn to manage time and context switching deliberately: Orosz emphasizes that time management shifts from “maker schedule” to “manager schedule.” Future EMs should practice structuring recurring meetings, protect deep work time, and use lightweight systems, for e.g., Getting Things Done (GTD), to track tasks across people and priorities—not just their own.Get fluent in setting and supporting growth goals—for others and yourself: As a manager, you won’t just pursue your own goals—you’ll guide others in theirs. Orosz suggests practicing this by helping peers articulate growth goals, using role frameworks where available. Future EMs must also apply the same discipline to their own goal-setting, or risk drift.Seek and learn from mentors before the transition: Orosz didn’t wait until he was fully in the role—he proactively asked his management chain to connect him with internal mentors who understood the company’s management expectations. Engineers eyeing management should do the same, asking for guidance and observation opportunities ahead of time.Develop the habit of reflection, not just execution: Romano and Orosz both stress the importance of stepping back. Engineers often optimize for output; future managers must learn to observe team dynamics, reflect on what’s working, and adapt. Orosz models this by reading, writing, attending conferences, and running lightweight experiments with how he works.Strengthen emotional awareness and communication range: Romano explicitly notes that successful managers listen closely, pick up non-verbal cues, and adjust their communication style to fit each team member. Aspiring EMs should build this muscle early by observing tone, response patterns, and interpersonal signals on their teams.Practice coaching and teaching—not just explaining: Romano compares great management to good teaching: if one explanation fails, try another. Aspiring EMs should practice helping others understand by adapting to their learning style—not defaulting to their own.Clarify your own motivation: Denis D., Software Engineering Manager at PaySaaS Technology and Romano both warn that without intrinsic interest in people and leadership, the transition becomes painful. Future EMs should reflect early: do they enjoy unblocking others? Does seeing someone else grow feel like progress? If not, they should reconsider the path.Build a low-friction system to stay tech-adjacent: Denis maintains a Notion glossary, logs unknown terms, and watches short tutorials to stay grounded in the tech domain even after moving into management. Aspiring EMs can adopt this habit early to prevent drift and preserve confidence in technical discussions.On the technical side, credibility matters: working several years as an engineer, shipping projects, and understanding the software development lifecycle from firsthand experience will make you a more empathetic and effective leader. As Romano notes, having been “under deadline pressure” or stuck on a stubborn bug helps you relate to the struggles your team faces – “that empathy makes you more effective as a manager.”Ex software development manager and author of Coding in Delphi, Nick Hodges’ words sum up the job of a software development team manager nicely,“Sometimes being a manager is hard—even impossible. Sometimes you have to give up being right and put the needs of the entire organization over yourself. Sometimes you have to balance protecting your people with being a loyal member of the management team. Sometimes you have to manage up as well as you manage down. Being right isn’t enough—being effective matters more.”If Romano’s reflections on team dynamics and career growth sparked your interest, his book Learn Python Programming offers a different kind of guidance, focused on building solid, modern Python skills. Now in its fourth edition, the book covers everything from core syntax and idioms to web APIs, CLI tools, and competitive programming techniques.Get the Book🛠️Tool of the Week⚒️Backstage: Open-Source Developer PortalBackstage provides a central Software Catalog, project templates, and “docs-as-code” infrastructure (TechDocs) so teams can standardize their architecture, onboarding and documentation. For engineering managers, this means you can enforce coding standards and best practices (via templates and catalogs), keep architecture and ownership information up-to-date, and give developers self-service access to resources.Learn more about Backstage📰 Tech BriefsBuilding Strategic Influence as a Staff Engineer or Engineering Manager by Mark Allen, Engineering Leader & Technical Co-Founder @ Isometric: Outlines how staff engineers and engineering managers can build strategic influence by identifying business priorities, acting with curiosity beyond their role, cultivating cross-functional relationships, shaping their internal brand, and selectively saying yes to high-impact opportunities to grow their organizational visibility and impact.How Staff+ Engineers Can Develop Strategic Thinking by Shweta Saraf, Director of Network and Infra Management @Netflix: Explains how to odevelop strategic thinking by diagnosing organizational needs, aligning technical decisions with business goals, influencing cross-functional stakeholders, and balancing innovation with risk—emphasizing that strategic impact stems as much from mindset and relationship-building as from technical expertise.The AI productivity paradox in software engineering: Balancing efficiency and human skill retention: AI adoption in software engineering is creating a productivity paradox—delivering short-term task efficiency while eroding system performance, cognitive skills, and governance, unless teams integrate AI responsibly with oversight, skill development, and systemic alignment.That’s all for today. Thank you for reading this issue of Deep Engineering. We’re just getting started, and your feedback will help shape what comes next.Take a moment to fill out this short survey we run monthly—as a thank-you, we’ll add one Packt credit to your account, redeemable for any book of your choice.We’ll be back next week with more expert-led content.Stay awesome,Divya Anne SelvarajEditor-in-Chief, Deep EngineeringTake the Survey, Get a Packt Credit!If your company is interested in reaching an audience of developers, software engineers, and tech decision makers, you may want toadvertise with us.*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

Divya Anne Selvaraj

26 Jun 2025

Deep Engineering #6: Imran Ahmad on Algorithmic Thinking, Scalable Systems, and the Rise of AI Agents

Divya Anne Selvaraj

26 Jun 2025

How classical algorithms and real-world trade-offs will shape the next generation of software#6Imran Ahmad on Algorithmic Thinking, Scalable Systems, and the Rise of AI AgentsHow classical algorithms, system constraints, and real-world trade-offs will shape the next generation of intelligent softwareWorkshop: Unpack OWASP Top 10 LLMs with SnykJoin Snyk and OWASP Leader Vandana Verma Sehgal on Tuesday, July 15 at 11:00AM ET for a live session covering:✓ The top LLM vulnerabilities✓ Proven best practices for securing AI-generated code✓ Snyk’s AI-powered tools automate and scale secure dev.See live demos plus earn 1 CPE credit!Register todayHi Welcome to the sixth issue of Deep Engineering.A recent IBM and Morning Consult survey found that 99% of enterprise developers are now exploring or developing AI agents. Some have even christened 2025 as “the year of the AI agent”. We are experiencing a shift from standalone models to agentic systems.To understand what this shift means for developers we spoke with Imran Ahmad, data scientist at the Canadian Federal Government’s Advanced Analytics Solution Center (A2SC) and visiting professor at Carleton University. Ahmad is also the author of 50 Algorithms Every Programmer Should Know (Packt, 2023) and is currently working on his highly anticipated next book with us, 30 Agents Every AI Engineer Should Know, due out later this year. He has deep experience working on real-time analytics frameworks, multimedia data processing, and resource allocation algorithms in cloud computing.You can watch the full interview and read the transcript here—or keep reading for our take on the algorithmic mindset that will define the next generation of agentic software.Sign Up |AdvertiseFrom Models to Agents with Imran AhmadAccording to Gartner by 2028, 90% of enterprise software engineers will use AI code assistants (up from under 14% in early 2024). But we are already moving beyond code assistants to agents: software entities that don’t just respond to prompts, but plan, reason, and act by orchestrating tools, models, and infrastructure independently.“We have a lot of hope around AI – that it can eventually replace a human,” Ahmad says. “But if you think about how a person in a company solves a problem, they rely on a set of tools… After gathering information, they create a solution. An ‘agent’ is meant to replace that kind of human reasoning. It should be able to discover the tools in the environment around it, and have the wisdom to orchestrate a solution tailored to the problem. We're not there yet, but that's what we're striving for.”This vision aligns with where industry leaders are headed. Maryam Ashoori, Director of Product Management, IBM watsonx.ai concurs that 2025 is “the year of the AI agent”, and a recent IBM and Morning Consult survey found 99% of enterprise developers are now exploring or developing AI agents. Major platforms are rushing to support this paradigm: for instance, at Build 2025 Microsoft announced an Azure AI Agent Service to orchestrate multiple specialized agents as modular microservices. Such developments underscore the momentum behind agent-based architectures – which Igor Fedulov, CEO of Intersog, in an article for Forbes Technology Council, predicts will be a defining software trend by the end of 2025. Ahmad’s predicts this to be “the next generation of the algorithmic world we live in.”What is an agent?An AI agent is more than just a single model answering questions – it’s a software entity that can plan, call on various tools (search engines, databases, calculators, other models, etc.), and execute multi-step workflows to achieve a goal. “An agent is an entity that has the wisdom to work independently and autonomously,” Ahmad explains. “It can explore its environment, discover available tools, select the right ones, and create a workflow to solve a specific problem. That’s the dream agent.” Today’s implementations only scratch the surface of that ideal. For example, many so-called agents are basically LLMs augmented with function-calling abilities (tool APIs) – useful, but still limited in reasoning. Ahmad emphasizes that “a large language model is not the only tool. It’s perhaps the most important one right now, but real wisdom lies outside the LLM – in the agent.” In other words, true intelligence emerges from how an agent chooses and uses an ecosystem of tools, not just from one model’s output.The Practitioner’s Lens: Driving vs. Building the EngineEven as new techniques emerge, software professionals must decide how deep to go into theory. Ahmad draws a line between researchers and practitioners when it comes to algorithms. The researcher may delve into proofs of optimality, complexity theory, or inventing new algorithms. The practitioner, however, cares about applying algorithms effectively to solve real problems. Ahmad uses an analogy to explain this:“Do you want to build a car and understand every component of the engine? Or do you just want to drive it? If you want to drive it, you need to know the essentials – how to maintain it – but not necessarily every internal detail. That’s the practitioner role.”A senior engineer doesn’t always need to derive equations from scratch, but they do need to know the key parameters, limitations, and maintenance needs of the algorithmic “engines” they use.Ahmad isn’t advocating ignorance of theory. In fact, he stresses that having some insight under the hood improves decision-making. “If you know a bit more about how the engine works, you can choose the right car for your needs,” he explains. Similarly, knowing an algorithm’s fundamentals (even at a high level) helps an engineer pick the right tool for a given job. For example: Is your search problem better served by a Breath-First Search (BFS) or Depth-First Search (DFS) approach? Would a decision tree suffice, or do you need the boost in accuracy from an ensemble method? Experienced engineers approach such questions by combining intuition with algorithmic knowledge – a very practical kind of expertise. Ahmad’s advice is to focus on the level of understanding that informs real-world choices, rather than getting lost in academic detail irrelevant to your use case.Algorithm Choices and Real-World ScalabilityIn the wild, data is messy and scale is enormous – revealing which algorithms truly perform. “When algorithms are taught in universities… they’re usually applied to small, curated datasets. I call this ‘manicured pedicure data.’ But that’s not real data,” Ahmad quips. In his career as a public-sector data scientist, he routinely deals with millions of records and offers three key insights that shape how engineers should approach algorithm selection in production environments:Performance at scale requires different choices than in theory: Ahmad uses an example from his experience when he applied the Apriori algorithm (a well-known method for association rule mining). “When I used Apriori in practice, I found it doesn’t scale,” he admits. “It generates thousands of rules and then filters them after the fact. There’s a newer, better algorithm called (Frequent Pattern) FP-Growth that does the filtering at the source. It only generates the rules you actually need, making it far more scalable” A theoretically correct algorithm can become unusable when faced with big data volumes or strict latency requirements.Non-functional requirements often determine success: Beyond just picking the right algorithm, non-functional requirements like performance, scalability, and reliability must guide engineering decisions. “In academia, we focus on functional requirements… ‘this algorithm should detect fraud.’ And yes, the algorithm might technically work. But in practice, you also have to consider how it performs, how scalable it is, whether it can run as a cloud service, and so on.” Robust software needs algorithms that meet functional goals and the operational demands of deployment (throughput, memory, cost, etc.).Start simple, escalate only as needed:Simpler algorithms are easier to implement, explain, and maintain – valuable qualities especially in domains like finance or healthcare where interpretability matters. While discussing predictive models, Ahmad describes an iterative approach – perhaps begin with intuitive rules, upgrade to a decision tree for more structure, then if needed move to a more powerful model like XGBoost or an SVM. Jumping straight to a deep neural net can be overkill for a simple classification. “It’s usually a mistake to begin with something too complex – it can be overkill, like using a forklift to lift a sheet of paper,” he says.However, Algorithmic choices don’t occur in a vacuum – they influence and are influenced by software architecture. Modern systems, especially AI systems, have distinct phases (training, testing, inference) and often run in distributed cloud environments. Engineers therefore must integrate algorithmic thinking into high-level design and infrastructure decisions.Bridging Algorithms and Architecture in PracticeTake the example of training a machine learning model versus deploying it. “During training, you need a lot of data... a lot of processing power – GPUs, ideally. It’s expensive and time-consuming,” Ahmad notes. This is where cloud architecture shines. “The cloud gives you elastic architectures – you can spin up 2,000 nodes for 2 or 10 hours, train your model, and then shut it down. The cost is manageable…and you’re done.” Cloud platforms allow an elastic burst of resources: massive parallelism for a short duration, which can turn a week-long training job into a few hours for a few hundred dollars. Ahmad highlights that this elasticity was simply not available decades ago in on-prem computing. Today, any team can rent essentially unlimited compute for a day, which removes a huge barrier in building complex models. “If you want to optimize for cost and performance, you need elastic systems. Cloud computing… offers exactly that” for AI workloads, he says.Once trained, the model often compresses down to a relatively small artifact (Ahmad jokes that the final model file is “like the tail of an elephant – tiny compared to the effort to build it”). Serving predictions might only require a lightweight runtime that can even live on a smartphone. Thus, the hardware needs vary drastically between phases: heavy GPU clusters for training; maybe a simple CPU or even embedded device for inference. Good system design accommodates these differences – e.g., by separating training pipelines from inference services, or using cloud for training but edge devices for deployment when appropriate.So, how does algorithm choice drive architecture? Ahmad recommends evaluating any big design decision on three axes:CostPerformanceTime-to-deliverIf adopting a more sophisticated algorithm (or distributed processing framework, etc.) will greatly improve accuracy or speed and the extra cost is justified, it may be worth it. “First, ask yourself: does this problem justify the additional complexity…? Then evaluate that decision along three axes: cost, performance, and time,” he advises. “If an algorithm is more accurate, more time-efficient, and the cost increase is justified, then it’s probably the right choice.” On the flip side, if a fancy algorithm barely improves accuracy or would bust your budget/latency requirements, you might stick with a simpler approach that you can deploy more quickly. This trade-off analysis – weighing accuracy vs. expense vs. speed – is a core skill for architects in the age of AI. It prevents architecture astronautics (over-engineering) by ensuring complexity serves a real purpose.Classical Techniques: The Unsung Heroes in AI SystemsAhmad views classical computer science algorithms and modern AI methods as complementary components of a solution.“Take search algorithms, for instance,” Ahmad elaborates. “When you're preparing datasets for AI… you often have massive data lakes – structured and unstructured data all in one place. Now, say you're training a model for fraud detection. You need to figure out which data is relevant from that massive repository. Search algorithms can help you locate the relevant features and datasets. They support the AI workflow by enabling smarter data preparation.” Before the fancy model ever sees the data, classical algorithms may be at work filtering and finding the right inputs. Similarly, Ahmad points out, classic graph algorithms might be used to do link analysis or community detection that informs feature engineering. Even some “old-school” NLP (like tokenization or regex parsing) can serve as preprocessing for LLM pipelines. These building blocks ensure that the complex AI has quality material to work with.Ahmad offers an apt metaphor:“Maybe AI is your ‘main muscle,’ but to build a strong body – or a performant system – you need to train the supporting muscles too. Classical algorithms are part of that foundation.”Robust systems use the best of both worlds. For example, he describes a hybrid approach in real-world data labeling. In production, you often don’t have neat labeled datasets; you have to derive labels or important features from raw data. Association rule mining algorithms like Apriori or FP-Growth (from classical data mining) can uncover patterns. These patterns might suggest how to label data or which combined features could predict an outcome. “If you feed transaction data into FP-Growth, it will find relationships – like if someone buys milk, they’re likely to buy cheese too… These are the kinds of patterns the algorithm surfaces,” Ahmad explains. Here, a classical unsupervised algorithm helps define the inputs to a modern supervised learning task – a symbiosis that improves the overall system.Foundational skills like devising efficient search strategies, using dynamic programming for optimal substructure problems, or leveraging sorting and hashing for data organization are still extremely relevant. They might operate behind the scenes of an AI pipeline or bolster the infrastructure (e.g., database indexing, cache eviction policies, etc.) that keeps your application fast and reliable. Ahmad even notes that Google’s hyperparameter tuning service, Vizier, is “based on classical heuristic algorithms” rather than any neural network magic – yet it significantly accelerates model optimization.Optimization: The (Absolute) Necessity of Efficiency“Math can be cruel,” Ahmad warns. “If you’re not careful, your problem might never converge… If you accidentally introduce an exponential factor in the wrong place, it might take years – or even centuries – for the solution to converge. The sun might die before your algorithm finishes!” This colorful exaggeration underscores a serious point: computational complexity can explode quickly, and engineers need to be vigilant. It’s not acceptable to shrug off inefficiencies with “just let it run longer” if the algorithmic complexity is super-polynomial. “Things can spiral out of control very quickly. That’s why optimization isn't a luxury – it’s a necessity,” Ahmad says.Ahmad talks about three levels at which we optimize AI systems:Hardware: Choosing the right compute resources can yield massive speedups. For example, training a deep learning model on a GPU or TPU vs. a CPU can be orders of magnitude faster. “For deep learning especially, using a GPU can speed up training by a factor of 1,000,” Ahmad notes, based on his experience. So, part of an engineer’s algorithmic thinking is knowing when to offload work to specialized hardware, or how to parallelize tasks across a cluster.Hyperparameter tuning and algorithmic settings: Many algorithms (especially in machine learning) have knobs to turn – learning rate, tree depth, number of clusters, etc. The wrong settings can make a huge difference in both model quality and compute time. Traditionally, tuning was an art of trial and error. But now, tools like Google’s Vizier (and open-source libraries for Bayesian optimization) can automate this search efficiently.Ensuring the problem is set up correctly: A common mistake is diving into training without examining the data’s signal-to-noise ratio. Ahmad recommends the CRISP-DM approach – spend ample time on data understanding and preparation. “Let’s say your dataset has a lot of randomness and noise. If there's no clear signal, then even a Nobel Prize–winning scientist won’t be able to build a good model,” he says. “So, you need to assess your data before you commit to AI.” This might involve using statistical analysis or simple algorithms to verify that patterns exist. “Use classical methods to ensure that your data even has a learnable pattern. Otherwise, you’re wasting time and resources,” Ahmad advises.The cost of compute – and the opportunity cost of engineers’ time – is too high to ignore optimization. Or as Ahmad bluntly puts it, “It’s not OK to say, ‘I’m not in a hurry, I’ll just let it run.’” Competitive teams optimize both to push performance and to control time/cost, achieving results that are fast, scalable, and economically sensible.Learning by Doing: Making Algorithms StickMany developers first encounter algorithms as leetcode-style puzzles or theoretical exercises for interviews. But how can they move beyond rote knowledge to true mastery? Ahmad’s answer: practice on real problems. “Learning algorithms for interviews is a good start… it shows initiative,” he acknowledges. “But in interview prep, you're not solving real-world problems… To truly make algorithmic knowledge stick, you need to use algorithms to solve actual problems.”In the artificial setting of an interview question, you might code a graph traversal or a sorting function in isolation. The scope is narrow and hints are often provided by the problem constraints. Real projects are messier and more holistic. When you set out to build something end-to-end, you quickly uncover gaps in your knowledge and gain a deeper intuition. “That’s when you'll face real challenges, discover edge cases, and realize that you may need to know other algorithms just to get your main one working,” Ahmad says. Perhaps you’re implementing a network flow algorithm but discover you need a good data structure for priority queues to make it efficient, forcing you to learn or recall heap algorithms. Or you’re training a machine learning model and hit a wall until you implement a caching strategy to handle streaming data. Solving real problems forces you to integrate multiple techniques, and shows how classical and modern methods complement each other in context. Ahmad puts it succinctly: “There’s an entire ecosystem – an algorithmic community – that supports every solution. Classical and modern algorithms aren’t separate worlds. They complement each other, and a solid understanding of both is essential.”So, what’s the best way to gain this hands-on experience? Ahmad recommends use-case-driven projects, especially in domains that matter to you. He suggests tapping into the wealth of public datasets now available. “Governments around the world are legal custodians of citizen data… If used responsibly, this data can change lives,” he notes. Portals like data.gov host hundreds of thousands of datasets spanning healthcare, transportation, economics, climate, and more. Similar open data repositories exist for other countries and regions. These aren’t sanitized toy datasets – they are real, messy, and meaningful. “Choose a vertical you care about, download a dataset, pick an algorithm, and try to solve a problem. That’s the best way to solidify your learning,” Ahmad advises. The key is to immerse yourself in a project where you must apply algorithms end-to-end: from data cleaning and exploratory analysis, to choosing the right model or algorithmic approach, through optimization and presenting results. This process will teach more than any isolated coding puzzle, and the lessons will stick because they’re tied to real outcomes.Yes, 2025 is “the year of the AI agent”, but as the industry shifts from standalone models to agentic systems, engineers must learn to pair classical algorithmic foundations with real-world pragmatism, because in this era of AI agents, true intelligence lies not only in models, but in how wisely we orchestrate them.If Ahmad’s perspective on real-world scalability and algorithmic pragmatism resonated with you, his book 50 Algorithms Every Programmer Should Know goes deeper into the practical foundations behind today’s AI systems. The following excerpt explores how to design and optimize large-scale algorithms for production environments—covering parallelism, cloud infrastructure, and the trade-offs that shape performant systems.🧠Expert Insight: Large-Scale Algorithms by Imran AhmadThe complete “Chapter 15: Large‑Scale Algorithms” from the book 50 Algorithms Every Programmer Should Know by Imran Ahmad (Packt, September 2023).Large-scale algorithms are specifically designed to tackle sizable and intricate problems. They distinguish themselves by their demand for multiple execution engines due to the sheer volume of data and processing requirements. Examples of such algorithms include Large Language Models (LLMs) like ChatGPT, which require distributed model training to manage the extensive computational demands inherent to deep learning. The resource-intensive nature of such complex algorithms highlights the requirement for robust, parallel processing techniques critical for training the model.In this chapter, we will start by introducing the concept of large-scale algorithms and then proceed to discuss the efficient infrastructure required to support them. Additionally, we will explore various strategies for managing multi-resource processing. Within this chapter, we will examine the limitations of parallel processing, as outlined by Amdahl’s law, and investigate the use of Graphics Processing Units (GPUs).Read the Complete Chapter50 Algorithms Every Programmer Should Know by Imran Ahmad (Packt, September 2023) is a practical guide to algorithmic problem-solving in real-world software. Now in its second edition, the book covers everything from classical data structures and graph algorithms to machine learning, deep learning, NLP, and large-scale systems.For a limited time, get the eBook for $9.99 at packtpub.com — no code required.Get the Book🛠️Tool of the Week⚒️OSS Vizier — Production-Grade Black-Box Optimization from GoogleOSS Vizier is a Python-based, open source optimization service built on top of Google Vizier—the system that powers hyperparameter tuning and experiment optimization across products like Search, Ads, and YouTube. Now available to the broader research and engineering community, OSS Vizier brings the same fault-tolerant, scalable architecture to a wide range of use cases—from ML pipelines to physical experiments.Highlights:Flexible, Distributed Architecture: Supports RPC-based optimization via gRPC, allowing Python, C++, Rust, or custom clients to evaluate black-box objectives in parallel or sequentially.Rich Integration Ecosystem: Includes native support for PyGlove, TensorFlow Probability, and Vertex Vizier—enabling seamless connection to evolutionary search, Bayesian optimization, and cloud workflows.Research-Ready: Comes with standardized benchmarking APIs, a modular algorithm interface, and compatibility with AutoML tooling—ideal for evaluating and extending new optimization strategies.Resilient and Extensible: Fault-tolerant by design, with evaluations stored in SQL-backed datastores and support for retry logic, partial failure, and real-world constraints (e.g., human-evaluated objectives or lab settings).Learn more about OSS Vizier📰 Tech BriefsAI agents in 2025: Expectations vs. reality by Ivan Belcic and Cole Stryker, IBM Think: In 2025, AI agents are widely touted as transformative tools for work and productivity, but experts caution that while experimentation is accelerating, current capabilities remain limited, true autonomy is rare, and success depends on governance, strategy, and realistic expectations.Agent Mode for Gemini added to Android Studio: Google has introduced Agent Mode for Gemini in Android Studio, enabling developers to describe high-level goals that the agent can plan and execute—such as fixing build errors, adding dark mode, or generating UI from a screenshot—while allowing user oversight, feedback, and iteration, with expanded context support via Gemini API and MCP integration.Google’s Agent2Agent protocol finds new home at the Linux Foundation: Google has donated its Agent2Agent (A2A) protocol—a standard for enabling interoperability between AI agents—to the Linux Foundation, aiming to foster vendor-neutral, open development of multi-agent systems, with over 100 tech partners now contributing to its extensible, secure, and scalable design.Azure AI Foundry Agent Service GA Introduces Multi-Agent Orchestration and Open Interoperability: Microsoft has launched the Azure AI Foundry Agent Service into general availability, offering a modular, multi-agent orchestration platform that supports open interoperability, seamless integration with Logic Apps and external tools, and robust capabilities for monitoring, governance, and cross-cloud agent collaboration—all aimed at enabling scalable, intelligent agent ecosystems across diverse enterprise use cases.How AI Is Redefining The Way Software Is Built In 2025 by Igor Fedulov, CEO of Intersog: AI is transforming software development by automating tasks, accelerating workflows, and enabling more intelligent, adaptive systems—driving a shift toward agent-based architectures, cloud-native applications, and advanced technologies like voice and image recognition, while requiring developers to upskill in AI, data analysis, and security to remain competitive.That’s all for today. Thank you for reading this issue of Deep Engineering. We’re just getting started, and your feedback will help shape what comes next.Take a moment to fill out this short survey we run monthly—as a thank-you, we’ll add one Packt credit to your account, redeemable for any book of your choice.We’ll be back next week with more expert-led content.Stay awesome,Divya Anne SelvarajEditor-in-Chief, Deep EngineeringTake the Survey, Get a Packt Credit!If your company is interested in reaching an audience of developers, software engineers, and tech decision makers, you may want toadvertise with us.*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

Divya Anne Selvaraj

19 Jun 2025

Deep Engineering #5: Dhirendra Sinha (Google) and Tejas Chopra (Netflix) on Scaling, AI Ops, and System Design Interviews

Divya Anne Selvaraj

19 Jun 2025

Lessons on designing for failure and the importance of trade-off thinking#5Dhirendra Sinha (Google) and Tejas Chopra (Netflix) on Scaling, AI Ops, and System Design InterviewsFrom designing fault-tolerant systems at Big Tech and hiring for system design roles, Chopra and Sinha share lessons on designing for failure and the importance of trade-off thinkingHi Welcome to the fifth issue of Deep Engineering.With AI workloads reshaping infrastructure demands and distributed systems becoming the default, engineers are facing new failure modes, stricter trade-offs, and rising expectations in both practice and hiring.To explore what today’s engineers need to know, we spoke with Dhirendra Sinha (Software Engineering Manager at Google, and long-time distributed systems educator) and Tejas Chopra (Senior Engineer at Netflix and Adjunct Professor at UAT). Their recent book, System Design Guide for Software Professionals (Packt, 2024), distills decades of practical experience into a structured approach to design thinking.In this issue, we unpack their hard-won lessons on observability, fault tolerance, automation, and interview performance—plus what it really means to design for scale in a world where even one-in-a-million edge cases are everyday events.You can watch the full interview and read the transcript here—or keep reading for our distilled take on the design mindset that will define the next decade of systems engineering.Sign Up |AdvertiseJoin us on July 19 for a 150-minute interactive MCP Workshop. Go beyond theory and learn how to build and ship real-world MCP solutions. Limited spots available! Reserve your seat today.Use Code EARLY35 for 35% off!Designing for Scale, Failure, and the Future — With Dhirendra Sinha and Tejas Chopra“Foundational system design principles—like scalability, reliability, and efficiency—are remarkably timeless,” notes Chopra, adding that “the rise of AI only reinforces the importance of these principles.” In other words, new AI systems can’t compensate for poor architecture; they reveal its weaknesses. Sinha concurs: “If the foundation isn’t strong, the system will be brittle—no matter how much AI you throw at it.” AI and system design aren’t at odds – “they complement each other,” says Chopra, with AI introducing new opportunities and stress-tests for our designs.One area where AI is elevating system design is in AI-driven operations (AIOps). Companies are increasingly using intelligent automation for tasks like predictive autoscaling, anomaly detection, and self-healing.“There’s a growing demand for observability systems that can predict service outages, capacity issues, and performance degradation before they occur,” notes Sam Suthar, founding director of Middleware. AI-powered monitoring can catch patterns and bottlenecks ahead of failures, allowing teams to fix issues before users notice. At the same time, designing the systems to support AI workloads is a fresh challenge. The recent rollout of a Ghibli-style image generator saw explosive demand – so much that OpenAI’s CEO had to ask users to pause as GPU servers were overwhelmed. That architecture didn’t fully account for the parallelization and scale such AI models required. AI can optimize and automate a lot, but it will expose any gap in your system design fundamentals. As Sinha puts it, “AI is powerful, but it makes mastering the fundamentals of system design even more critical.”Scaling Challenges and Resilience in PracticeSo, what does it take to operate at web scale in 2025? Sinha highlights four key challenges facing large-scale systems today:Scalability under unpredictable load: global services must handle sudden traffic spikes without falling over or grossly over-provisioning. Even the best capacity models can be off, and “unexpected traffic can still overwhelm systems,” Sinha says.Balancing the classic trade-offs between consistency, performance, and availability: This remains as relevant as ever. In practice, engineers constantly juggle these – and must decide where strong consistency is a must versus where eventual consistency will do.Security and privacy at scale have grown harder: Designing secure systems for millions of users, with evolving privacy regulations and threat landscapes, is an ongoing battle.The rise of AI introduces “new uncertainties”: we’re still learning how to integrate AI agents and AI-driven features safely into large architectures.Chopra offers an example from Netflix: “We once had a live-streaming event where we expected a certain number of users – but ended up with more than three times that number.” The system struggled not because it was fundamentally mis-designed, but due to hidden dependency assumptions. In a microservices world, “you don’t own all the parts—you depend on external systems. And if one of those breaks under load, the whole thing can fall apart,” Chopra warns. A minor supporting service that wasn’t scaled for 3× traffic can become the linchpin that brings down your application. This is why observability is paramount. At Netflix’s scale (hundreds of microservices handling asynchronous calls), tracing a user request through the maze is non-trivial. Teams invest heavily in telemetry to know “which service called what, when, and with what parameters” when things go wrong. Even so, “stitching together a timeline can still be very difficult” in a massive distributed system, especially with asynchronous workflows. Modern observability tools (distributed tracing, centralized logging, etc.) are essential, and even these are evolving with AI assistance to pinpoint issues faster.So how do Big Tech companies approach scalability and robustness by design? One mantra is to design for failure. Assume everything will eventually fail and plan accordingly. “We operate with the mindset that everything will fail,” says Chopra. That philosophy birthed tools like Netflix’s Chaos Monkey, which randomly kills live instances to ensure the overall system can survive outages. If a service or an entire region goes down, your architecture should gracefully degrade or auto-heal without waking up an engineer at 2 AM. Sinha recalls an incident from his days at Yahoo:“I remember someone saying, “This case is so rare, it’s not a big deal,” and the chief architect replied, “One in a million happens every hour here.” That’s what scale does—it invalidates your assumptions.”In high-scale systems, even million-to-one chances occur regularly, so no corner case is truly negligible. In Big Tech, achieving resilience at scale has resulted in three best practices:Fault-tolerant, horizontally scalable architectures: In Netflix and other companies, such architecture ensure that if one node or service dies, the load redistributes and the system heals itself quickly. Teams focus not just on launching features but “landing” them safely – meaning they consider how each new deployment behaves under real-world loads, failure modes, and even disaster scenarios. Automation is key: from continuous deployments to automated rollback and failover scripts. “We also focus on automating everything we can—not just deployments, but also alerts. And those alerts need to be actionable,” Sinha says.Explicit capacity planning and graceful degradation: Engineers define clear limits for how much load a system can handle and build in back-pressure or shedding mechanisms beyond that. Systems often fail when someone makes unrealistic assumptions about unlimited capacity. Caching, rate limiting, and circuit breakers become your safety net. Gradual rollouts further boost robustness. “When we deploy something new, we don’t release it to the entire user base in one go,” Chopra explains. Whether it’s a new recommendation algorithm or a core infrastructure change, Netflix will enable it for a small percentage of users or in one region first, observe the impact, then incrementally expand if all looks good. This staged rollout limits the blast radius of unforeseen issues. Feature flags, canary releases, and region-by-region deployments should be standard operating procedure.Infrastructure as Code (IaC): Modern infrastructure tooling also contributes to resiliency. Many organizations now treat infrastructure as code, defining their deployments and configurations in declarative scripts. As Sinha notes, “we rely heavily on infrastructure as code—using tools like Terraform and Kubernetes—where you define the desired state, and the system self-heals or evolves toward that.” By encoding the target state of the system, companies enable automated recovery; if something drifts or breaks, the platform will attempt to revert to the last good state without manual intervention. This codified approach also makes scaling and replication more predictable, since environments can be spun up from the same templates.These same principles—resilience, clarity, and structured thinking—also underpin how engineers should approach system design interviews.Mastering the System Design InterviewCracking the system design interview is a priority for many mid-level engineers aiming for senior roles, and for good reason. Sinha points out that system design skill isn’t just a hiring gate – it often determines your level/title once you’re in a company. Unlike coding interviews where problems have a neat optimal solution, “system design is messy. You can take it in many directions, and that’s what makes it interesting,” Sinha says. Interviewers want to see how you navigate an open-ended problem, not whether you can memorize a textbook solution. Both Sinha and Chopra emphasize structured thinking and communication. Hiring managers deliberately ask ambiguous or underspecified questions to see if the candidate will impose structure: Do they ask clarifying questions? Do they break the problem into parts (data storage, workload patterns, failure scenarios, etc.)? Do they discuss trade-offs out loud? Sinha and Chopra offer two guidelines:There’s rarely a single “correct” answer: What matters is reasoning and demonstrating that you can make sensible trade-offs under real-world constraints. “It’s easy to choose between good and bad solutions,” Sinha notes, “but senior engineers often have to choose between two good options. I want to hear their reasoning: Why did you choose this approach? What trade-offs did you consider?” A strong candidate will articulate why, say, they picked SQL over NoSQL for a given scenario – and acknowledge the downsides or conditions that might change that decision. In fact, Chopra may often follow up with “What if you had 10× more users? Would your choice change?” to test the adaptability of a candidate’s design. He also likes to probe on topics like consistency models: strong vs eventual consistency and the implications of the CAP theorem. Many engineers “don’t fully grasp how consistency, availability, and partition tolerance interact in real-world systems,” Chopra observes, so he presents scenarios to gauge depth of understanding.Demonstrate a collaborative, inquisitive approach: A system design interview shouldn’t be a monologue; it’s a dialogue. Chopra says, “I try to keep the interview conversational. I want the candidate to challenge some of my assumptions.” For example, a candidate might ask: What are the core requirements? Are we optimizing for latency or throughput? or How many users are we targeting initially? — “that kind of questioning is exactly what happens in real projects,” Chopra explains. It shows the candidate isn’t just regurgitating a pre-learned architecture, but is actively scoping the problem like they would on the job. Strong candidates also prioritize requirements on the fly – distinguishing must-haves (e.g. high availability, security) from nice-to-haves (like an optional feature that can be deferred).Through years of interviews, Sinha and Chopra have noticed three common pitfalls:Jumping into solution-mode too fast: “Candidates don’t spend enough time right-sizing the problem,” says Chopra. “The first 5–10 minutes should be spent asking clarifying questions—what exactly are we designing, what are the constraints, what assumptions can we make?” Diving straight into drawing boxes and lines can lead you down the wrong path. Sinha agrees: “They hear something familiar, get excited, and dive into design mode—often without even confirming what they’re supposed to be designing. In both interviews and real life, that’s dangerous. You could end up solving the wrong problem.”Lack of structure – jumping randomly between components without a clear plan: This scattered approach makes it hard to know if you’ve covered the key areas. Interviewers prefer a candidate who outlines a high-level approach (e.g. client > service > data layer) before zooming in, and who checks back on requirements periodically.Poor time management: It’s common for candidates to get bogged down in details early (like debating the perfect database indexing scheme) and then run out of time to address other important parts of the system. Sinha and Chopra recommend practicing pacing yourself and be willing to defer some details. It’s better to have a complete, if imperfect, design than a perfect cache layer with no time to discuss security or analytics requirements. If an interviewer hints to move on or asks about an area you haven’t covered, take the cue. “Listen to the interviewer’s cues,” Sinha advises. “We want to help you succeed, but if you miss the hints, we can’t evaluate you properly.”Tech interviews in general have gotten more demanding in 2025. The format of system design interviews hasn’t drastically changed, but the bar is higher. Companies are more selective, sometimes even “downleveling” strong candidates if they don’t perfectly meet the senior criteria. Evan King and Stefan Mai, cofounders of interview preparation startup, in an article in The Pragmatic Engineer observe, “performance that would have secured an offer in 2021 might not even clear the screening stage today”. This reflects a market where competition is fierce and expectations for system design prowess are rising. But as Chopra and Sinha illustrate, the goal is not to memorize solutions – it’s to master the art of trade-offs and critical thinking.Beyond Interviews: System Design as a Career CatalystSystem design isn’t just an interview checkbox – it’s a fundamental skill for career growth in engineering. “A lot of people revisit system design only when they're preparing for interviews,” Sinha says. “But having a strong grasp of system design concepts pays off in many areas of your career.” It becomes evident when you’re vying for a promotion, writing an architecture document, or debating a new feature in a design review.Engineers with solid design fundamentals tend to ask the sharp questions that others miss (e.g. What happens if this service goes down? or Can our database handle 10x writes?). They can evaluate new technologies or frameworks in the context of system impact, not just code syntax. Technical leadership roles especially demand this big-picture thinking. In fact, many companies now expect even engineering managers to stay hands-on with architecture – “system design skills are becoming non-negotiable” for leadership.Mastering system design also improves your technical communication. As you grow more senior, your success depends on how well you can simplify complexity for others – whether in documentation or in meetings. “It’s not just about coding—it’s about presenting your ideas clearly and convincingly. That’s a huge part of leadership in engineering,” Sinha notes. Chopra agrees, framing system design knowledge as almost a mindset: “System design is almost a way of life for senior engineers. It’s how you continue to provide value to your team and organization.” He compares it to learning math: you might not explicitly use the quadratic formula daily, but learning it trains your brain in problem-solving.Perhaps the most exciting aspect is that the future is wide open. “Many of the systems we’ll be working on in the next 10–20 years haven’t even been built yet,” Chopra points out. We’re at an inflection point with technologies like AI agents and real-time data streaming pushing boundaries; those with a solid foundation in distributed systems will be the “go-to” people to harness these advances. And as Chopra notes,“seniority isn’t about writing complex code. It’s about simplifying complex systems and communicating them clearly. That’s what separates great engineers from the rest.”System design proficiency is a big part of developing that ability to cut through complexity.Emerging Trends and Next Frontiers in System DesignWhile core principles remain steady, the ecosystem around system design is evolving rapidly. We can identify three significant trends:Integration of AI Agents with Software Systems: As Gavin Bintz writes in Agent One, an emerging trend is the integration of AI agents with everyday software systems. New standards like Anthropic’s Model Context Protocol (MCP), are making it easier for AI models to securely interface with external tools and services. You can think of MCP as a sort of “universal adapter” that lets a large language model safely query your database, call an API like Stripe, or post a message to Slack – all through a standardized interface. This development opens doors to more powerful, context-aware AI assistants, but it also raises architectural challenges. Designing a system that grants an AI agent limited, controlled access to critical services requires careful thought around authorization, sandboxing, and observability (e.g., tracking what the AI is doing). Chopra sees MCP as fertile ground for new system design patterns and best practices in the coming years.Deepening of observability and automation in system management: Imagine systems that not only detect an anomaly but also pinpoint the likely root cause across your microservices and possibly initiate a fix. As Sam Suthar, Founding Director at Middleware, observes, early steps in this direction are already in play – for example, tools that correlate logs, metrics, and traces across a distributed stack and use machine learning to identify the culprit when something goes wrong. The ultimate goal is to dramatically cut Mean Time to Recovery (MTTR) when incidents occur, using AI to assist engineers in troubleshooting. As one case study showed, a company using AI-based observability was able to resolve infrastructure issues 75% faster while cutting monitoring costs by 75%. The complexity of modern cloud environments is pushing us toward this new normal of predictive, adaptive systems.Sustainable software architecture: There is growing dialogue now about designing systems that are not only robust and scalable, but also efficient in their use of energy and resources. The surge in generative AI has shone a spotlight on the massive power consumption of large-scale services. According to Kemene et al., in an article published by the World Economic Forum (WEF), Data centers powering AI workloads can consume as much electricity as a small city; the International Energy Agency projects data center energy use will more than double by 2030, with AI being “the most important driver” of that growth. Green software engineering principles urge us to consider the carbon footprint of our design choices. Sinha suggests this as an area to pay attention to.Despite faster cycles, sharper constraints and more automation system design remains grounded in principles. As Chopra and Sinha make clear, the ability to reason about failure, scale, and trade-offs isn’t just how systems stay up; it’s also how engineers move up in their career.If you found Sinha and Chopra’s perspective on designing for scale and failure compelling, their book System Design Guide for Software Professionals unpacks the core attributes that shape resilient distributed systems. The following excerpt from the book breaks down how consistency, availability, partition tolerance, and other critical properties interact in real-world architectures. You’ll see how design choices around reads, writes, and replication influence system behavior—and why understanding these trade-offs is essential for building scalable, fault-tolerant infrastructure.Expert Insight: Distributed System Attributes by Dhirendra Sinha and Tejas ChopraThe complete “Chapter 2: Distributed System Attributes” from the book System Design Guide for Software Professionals by Dhirendra Sinha and Tejas Chopra (Packt, August 2024)…Before we jump into the different attributes of a distributed system, let’s set some context in terms of how reads and writes happen.Let’s consider an example of a hotel room booking application (Figure 2.1). A high-level design diagram helps us understand how writes and reads happen:Figure 2.1 – Hotel room booking request flowAs shown in Figure 2.1, a user (u1) is booking a room (r1) in a hotel and another user is trying to see the availability of the same room (r1) in that hotel. Let’s say we have three replicas of the reservations database (db1, db2, and db3). There can be two ways the writes get replicated to the other replicas: The app server itself writes to all replicas or the database has replication support and the writes get replicated without explicit writes by the app server.Let’s look at the write and the read flows:Read the Complete ChapterSystem Design Guide for Software Professionals by Dhirendra Sinha and Tejas Chopra (Packt, August 2024) is a comprehensive, interview-ready manual for designing scalable systems in real-world settings. Drawing on their experience at Google, Netflix, and Yahoo, the authors combine foundational theory with production-tested practices—from distributed systems principles to high-stakes system design interviews.For a limited time, get the eBook for $9.99 at packtpub.com — no code required.Get the Book🛠️Tool of the Week⚒️Diagrams 0.24.4 — Architecture Diagrams as Code, for System DesignersDiagrams is an open source Python toolkit that lets developers define cloud architecture diagrams using code. Designed for rapid prototyping and documentation, it supports major cloud providers (AWS, GCP, Azure), Kubernetes, on-prem infrastructure, SaaS services, and common programming frameworks—making it ideal for reasoning about modern system design.The latest release (v0.24.4, March 2025) adds stability improvements and ensures compatibility with recent Python versions. Diagrams has been adopted in production projects like Apache Airflow and Cloudiscovery, where infrastructure visuals need to be accurate, automatable, and version controlled.Highlights:Diagram-as-Code: Define architecture models using simple Python scripts—ideal for automation, reproducibility, and tracking in Git.Broad Provider Support: Over a dozen categories including cloud platforms, databases, messaging systems, DevOps tools, and generic components.Built on Graphviz: Integrates with Graphviz to render high-quality, publishable diagrams.Extensible and Scriptable: Easily integrate with build pipelines or architecture reviews without relying on external design tools.Visit Diagrams' GitHub Repo📰 Tech BriefsAnalyzing Metastable Failures in Distributed Systems: A new HotOS'25 paper builds on prior work to introduce a simulation-based pipeline—spanning Markov models, discrete event simulation, and emulation—to help engineers proactively identify and mitigate metastable failure modes in distributed systems before they escalate.A Senior Engineer's Guide to the System Design Interview: A comprehensive, senior-level guide to system design interviews that demystifies core concepts, breaks down real-world examples, and equips engineers with a flexible, conversational framework for tackling open-ended design problems with confidence.Using Traffic Mirroring to Debug and Test Microservices in Production-Like Environments: Explores how production traffic mirroring—using tools like Istio, AWS VPC Traffic Mirroring, and eBPF—can help engineers safely debug, test, and profile microservices under real-world conditions without impacting users.Designing Instagram: This comprehensive system design breakdown of Instagram outlines the architecture, APIs, storage, and scalability strategies required to support core features like media uploads, feed generation, social interactions, and search—emphasizing reliability, availability, and performance at massive scale.Chiplets and the Future of System Design: A forward-looking piece on how chiplets are reshaping the assumptions behind system architecture—covering yield, performance, reuse, and the growing need for interconnect standards and packaging-aware system design.That’s all for today. Thank you for reading the first issue of Deep Engineering. We’re just getting started, and your feedback will help shape what comes next.Take a moment to fill out this short survey we now run monthly—as a thank-you, we’ll add one Packt credit to your account, redeemable for any book of your choice.We’ll be back next week with more expert-led content.Stay awesome,Divya Anne SelvarajEditor in Chief, Deep EngineeringTake the Survey, Get a Packt Credit!If your company is interested in reaching an audience of developers, software engineers, and tech decision makers, you may want toadvertise with us.*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

Divya Anne Selvaraj

12 Jun 2025

Deep Engineering #4: Alessandro Colla and Alberto Acerbis on Domain-Driven Refactoring at Scale

Divya Anne Selvaraj

12 Jun 2025

Why understanding the domain beats applying patterns—and how to refactor without starting over#4Alessandro Colla and Alberto Acerbis on Domain-Driven Refactoring at ScaleWhy understanding the domain beats applying patterns—and how to refactor without starting overWelcome to the fourth issue of Deep Engineering.In enterprise software systems, few challenges loom larger than refactoring legacy systems to meet modern needs. These efforts can feel like open-heart surgery on critical applications that are still running in production. Systems requiring refactoring are often business-critical, poorly modularized, and resistant to change by design.To understand how Domain-Driven Design (DDD) can guide this process, we spoke with Alessandro Colla and Alberto Acerbis—authors of Domain-Driven Refactoring (Packt, 2025) and co-founders of the "DDD Open" and "Polenta and Deploy" communities.Colla brings over three decades of experience in eCommerce systems, C# development, and strategic software design. Acerbis is a Microsoft MVP and backend engineer focused on building maintainable systems that deliver business value. Together, they offer a grounded, pattern-skeptical view of what DDD really looks like in legacy environments—and how teams can use it to make meaningful change without rewriting from scratch.You can watch the full interview and read the full transcript here—or keep reading for our distilled take on the principles, pitfalls, and practical steps that shape successful DDD refactoring.Sign Up |AdvertiseThe conference to learn, apply, and improve your craftdev2next is the premier conference designed for software developers, architects, technology leaders, development managers, and directors. Explore cutting-edge strategies, tools, and essential practices for building powerful applications using the latest trends and good practices.When: September 29 - October 2, 2025Where: Colorado Springs, COBuy Conference and Workshop TicketsPrinciples over Patterns: Applying DDD to Legacy Systems with Alessandro Colla and Alberto AcerbisLegacy systems are rarely anyone’s favorite engineering challenge. Often labeled “big balls of mud,” these aging codebases resist change by design—lacking tests, mixing concerns, and coupling business logic to infrastructure in ways that defy modular thinking. Yet they remain critical. “It’s more common to work on what we call legacy code than to start fresh,” Acerbis notes from experience. Their new book, Domain-Driven Refactoring, was born from repeatedly facing large, aging codebases that needed new features. “The idea behind the book is to bring together, in a sensible and incremental way, how we approach the evolution of complex legacy systems,” explains Colla. Rather than treat DDD as something only for new projects, Colla and Acerbis show how DDD’s concepts can guide the incremental modernization of existing systems.They begin by reinforcing core DDD concepts—what Colla calls their “foundation”—before demonstrating how to apply patterns gradually. This approach acknowledges a hard truth: when a client asks for “a small refactor” of a legacy system, “it’s never small. It always becomes a bigger refactor,” Acerbis says with a laugh. The key is to take baby steps. “Touching a complex system is always difficult, on many levels,” Colla cautions, so the team must break down the work into manageable changes rather than trying an all-at-once overhaul.Modular Monoliths Before MicroservicesOne of the first decisions in a legacy overhaul is whether to break a monolithic application into microservices. But Colla and Acerbis urge caution here—hype should not dictate architecture.“Normally, a customer comes to us asking to transform their legacy application into a microservices system because — you know — ‘my cousin told me microservices solve all the problems,’” Acerbis jokes. The reality is that blindly carving up a legacy system into microservices can introduce as much complexity as it removes. “Once you split your system into microservices, your architecture needs to support that split,” he explains, from infrastructure and deployment to data consistency issues.Instead, the duo advocates an interim step: first evolve the messy monolith into a well-structured modular monolith. “Using DDD terms, you should move your messy monolith into a good modular monolith,” says Acerbis. In a modular monolith, clear boundaries are drawn around business subdomains (often aligning with DDD bounded contexts), but the system still runs as a single deployable unit. This simplification and ordering within the monolith can often deliver the needed agility and clarity. “We love monoliths, OK? But modular ones,” Colla admits. With a modular monolith in place, teams can implement new features more easily and see if further decomposition is truly warranted. Only if needed—due to scale or independent deployment demands—should you “split it into microservices. But that’s a business and technical decision the whole team needs to make together,” Acerbis emphasizes.By following this journey, teams often find full microservices unnecessary. Colla notes that many times they’ve been able to meet all business requirements just by going modular, without ever needing microservices. The lesson: choose the simplest architecture that solves the problem and avoid microservices sprawl unless your system’s scale and complexity absolutely demand it.First Principles: DDD as a Mindset, Not a ChecklistA central theme from Colla and Acerbis is that DDD is fundamentally about understanding the problem domain, not checking off a list of patterns. “Probably the most important principle is that DDD is not just technical — it’s about principles,” says Acerbis. Both engineers stress the importance of exploration and ubiquitous language before diving into code. “Start with the strategic patterns — particularly the ubiquitous language — to understand the business and what you’re dealing with,” Colla advises. In practice, that means spending time with domain experts, clarifying terminology, and mapping out the business processes and subdomains. Only once the team shares a clear mental model of “what actually needs to be built” should they consider tactical design patterns or write any code.Colla candidly shares that he learned this the hard way.“When I started working with DDD, CQRS, and event sourcing, I made the mistake of jumping straight into technical modeling — creating aggregates, entities, value objects — because I’m a developer, and that’s what felt natural. But I skipped the step of understanding why I was building those classes.I ended up with a mess.”Now he advocates for understand the why, then the how. “We spent the first chapters of the book laying out the principles. We wanted readers to understand the why — so that once you get to the code, it comes naturally,” Colla says.This principle-centric mindset guards against a common trap: applying DDD patterns by rote or “cloning” a solution from another project.“I’ve seen situations where someone says, ‘I’ve already solved a similar problem using DDD — I’ll just reuse that design.’ But no, that’s not how it works,” Acerbis warns.Every domain is different, and DDD is “about exploration. Every situation is different.” By treating DDD as a flexible approach to learning and modeling the domain—rather than a strict formula—teams can avoid over-engineering and build models that truly fit their business.From Strategic to Tactical: Applying Patterns IncrementallyOnce the team has a solid grasp of the domain, they can start to apply DDD’s tactical patterns (entities, value objects, aggregates, domain events, etc.) to reshape the code. But which pattern comes first? Colla doesn’t prescribe a one-size-fits-all sequence. “I don’t think there’s a specific pattern to apply before others,” he says. The priority is dictated by the needs of the domain and the pain points in the legacy code. However, the strategic understanding guides the tactical moves: by using the ubiquitous language and bounded contexts identified earlier, the team can decide where an aggregate boundary should be, where to introduce a value object for a concept, and so on.Acerbis emphasizes that their book isn’t a compendium of all DDD patterns—classic texts already cover those. Instead, it shows how to practically apply a selection of patterns in a legacy refactoring context. The aim is to go from “a bad situation — a big ball of mud — to a more structured system,” he says. A big win of this structure is that new features become easier to add “without being afraid of introducing bugs or regressions,” because the code has clear separation of concerns and meaningful abstractions.Exploring the domain comes first. Only then should the team “bring in the tactical patterns when you begin touching the code,” says Colla. In other words, let the problem guide the solution. By iteratively applying patterns in the areas that need them most, the system gradually transforms—all while continuing to run and deliver value. This incremental refactoring is core to their approach; it avoids the risky big-bang rewrite and instead evolves the architecture piece by piece, in sync with growing domain knowledge.Balancing Refactoring with Rapid DeliveryIn theory, it sounds ideal to methodically refactor a system. In reality, business stakeholders are rarely patient—they need new features yesterday. Colla acknowledges this tension:“This is the million-dollar question. As in life, the answer is balance. You can't have everything at once — you need to balance features and refactoring.”The solution is to weave refactoring into feature development, rather than treating it as a separate project that halts new work.“Stakeholders want new features fast because the system has to keep generating value,” Colla notes. Completely pausing feature development for months of cleanup is usually a non-starter (“We’ve had customers say, ‘You need to fix bugs and add new features — with the same time and budget.’”). Instead, Colla’s team refactors in context: “if a new feature touches a certain area of the system, we refactor that area at the same time.” This approach may slightly slow down that feature’s delivery, but it pays off in the long run by preventing the codebase from deteriorating further. Little by little (“always baby steps,” as Colla says), they improve the design while still delivering business value.Acerbis adds that having a solid safety net of tests is what makes this sustainable. Often, clients approach them saying it’s too risky or slow to add features because “the monolith has become a mess.” The first order of business, then, is to shore up test coverage.“We usually start with end-to-end tests to make sure that the system behaves the same way after changes,” he explains.Writing tests for a legacy system can be time-consuming initially, but it instills confidence.“In the beginning, it takes time. You have to build that infrastructure and coverage. But as you move forward, you’ll see the benefits — every time you deploy a new feature, you’ll know it was worth it.”With robust tests in place, the team can refactor aggressively within each iteration, knowing they will catch any unintended side effects before they reach users.Aligning Architecture with OrganizationEven the best technical refactoring will falter if organizational structure is at odds with the design. This is where Conway’s Law comes into play—the notion that software systems end up reflecting the communication structures of the organizations that build them.“When introducing DDD, it’s not just about technical teams. You need involvement from domain experts, developers, stakeholders — everyone,” says Acerbis.In practice, this means that establishing clean bounded contexts in code may eventually require realigning team responsibilities or communication paths in the company.Of course, changing an organization chart is harder than changing code. Colla and Acerbis therefore approach it in phases. “Context mapping is where we usually begin — understanding what each team owns and how they interact,” Colla explains. They first try to fix the code boundaries while not breaking any essential communication between people or teams. For instance, if two modules should only talk via a well-defined interface, they might introduce an anti-corruption layer in code, even if the same two teams still coordinate closely as they always have. Once the code’s boundaries stabilize and prove beneficial, the case can be made to align the teams or management structure accordingly.“The hardest part is convincing the business side that this is the right path,” Acerbis admits. Business stakeholders control budgets and priorities, so without their buy-in, deep refactoring stalls. The key is to demonstrate value early and keep them involved. Ultimately, “it only works if the business side is on board — they’re the ones funding the effort,” he says. Colla concurs: “everyone — developers, architects, business — needs to share the same understanding. Without that alignment, it doesn’t work.” DDD, done right, becomes a cross-discipline effort, bridging tech and business under a common language and vision.Building a Safety Net: Tools and Testing TechniquesGiven the complexity of legacy transformation, what tools or frameworks can help? Colla’s answer may surprise some: there is no magic DDD framework that will do it for you. “There aren’t any true ‘DDD-compliant’ frameworks,” he says. DDD isn’t something you can buy off-the-shelf; it’s an approach you must weave into how you design and code. However, there are useful libraries and techniques to smooth the journey, especially around testing and architecture fitness.“What’s more important to me is testing — especially during refactoring. You need a strong safety net,” Colla emphasizes. His team’s rule of thumb: start by writing end-to-end tests for current behavior. “We always start with end-to-end tests. That way, we make sure the expected behavior stays the same,” Colla shares. These broad tests cover critical user flows so that if a refactoring accidentally changes something it shouldn’t, the team finds out immediately. Next, they add architectural tests (often called fitness functions) to enforce the intended module boundaries. “Sometimes, dependencies break boundaries. Architectural tests help us catch that,” he notes. For instance, a test might ensure that code in module A never calls code in module B directly, enforcing decoupling. And of course, everyday unit tests are essential for the new code being written: “unit tests, unit tests, unit tests,” Colla repeats for emphasis. “They prove your code does what it should.”Acerbis agrees that no all-in-one DDD framework exists (and maybe that’s for the best). “DDD is like a tailor-made suit. Every time, you have to adjust how you apply the patterns depending on the problem,” he says. Instead of relying on a framework to enforce DDD, teams should rely on discipline and tooling – especially the kind of automated tests Colla describes – to keep their refactoring on track. Acerbis also offers a tip on using AI assistance carefully: tools like GitHub Copilot can be helpful for generating code, but “you don’t know how it came up with that solution.” He prefers to have developers write the code with understanding, then use AI to review or suggest improvements. This ensures that the team maintains control over design decisions rather than blindly trusting a tool.Event-Driven Architecture: Avoiding the "Distributed Monolith"DDD often goes hand-in-hand with event-driven architecture for decoupling. Used well, domain events can keep bounded contexts loosely coupled. But Colla and Acerbis caution that it’s easy to misuse events and end up with a distributed mess. Acerbis distinguishes two kinds of events with very different roles: domain events and integration events. “Domain events should stay within a bounded context. Don’t share them across services,” he warns. If you publish your internal domain events for other microservices to consume, you create tight coupling: “when you change the domain event — and you will — you’ll need to notify every team that relies on it. That’s tight coupling, not decoupling.”The safer pattern is to keep domain events private to a service or bounded context, and publish separate integration events for anything that truly needs to be shared externally. That way, each service can evolve its internal model (and its domain event definitions) independently. Colla admits he’s learned this by making the mistakes himself. The temptation is to save effort by reusing an event “because it feels efficient,” but six months later, when one team changes that event’s schema, everything breaks. “We have to resist that instinct and think long-term,” he says. Even if it requires a bit more work upfront to define distinct integration events, it prevents creating what he calls a “distributed monolith that’s impossible to evolve” – a system where services are theoretically separate but so tightly coupled by data contracts that they might as well be a single unit.Another often overlooked aspect of event-driven systems is the user experience in an eventually consistent world. Because events introduce asynchrony, UIs must be designed to handle the delay. Acerbis mentions using task-based UIs, where screens are organized around high-level business tasks rather than low-level CRUD forms, to better set user expectations and capture intent that aligns with back-end processes. The bottom line is that events are powerful, but they come with their own complexities – teams must design and version them thoughtfully, and always keep the end-to-end system behavior in mind.💡What This Means for YouDon’t jump in without understanding the domain: Slow down and truly grasp the domain events and logic before coding them. Focus on principles before patterns. DDD isn’t a bag of technical tricks, but a way to deeply understand the business domain before coding solutions. Align with the business. Technical architecture must reflect the domain and may require organizational buy-in and alignment (think Conway’s Law).Beware the golden hammer: “Use DDD where it makes sense. You don’t need to apply it to your entire system,” Acerbis advises. Focus DDD efforts on the core domain (where the competitive advantage lies), and keep supporting domains simple. Modular monolith first. Instead of rushing into microservices, first untangle your “big ball of mud” into a well-structured modular monolith—often that's enough.No “Franken-events”: If you see an “and” in an event name, that’s a red flag – it likely violates the single responsibility principle for events and will cause trouble when one part of that event changes and the other doesn’t. Refactor in baby steps. Integrate refactoring tasks into regular feature work, supported by a strong safety net of tests, to balance improvement with delivery.Never allow invalid data by design: A subtle but dangerous practice is allowing objects or aggregates in an invalid state (for example, by using flags like isValid). “Your aggregates should always be in a valid state,” Acerbis emphasizes, meaning your constructors or factories should enforce invariants so you don’t have to constantly check validity later.Don’t split the system before it’s ready: Microservices introduce complexity too early. “Once you split your system into microservices, your architecture needs to support that split,” Acerbis warns. Work on converting to a modular monolith first—often that's enough.“Simple” versus “easy” code: “Simple code is not the same as easy code. Simple code takes effort. Easy code is quick, but it’s hard to maintain,” says Acerbis. What feels “easy” in the moment (quick-and-dirty hacks, copy-paste coding, skipping tests) leads to a tangled mess. Writing simple, clear code often requires more thought and discipline—but it pays off with maintainability. Evolve, don’t rewrite. Aim to evolve the system through continuous small changes rather than costly complete rewrites.If you found Colla and Acerbis’ insights useful, their book, Domain-Driven Refactoring offers a deeper, hands-on perspective—showing how to incrementally apply DDD principles in real systems under active development with substantial code examples. Here is an excerpt which covers how to integrate events within a CQRS architecture.Expert Insight: Integrating Events with CQRS by Alessandro Colla and Alberto AcerbisAn Excerpt from “Chapter 7: Integrating Events with CQRS” in the book Domain-Driven Refactoring by Alessandro Colla and Alberto Acerbis (Packt, May 2025)In this chapter, we will explore how to effectively integrate events into your system using the Command Query Responsibility Segregation (CQRS) pattern. As software architectures shift from monolithic designs to more modular, distributed systems, adopting event-driven communication becomes essential. This approach offers scalability, decoupling, and resilience, but also brings complexity and challenges such as eventual consistency, fault tolerance, and infrastructure management.The primary goal of this chapter is to guide you through the implementation of event-driven mechanisms within the context of a CQRS architecture. By the end of this chapter, you will have a clear understanding of how events and commands operate in tandem to manage state changes, communicate between services, and optimize both the reading and writing of data.(In this excerpt) you will learn about the following:The benefits and trade-offs of transitioning from synchronous to asynchronous communicationHow event-driven architectures improve system scalability and decouplingThe difference between commands (which trigger state changes) and events (which signal that something has happened)How to apply proper message-handling patterns for bothThe principles of CQRS and understanding why separating read and write models enhances performance and scalabilityHow to implement the separation of command and query responsibilities with a focus on read and write optimizationHow to introduce a message broker for handling asynchronous communicationHow to capture and replay the history of state changes with event sourcingRead the Complete ExcerptDomain-Driven Refactoring by Alessandro Colla and Alberto Acerbis is a practical guide to modernizing legacy systems using DDD. Through real-world C# examples, the authors show how to break down monoliths into modular architectures—whether evolving toward microservices or improving maintainability within a single deployable unit. The book covers both strategic and tactical patterns, including bounded contexts, aggregates, and event-driven integration.Use code DOMAIN20 for 20% off at packtpub.com — valid through June 16, 2025.Get the Book🛠️Tool of the Week⚒️Context Mapper 6.12.0 — Strategic DDD Refactoring, VisualizedContext Mapper is an open source modeling toolkit for strategic DDD, purpose-built to define and evolve bounded contexts, map interrelationships, and drive architectural refactorings. It offers a concise DSL for creating context maps and includes built-in transformations for modularizing monoliths, extracting services, and analyzing cohesion/coupling trade-offs.The latest version continues its focus on reverse-engineering context maps from Spring Boot and Docker Compose projects, along with support for automated architectural refactorings—making it ideal for teams modernizing legacy systems or planning microservice transitions.Highlights:Iterative Refactoring: Apply “Architectural Refactorings” to improve modularity without rewriting everything.Reverse Engineering: Extract bounded context candidates from existing codebases using the Context Map Discovery library.Multi-Format Output: Export maps to Graphviz, PlantUML, MDSL, or Freemarker-based text formats.IDE Integrations: Available as plugins for Eclipse and VS Code, or use it directly in Gitpod without local setup.Visit the Project Website📰 Tech BriefsArchitecture Refactoring Towards Service Reusability in the Context of Microservices by Daniel et al.: This paper proposes a catalog of architectural refactorings—Join API Operations with Heterogeneous Data, Introduce Metadata, and Extract Pluggable Processors—to improve service reusability in microservice architectures by reducing code duplication, decoupling data from processing logic, and supporting heterogeneous inputs, and validates these patterns through impact analysis on three real-world case studies.DDD & LLMs - Eric Evans - DDD Europe 2024: In this keynote, Evans reflects on the transformative potential of large language models in software development, urging the community to embrace experimentation, learn through hands-on projects, and explore how DDD might evolve—or be challenged—in an era increasingly shaped by AI-assisted systems.Domain Re-discovery Patterns for Legacy Code - Richard Groß - DDD Europe 2024: In this talk, Groß introduces domain rediscovery patterns for legacy systems—ranging from passive analysis techniques like mining repositories and activity logging to active refactoring patterns and visualization tools—all aimed at incrementally surfacing domain intent, guiding safe modernization without full rewrites, and avoiding hidden technical and organizational costs of starting from scratch.Legacy Modernization meets GenAI by Ferri et al., Thoughtworks: This article discusses how GenAI can address core challenges in legacy system modernization—such as reverse engineering, capability mapping, and high-level system comprehension—arguing for a human-guided, evolutionary approach while showcasing Thoughtworks’ internal accelerator, CodeConcise, as one practical application of these ideas.Refactor a monolith into microservices by the Google Cloud Architecture Center: This guide outlines how to incrementally refactor a monolithic application into microservices using DDD, bounded contexts, and asynchronous communication—emphasizing the Strangler Fig pattern, data and service decoupling strategies, and operational considerations like distributed transactions, service boundaries, and database splitting.That’s all for today. Thank you for reading the first issue of Deep Engineering. We’re just getting started, and your feedback will help shape what comes next.Take a moment to fill out this short survey—as a thank-you, we’ll add one Packt credit to your account, redeemable for any book of your choice.We’ll be back next week with more expert-led content.Stay awesome,Divya Anne SelvarajEditor in Chief, Deep EngineeringTake the Survey, Get a Packt Credit!If your company is interested in reaching an audience of developers, software engineers, and tech decision makers, you may want toadvertise with us.*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}.reverse{display:table;width: 100%;

Divya Anne Selvaraj

05 Jun 2025

Deep Engineering #3: Designing for AI and Humans with MoonBit Core Contributor Zihang YE

Divya Anne Selvaraj

05 Jun 2025

From CLI design to AI ergonomics—MoonBit offers patterns worth borrowing#3Designing for AI and Humans with MoonBit Core Contributor Zihang YEFrom CLI design to AI ergonomics—MoonBit offers patterns worth borrowingHi ,Welcome to the third issue of Deep Engineering.AI agents are no longer just code generators, they’re becoming active users of codebases, APIs, and developer tools. From semantic documentation protocols to agent-readable APIs, the systems we design must increasingly expose structure, context, and intent. Software now needs to serve two audiences—humans and machines.This issue explores what that means in practice, through the lens of MoonBit—a new language built from the ground up for WebAssembly (Wasm)-native performance and AI-first tooling.Our feature article examines how MoonBit responds to this dual-audience challenge: not with flashy syntax, but with a tightly integrated toolchain and a runtime model designed to be both fast and machine-consumable. And in a companion tutorial, MoonBit core contributor Zihang YE walks us through building a diff algorithm as a Wasm-ready CLI—an instructive example of the language’s design philosophy in action.Sign Up |AdvertiseSponsored:Web Devs: Turn Your Knowledge Into IncomeBuild the knowledge base that will enable you to collaborate with AI for years to come💰 Competitive Pay Structure⏰ Ultimate Flexibility🚀 Technical Requirements (No AI Experience Needed)Weekly payouts + remote work: The developer opportunity you've been waiting for!The flexible tech side hustle paying up to $50/hourApply NowBeyond Syntax: MoonBit and the Future of Language, Tooling, and AI WorkflowsThe mainstream dominance of Python, JavaScript, and Rust might suggest the age of new programming languages is over. A new breed of languages including MoonBit prove otherwise—not by reinventing syntax, but by responding to two tectonic shifts in software development: AI-assisted workflows, and the rise of Wasm-native deployment in cloud and edge environments.In edge computing and micro-runtime environments, developers need tools that start instantly, consume minimal memory, and run predictably across platforms. MoonBit’s design responds directly to this: it produces compact Wasm binaries optimized for streaming data, making it suitable for CLI tools, embedded components, and other low-overhead tasks.At the same time, AI workloads are exposing the limitations of dynamic languages like Python in large-scale systems. MoonBit’s founders note that Python’s “easy to learn” nature can become a double-edged sword for complex tasks. Even with optional annotations, its dynamic type system can hinder static analysis, complicating maintainability and scalability as codebases grow. In response, MoonBit introduces a statically typed, AI-aware language model with built-in tooling—formatter, package manager, VSCode integration—designed to support both human and machine agents.Rather than replacing Python, MoonBit takes a pragmatic approach. It explicitly embraces an “ecosystem reuse” model: it uses AI-powered encapsulation to lower the barrier for cross-language calls, avoiding reinvention of existing Python tools, and it aims to “democratize” static typing by coupling a strict type system with AI code generation.A Language is Not EnoughMoonBit is a toolchain-native languages, designed from the start to work smoothly with modern build, editing, and AI workflows. Unlike older languages that were retrofitted with new tools, MoonBit bundles its compiler, package manager, IDE, language server, and even an AI assistant as a cohesive whole. As the MoonBit team puts it, they “integrate a comprehensive toolchain from the start” to provide a streamlined coding experience.This stands in contrast to older systems languages like C/C++ and even to modern ones like Rust, which, despite its safety guarantees, still requires extra configuration to target Wasm. MoonBit by design treats Wasm as its primary compilation target – it is “Wasm-first”, built “as easy as Golang” but generating very compact Wasm output.Similarly, MoonBit was conceived to work hand-in-hand with AI tools. It offers built-in hooks for AI code assistance (more on this below) and even considers AI protocols like Anthropic’s Model Context Protocol (MCP) as first-class integration points. In MoonBit, the language + toolchain combo is now a single product, not an afterthought.MoonBit is not alone. Other new languages like Grain, Roc, and Hylo (formerly Val) each explore different priorities—from functional programming for the web to safe systems-level design and simplified developer experience.Grain prioritizes JS interop and functional ergonomics; Roc favors simplicity and speed, though it’s still pre-release; and Hylo experiments with value semantics and low-level control. MoonBit and these other languages make it clear that language design is soon going to become inseparable from its runtime, developer experience, and AI integration.Architecture and Developer ExperienceMoonBit’s architecture reflects a deliberate focus on toolchain integration and cross-platform performance. It is a statically typed, multi-paradigm language influenced by Go and Rust, supporting generics, structural interfaces, and static memory management. The compiler is designed for whole-program optimization, producing Wasm or native binaries with minimal overhead. According to benchmarks cited by the team, MoonBit compiled 626 packages in 1.06 seconds—approximately 9x faster than Rust in the same test set. Its default Wasm output is compact: a basic HTTP service compiles to ~27 KB, which compares favorably to similar Rust (~100 KB) and JavaScript (~8.7 MB) implementations. This is partly due to MoonBit’s support for Wasm GC, allowing it to omit runtime components that Rust must include.The syntax and structure are also optimized for machine parsing. All top-level definitions require explicit types, and interface methods are defined at the top level rather than nested. This flatter structure reportedly improves LLM performance by reducing key–value cache misses during code generation. The language includes built-in support for JSON, streaming data processing via iterators, and compile-time error tracking through control-flow analysis.Tooling is tightly coupled with the language. The moon CLI handles compilation, formatting, testing, and dependency management via the Mooncakes registry. The build system, written in Rust, supports parallel, incremental builds. A dedicated LSP server (distributed via npm) integrates MoonBit with IDEs, enabling features like real-time code analysis and completions. Debugging is supported via the CLI with commands like moon run --target js --debug, which link into source-level tools.A browser-based IDE preview is also available. It avoids containers in favor of a parallelized backend and includes an embedded AI assistant capable of generating documentation, suggesting tests, and offering inline explanations. According to the team, this setup is designed to support both developer productivity and AI agent interaction.MoonBit’s performance profile extends beyond Wasm. A recent release introduced an LLVM backend for native compilation. In one example published by the team, MoonBit outperformed Java by up to 15x in a numeric loop benchmark. The language also supports JavaScript as a compilation target, expanding deployment options across web and server contexts.AI Systems as Language ConsumersLLMs are no longer just helping developers write code—they’re starting to read, run, and interact with it. This shift requires rethinking what it means for a language to be “usable.”MoonBit anticipates this by treating AI systems as first-class consumers of code and tooling. Its team has adopted the MCP, an emerging open standard developed by Anthropic to enable LLMs to interface with external tools and data sources. MCP defines a JSON-RPC server architecture, allowing programs to expose structured endpoints that LLMs can query or invoke. MoonBit’s ecosystem includes a work-in-progress MCP server SDK written in MoonBit and compiled to Wasm, enabling MoonBit components to act as MCP-capable endpoints callable by models such as Claude.This integration reflects a broader shift in tooling. Modern documentation tools like Mintlify now expose semantically indexed content explicitly for AI retrieval. UIs and APIs are being annotated with machine-readable metadata. Even version control is evolving: newer workflows track units of change like (prompt + schema + tests), not just line diffs, enabling intent-aware versioning usable by humans and machines alike.MoonBit’s example agent on GitHub demonstrates this in practice, combining Wasm components (e.g. via Fermyon Spin), LLMs (such as DeepSeek), and MoonBit logic to automate development tasks. Under this model, protocols like MCP enable developers to publish AI-accessible functions directly from their codebases. MoonBit’s support for this workflow—via Wasm and first-party libraries—illustrates a growing view in language design: that AI systems are not just tools for writing code, but active consumers of it.Wasm’s Impact on Performance and PortabilityThree years ago, William Overton, a Senior Serverless Solutions Architect, said, Wasm "starts incredibly quickly and is incredibly light to run," making it well-suited to execute code across CDNs, edge nodes, and lightweight VMs with low startup latency and near-native speed. Today, the growing adoption of Wasm is reshaping expectations for both performance and cross-platform deployment.For MoonBit, Wasm is the default compilation target—not an optional backend. Its tooling is built around producing compact, portable Wasm modules. A simple web server in MoonBit compiles to a 27 KB Wasm binary—significantly smaller than equivalent builds in Rust or JavaScript. This reduction in size translates directly to faster load times and reduced memory usage, making MoonBit viable for constrained environments like embedded systems, CLI tools, and edge deployments.Standardized but still-emerging features like Wasm GC—and experimental ones like the Component Model—further reinforce this model. MoonBit has adopted both: its use of interface types and Wasm GC helps minimize runtime footprint. In a published comparison, MoonBit’s Wasm output was roughly an order of magnitude smaller than that of Rust, largely due to differences in memory management.Taken together, these developments suggest that Wasm is becoming a practical universal format for lightweight applications. For teams building portable utilities or latency-sensitive services, languages with Wasm-native support—such as MoonBit—offer tangible advantages over traditional container- or VM-based approaches.💡What This Means for YouMoonBit offers concrete lessons even if you never write MoonBit code. Key takeaways include:Ecosystem Continuity: Instead of building isolated ecosystems, consider bridging existing ones. MoonBit demonstrates that Python libraries can be reused as external modules—wrapped, if needed, by AI-generated shims. This reduces rewrites and enables gradual migration to safer or more performant languages.Integrated Tooling: Treat your language platform as a cohesive whole. MoonBit’s CLI (moon) unifies compilation, testing, debugging, and package management, minimizing context switches. Its build system exposes project metadata to IDEs via LSP integration. In your own tooling, aim for end-to-end flows powered by a single interface that integrates with the editor.Wasm and Runtime Strategy: For cross-platform deployment, prioritize Wasm as a primary target. MoonBit emits Wasm, JavaScript, or native binaries from a single compiler, and leverages Wasm GC for smaller outputs. Adopt language/toolchain combinations that support compact binaries and multiple backends without sacrificing performance.Data-Oriented Design: MoonBit’s JSON type, Iter abstraction, and pattern matching illustrate a clean model for streaming data. Architect utilities and pipelines to minimize allocations and intermediate state—use iterators, stream transforms, and statically analyzable data access patterns where possible.AI-Friendliness: MoonBit enforces top-level type annotations and flattens scope structures to support linear token generation. If you expect AI tooling to generate, refactor, or analyze your code, avoid deep nesting and implicit state—prefer clarity and structure that LLMs can parse efficiently.Static Checking + AI: MoonBit combines a strict type and error system with AI assistance to ease onboarding and boilerplate generation. This model lets developers write in a safe language without sacrificing velocity. For your own teams, consider pairing statically typed languages (or gradually typed ones like Python with type hints) with copilots that bridge ergonomics and enforcement.CLI Extensibility: The moon CLI supports modular growth—commands like moon new, moon run, and moon add are extensible by design. It can even serve as an LSP or MCP server. Treat your own CLIs as platform interfaces: design for plugin support, programmatic inspection, and long-term integration with AI and editor tooling.To see these ideas in practice—especially MoonBit’s type system, performance model, and Wasm-native tooling—Zihang YE, one of MoonBit’s core contributors, offers a hands-on walkthrough. His article walks us through the implementation of a diff algorithm using MoonBit, building a CLI tool that’s usable both by developers and AI systems via the MCP.Expert Insight: Implementing a Diff Algorithm in MoonBit by Zihang YEA hands-on introduction to MoonBit through the implementation of a version-control-grade diff tool.MoonBit is an emerging programming language that has a robust toolchain and relatively low learning curve. As a modern language, MoonBit includes a formatter, a plugin supporting VSCode, an LSP server, a central package registry, and more. It offers the friendly features of functional programming languages with manageable complexity.To demonstrate MoonBit’s capabilities, we’ll implement a core software development tool—a diff algorithm. Diff algorithms are essential in software development, helping identify changes between different versions of text or code. They power critical tools in version control systems, collaborative editing platforms, and code review workflows, allowing developers to track modifications efficiently. If you have ever used git diff then you are already familiar with such algorithms.The most widely used approach is Eugene W. Myers Diff algorithm, proposed in the paper “An O(ND) Difference Algorithm and Its Variations”. This algorithm is widely used for its optimal time complexity. Its space-efficient implementation and ability to find the shortest edit script make it superior to alternatives like patience diff or histogram diff and make it the standard in version control systems like Git and many text comparison tools such as Meld.In this tutorial, we’ll implement a version of the Myers Diff algorithm in MoonBit. This hands-on project is ideal for beginners exploring MoonBit, offering insight into version control fundamentals while building a tool usable by both humans and AI through a standard API.We will start by developing the algorithm itself, then build a command line application that integrates the Component Model and the MCP, leveraging MoonBit’s WebAssembly (Wasm) backend. Wasm is a blooming technology that provides privacy, portability, and near-native performance by running assembly-like code in virtual machines across platforms —qualities that MoonBit supports natively, making the language well-suited for building efficient cross-platform tools.By the end of this tutorial, you’ll have a functional diff tool that demonstrates these capabilities in action.Project SetupLet’s first create a new moonbit project by running:moon new --lib diffThe following will be the project structure of the code. The moon.mod.json contains the configuration for the project, while the moon.pkg.json contains the configuration for each package. top.mbt is the file we'll be editing throughout this post.├── LICENSE├── moon.mod.json├── README.md└── src ├── lib │ ├── hello.mbt │ ├── hello_test.mbt │ └── moon.pkg.json ├── moon.pkg.json └── top.mbtWe will be comparing two pieces of text, divided each into lines. Each line will include its content and a line number. The line number helps track the exact position of changes, providing important context about the location of changes when displaying the differences between the original and modified files.Read the Complete Tutorial🛠️Tool of the Week⚒️MCP Python SDK 1.9.2 — Structured Interfaces for AI-Native ApplicationsThe MCP is a standard for exposing structured data, tools, and prompts to language models. The MCP Python SDK brings this to production-ready Python environments, with a lightweight, FastAPI-compatible server model and first-class support for LLM interaction patterns. The latest release, v1.9.2 (May 2025), introduces:Streamable HTTP Support: Improved transport layer for scalable, resumable agent communication.Lifespan Contexts: Type-safe initialization for managing resources like databases or auth providers.Authentication: Built-in OAuth2 flows for securing agent-accessible endpoints.Claude Desktop Integration: Direct install into Anthropic’s desktop agent environment via mcp install.Async Tooling: Tools, resources, and prompts can now be async functions with full lifecycle hooks.Ideal for teams designing LLM-facing APIs, building AI-autonomous agents, or integrating prompt-based tools directly into Python services. It’s the protocol MoonBit already supports—and the interface LLMs increasingly expect.Read the Project Description📰 Tech BriefsArchitectural Patterns for AI Software Engineering Agents by Nati Shalom, Fellow at Dell NativeEdge: Examines how modern coding agents are being structured like real-world dev teams—using patterns such as code search, AST analysis, and version-controlled prompt templates to enable disciplined, multi-agent collaboration.A survey of agent interoperability protocols: Model Context Protocol (MCP), Agent Communication Protocol (ACP), Agent-to-Agent Protocol (A2A), and Agent Network Protocol (ANP) by Ehtesham et al.: Offers an in-depth analysis of four emerging protocols designed to enhance interoperability among AI agents, examining their architectures, communication patterns, and security models.When the Agents Go Marching In: Five Design Paradigms Reshaping Our Digital Future by Adrian Levy, Senior UX Expert at CyberArk: Discusses how agentic UX is reshaping everything from collaboration to trust. If MoonBit is what languages might look like in this new world, Levy’s article shows how interfaces and systems are evolving to meet the same challenge, articulating the Agent Experience (AX) paradigm.Beyond augmentation: Agentic AI for software development by the Khare et al., Infosys Knowledge Institute: A practice-oriented report on how autonomous agents are moving from coding assistants to pipeline-integrated actors—handling complex dev tasks end-to-end and delivering measurable productivity gains in database and API generation.Emerging Developer Patterns for the AI Era by Yoko Li, Engineer, a16z: Explores how core concepts like version control, documentation, dashboards, and scaffolding are being reimagined to support AI agents as first-class participants in the software loop—not just code generators, but consumers, collaborators, and operators.That’s all for today. Thank you for reading this issue of Deep Engineering. We’re just getting started, and your feedback will help shape what comes next.Take a moment to fill out this short survey—as a thank-you, we’ll add one Packt credit to your account, redeemable for any book of your choice.We’ll be back next week with more expert-led content.Stay awesome,Divya Anne SelvarajEditor-in-Chief, Deep EngineeringTake the Survey, Get a Packt Credit!If your company is interested in reaching an audience of developers, software engineers, and tech decision makers, you may want toadvertise with us.*{box-sizing:border-box}body{margin:0;padding:0}a[x-apple-data-detectors]{color:inherit!important;text-decoration:inherit!important}#MessageViewBody a{color:inherit;text-decoration:none}p{line-height:inherit}.desktop_hide,.desktop_hide table{mso-hide:all;display:none;max-height:0;overflow:hidden}.image_block img+div{display:none}sub,sup{font-size:75%;line-height:0}#converted-body .list_block ol,#converted-body .list_block ul,.body [class~=x_list_block] ol,.body [class~=x_list_block] ul,u+.body .list_block ol,u+.body .list_block ul{padding-left:20px} @media (max-width: 100%;display:block}.mobile_hide{min-height:0;max-height:0;max-width: 100%;overflow:hidden;font-size:0}.desktop_hide,.desktop_hide table{display:table!important;max-height:none!important}}

Deep Engineering

Deep Engineering #16: Designing Systems for Longevity with Alexander Kushnir

Deep Engineering #15: Steven F. Lott on Pragmatic Object-Oriented Python

Deep Engineering #14: Mihalis Tsoukalos on Go’s Concurrency Discipline

Deep Engineering #13: Designing Staleness SLOs for Dynamo-Style KV Stores with Archit Agarwal

Deep Engineering #12: Tony Dunsworth on AI for Public Safety and Critical Systems

Deep Engineering Specials: Vibe Coding—Promise, Pressure, and Practical Limits

Deep Engineering #11: Quentin Colombet on Modular Codegen and the Future of LLVM’s Backend

Deep Engineering #10: Prof. Elías F. Combarro on Programming Quantum Systems in Flux

Deep Engineering #9: Unpacking MLIR and Mojo with Ivo Balbaert

Deep Engineering #8: Gabriel Baptista and Francesco Abbruzzese on Architecting Resilience with DevOps

Deep Engineering #7: Managing Software Teams in the Post-AI Era with Fabrizio Romano

Deep Engineering #6: Imran Ahmad on Algorithmic Thinking, Scalable Systems, and the Rise of AI Agents

Deep Engineering #5: Dhirendra Sinha (Google) and Tejas Chopra (Netflix) on Scaling, AI Ops, and System Design Interviews

Deep Engineering #4: Alessandro Colla and Alberto Acerbis on Domain-Driven Refactoring at Scale

Deep Engineering #3: Designing for AI and Humans with MoonBit Core Contributor Zihang YE

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access