AI and the Resurrection of Systems Engineering

Engineering

Neel Shah

Apr 30, 2026

Therac-25, 1985. A radiation therapy machine killed six patients. The cause: a shared memory flag that a programmer assumed would be set instantaneously. A timing assumption, undocumented and unchecked. Working exactly as programmed. That was the problem [Leveson & Turner, 1993].

In 1994, US law enforcement struck a deal with telecom carriers: build wiretap infrastructure into the network. Thirty years later, that architecture assumed a perimeter that no longer existed. October 2024: thousands of Cisco switches remained unpatched against a known critical vulnerability. China's Ministry of State Security walked through the open door, harvested credentials, pivoted into management networks — and found the FBI's own surveillance servers. They didn't crack encryption. They read what federal agents were reading, in real time. Requirements-execution failure: no binding patch policy, no defence-in-depth, no link between threat modelling and operations [Salt Typhoon, 2024].

Cruise, October 2023. A driverless robotaxi dragged a pedestrian twenty feet. The motion-prediction model correctly detected her – and estimated she would stop sliding. No requirement had been written for a pedestrian mid-slide. $1.5M NHTSA penalty. Entire fleet recalled [NHTSA, 2024].

‍Three incidents. Forty years apart. Structurally identical: a requirement nobody wrote, an assumption nobody traced.

After soldering a couple of hundred wires and hand pressing four dozen hydraulic joints, the first time I powered up a brake-by-wire HIL rig at Jaguar Land Rover, the hydraulic pump threw EM noise that crashed the control laptop on the first cycle. The only thing standing between me and 160bar of brake fluid finding a new home was a kill switch my Functional Safety team had written into the design. That requirement is perhaps the reason I'm able to write this today.

AI doesn't eliminate these failure modes. It adds new ones – and, for the first time, gives us tools powerful enough to actually stay ahead of them.

Safety Standards vs. AI Deployment Speed: A 12-Year Chase

The Paradigm That Broke the Model

Andrej Karpathy mapped three eras of software [Karpathy, 2017; 2023]:

‍Software 1.0 – deterministic code. You write the rules; the machine obeys. The V-Model was built for this and works brilliantly in it.

Software 2.0 – learned behaviour. Neural networks produce probabilistic outputs whose failure modes are emergent. You cannot enumerate what a model will do at the edge of its training distribution.

Software 3.0 – language models as substrate. "Pick up the red object near the door" is a program. Behaviour is plastic, compositional, context-dependent.

An AI-native system collapses all three into a single product. A humanoid robot's joint controller is 1.0 (deterministic C code), its perception stack is 2.0 (a trained net), its task planner is 3.0 (an LLM reasoning over context). All running in parallel, in real time, inside a body with inertia, thermal envelopes, and actuator limits. Rodney Brooks put it plainly in 2002: "The body is not a peripheral. It is the system." [Brooks, 2002]. In 2026, the system includes the model weights.

When Rhoda AI emerged from stealth with a $450M raise, the technical summary read: "a closed-loop system that continuously updates actions as conditions change." That is a feedback controller. The plant model learned from 100 million internet videos instead of a dyno test – same architecture, new vocabulary, thirty-year-old failure modes [Shah, 2025a].

The industry's most technically sophisticated companies have noticed. When John Ternus became Apple's CEO in April 2026, the $4 trillion company placed a mechanical engineer — the person who oversaw M1 through M5 — at the top [Apple, 2026]. Jensen Huang calls his approach "Extreme Co-Design": every stack layer designed simultaneously, from silicon to networking to software. The Rubin GPU platform's 10x inference cost reduction came from this method [NVIDIA, 2025]. Stanford's CS153, "Frontier Systems" — dubbed "AI Coachella" — assembled Huang, Altman, Karpathy, and Satya Nadella to make one argument: you cannot optimise any layer of an AI stack without understanding how it constrains every other [CS153, 2026]. That argument has a forty-year-old name. Systems thinking.

Three Paradigms, Three Requirements Problems

The Moose Test Problem

Tesla's Data Engine is Requirements 2.0: requirements discovered through fleet behaviour rather than authored upfront [Karpathy, 2017]. At Tesla's scale — OTA regressions surfaced within 48 hours — it is remarkable. It is also not a process advantage. It is a capital advantage that took fifteen years to build, and it only works for requirements that data can discover.

Here is where it fractures. A Tesla Model 3 encounters a 1,000-kilogram moose on a twisty Scandinavian road at night. If the Data Engine has never seen this scenario, the first car that does becomes the data point. The Moose Test was invented by humans at Swedish Auto Magazine in the 1970s. That became an internal requirement to sign-off cars at Volvo and Saab quickly. Now it is a regulatory requirement for passenger cars in most countries.

Two categories of requirements exist. Those discoverable from fleet data — sensor degradation curves, braking distance vs tyre temperature. And those that must be written before hardware commits — how a medical robot behaves when a patient grabs the arm, what happens when GPS and IMU disagree. Neither covers the other. Mixing them up is expensive.

The V-Model Is Dead and Here's What Replaced It

The V-Model is not merely obsolete — it is invalid for AI-native systems. Obsolete tools still work on their original problem; invalid ones fail on first principles. Four forces drove the invalidation.

Non-deterministic behaviour: a neural network's failure surface at the training-distribution edge is not enumerable — mathematical, not tooling [arXiv:2308.05381]. Iterative development: in ML, training data shapes requirements, model behaviour rewrites the spec, edge cases update the distribution. The loop is circular; the V-Model demands a line [Ullrich et al., 2024]. Simulation-to-real divergence: every deployment surfaces behaviours the sim didn't cover; the V-Model has no formalism for this. OTA updates: each patch reopens the "terminal" verification event. Every update is a new product; the snapshot is stale by morning.

A fifth suspect often gets accused: supply fragmentation. Acquit it. Automotive and aerospace ran the V-Model competently on fragmented tools for four decades. What fragmentation does is make the transition to anything better painful — the new model requires live traceability across a coherent data thread. This is now solvable. LLM-powered requirements tools traverse fragmented stacks, maintain bi-directional traceability, and surface the full change-impact chain in seconds [Siemens, 2025].

The Traceability Tax: 15–20% of Every Complex Programme Budget

What replaces the V-Model is the Reuleaux Triangle — three interlocked cycles. The convex outer boundary encodes the irreversibility of physical architecture within a generation. You can reshape a neural network. You cannot unbolt a chassis.

Loop B (hardware generation, months-to-years) — SoC selection, actuator specs, monocoque architecture: once committed, not revisable. Loop A (SW/AI sprint, weeks) — absorb a wrong turn, retrain, redeploy. Live Requirements — fleet data flowing continuously from deployment back into development. The Data Flywheel at the top is AI's genuinely new contribution: requirements discovered from fleet behaviour at scale, entirely absent from the classical V.

Probabilistic threshold gauges replace binary only pass/fail with provable coverage bounds. The World Model / AI Assist ellipse at the centre runs shadow-mode simulations against every proposed change — hardware-software conflicts surfaced before they cost real money or lives.

What Failure Actually Looks Like – and Why It's Now Traceable

Interface failures. A robot arm fastening a bolt encounters 1mm part variation. Trained on nominal geometry, the policy misses it. Torque reading: within spec. The car leaves the line cross-threaded. No fault code [Shah, 2025b]. FMEA can't catch a semantic failure — it requires specifying what failure looks like in advance. Tesla FSD still fails when camera housings fog at specific dew points [NHTSA, 2022]: a thermal-optical interface requirement that was never written.

Tacit knowledge failures. In 2024, Waymo recalled 1,212 robotaxis after crashes into poles, gates, and chains. Root cause: an undocumented labelling call from years earlier — pole-like objects between non-road-edge surfaces could be deprioritised. Never written down, it propagated into 1,212 deployed vehicles [Waymo, 2024; NHTSA, 2025]. OTA fixed the software. Missing requirements cannot be patched retroactively.

Root Cause Anatomy: What Actually Kills Physical AI Programs

The liability cliff. ISO 26262, DO-178C, and ARP 4754A were built for Software 1.0. None cover AI-generated behaviour. The EU AI Act (effective August 2026) mandates auditability for high-risk AI [EU, 2024]. Modern AI-assisted traceability tools maintain exactly this audit chain automatically — requirement origin, implementation linkage, test evidence — queryable in natural language. Compliance becomes explainability.

Dark requirements. AI systems can now generate their own requirements from field telemetry — powerful, and dangerous. A requirement written by a model, implemented by a model, verified by a model, with no human able to explain the chain, is an audit failure waiting to happen [ISO/IEC 42001, 2023]. Treat model-generated requirements as hypotheses. Human sign-off before commitment. Traceability proving it occurred.

The sign-off problem — and why it's suddenly solvable. ISO 26262 ASIL-D is elegant for deterministic systems. For a neural network controlling a robot arm, the failure surface is the product of all component distributions. You cannot enumerate it. Recent formalisation in probabilistic safety [arXiv:2506.05171] resolves this: world model + safety spec + formal verifier → P(unhandled failure) ≤ ε (e.g. 10⁻⁶/hr). You do not enumerate the moose. You prove the probability of any unhandled moose-class event is below 10⁻⁶.

NVIDIA’s GR00T N2 raises the stakes: a World Action Model that simulates outcomes before acting. When a World Action Model hallucinates, the failure lives in latent space — invisible to any requirement written against observable outputs. The practical response: instrument the gap between what the world model predicts and what sensors report, trigger safe states when that delta breaches threshold. Ensemble disagreement between parallel models adds a second layer. Not a complete solution — but a measurable, certifiable step. SOTIF's logic applied to a new class of system. The Jama Software benchmark across 40,000+ projects: manual traceability consumes 15–20% of project budget; AI cuts that by 70–90% [Accuris, 2025].

The Sign-Off Problem: Four Safety Paradigms, One Winner

The Cost of Getting It Wrong — and the Speed of Getting It Right

INCOSE and NASA are precise [INCOSE, 2017; NASA, 2007]: a defect caught at requirements costs 1×. At integration: 100×. In operations: 1,500×. Boeing 737 MAX made this concrete — MCAS behaviour never documented to certification authorities or flight crews. 346 deaths. $20 billion in direct costs [NTSB, 2019].

The Traceability Tax: Accuris benchmarks 13,000 requirements at 1,083 hours of manual work — six engineers, one month, not writing code. At 1,500× for defects reaching production, the ROI of eliminating that tax is not subtle.

The 1,500× Rule: When You Find It Changes Everything

I built a steer-by-wire driving simulator at JLR. First time we powered the HIL+SIL+DIL rig, the latency was so unnatural the simulator was unusable. Several weeks and an unprecedented networking protocol later, we solved it. Written requirements before we built the rig: days, not weeks.

The AI is not the risk. The missing requirement is. Physical AI that ships without systems engineering isn't bold — it's just Therac-25 with better marketing. The good news: for the first time, the tools are fast enough that "we move too fast for process" is no longer an excuse. It's a confession.

References

Leveson, N. & Turner, C. (1993). "An investigation of the Therac-25 accidents." IEEE Computer, 26(7). https://ieeexplore.ieee.org/document/274940
Salt Typhoon. (2024). US Senate Commerce Committee Hearing; FBI Director Wray statements; Wired, "China's Salt Typhoon Spied on US Wiretap Systems." https://www.wired.com/story/salt-typhoon-fbi-wiretap/
NHTSA. (2024). Consent Order: Cruise LLC. https://www.nhtsa.gov/press-releases/consent-order-cruise-crash-reporting
Karpathy, A. (2017). "Software 2.0." Medium. https://karpathy.medium.com/software-2-0-a64152b37c35
Karpathy, A. (2023). YC Startup School on Software 3.0. https://www.youtube.com/watch?v=zjkBMFhNj_g
Brooks, R. (2002). Flesh and Machines. Pantheon Books.
Apple. (2026). "Tim Cook to become Apple Executive Chairman; John Ternus to become Apple CEO." https://www.apple.com/newsroom/2026/04/tim-cook-to-become-apple-executive-chairman-john-ternus-to-become-apple-ceo/
NVIDIA. (2025). Rubin Platform announcement. https://nvidianews.nvidia.com/news/rubin-platform-ai-supercomputer
CS153. (2026). Frontier Systems, Stanford University. https://cs153.stanford.edu/
Euro NCAP. Moose Test / Elk Test methodology. https://www.euroncap.com
NHTSA. (2022). Tesla FSD investigation reports. https://www.nhtsa.gov/vehicle-safety/automated-vehicles
Waymo. (2024). "Voluntary recall of our previous software." https://waymo.com/blog/2024/02/voluntary-recall-of-our-previous-software
NHTSA. (2025). Part 573 Safety Recall Report 25E-034. https://static.nhtsa.gov/odi/rcl/2025/RCLRPT-25E034-2471.PDF
EU. (2024). EU AI Act — Regulation (EU) 2024/1689. https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:32024R1689
INCOSE. (2017). Systems Engineering Handbook (4th ed.). https://www.incose.org
NASA. (2007). NASA Systems Engineering Handbook. https://www.nasa.gov/seh
NTSB. (2019). Boeing 737 MAX accident investigation. https://www.ntsb.gov/investigations/Pages/DCA19MA060.aspx
DARPA. (2017). Explainable AI (XAI) programme. https://www.darpa.mil/program/explainable-artificial-intelligence
ISO/IEC 42001. (2023). AI Management System. https://www.iso.org/standard/81230.html
ISO/PAS 8800. (2024). AI safety in road vehicles. https://www.iso.org/standard/83303.html
Siemens. (2025). Polarion LiveDocs / Teamcenter Copilot. https://polarion.plm.automation.siemens.com/products/polarion-requirements
arXiv:2506.05171. (2025). "Towards provable probabilistic safety for scalable embodied AI systems."
arXiv:2509.11446. (2024). "LLMs for Requirements Engineering: A Systematic Literature Review." https://arxiv.org/abs/2509.11446
arXiv:2308.05381. (2024). "An Exploratory Study of V-Model in Building ML-Enabled Software." https://arxiv.org/abs/2308.05381
Ullrich, L. et al. (2024). "Expanding the Classical V-Model for the Development of Complex Systems Incorporating AI." IEEE Transactions on Intelligent Vehicles. https://arxiv.org/abs/2502.13184
Gartner. (2026). "Gartner Predicts By 2028, Explainable AI Will Drive LLM Observability Investments to 50%." https://www.gartner.com/en/newsroom/press-releases/2026-03-30-gartner-predicts-by-2028-explainable-ai-will-drive-llm-observability-investments-to-50-percent-for-secure-genai-deployment
Accuris. (2025). Requirements traceability benchmarks. https://www.accuristech.com
Shah, N. (2025a). LinkedIn. https://www.linkedin.com/posts/neelshah29_physicalai-robotics-systemsengineering-activity-7437918792472666112-952G
Shah, N. (2025b). LinkedIn. https://www.linkedin.com/posts/neelshah29_physicalai-robotics-systemsengineering-activity-7450651502903296000-aob1
Shah, N. (2025c). LinkedIn. https://www.linkedin.com/posts/neelshah29_physicalai-gtc2026-robotics-activity-7439743543549755392-btiE
Shah, N. (2025d). LinkedIn. https://www.linkedin.com/posts/neelshah29_functionalsafety-asil-iso26262-activity-7442659571216752640-yNhr

Example h2