fix: netplan: upgrade 1.1.2 -> 1.2.1 (Fedora f44 import) to fix PID1 generator deadlock#17815
fix: netplan: upgrade 1.1.2 -> 1.2.1 (Fedora f44 import) to fix PID1 generator deadlock#17815bfjelds wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR upgrades netplan in Azure Linux 4.0 from 1.1.2 to 1.2.1 by repointing the Fedora dist-git import from the global f43 snapshot to a pinned Fedora f44 commit (which ships netplan 1.2.1-2). The motivation is to pull in netplan's upstream "split generate/configure" refactor (canonical/netplan PR #552), which makes the boot-time systemd generator validation-only and defers virtual-device creation to a new netplan-configure.service. This eliminates a PID 1 self-deadlock (generator-phase getgrnam("systemd-network") NSS lookup blocking on PID 1's own userdb varlink) that can freeze grubazl4 A/B rollback boots. The change is a faithful upstream import — no vendored backport — consistent with the repo's "minimal divergence from upstream" principle.
Changes:
- Adds a dedicated
base/comps/netplan/netplan.comp.tomlpinning the Fedora f44 import (upstream-distro = fedora 44,upstream-commit = 66c31bcd…) with a thorough rationale comment, and removes the inline[components.netplan]entry fromcomponents.toml. - Regenerates
locks/netplan.lock, the renderedspecs/n/netplan/netplan.spec(now 1.2.1, addsconfigurebinary +netplan-configure.service), and thesourcesSHA512; the f43-onlystatus_fail_cleanly.patchis dropped whilenetplan-fallback-renderer.patchis retained.
I verified the comp pin syntax matches existing f44 pins (e.g., bash, libseccomp, stringtemplate4), the alphabetical ordering in components.toml is preserved, there are no dangling references to the deleted status_fail_cleanly.patch, and no duplicate [components.netplan] definitions remain. No concrete code-level issues were found.
One operational consideration (already documented by the author): the refactor defers dummy/bridge/bond/vlan/SR-IOV creation into netplan-configure.service, which Fedora ships preset-disabled — images relying on netplan applying such config at boot must enable that service or config silently never applies. This is inherent to the upstream change and not fixable within this component diff, but it warrants downstream image awareness.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
base/comps/netplan/netplan.comp.toml |
New dedicated component pinning the Fedora f44 import for the 1.2.1 upgrade, with rationale comment. |
base/comps/components.toml |
Removes the inline [components.netplan] entry (now defined in its dedicated file); ordering preserved. |
locks/netplan.lock |
Regenerated lock pinning the new f44 upstream-commit and updated fingerprints. |
specs/n/netplan/netplan.spec |
Rendered spec for 1.2.1: drops Patch1002, adds configure binary, netplan-configure.service, and python3-setuptools BR. |
specs/n/netplan/sources |
Updates SHA512 to the 1.2.1 source tarball. |
specs/n/netplan/status_fail_cleanly.patch |
Deleted — the f43-only patch is absent in f44 and no longer referenced. |
ddstreetmicrosoft
left a comment
There was a problem hiding this comment.
PR #552 itself is 44 commits, 40 files, +2268 / -1335
I understand this is just a typo (and the AI meant canonical/netplan#552), but please review PR text before asking others to review it...since it's pretty obvious that a sudo update has nothing to do with this PR.
The full 1.1.2 -> main delta is 83 files / +3141 -1655 (~7,625 patch lines).
This also doesn't make sense. What 'main' are you referring to? Not the 'main' branch for azl4, since that's '4.0'. Do you mean the 'main' branch from upstream netplan? That also doesn't make sense since there is no need (nor suggestion, hopefully) that we should update to the upstream netplan 'main' branch code.
A smaller custom patch (resolve the group from /etc/group via fgetgrent instead of getgrnam, ~15-30 lines) is possible
I don't think that mentioning this alternate approach without actually including it is helpful (meaning, you should include this patch as a comment or attachment or link or something).
but diverges from upstream and we would own it indefinitely.
This is quite obviously completely false - we would only own it for the lifetime of this stable release. Our next stable release (whenever that is) will almost certainly include netplan at or above 1.2.1, in which case we would not keep a workaround patch.
What this PR changes
Why is this section included? Isn't the content of this explicitly obvious by just looking at the PR code? Is this just an AI slop section?
netplan-configure.service, which Fedora ships preset: disabled
enable netplan-configure.service (e.g. imagecustomizer services.enable)
uh...hard no. Nobody should be required to use imagecustomizer or anything else to customize their image just to get networking configuration working.
This PR also needs to fully list all changes (other than canonical/netplan#552) that will come with the full version update from 1.1.2 -> 1.2.1.
If we were post-GA there is no way I would ever ack this, but as we're pre-GA if you can update this PR documentation as outlined above and provide the needed info about what else we're getting with the netplan version update, i'll re-review it.
Also, please don't add in another AI-generated novella worth of text; keep in mind that people reviewing PRs are suppose to actually read the entire PR and review all the PR code. Please be concise and direct, instead of engaging in the fundamentally superfluous, prolix, and utterly labyrinthine practice of deploying an exponentially inflated multitude of polysyllabic lexemes, redundant subordinate clauses, and gratuitous rhetorical flourishes to articulate a concept that could otherwise be communicated with optimal alacrity, crystalline lucidity, and profound economy of language.
Summary
Upgrade
netplanin Azure Linux 4.0 from 1.1.2 to 1.2.1 by moving the package's Fedora dist-git import pointer to the f44 branch head (which shipsnetplan 1.2.1-2). netplan 1.2.1 contains the upstream "split generate/configure" refactor that makes the boot-time systemd generator validation-only, eliminating a PID 1 self-deadlock that can freeze azurelinux boots with netplan configured.Background: the boot freeze this fixes
azurelinux images with netplan configured freeze during boot with:
Deadlock analysis
The freeze is a PID 1 self-deadlock between the netplan systemd generator and systemd's userdb service:
netplan generateis invoked .../usr/lib/systemd/system-generators/netplan).systemd-networkgroup.nsswitch.confroutes the group lookup through nss-systemd, which issues a varlink request (io.systemd.UserDatabase.GetGroupRecord/GetMemberships) to/run/systemd/userdb/io.systemd.DynamicUser.manager_run_generators()waiting for this very generator batch to finish. The varlink round-trips stall.Protocol errorand thenFreezing execution.It is host-speed-sensitive (a race against the 90s budget): slow/nested-virt hosts freeze, faster CI hosts service the varlink in time on byte-identical images. The trigger is always netplan — the only generator doing a
systemd-networkgroup lookup.Three conditions must coincide: (1) netplan's generator-phase
getgrnam("systemd-network")(present since netplan 1.0.1), (2) authselect'sfiles [SUCCESS=merge] systemdgroup line (new in AZL4), (3) PID 1 unable to answer its own userdb varlink during generators.Why not a systemd-side fix
A minimal systemd-side mitigation exists (bypass nss-systemd in the generator environment, e.g.
SYSTEMD_NSS_DYNAMIC_BYPASS=1inbuild_generator_environment(), ~20 lines). However, systemd's own documented contract places the fault on the generator, not on systemd.systemd.generator(7), "Notes about writing generators" (https://www.freedesktop.org/software/systemd/man/latest/systemd.generator.html):An NSS lookup that dispatches to nss-systemd is IPC to another process (PID 1's userdb). netplan calling
getgrnamin a generator therefore violates the generator contract, so the upstream-systemd position is that this belongs in netplan, not systemd. The deadlock-prone design is also structurally unchanged in current systemd (verifiedv258vsmain/262~devel), and upstream systemd carries no targeted fix. A systemd workaround would be a non-upstreamable local divergence, whereas netplan already fixes it upstream.The upstream netplan fix
netplan upstream split the generate and configure stages so the boot generator no longer writes networkd files (and makes no NSS call); the file writing +
chown(thegetgrnam) moved into a newnetplan-configure.serviceordered after boot, when PID 1's event loop is free.Key change: PR #552 "Split generate/configure stages for sd-generator compliance" (canonical/netplan) — merged 2025-12-16, first released in v1.2, present in 1.2.1.
42db0158— "configure: Add new binary to produce network service configs" (anchor)8233cf9d— adds thenetplan-configure.serviceunit6ad42dec— generator becomes validation-only8622557d— relatedThe generator / configure / util sources are byte-identical between netplan 1.2.1 and current
main, so 1.2.1 carries the complete deadlock-avoiding refactor.Why upgrade rather than patch 1.1.2
Backporting the refactor onto 1.1.2 is impractical:
configure.c,gen-networkd.c,gen-openvswitch.c,gen-sriov.c), heavy rewrites ofgenerate.c/networkd.c/openvswitch.c/sriov.c, a new systemd unit, plus meson / spec / CLI changes. Not a clean cherry-pick onto 1.1.2.1.1.2 -> maindelta is 83 files / +3141 -1655 (~7,625 patch lines)./etc/groupviafgetgrentinstead ofgetgrnam, ~15-30 lines) is possible but diverges from upstream and we would own it indefinitely.Moving the Fedora import pointer to f44 gets the released, upstream-maintained 1.2.1 with no vendored divergence.
What this PR changes
base/comps/netplan/netplan.comp.tomlpinning the Fedora f44 import (upstream-distro = fedora 44,upstream-commit = 66c31bcd3e9aeb8d15a5b4184009e57d799b0158).[components.netplan]entry frombase/comps/components.toml.locks/netplan.lockand renderedspecs/n/netplan/(now 1.2.1). The f43-era Fedorastatus_fail_cleanly.patchdrops out (not present in f44);netplan-fallback-renderer.patch(FedoraPatch1001) is retained.Adoption note (important for image consumers)
The refactor defers virtual-device creation (dummy / bridge / bond / vlan / SR-IOV) into
netplan-configure.service, which Fedora ships preset: disabled. Images that rely on netplan applying such config at boot must enablenetplan-configure.service(e.g. imagecustomizerservices.enable), otherwise netplan config silently never applies at boot (empty/run/systemd/network).Validation
netplan 1.2.1 RPMs were built locally with
azldevfrom this branch (netplan-1.2.1-4.azl4plus subpackages). Verified the package ships/usr/libexec/netplan/configure,/usr/lib/systemd/system/netplan-configure.service, and a validation-only/usr/libexec/netplan/generate.Local: Injected the locally built 1.2.1 RPMs into trident image build on stock systemd 258.4-4 (no systemd change), with
netplan-configure.serviceenabled, and ran the Trident rollback update tests. Serial logs confirm netplan ran — the 1.2.1-onlynetplan-configure.servicestarted and were successfully tested.Pipeline: The same locally built 1.2.1 RPMs were validated through the work-in-progress AZL4 Trident pipelines.