Skip to content

255 migrate the so3 build system to infrabase and move to the new so3 logo#256

Open
daniel-rossier wants to merge 110 commits into
mainfrom
255-migrate-the-so3-build-system-to-infrabase-and-move-to-the-new-so3-logo
Open

255 migrate the so3 build system to infrabase and move to the new so3 logo#256
daniel-rossier wants to merge 110 commits into
mainfrom
255-migrate-the-so3-build-system-to-infrabase-and-move-to-the-new-so3-logo

Conversation

@daniel-rossier

Copy link
Copy Markdown
Contributor

No description provided.

Daniel Rossier added 30 commits June 12, 2026 12:14
Re-sync the build system from the edgemtech Infrabase tree (without
torizon and e1c), nest the SO3 sources under so3/ to match the Infrabase
per-OS layout, and build so3/usr-so3/rootfs-so3/avz in-tree.

- build/: Infrabase meta-layers re-synced from edgemtech; torizon, e1c
  and verdin removed; new meta-toolchain layer (musl-cross-make recipe
  building the aarch64/arm musl user-space toolchain into build/tmp)
- SO3 sources nested under so3/{so3,usr,rootfs,target}; recipe paths and
  .gitignore updated for the new layout (artifacts re-ignored)
- in-tree recipes: so3 (6.2.0), usr-so3, rootfs-so3, avz (no github
  fetch); u-boot fetched+patched (2022.04, aligned with edgemtech)
- deploy via unprivileged bitbake + sudo -n (meta-filesystem)
- bsp-so3 builds, deploys and boots to so3% standalone (virt64) and as
  an AVZ guest (virt64_avz_so3 ITS, EL2)
…system-to-infrabase-and-move-to-the-new-so3-logo
The old manual qemu/ mechanism (fetch.sh + qemu.patch) is superseded by
the meta-qemu recipe: it fetches the same QEMU 8.2.2 and applies the same
hw/arm/virt.{c,h} patches (CLCD/KMI/PS2). Verified that build.sh -x qemu
rebuilds an equivalent qemu-system-aarch64. qemu/ stays gitignored and is
regenerated on demand.
Revert the avz recipe to fetching SO3 from upstream at a pinned SRCREV
and building the hypervisor (EL2) from it, instead of the in-tree so3/
sources. AVZ is decoupled from the in-tree SO3, which is the guest/
capsule (EL1) under development. Verified: bitbake avz fetches, attaches
into avz/, configures virt64_avz_defconfig and builds avz/so3.bin.
The do_build make invocation relied on a CROSS_COMPILE inherited from the
caller's shell, which broke virt32 (arm) builds when the shell had an
aarch64 CROSS_COMPILE set (cc1: unknown value 'generic-armv7-a' for
-mtune). Pass CROSS_COMPILE=${IB_TOOLCHAIN}- explicitly so virt32 uses
arm-linux-gnueabihf- and virt64 uses aarch64-none-linux-gnu-, matching
atf.bbclass.
Drop the usr/lib/lvgl git submodule (.gitmodules removed) and go back to
the original meta-usr strategy: lvgl is fetched at build time by the
meta-usr lvgl bbappend, gated on the :lvgl OVERRIDE. usr-so3 re-enables
do_fetch/unpack/attach so the bbappend pulls lvgl into usr/lib/lvgl
(do_patch stays noexec — the slv/lvgl integration patches are already
baked into the in-tree usr/). The lvgl bbappend now mkdir's lib/lvgl
(no longer pre-created by the submodule). usr/lib/lvgl is gitignored;
meta-usr otherwise realigned with edgemtech.
The bbclass selects the current platform's target (QEMU_TARGET: arm-softmmu
for virt32, aarch64-softmmu for virt64) and, when reconfiguring, appends any
other arch already built under qemu/build so meson does not drop it. Thus
building arm-softmmu then aarch64-softmmu (or vice-versa, e.g. switching
IB_PLATFORM between so3 standalone/avz/capsule) keeps both qemu-system-*
binaries instead of wiping the previous one. do_configure is nostamp so the
accumulation re-evaluates each build.
The SO3 kernel is built in place, so switching IB_PLATFORM between
virt64 and virt32 (aarch64<->arm) leaves a stale .config and object
files behind, producing a wrong-arch kernel. Track the last built
arch in a .ib_last_arch marker and run 'make distclean' only when it
changes, keeping same-arch rebuilds incremental.
Two arch-switch bugs surfaced when building SO3 for virt32 (arm) after
virt64 (aarch64):

1. 'OVERRIDES += ":so3"' inserts a leading space, so OVERRIDES became
   "...:arm :so3" and the CPU token parsed as "arm " (trailing space).
   :<cpu> overrides such as IB_MUSL_TARGET:arm then never collapsed, so
   the user-space cmake build got a literal ${IB_MUSL_TARGET} on PATH and
   could not find arm-linux-musleabihf-gcc. Switch to OVERRIDES:append in
   all five SO3 recipes (no inserted space).

2. The usr-so3 cmake build dir caches the toolchain in CMakeCache.txt, so
   switching arch kept emitting aarch64 binaries (an aarch64 init.elf on a
   32-bit kernel -> prefetch abort at boot). Wipe so3/usr/build when the
   arch changes, tracked via a .ib_last_arch marker at the usr/ root.
Both QEMU launch scripts only handled virt64, so with IB_PLATFORM=virt32
they printed the MAC/GDB lines and exited without starting QEMU. Select
QEMU_BIN per platform (qemu-system-arm for virt32) and add a virt32 branch
booting U-Boot directly (-M virt -cpu cortex-a15 -kernel u-boot/u-boot,
sdcard.img.virt32). stg.sh keeps the virtio GPU/keyboard/mouse + SDL
window; the virt64-only guard is widened to accept virt32.
u-boot is built from the meta-uboot recipe (github 2022.04 @ pinned
SRCREV + the SO3 patch set), which fetches and attaches it, backing any
prior copy up to u-boot.back. The committed in-tree u-boot/ was therefore
obsolete and was clobbered on every build, producing a huge spurious
diff. Remove all 18k files from tracking and gitignore /u-boot/, matching
how qemu/ and avz/ are already handled.
The patch set was inherited wholesale from the edgemtech recipe and had
never been regenerated by do_updiff in this repo. It carried two classes
of cruft:

  * duplicate chains — the same source file patched twice (e.g. board.c
    in 0004 and 0077, setexpr.c in 0008/0081, the tools/boot/*.c and the
    defconfigs each appearing in two generations with ./ vs b/ labels),
    the residue of repeated append-only updiff runs across a label-format
    change;
  * build artifacts frozen as patches — hello_world.srec, autoconf.mk,
    autoconf.mk.dep, include/config/uboot.release, include/generated/*
    (dt.h, *_autogenerated.h), lib/efi_selftest/efi_miniapp_*.h.

Regenerated from scratch: diff the pristine fetch against the working
tree (do_diffcompose), drop the old numbered set, promote the staged
one-patch-per-file result (do_updiff). 64 messy patches -> 54 clean,
consolidated, git-labelled patches. e1c_boot.c is kept (compiled but
unused) per decision. Verified: a clean fetch+unpack+patch+build applies
all 54 and produces a working u-boot.

Also completed the do_diffcompose artifact exclude-list in patch.bbclass
(autoconf.mk, autoconf.mk.dep, *.srec, efi_miniapp_*.h) so future updiff
runs stay clean.
ls sets CLOEXEC via fcntl(). arm64 musl issues this as fcntl (NR 25),
which SO3 handles; arm32 (virt32) musl issues the same call as fcntl64
(NR 221), which syscall.tbl never registered -> 'unhandled syscall: 221'
warning and a silently-failing -ENOSYS. Map fcntl64 to the existing
__sys_fcntl handler so virt32 behaves like virt64.
Killing a process whose spawned thread was blocked in the kernel hit
'BUG in kernel/thread.c:105' (discard_tcb_in_pcb: WAITING 'not handled
yet'). A sleeping thread sits in __sleep() with a struct timer on its
own kernel stack, so it cannot just be freed — the pending timer would
dangle and later fire on freed memory.

Handle it cooperatively: add a tcb->killed flag; discard_tcb_in_pcb()
flags+wakes WAITING threads (instead of BUG()) and waits for them via
the existing threads_active completion, reaping them afterwards.
A woken thread resumes in __sleep(), stops its own timer, sees killed
and self-terminates with thread_exit() — entirely in kernel, never
returning to the (already-released) user pages. READY threads are still
force-freed (they must not resume into freed user space).

Verified: Ctrl-C of lvgl_demo stress (whose slv tick thread loops in
usleep) no longer panics.

Limitation: only the __sleep() wait is instrumented. A thread killed
while blocked on a futex/mutex would not yet self-terminate; that needs
the same killed-check added to those wait paths.
The 128 KB lvgl heap is too small to build lv_demo_widgets (lv_conf.h's
own note flags this), so the widget tree failed to allocate, nothing
rendered, and the main thread spun in lv_timer_handler() without
reaching a syscall boundary — making Ctrl-C undeliverable. 4 MB fits the
demo comfortably; it is BSS (zero-init) so the .elf on disk is unchanged.
A diagnostic that bypasses LVGL: opens /dev/fb, queries geometry via the
same ioctls slv uses, mmap()s the VRAM and draws colour bars + an
animated square straight into it. Lets us tell apart a broken display
pipeline (PL111 CLCD -> QEMU SDL) from an LVGL-side problem. Ctrl-C to
quit.
fb_mmap() mapped the CLCD VRAM cacheable, which is wrong for a
framebuffer: on real hardware the CPU writes linger in the data cache
and never reach the scanout buffer. Map it non-cacheable (nocache=true).
(Under QEMU/TCG it is cosmetic since the cache is not modelled, but it is
required on real targets.)
SO3 drives the PL111 CLCD + PL050 keyboard/mouse that the so3 QEMU patch
wires unconditionally into '-M virt'; it has no virtio-gpu driver, so the
virtio-gpu/keyboard/mouse devices only added a competing blank console.
More importantly the SDL backend did not present the PL111 console's
surface at all (verified: pl110 renders the framebuffer into the surface
- monitor 'screendump' shows it - yet the SDL window stayed black).
Switching to '-display gtk' shows the panel correctly (and its View menu
lists every console). Drop the virtio-gpu/keyboard/mouse devices.
Paint the colour-bar background once, then per frame only restore the
square's previous rows and redraw it, instead of memcpy-ing the whole
3 MB framebuffer every frame.
The serial IRQ delivered SIGINT to current() - whatever thread happened
to be running when the Ctrl-C key arrived. A foreground app asleep in a
syscall (e.g. usleep) is not the running thread (the idle thread is, with
pcb==NULL), so Ctrl-C was silently dropped; it only worked for CPU-busy
apps. And at the shell prompt the prompt was never reprinted.

Two parts:

1. Track the foreground console process. Add a global fg_pcb, set by
   sys_do_wait4() to the child a process blocks waiting on (the shell's
   foreground job) and restored to the waiter when it exits. The serial
   IRQ now targets fg_pcb (fallback: current()), so SIGINT reaches the
   foreground app even while it sleeps.

2. Cancel the line at the prompt instead of signalling the shell. When a
   console read is in progress (read_lock held), the IRQ sets serial_intr;
   pl011_get_byte returns ETX and console_getc discards the typed line and
   returns an empty line, so the shell's fgets returns and it reprints the
   prompt once. This avoids musl's sticky-EOF on a 0-byte read and a
   siglongjmp-through-fgets file-lock leak. Matches the driver's existing
   read_lock design comment.

Relies on the cooperative WAITING-thread teardown for the kill path.
Mirror the virt32 graphical fix onto the virt64 branch: SO3 drives the
same PL111 CLCD + PL050 (virt64.dts has clcd@08800000 / pl050 nodes), has
no virtio-gpu driver, and the SDL backend does not present the PL111
console. Switch to '-display gtk' and drop the virtio-gpu/keyboard/mouse
devices. The flash0.img AVZ-vs-U-Boot boot heuristic is unchanged.

Untested (no virt64 graphical run this session) but the framebuffer path
is identical to virt32; the kernel-side fixes (non-cacheable fb, Ctrl-C)
are arch-shared.
An interrupted task (e.g. Ctrl-C during a clean) can leave a recipe
WORKDIR that exists but lacks its temp/ subdir. bitbake then cannot
create that task's fifo and fails with
  do_clean: [Errno 2] No such file or directory: .../temp/fifo.NNNN
(hit on 'build.sh -ca bsp-so3'). Before clean/build, scan tmp/work and
remove any workdir missing temp/ (it holds nothing useful) so bitbake
recreates it cleanly.
'build.sh -ca bsp-so3' failed with
  usr-so3 do_clean: [Errno 2] No such file or directory: .../temp/fifo.NNNN

Root cause: the lvgl bbappend's shell do_clean:append ran
'rm -rf ${WORKDIR}/*', which deleted the running clean task's own temp/
(holding its fifo + run script) mid-execution, leaving an empty workdir.
The next clean then could not create its fifo there and failed.

Fix: make do_clean a Python task (usr.bbclass) plus Python do_clean:append
in the usr-so3 recipe and the lvgl bbappend. Python tasks create their
temp dir themselves and use no fifo, so they are robust when the workdir
is fresh/empty. The lvgl append no longer touches WORKDIR (bitbake owns
it); it only purges the fetched lvgl tree (in-tree usr/lib/lvgl, src/lib,
${S}/lib/lvgl). Verified: fresh clean, repeat clean, full 'bsp-so3 -c
clean', and clean->rebuild (aarch64) all succeed.
…2 entries

Remove the IB_PLATFORM:so3 override: SO3 now always builds for the main
IB_PLATFORM. That override was referenced only here and resolved via
OVERRIDES, so a value diverging from IB_PLATFORM silently built SO3 for
the wrong arch. The standalone / AVZ-guest / capsule contexts are not
distinct platforms - they differ only by IB_CONFIG:so3 / IB_TARGET_ITS:so3
(e.g. capsule = virt64_capsule_defconfig + virt64_capsule), which are
independent of the platform variable.

Also add the virt32 counterparts that were missing
(PREFERRED_VERSION_so3, IB_CONFIG:so3, IB_TARGET_ITS:so3, IB_STORAGE_MODE)
and default IB_PLATFORM to virt64.
AVZ is an EL2 hypervisor. The virt64 launcher only enabled EL2
(virtualization=on) when filesystem/flash0.img was present (ATF chain);
booting AVZ via the ITS without ATF used plain -M virt,gic-version=2
(EL1), so AVZ faulted on its first EL2 system-register write
(Synchronous Abort -> reset). Detect the selected so3 ITS from local.conf
and, when it is an avz ITS, add virtualization=on with -kernel u-boot
(virt64_defconfig is EL2-aware). Verified: AVZ now boots.
Daniel Rossier added 9 commits June 17, 2026 14:18
The **/build .gitignore rule (.gitignore:17) silently keeps recipe source
patches out of the index; they must be force-added like the 187 already
tracked. 25 patches referenced by committed recipe metadata (atf, linux,
buildroot, lvgl) were never added, so the clean CI checkout failed parsing
with "Unable to get checksum for ... SRC_URI entry". Force-add them.
**/build silently ignored the whole /build metadata tree, so recipe patches
had to be force-added; a forgotten one broke CI parsing (fixed in a1624e7).
Un-ignore /build, re-ignore output one level down, re-include conf + meta-*
source dirs. Anchor the bare 'atf' rule to /atf so it stops matching the
meta-atf recipe dir. Ignore generated build/conf/auto.conf and *.orig/*.rej.
do_build compiles gcc 12.4.0 with in-tree mpfr/gmp/mpc. When the source
tree has inconsistent timestamps (configure.ac newer than configure, as on
a fresh copy that doesn't preserve mtimes), make tries to regenerate the
autotools files and invokes automake-1.17/autoconf. The so3-env CI image
ships no autotools (and Ubuntu 24.04 has automake 1.16, not the 1.17 mpfr
wants), so do_build died with 'automake-1.17: command not found'.

--disable-maintainer-mode turns the regen rules into no-ops, making the
toolchain build environment- and timestamp-independent. Verified by
reproducing the exact failure in the so3-env container and confirming the
flag builds gcc past the mpfr stage.
Temporary diagnostic: the toolchain build fails only on GitHub-hosted
runners (passes locally and on a self-hosted box with the same image and
commit), and the inner log.do_build is never shown in the CI console.
Print nproc/df/free and tail the failing toolchain log so we can see the
actual error. To be reverted once diagnosed.
The CI failure on 32852c3 was transient: the same build logic passed on
re-run (and passes locally + on a self-hosted box). The runner had ample
disk (85G) and RAM (16G), so the cause was a flaky mirror download — musl-
cross-make fetches tarballs from ftpmirror.gnu.org during do_build with a
no-retry 'wget -c -O'. Override DL_CMD with --tries/--waitretry/--timeout so
a single bad mirror recovers instead of failing the toolchain build.

Also revert the temporary build.yml diagnostic (cedbc42) now that the
root cause is understood; the workflow is back to its clean form.
Reproduces .github/workflows/build.yml without pushing: exports the
git-tracked tree into a throwaway dir under $HOME (snap/rootless Docker
cannot bind-mount /tmp) and runs the exact 'build.sh -k so3' + 'build.sh -x
usr-so3' in the so3-env image, per platform. Mounting only tracked files
means untracked-but-referenced sources fail locally exactly as in CI, and
build/tmp is excluded so the toolchain builds from scratch. Use -r <ref>
for an exact committed state.
The toolchain build failed intermittently in CI (FAIL/FAIL/PASS/PASS/FAIL
across runs), always at musl-toolchain do_build, very early and with no
build output — i.e. a download failure. musl-cross-make's default GNU_SITE
is ftpmirror.gnu.org, which 302-redirects to a random mirror; incomplete
mirrors 404 and wget --tries just re-hits the same redirect. Pin GNU_SITE to
the canonical https://ftp.gnu.org/gnu (complete, no random mirror); keep the
wget retries as a safety net.

Also keep a minimal on-failure dump of the toolchain do_build log in CI so
any residual download flake is diagnosable without a separate commit.
The Check Code Style workflow was red (pre-existing): after the Infrabase
migration the SO3 sources moved under so3/, so check-path 'so3' swept in
vendored code (micropython, libxml2) and check-path 'usr/src' (no longer a
real dir) silently fell back to scanning the whole repo. clang-format also
flagged genuinely-misformatted first-party files.

- Point check-paths at the real nested dirs: so3/so3 (kernel) and so3/usr
  (user space); exclude vendored trees (micropython, libxml2, usr/lib/linux,
  lvgl).
- Reformat the 13 tracked first-party files that violated the repo's own
  .clang-format (5 kernel, 7 usr/lib/slv, fb_test.c).

Verified by replicating the action's exact logic (find + exclude regex,
clang-format 19, --style=file) over the tracked tree: both jobs report 0
failures.
Make the generic build/ files byte-identical to edgem1 where they should
be (meld-minimal), while keeping the torizon/e1c separation intact:

- restore the EDGEMTech copyright headers on the generic layer files
  (meta-so3/meta-qemu/meta-rootfs/meta-filesystem/meta-uboot layer.conf,
  avz/so3 bbclass, bsp-so3, rootfs-so3, so3_6.2.0)
- drop the dead utils_restore_user_ownership() call in usr-so3 (undefined,
  error-path only)
- drop a stray whitespace line in rootfs-linux
@daniel-rossier daniel-rossier force-pushed the 255-migrate-the-so3-build-system-to-infrabase-and-move-to-the-new-so3-logo branch from 4934926 to 97585d5 Compare June 17, 2026 17:48
@daniel-rossier

Copy link
Copy Markdown
Contributor Author

@AndreCostaaa @clemdiep You can proceed with the review :-) Thanks.

Daniel Rossier added 18 commits June 17, 2026 21:09
Rewrite the landing README around the three build modes (standalone / AVZ /
SO3 capsule), supported targets, and a clear pointer to the published
documentation as the source of truth. Remove the no-longer-current
discourse.heig-vd.ch discussion-forum link and the obsolete in-tree CI-patch
and ./st/./stv/./stg run notes (all covered by doc/ now).
The discourse.heig-vd.ch forum no longer exists. Remove the 'Discussion
forum' section from the index (keeping the sponsor acknowledgement and the
HEIG-VD/REDS logo) and the forum link from the LVGL page; questions now go
through GitHub issues / the maintainer (see the README).
Add a proper 'Welcome to SO3' opening and a dedicated section explaining
SO3's defining trait — polymorphism: one source tree built into a standalone
OS (EL1), the AVZ hypervisor (EL2), or an SO3 capsule (S3C) on top of AVZ
beside a Linux agency.
The source IB_TARGET/fs is the rootfs image loop-mounted as root, and the
ext4 rootfs partition needs ownership/perms/symlinks preserved. Replace the
unprivileged non-preserving `cp -rv` (which aborts on root-owned files) with
`sudo cp -av`. Keeps this recipe identical to the edgem1 tree.
-k so3 -> -x so3, -f -> -x filesystem (the -k/-b/-r/-f options were
removed when build.sh/deploy.sh were reduced to -a/-x).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrate the so3 build system to InfraBase and move to the new SO3 logo

1 participant