Stable release testing - notes from the autobuilder perspective


Richard Purdie
 

I wanted to write down my findings on trying to getting and keeping
stable branch builds working on the autobuilder. I also have a proposal
in mind for moving this forward.

Jeremy did good work in getting thud nearly building, building upon
work I'd done in getting buildtools-extended-tarball working for older
releases. Its not as simpler a problem as it would first appear.

We have two versions of buildtools tarball. In simple terms, one has
the basic utils needed to run builds without gcc and the other includes
gcc.

Our current policy was to install a buildtools tarball on certain
problematic autobuilders but this doesn't work since a given release
usually has a set of tools its known to work with and it won't work
without tools outside that. We therefore suffer "bitrot" as new workers
are added and older ones replaced with new distro installs.

In particular:
* gcc 10 doesn't work with older releases
* gcc 4.8 and 4.9 don't work with newer releases
* we no longer install makeinfo onto new autobuilder workers
* we no longer install python2 onto new autobuilder workers
* some older autobuilder workers have old versions of python3
* newer autobuilder workers need newer uninative versions
* some things changed like crypt() being moved out of glibc

This means that for a given release we want to use the standard
buildtools tarball on "old" systems and the extended buildtools tarball
on "new" systems that didn't exist at the time the release was made.

My thoughts are that we should:

a) Remove all the current buildtools installs from the autobuilder

b) teach autobuilder-helper to install buildtools tarballs in all the
older release branches

c) backport most of the autobuilder-helper changes to older releases so
its easier to maintain things

d) backport buildtools-extended-tarball to older releases

e) backport the necessary fixes to older releases to allow them to
build on the current infrastructure with buildtools.

Dunfell is in a good state and ok.

Zeus needs poky:zeus-next yocto-autobuilder-helper:contrib/rpurdie/zeus

Thud has branches available that need to update against the zeus
changes I've figured out which should get that working too.

Pyro has example code at poky-contrib:rpurdie/pyro to allow a
buildtools tarball that old to be built.

As things stand the branches are all just going to bitrot so if we can
get these branches to build cleanly, it would seem to make sense to me
to merge this approximate set of changes in the hope that stable
maintenance in case of any major security fix (for example) becomes
much more possible.

Any thoughts from anyone on this?

Cheers,

Richard


Otavio Salvador
 

Hello all,

Em seg., 7 de set. de 2020 às 13:14, Richard Purdie
<richard.purdie@...> escreveu:
...
Any thoughts from anyone on this?
I second this and at least at O.S. Systems we've been using Docker
containers to keep maintenance easier for old releases. I'd be great
we could alleviate this and reduce its use as much as possible.

The CI builder maintenance is indeed a time-consuming task and as
easier it gets the easier is to convince people to set up them for
their uses and in the end, this helps to improve the quality of
submitted patches and reduces the maintenance effort as well.

--
Otavio Salvador O.S. Systems
http://www.ossystems.com.br http://code.ossystems.com.br
Mobile: +55 (53) 9 9981-7854 Mobile: +1 (347) 903-9750


Tom Rini
 

On Mon, Sep 07, 2020 at 02:59:41PM -0300, Otavio Salvador wrote:
Hello all,

Em seg., 7 de set. de 2020 às 13:14, Richard Purdie
<richard.purdie@...> escreveu:
...
Any thoughts from anyone on this?
I second this and at least at O.S. Systems we've been using Docker
containers to keep maintenance easier for old releases. I'd be great
we could alleviate this and reduce its use as much as possible.

The CI builder maintenance is indeed a time-consuming task and as
easier it gets the easier is to convince people to set up them for
their uses and in the end, this helps to improve the quality of
submitted patches and reduces the maintenance effort as well.
Excuse what may be a dumb question, but why are we not just building
pyro for example in a Ubuntu 16.04 or centos7 (or anything else with
official containers available) ? Is the performance hit too much, even
with good volume management? And extend that for other branches of
course. But as we look at why people care about such old releases (or,
supporting a current release into the future) it seems like "our build
environment is a container / VM so we can support this on modern HW"
pops up.

--
Tom


Richard Purdie
 

On Mon, 2020-09-07 at 16:55 -0400, Tom Rini wrote:
On Mon, Sep 07, 2020 at 02:59:41PM -0300, Otavio Salvador wrote:
Hello all,

Em seg., 7 de set. de 2020 às 13:14, Richard Purdie
<richard.purdie@...> escreveu:
...
Any thoughts from anyone on this?
I second this and at least at O.S. Systems we've been using Docker
containers to keep maintenance easier for old releases. I'd be
great
we could alleviate this and reduce its use as much as possible.

The CI builder maintenance is indeed a time-consuming task and as
easier it gets the easier is to convince people to set up them for
their uses and in the end, this helps to improve the quality of
submitted patches and reduces the maintenance effort as well.
Excuse what may be a dumb question, but why are we not just building
pyro for example in a Ubuntu 16.04 or centos7 (or anything else with
official containers available) ? Is the performance hit too much,
even with good volume management? And extend that for other branches
of course. But as we look at why people care about such old releases
(or, supporting a current release into the future) it seems like "our
build environment is a container / VM so we can support this on
modern HW" pops up.
The autobuilder is setup for speed so there aren't VMs involved, its
'baremetal'. Containers would be possible but at that point the kernel
isn't the distro kernel and you have permission issues with the qemu
networking for example.

Speed is extremely important as we have about a 6 hour build test time
but a *massive* test range (e.g. all the gcc/glibc test suites on each
arch, build+boot test all the arches under qemu for sysvinit+systemd,
oe-selftest on each distro). I am already tearing my hair out trying to
maintain what we have and deal with the races, adding in containers
into the mix simply isn't something I can face.

We do have older distros in the cluster for a time, e.g. centos7 is
still there although we've replaced the OS on some of the original
centos7 workers as the hardware had disk failures so there aren't as
many of them as there were. Centos7 gives us problems trying to build
master.

So this plan is the best practical approach we can come up with to
allow us to be able to build older releases yet not change the
autobuilders too much and cause new sets of problems. I should have
mentioned this, I just assume people kind of know this, sorry.

Cheers,

Richard


Tom Rini
 

On Mon, Sep 07, 2020 at 10:03:36PM +0100, Richard Purdie wrote:
On Mon, 2020-09-07 at 16:55 -0400, Tom Rini wrote:
On Mon, Sep 07, 2020 at 02:59:41PM -0300, Otavio Salvador wrote:
Hello all,

Em seg., 7 de set. de 2020 às 13:14, Richard Purdie
<richard.purdie@...> escreveu:
...
Any thoughts from anyone on this?
I second this and at least at O.S. Systems we've been using Docker
containers to keep maintenance easier for old releases. I'd be
great
we could alleviate this and reduce its use as much as possible.

The CI builder maintenance is indeed a time-consuming task and as
easier it gets the easier is to convince people to set up them for
their uses and in the end, this helps to improve the quality of
submitted patches and reduces the maintenance effort as well.
Excuse what may be a dumb question, but why are we not just building
pyro for example in a Ubuntu 16.04 or centos7 (or anything else with
official containers available) ? Is the performance hit too much,
even with good volume management? And extend that for other branches
of course. But as we look at why people care about such old releases
(or, supporting a current release into the future) it seems like "our
build environment is a container / VM so we can support this on
modern HW" pops up.
The autobuilder is setup for speed so there aren't VMs involved, its
'baremetal'. Containers would be possible but at that point the kernel
isn't the distro kernel and you have permission issues with the qemu
networking for example.
Which issues do you run in to with qemu networking? I honestly don't
know if the U-Boot networking tests we run via qemu under Docker are
more or less complex than what you're running in to.

Speed is extremely important as we have about a 6 hour build test time
but a *massive* test range (e.g. all the gcc/glibc test suites on each
arch, build+boot test all the arches under qemu for sysvinit+systemd,
oe-selftest on each distro). I am already tearing my hair out trying to
maintain what we have and deal with the races, adding in containers
into the mix simply isn't something I can face.

We do have older distros in the cluster for a time, e.g. centos7 is
still there although we've replaced the OS on some of the original
centos7 workers as the hardware had disk failures so there aren't as
many of them as there were. Centos7 gives us problems trying to build
master.
The reason I was thinking about containers is that it should remove some
of what you have to face. Paul may or may not want to chime in on how
workable it ended up being for a particular customer, but leveraging
CROPS to setup build environment of a supported host and then running it
on whatever the available build hardware is, was good. It sounds like
part of the autobuilder problem is that it has to be a specific set of
hand-crafted machines and that in turn feels like we've lost the
thread, so to speak, about having a reproducible build system. 6 hours
even beats my U-Boot world before/after times, so I do get the dread of
"now it might take 5% longer, which is a very real more wallclock time.
But if it means more builders could be available as they're easy to spin
up, that could bring the overall time down.

So this plan is the best practical approach we can come up with to
allow us to be able to build older releases yet not change the
autobuilders too much and cause new sets of problems. I should have
mentioned this, I just assume people kind of know this, sorry.
Since I don't want to put even more on your plate, what kind of is the
reasonable test to try here? Or is it hard to say since it's not just
"MACHINE=qemux86-64 bitbake world" but also "run this and that and
something else" ?

--
Tom


Richard Purdie
 

On Mon, 2020-09-07 at 17:19 -0400, Tom Rini wrote:
On Mon, Sep 07, 2020 at 10:03:36PM +0100, Richard Purdie wrote:
On Mon, 2020-09-07 at 16:55 -0400, Tom Rini wrote:
The autobuilder is setup for speed so there aren't VMs involved, its
'baremetal'. Containers would be possible but at that point the kernel
isn't the distro kernel and you have permission issues with the qemu
networking for example.
Which issues do you run in to with qemu networking? I honestly don't
know if the U-Boot networking tests we run via qemu under Docker are
more or less complex than what you're running in to.
Its the tun/tap device requirement that tends to be the pain point.
Being able to ssh from the host OS into the qemu target image is a
central requirement of oeqa. Everyone tells me it should use
portmapping and slirp instead to avoid the privs problems and the
container issues which is great but not implemented.

Speed is extremely important as we have about a 6 hour build test time
but a *massive* test range (e.g. all the gcc/glibc test suites on each
arch, build+boot test all the arches under qemu for sysvinit+systemd,
oe-selftest on each distro). I am already tearing my hair out trying to
maintain what we have and deal with the races, adding in containers
into the mix simply isn't something I can face.

We do have older distros in the cluster for a time, e.g. centos7 is
still there although we've replaced the OS on some of the original
centos7 workers as the hardware had disk failures so there aren't as
many of them as there were. Centos7 gives us problems trying to build
master.
The reason I was thinking about containers is that it should remove some
of what you have to face.
Removes some, yes, but creates a whole set of other issues.

Paul may or may not want to chime in on how
workable it ended up being for a particular customer, but leveraging
CROPS to setup build environment of a supported host and then running it
on whatever the available build hardware is, was good. It sounds like
part of the autobuilder problem is that it has to be a specific set of
hand-crafted machines and that in turn feels like we've lost the
thread, so to speak,
The machines are in fact pretty much off the shelf distro installs so
not hand crafted.

about having a reproducible build system. 6 hours
even beats my U-Boot world before/after times, so I do get the dread of
"now it might take 5% longer, which is a very real more wallclock time.
But if it means more builders could be available as they're easy to spin
up, that could bring the overall time down.
Here we get onto infrastructure as we're not talking containers on our
workers but on general cloud systems which is a different proposition.

We *heavily* rely on the fast network fabric between the workers and
our nas for sstate (NFS mounted). This is where we get a big chunk of
speed. So "easy to spin up" isn't actually the case for different
reasons.

So this plan is the best practical approach we can come up with to
allow us to be able to build older releases yet not change the
autobuilders too much and cause new sets of problems. I should have
mentioned this, I just assume people kind of know this, sorry.
Since I don't want to put even more on your plate, what kind of is the
reasonable test to try here? Or is it hard to say since it's not just
"MACHINE=qemux86-64 bitbake world" but also "run this and that and
something else" ?
Its quite simple:

MACHINE=qemux86-64 bitbake core-image-sato-sdk -c testimage

and

MACHINE=qemux86-64 bitbake core-image-sato-sdk -c testsdkext

are the two to start with. If those work, the other "nasty" ones are
oe-selftest and the toolchain test suites. Also need to check kvm is
working.

We have gone around in circles on this several times as you're not the
first to suggest it :/.

Cheers,

Richard


Tom Rini
 

On Mon, Sep 07, 2020 at 10:30:20PM +0100, Richard Purdie wrote:
On Mon, 2020-09-07 at 17:19 -0400, Tom Rini wrote:
On Mon, Sep 07, 2020 at 10:03:36PM +0100, Richard Purdie wrote:
On Mon, 2020-09-07 at 16:55 -0400, Tom Rini wrote:
The autobuilder is setup for speed so there aren't VMs involved, its
'baremetal'. Containers would be possible but at that point the kernel
isn't the distro kernel and you have permission issues with the qemu
networking for example.
Which issues do you run in to with qemu networking? I honestly don't
know if the U-Boot networking tests we run via qemu under Docker are
more or less complex than what you're running in to.
Its the tun/tap device requirement that tends to be the pain point.
Being able to ssh from the host OS into the qemu target image is a
central requirement of oeqa. Everyone tells me it should use
portmapping and slirp instead to avoid the privs problems and the
container issues which is great but not implemented.
Ah, OK. Yes, we're using "user" networking not tap.

Speed is extremely important as we have about a 6 hour build test time
but a *massive* test range (e.g. all the gcc/glibc test suites on each
arch, build+boot test all the arches under qemu for sysvinit+systemd,
oe-selftest on each distro). I am already tearing my hair out trying to
maintain what we have and deal with the races, adding in containers
into the mix simply isn't something I can face.

We do have older distros in the cluster for a time, e.g. centos7 is
still there although we've replaced the OS on some of the original
centos7 workers as the hardware had disk failures so there aren't as
many of them as there were. Centos7 gives us problems trying to build
master.
The reason I was thinking about containers is that it should remove some
of what you have to face.
Removes some, yes, but creates a whole set of other issues.

Paul may or may not want to chime in on how
workable it ended up being for a particular customer, but leveraging
CROPS to setup build environment of a supported host and then running it
on whatever the available build hardware is, was good. It sounds like
part of the autobuilder problem is that it has to be a specific set of
hand-crafted machines and that in turn feels like we've lost the
thread, so to speak,
The machines are in fact pretty much off the shelf distro installs so
not hand crafted.
Sorry, what I meant by hand-crafted is that for it to work for older
installs, you have to have this particular dance to provide various host
tools, that weren't required at the time.

about having a reproducible build system. 6 hours
even beats my U-Boot world before/after times, so I do get the dread of
"now it might take 5% longer, which is a very real more wallclock time.
But if it means more builders could be available as they're easy to spin
up, that could bring the overall time down.
Here we get onto infrastructure as we're not talking containers on our
workers but on general cloud systems which is a different proposition.

We *heavily* rely on the fast network fabric between the workers and
our nas for sstate (NFS mounted). This is where we get a big chunk of
speed. So "easy to spin up" isn't actually the case for different
reasons.

So this plan is the best practical approach we can come up with to
allow us to be able to build older releases yet not change the
autobuilders too much and cause new sets of problems. I should have
mentioned this, I just assume people kind of know this, sorry.
Since I don't want to put even more on your plate, what kind of is the
reasonable test to try here? Or is it hard to say since it's not just
"MACHINE=qemux86-64 bitbake world" but also "run this and that and
something else" ?
Its quite simple:

MACHINE=qemux86-64 bitbake core-image-sato-sdk -c testimage

and

MACHINE=qemux86-64 bitbake core-image-sato-sdk -c testsdkext

are the two to start with. If those work, the other "nasty" ones are
oe-selftest and the toolchain test suites. Also need to check kvm is
working.

We have gone around in circles on this several times as you're not the
first to suggest it :/.
Thanks for explaining it again. I'll go off and do some tests.

--
Tom