Intermittent failure issue summary


Richard Purdie
 

I'm guessing a lot of people don't follow the intermittent issues. I therefore
thought I'd share a summary of some of them along with some random thoughts on
them. There is a mix of different things here, each needing different skills.

Systemd daemon-reload unit restart failures:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14787
AlexK has got part way in figuring out the circumstances of this, any systemd
experts able to spot what I think is a service file dependency issue?

EFI Boot Failure:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14018
"oe-selftest - efibootpartition.GenericEFITest.test_boot_efi selftest"
Does anyone know the EFI boot process and know what logging we might add to the
system so we gain more insight when this happens?

Bitbake parsing error:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14665
"Parsing recipes...ERROR: ParseError in None: Not all recipes parsed, parser
thread killed/died? Exiting" - I just can't spot the logic bug causing this
error (and some similar variants), maybe someone else can?

sstate files not found:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14775
For this one I think we need to write a standalone replica of the tests against
an sstate mirror that sstate.bbclass runs to check if sstate objects exist. That
way we could try different load levels against the project server and see
whether it is the sstate/fetcher code (which does weird things with threads and
concurrent connections) or if it is the server side of things that has some
limit we can't spot.

pseudo do_flush_pseudodb task error:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14654
not sure why this sometimes happens, like need to sport the race in the pseudo
shutdown code.

Memory resident bitbake PR Serv issue:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14786
This is one of the blocking issues on moving to memory resident bitbake by
default

x86 boot log serio/CD drive timeout in qemu:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14743
We've talked about disabling some of the peripherals we don't need/care about
such as psmouse and the CD drive. Anyone fancy digging into this with upstream
qemu? I suspect there are other people who'd like this too.

Bitbake Server timeout:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14201
This one really needs a rework of bitbake's main loop with a new thread so that
the UI and server can talk even when whatever it is doing (parsing, event
handlers) is blocked. No takers?! Just thought I'd add to the list! :)


These are 8 of the issues and probably the most frequent/annoying or ones where
there is a clearish path forward. The full list of 57:

https://bugzilla.yoctoproject.org/buglist.cgi?quicksearch=AB-INT

(it was over 70 at one point, we've beaten it down a bit)

Cheers,

Richard


Markus Volk
 

the systemd issue could be this ?

https://github.com/systemd/systemd/pull/22552/commits/de90700f36f2126528f7ce92df0b5b5d5e277558

Am 16.04.22 um 12:26 schrieb Richard Purdie:

I'm guessing a lot of people don't follow the intermittent issues. I therefore
thought I'd share a summary of some of them along with some random thoughts on
them. There is a mix of different things here, each needing different skills.

Systemd daemon-reload unit restart failures:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14787
AlexK has got part way in figuring out the circumstances of this, any systemd
experts able to spot what I think is a service file dependency issue?

EFI Boot Failure:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14018
"oe-selftest - efibootpartition.GenericEFITest.test_boot_efi selftest"
Does anyone know the EFI boot process and know what logging we might add to the
system so we gain more insight when this happens?

Bitbake parsing error:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14665
"Parsing recipes...ERROR: ParseError in None: Not all recipes parsed, parser
thread killed/died? Exiting" - I just can't spot the logic bug causing this
error (and some similar variants), maybe someone else can?

sstate files not found:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14775
For this one I think we need to write a standalone replica of the tests against
an sstate mirror that sstate.bbclass runs to check if sstate objects exist. That
way we could try different load levels against the project server and see
whether it is the sstate/fetcher code (which does weird things with threads and
concurrent connections) or if it is the server side of things that has some
limit we can't spot.

pseudo do_flush_pseudodb task error:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14654
not sure why this sometimes happens, like need to sport the race in the pseudo
shutdown code.

Memory resident bitbake PR Serv issue:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14786
This is one of the blocking issues on moving to memory resident bitbake by
default

x86 boot log serio/CD drive timeout in qemu:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14743
We've talked about disabling some of the peripherals we don't need/care about
such as psmouse and the CD drive. Anyone fancy digging into this with upstream
qemu? I suspect there are other people who'd like this too.

Bitbake Server timeout:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14201
This one really needs a rework of bitbake's main loop with a new thread so that
the UI and server can talk even when whatever it is doing (parsing, event
handlers) is blocked. No takers?! Just thought I'd add to the list! :)


These are 8 of the issues and probably the most frequent/annoying or ones where
there is a clearish path forward. The full list of 57:

https://bugzilla.yoctoproject.org/buglist.cgi?quicksearch=AB-INT

(it was over 70 at one point, we've beaten it down a bit)

Cheers,

Richard





Richard Purdie
 

On Sat, 2022-04-16 at 15:31 +0200, Markus Volk wrote:
the systemd issue could be this ?

https://github.com/systemd/systemd/pull/22552/commits/de90700f36f2126528f7ce92df0b5b5d5e277558

Am 16.04.22 um 12:26 schrieb Richard Purdie:
Systemd daemon-reload unit restart failures:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14787
AlexK has got part way in figuring out the circumstances of this, any systemd
experts able to spot what I think is a service file dependency issue?
Yes, that could well be it :)

Particularly when you read:

https://github.com/systemd/systemd/issues/15316

Alex: Any thoughts?

Cheers,

Richard


Alexander Kanavin
 

On Sat, 16 Apr 2022 at 15:40, Richard Purdie
<richard.purdie@...> wrote:
the systemd issue could be this ?

https://github.com/systemd/systemd/pull/22552/commits/de90700f36f2126528f7ce92df0b5b5d5e277558

Am 16.04.22 um 12:26 schrieb Richard Purdie:
Systemd daemon-reload unit restart failures:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14787
AlexK has got part way in figuring out the circumstances of this, any systemd
experts able to spot what I think is a service file dependency issue?
Yes, that could well be it :)

Particularly when you read:

https://github.com/systemd/systemd/issues/15316

Alex: Any thoughts?
These commits have been backported to 250-stable, released in 250.4,
and we already carry that version :-(
https://github.com/systemd/systemd-stable/commit/367041af816d48d4852140f98fd0ba78ed83f9e4

Alex


Jose Quaresma
 



Richard Purdie <richard.purdie@...> escreveu no dia sábado, 16/04/2022 à(s) 11:26:
I'm guessing a lot of people don't follow the intermittent issues. I therefore
thought I'd share a summary of some of them along with some random thoughts on
them. There is a mix of different things here, each needing different skills.

Systemd daemon-reload unit restart failures:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14787
AlexK has got part way in figuring out the circumstances of this, any systemd
experts able to spot what I think is a service file dependency issue?

EFI Boot Failure:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14018
"oe-selftest - efibootpartition.GenericEFITest.test_boot_efi selftest"
Does anyone know the EFI boot process and know what logging we might add to the
system so we gain more insight when this happens?

Bitbake parsing error:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14665
"Parsing recipes...ERROR: ParseError in None: Not all recipes parsed, parser
thread killed/died? Exiting" - I just can't spot the logic bug causing this
error (and some similar variants), maybe someone else can?

sstate files not found:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14775
For this one I think we need to write a standalone replica of the tests against
an sstate mirror that sstate.bbclass runs to check if sstate objects exist. That
way we could try different load levels against the project server and see
whether it is the sstate/fetcher code (which does weird things with threads and
concurrent connections) or if it is the server side of things that has some
limit we can't spot.

Will it be a good idea to raise a warning and do another try for such cases?

A timeout on socket seems to me that is server related and the last server
infrastructure migration this timeout issue improves a lot.
Before that last migration I can workaround this timeout issue setting 
BB_NUMBER_THREADS=1 that will do one connection at a time.
Ding this BB_NUMBER_THREADS=1 makes me think that this can be
some race condition with the oe.utils.ThreadedPool that afaik
is only used on the sstate.bbclass.

Jose


pseudo do_flush_pseudodb task error:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14654
not sure why this sometimes happens, like need to sport the race in the pseudo
shutdown code.

Memory resident bitbake PR Serv issue:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14786
This is one of the blocking issues on moving to memory resident bitbake by
default

x86 boot log serio/CD drive timeout in qemu:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14743
We've talked about disabling some of the peripherals we don't need/care about
such as psmouse and the CD drive. Anyone fancy digging into this with upstream
qemu? I suspect there are other people who'd like this too.

Bitbake Server timeout:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14201
This one really needs a rework of bitbake's main loop with a new thread so that
the UI and server can talk even when whatever it is doing (parsing, event
handlers) is blocked. No takers?! Just thought I'd add to the list! :)


These are 8 of the issues and probably the most frequent/annoying or ones where
there is a clearish path forward. The full list of 57:

https://bugzilla.yoctoproject.org/buglist.cgi?quicksearch=AB-INT

(it was over 70 at one point, we've beaten it down a bit)

Cheers,

Richard








--
Best regards,

José Quaresma


Ross Burton <ross@...>
 

On Sat, 16 Apr 2022 at 11:26, Richard Purdie
<richard.purdie@...> wrote:
x86 boot log serio/CD drive timeout in qemu:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14743
We've talked about disabling some of the peripherals we don't need/care about
such as psmouse and the CD drive. Anyone fancy digging into this with upstream
qemu? I suspect there are other people who'd like this too.
Patches sent for the keyboard/mouse part. The CD drive is trickier...

Ross


Richard Purdie
 

On Tue, 2022-04-19 at 17:50 +0100, Ross Burton wrote:
On Sat, 16 Apr 2022 at 11:26, Richard Purdie
<richard.purdie@...> wrote:
x86 boot log serio/CD drive timeout in qemu:
https://bugzilla.yoctoproject.org/show_bug.cgi?id=14743
We've talked about disabling some of the peripherals we don't need/care about
such as psmouse and the CD drive. Anyone fancy digging into this with upstream
qemu? I suspect there are other people who'd like this too.
Patches sent for the keyboard/mouse part. The CD drive is trickier...
Knocking those two out alone is great and much appreciated, thanks!

Cheers,

Richard