Adding more information to the SBOM


Marta Rybczynska
 

Dear all,
(cross-posting to oe-core and *-architecture)
In the last months, we have worked in Oniro on using the create-spdx
class for both IP compliance and security.

During this work, Alberto Pianon has found that some information is
missing from the SBOM and it does not contain enough for Software
Composition Analysis. The main missing point is the relation between
the actual upstream sources and the final binaries (create-spdx uses
composite sources).

Alberto has worked on how to obtain the missing data and now has a
POC. This POC provides full source-to-binary tracking of Yocto builds
through a couple of scripts (intended to be transformed into a new
bbclass at a later stage). The goal is to add the missing pieces of
information in order to get a "real" SBOM from Yocto, which should, at
a minimum:

- carefully describe what is found in a final image (i.e. binary files
and their dependencies), since that is what is actually distributed
and goes into the final product;
- describe how such binary files have been generated and where they
come from (i.e. upstream sources, including patches and other stuff
added from meta-layers); provenance is important for a number of
reasons related to IP Compliance and security.

The aim is to become able to:

- map binaries to their corresponding upstream source packages (and
not to the "internal" source packages created by recipes by combining
multiple upstream sources and patches)
- map binaries to the source files that have been actually used to
build them - which usually are a small subset of the whole source
package

With respect to IP compliance, this would allow to, among other things:

- get the real license text for each binary file, by getting the
license of the specific source files it has been generated from
(provided by Fossology, for instance), - and not the main license
stated in the corresponding recipe (which may be as confusing as
GPL-2.0-or-later & LGPL-2.1-or-later & BSD-3-Clause & BSD-4-Clause, or
even worse)
- automatically check license incompatibilities at the binary file level.

Other possible interesting things could be done also on the security side.

This work intends to add a way to provide additional data that can be
used by create-spdx, not to replace create-spdx in any way.

The sources with a long README are available at
https://gitlab.eclipse.org/eclipse/oniro-compliancetoolchain/toolchain/tinfoilhat/-/tree/srctracker/srctracker

What do you think of this work? Would it be of interest to integrate
into YP at some point? Shall we discuss this?

Marta and Alberto


Joshua Watt
 

On Wed, Sep 14, 2022 at 9:16 AM Marta Rybczynska <rybczynska@...> wrote:

Dear all,
(cross-posting to oe-core and *-architecture)
In the last months, we have worked in Oniro on using the create-spdx
class for both IP compliance and security.

During this work, Alberto Pianon has found that some information is
missing from the SBOM and it does not contain enough for Software
Composition Analysis. The main missing point is the relation between
the actual upstream sources and the final binaries (create-spdx uses
composite sources).
I believe we map the binaries to the source code from the -dbg
packages; is the premise that this is insufficient? Can you elaborate
more on why that is, I don't quite understand. The debug sources are
(basically) what we actually compiled (e.g. post-do_patch) to produce
the binary, and you can in turn follow these back to the upstream
sources with the downloadLocation property.


Alberto has worked on how to obtain the missing data and now has a
POC. This POC provides full source-to-binary tracking of Yocto builds
through a couple of scripts (intended to be transformed into a new
bbclass at a later stage). The goal is to add the missing pieces of
information in order to get a "real" SBOM from Yocto, which should, at
a minimum:
Please be a little careful with the wording; SBoMs have a lot of uses,
and many of them we can satisfy with what we currently generate; it
may not do the exact use case you are looking for, but that doesn't
mean it's not a "real" SBoM :)


- carefully describe what is found in a final image (i.e. binary files
and their dependencies), since that is what is actually distributed
and goes into the final product;
- describe how such binary files have been generated and where they
come from (i.e. upstream sources, including patches and other stuff
added from meta-layers); provenance is important for a number of
reasons related to IP Compliance and security.

The aim is to become able to:

- map binaries to their corresponding upstream source packages (and
not to the "internal" source packages created by recipes by combining
multiple upstream sources and patches)
- map binaries to the source files that have been actually used to
build them - which usually are a small subset of the whole source
package

With respect to IP compliance, this would allow to, among other things:

- get the real license text for each binary file, by getting the
license of the specific source files it has been generated from
(provided by Fossology, for instance), - and not the main license
stated in the corresponding recipe (which may be as confusing as
GPL-2.0-or-later & LGPL-2.1-or-later & BSD-3-Clause & BSD-4-Clause, or
even worse)
IIUC this is the difference between the "Declared" license and the
"Concluded" license. You can report both, and I think
create-spdx.bbclass can currently do this with its rudimentary source
license scanning. You really do want both and it's a great way to make
sure that the "Declared" license (that is the license in the recipe)
reflects the reality of the source code.

- automatically check license incompatibilities at the binary file level.

Other possible interesting things could be done also on the security side.

This work intends to add a way to provide additional data that can be
used by create-spdx, not to replace create-spdx in any way.

The sources with a long README are available at
https://gitlab.eclipse.org/eclipse/oniro-compliancetoolchain/toolchain/tinfoilhat/-/tree/srctracker/srctracker

What do you think of this work? Would it be of interest to integrate
into YP at some point? Shall we discuss this?
This seems promising as something that could potentially move into
core. I have a few points:
- The extraction of the sources to a dedicated directory is something
that Richard has been toying around with for quite a while, and I
think it would greatly simplify that part of your process. I would
very much encourage you to look at the work he's done, and work on
that to get it pushed across the finish line as it's a really good
improvement that would benefit not just your source scanning.
- I would encourage you to not wait to turn this into a bbclass
and/or library functions. You should be able to do this in a new
layer, and that would make it much clearer as to what the path to
being included in OE-core would look like. It also would (IMHO) be
nicer to the users :)


Marta and Alberto


Mark Hatle
 

On 9/14/22 9:56 AM, Joshua Watt wrote:
On Wed, Sep 14, 2022 at 9:16 AM Marta Rybczynska <rybczynska@...> wrote:

Dear all,
(cross-posting to oe-core and *-architecture)
In the last months, we have worked in Oniro on using the create-spdx
class for both IP compliance and security.

During this work, Alberto Pianon has found that some information is
missing from the SBOM and it does not contain enough for Software
Composition Analysis. The main missing point is the relation between
the actual upstream sources and the final binaries (create-spdx uses
composite sources).
I believe we map the binaries to the source code from the -dbg
packages; is the premise that this is insufficient? Can you elaborate
more on why that is, I don't quite understand. The debug sources are
(basically) what we actually compiled (e.g. post-do_patch) to produce
the binary, and you can in turn follow these back to the upstream
sources with the downloadLocation property.
When I last looked at this, it was critical that the analysis be:

binary -> patched & configured source (dbg package) -> how the sources were constructed.

As Joshua said above. I believe all of the information is present for this as you can tie the binary (through debug symbols) back to the debug package.. and the source of the debug package back to the sources that constructed it via heuristics. (If you enable the git patch mechanism. It should even be possible to use git blame to find exactly what upstreams constructed the patched sources.

For generated content, it's more difficult -- but for those items usually there is a header which indicates what generated the content so other heuristics can be used.


Alberto has worked on how to obtain the missing data and now has a
POC. This POC provides full source-to-binary tracking of Yocto builds
through a couple of scripts (intended to be transformed into a new
bbclass at a later stage). The goal is to add the missing pieces of
information in order to get a "real" SBOM from Yocto, which should, at
a minimum:
Please be a little careful with the wording; SBoMs have a lot of uses,
and many of them we can satisfy with what we currently generate; it
may not do the exact use case you are looking for, but that doesn't
mean it's not a "real" SBoM :)


- carefully describe what is found in a final image (i.e. binary files
and their dependencies), since that is what is actually distributed
and goes into the final product;
- describe how such binary files have been generated and where they
come from (i.e. upstream sources, including patches and other stuff
added from meta-layers); provenance is important for a number of
reasons related to IP Compliance and security.
Full compliance will require binaries mapped to patched source to upstream sources _AND_ the instructions (layer/recipe/configuration) used to build them. But it's up to the local legal determination to figure out 'how far you really need to go', vs just "here are the layers I used to build my project".)

The aim is to become able to:

- map binaries to their corresponding upstream source packages (and
not to the "internal" source packages created by recipes by combining
multiple upstream sources and patches)
- map binaries to the source files that have been actually used to
build them - which usually are a small subset of the whole source
package

With respect to IP compliance, this would allow to, among other things:

- get the real license text for each binary file, by getting the
license of the specific source files it has been generated from
(provided by Fossology, for instance), - and not the main license
stated in the corresponding recipe (which may be as confusing as
GPL-2.0-or-later & LGPL-2.1-or-later & BSD-3-Clause & BSD-4-Clause, or
even worse)
IIUC this is the difference between the "Declared" license and the
"Concluded" license. You can report both, and I think
create-spdx.bbclass can currently do this with its rudimentary source
license scanning. You really do want both and it's a great way to make
sure that the "Declared" license (that is the license in the recipe)
reflects the reality of the source code.
And the thing to keep in mind is that in a given package the "Declared" is usually what a LICENSE file or header says. But the "Concluded" has levels of quality behind them. The first level of quality is "Declared". The next level is automation (something like fossology), the next level is human reviewed, and the highest level is "lawyer reviewed".

So being able to inject SPDX information with Concluded values for evaluation and track the 'quality level' has always been something I wanted to do, but never had time.

At the time, my idea was a database (and/or bbappend) for each component that would included pre-processed SPDX data for each recipe. This data would run through a validation step to show it actually matches the patched sources. (If any file checksums do NOT match, then they would be flagged for follow up.)

- automatically check license incompatibilities at the binary file level.

Other possible interesting things could be done also on the security side.

This work intends to add a way to provide additional data that can be
used by create-spdx, not to replace create-spdx in any way.

The sources with a long README are available at
https://gitlab.eclipse.org/eclipse/oniro-compliancetoolchain/toolchain/tinfoilhat/-/tree/srctracker/srctracker

What do you think of this work? Would it be of interest to integrate
into YP at some point? Shall we discuss this?
This seems promising as something that could potentially move into
core. I have a few points:
- The extraction of the sources to a dedicated directory is something
that Richard has been toying around with for quite a while, and I
think it would greatly simplify that part of your process. I would
very much encourage you to look at the work he's done, and work on
that to get it pushed across the finish line as it's a really good
improvement that would benefit not just your source scanning.
- I would encourage you to not wait to turn this into a bbclass
and/or library functions. You should be able to do this in a new
layer, and that would make it much clearer as to what the path to
being included in OE-core would look like. It also would (IMHO) be
nicer to the users :)
Agreed, this looks useful. The key is start turning it into one or more bbclasses now. Things that work with the Yocto Project process. Don't try to "post-process" and reconstruct sources. Instead inject steps that will run your file checksums, build up your database as the source are constructed. (i.e. do_unpack, do_patch..)

etc.

The key is, all of the information IS available. It just may not be in the format you want.

--Mark


Marta and Alberto



Richard Purdie
 

On Wed, 2022-09-14 at 16:16 +0200, Marta Rybczynska wrote:
The sources with a long README are available at
https://gitlab.eclipse.org/eclipse/oniro-compliancetoolchain/toolchain/tinfoilhat/-/tree/srctracker/srctracker

What do you think of this work? Would it be of interest to integrate
into YP at some point? Shall we discuss this?
I had a look at this and was a bit puzzled by some of it.

I can see the issues you'd have if you want to separate the unpatched
source from the patches and know which files had patches applied as
that is hard to track. There would be significiant overhead in trying
to process and store that information in the unpack/patch steps and the
archiver class does some of that already. It is messy, hard and doens't
perform well. I'm reluctant to force everyone to do it as a result but
that can also result in multiple code paths and when you have that, the
result is that one breaks :(.

I also can see the issue with multiple sources in SRC_URI, although you
should be able to map those back if you assume subtrees are "owned" by
given SRC_URI entries. I suspect there may be a SPDX format limit in
documenting that piece?

Where I became puzzled is where you say "Information about debug
sources for each actual binary file is then taken from
tmp/pkgdata/<machine>/extended/*.json.zstd". This is the data we added
and use for the spdx class so you shouldn't need to reinvent that
piece. It should be the exact same data the spdx class uses.

I was also puzzled about the difference between rpm and the other
package backends. The exact same files are packaged by all the package
backends so the checksums from do_package should be fine.


For the source issues above it basically it comes down to how much
"pain" we want to push onto all users for the sake of adding in this
data. Unfortunately it is data which many won't need or use and
different legal departments do have different requirements. Experience
with archiver.bbclass shows that multiple codepaths doing these things
is a nightmare to keep working, particularly for corner cases which do
interesting things with the code (externalsrc, gcc shared workdir, the
kernel and more).

Cheers,

Richard


Alberto Pianon
 

Hi Richard,

thank you for your reply, you gave me very interesting cues to think
about. I'll reply in reverse/importance order

Il 2022-09-15 14:16 Richard Purdie wrote:
For the source issues above it basically it comes down to how much
"pain" we want to push onto all users for the sake of adding in this
data. Unfortunately it is data which many won't need or use and
different legal departments do have different requirements.
We didn't paint the overall picture sufficiently well, therefore our
requirements may come across as coming from a particularly pedantic
legal department; my fault :)

Oniro is not "yet another commercial Yocto project", we are not a legal
department (even if we are experienced FLOSS lawyers and auditors, the
most prominent of whom is Carlo Piana -- cc'ed -- former general counsel
of FSFE and member of OSI Board).

Our rather ambitious goal is not limited to Oniro, and consists in doing
compliance in the open source way and both setting an example and
providing guidance and material for others to benefit from our effort.
Our work will therefore be shared (and possibly improved by others) not
only with Oniro-based projects but also with any Yocto project. Among
other things, the most relevant bit of work that we want to share is
**fully reviewed license information** and other legal metadata about a
whole bunch of open source components commonly used in Yocto projects.

To do that in a **scalable and fully automated way**, we need that Yocto
collects some information that is currently disposed of (or simply not
collected) at build time.

Oniro Project Leader, Davide Ricci - cc'ed - strongly encouraged us to
seek for feedback from you in order to find out the best way to do it.

Maybe organizing a call would be more convenient than discussing
background and requirements here, if you (and others) are available.


Experience
with archiver.bbclass shows that multiple codepaths doing these things
is a nightmare to keep working, particularly for corner cases which do
interesting things with the code (externalsrc, gcc shared workdir, the
kernel and more).
I had a look at this and was a bit puzzled by some of it.
I can see the issues you'd have if you want to separate the unpatched
source from the patches and know which files had patches applied as
that is hard to track. There would be significiant overhead in trying
to process and store that information in the unpack/patch steps and the
archiver class does some of that already. It is messy, hard and doens't
perform well. I'm reluctant to force everyone to do it as a result but
that can also result in multiple code paths and when you have that, the
result is that one breaks :(.
I also can see the issue with multiple sources in SRC_URI, although you
should be able to map those back if you assume subtrees are "owned" by
given SRC_URI entries. I suspect there may be a SPDX format limit in
documenting that piece?
I'm replying in reverse order:

- there is a SPDX format limit, but it is by design: a SPDX package
entity is a single sw distribution unit, so it may have only one
downloadLocation; if you have more than one downloadLocation, you must
have more than one SPDX package, according to SPDX specs;

- I understand that my solution is a bit hacky; but IMHO any other
*post-mortem* solution would be far more hacky; the real solution
would be collecting required information directly in do_fetch and
do_unpack

- I also understand that we should reduce pain, otherwise nobody would
use our solution; the simplest and cleanest way I can think about is
collecting just package (in the SPDX sense) files' relative paths and
checksums at every stage (fetch, unpack, patch, package), and leave
data processing (i.e. mapping upstream source packages -> recipe's
WORKDIR package -> debug source package -> binary packages -> binary
image) to a separate tool, that may use (just a thought) a graph
database to process things more efficiently.


Where I became puzzled is where you say "Information about debug
sources for each actual binary file is then taken from
tmp/pkgdata/<machine>/extended/*.json.zstd". This is the data we added
and use for the spdx class so you shouldn't need to reinvent that
piece. It should be the exact same data the spdx class uses.
you're right, but in the context of a POC it was easier to extract them
directly from json files than from SPDX data :) It's just a POC to show
that required information may be retrieved in some way, implementation
details do not matter

I was also puzzled about the difference between rpm and the other
package backends. The exact same files are packaged by all the package
backends so the checksums from do_package should be fine.
Here I may miss some piece of information. I looked at files in
tmp/pkgdata but I couldn't find package file checksums anywhere: that is
why I parsed rpm packages. But if such checksums were already available
somewhere in tmp/pkgdata, it wouldn't be necessary to parse rpm packages
at all... Could you point me to what I'm (maybe) missing here? Thanks!

In any case, thank you much so for all your insights, they were
super-useful!

Cheers,

Alberto


Mark Hatle
 

On 9/16/22 10:18 AM, Alberto Pianon wrote:

... trimmed ...

I also can see the issue with multiple sources in SRC_URI, although you
should be able to map those back if you assume subtrees are "owned" by
given SRC_URI entries. I suspect there may be a SPDX format limit in
documenting that piece?
I'm replying in reverse order:
- there is a SPDX format limit, but it is by design: a SPDX package
entity is a single sw distribution unit, so it may have only one
downloadLocation; if you have more than one downloadLocation, you must
have more than one SPDX package, according to SPDX specs;
I think my interpretation of this is different. I've got a view of 'sourcing materials', and then verifying the are what we think they are and can be used the way we want. The "upstream sources" (and patches) are really just 'raw materials' that we use the Yocto Project to combined to create "the source".

So for the purpose of the SPDX, each upstream source _may_ have a corresponding SPDX, but for the binaries their source is the combined unit.. not multiple SPDXes. Think of it something like:

upstream source1 - SPDX
upstream source2 - SPDX
upstream patch
recipe patch1
recipe patch2

In the above, each of those items would be combined by the recipe system to construct the source used to build an individual recipe (and collection of packages). Automation _IS_ used to combine the components [unpack/fetch] and _MAY_ be used to generated a combined SPDX.

So your "upstream" location for this recipe is the local machine's source archive. The SPDX for the local recipe files can merge the SPDX information they know (and if it's at a file level) can use checksums to identify the items not captured/modified by the patches for further review (either manual or automation like fossology). In the case where an upstream has SPDX data, you should be able to inherit MOST files this way... but the output is specific to your configuration and patches.

1 - SPDX |
2 - SPDX |
patch |---> recipe specific SPDX
patch |
patch |

In some cases someone may want to generate SPDX data for the 3 patches, but that may or may not be useful in this context.

- I understand that my solution is a bit hacky; but IMHO any other
*post-mortem* solution would be far more hacky; the real solution
would be collecting required information directly in do_fetch and
do_unpack
I've not looked at the current SPDX spec, but past versions has a notes section. Assuming this is still present you can use it to reference back to how this component was constructed and the upstream source URIs (and SPDX files) you used for processing.

This way nothing really changes in do_fetch or do_unpack. (You may want to find a way to capture file checksums and what the source was for a particular file.. but it may not really be necessary!)

- I also understand that we should reduce pain, otherwise nobody would
use our solution; the simplest and cleanest way I can think about is
collecting just package (in the SPDX sense) files' relative paths and
checksums at every stage (fetch, unpack, patch, package), and leave
data processing (i.e. mapping upstream source packages -> recipe's
WORKDIR package -> debug source package -> binary packages -> binary
image) to a separate tool, that may use (just a thought) a graph
database to process things more efficiently.
Even it do_patch nothing really changes, other then again you may want to capture checksums to identify thingsthat need further processing.


This approach greatly simplifies things, and gives people doing code reviews the insight into what is the source used when shipping the binaries (which is really an important aspect of this), as well as which recipe and "build" (really fetch/unpack/patch) were used to construct the sources. If they want to investigate the sources further back to their provider, then the notes would have the information for that, and you could transition back to the "raw materials" providers.


Where I became puzzled is where you say "Information about debug
sources for each actual binary file is then taken from
tmp/pkgdata/<machine>/extended/*.json.zstd". This is the data we added
and use for the spdx class so you shouldn't need to reinvent that
piece. It should be the exact same data the spdx class uses.
you're right, but in the context of a POC it was easier to extract them
directly from json files than from SPDX data :) It's just a POC to show
that required information may be retrieved in some way, implementation
details do not matter

I was also puzzled about the difference between rpm and the other
package backends. The exact same files are packaged by all the package
backends so the checksums from do_package should be fine.
Here I may miss some piece of information. I looked at files in
tmp/pkgdata but I couldn't find package file checksums anywhere: that is
why I parsed rpm packages. But if such checksums were already available
somewhere in tmp/pkgdata, it wouldn't be necessary to parse rpm packages
at all... Could you point me to what I'm (maybe) missing here? Thanks!
file checksumming is expensive. There are checksums available to individual packaging engines, as well as aggregate checksums for "hash equivalency".. but I'm not aware of any per-file checksum that is stored.

You definitely shouldn't be parsing packages of any type (rpm or otherwise), as packages are truly optional. It's the binaries that matter here.

--Mark

In any case, thank you much so for all your insights, they were
super-useful!
Cheers,
Alberto


Richard Purdie
 

On Fri, 2022-09-16 at 17:18 +0200, Alberto Pianon wrote:
Il 2022-09-15 14:16 Richard Purdie wrote:

For the source issues above it basically it comes down to how much
"pain" we want to push onto all users for the sake of adding in this
data. Unfortunately it is data which many won't need or use and
different legal departments do have different requirements.
We didn't paint the overall picture sufficiently well, therefore our
requirements may come across as coming from a particularly pedantic
legal department; my fault :)

Oniro is not "yet another commercial Yocto project", we are not a legal
department (even if we are experienced FLOSS lawyers and auditors, the
most prominent of whom is Carlo Piana -- cc'ed -- former general counsel
of FSFE and member of OSI Board).

Our rather ambitious goal is not limited to Oniro, and consists in doing
compliance in the open source way and both setting an example and
providing guidance and material for others to benefit from our effort.
Our work will therefore be shared (and possibly improved by others) not
only with Oniro-based projects but also with any Yocto project. Among
other things, the most relevant bit of work that we want to share is
**fully reviewed license information** and other legal metadata about a
whole bunch of open source components commonly used in Yocto projects.
I certainly love the goal. I presume you're going to share your review
criteria somehow? There must be some further set of steps,
documentation and results beyond what we're discussing here?

I think the challenge will be whether you can publish that review with
sufficient "proof" that other legal departments can leverage it. I
wouldn't underestimate how different the requirements and process can
be between different people/teams/companies.

To do that in a **scalable and fully automated way**, we need that Yocto
collects some information that is currently disposed of (or simply not
collected) at build time.

Oniro Project Leader, Davide Ricci - cc'ed - strongly encouraged us to
seek for feedback from you in order to find out the best way to do it.

Maybe organizing a call would be more convenient than discussing
background and requirements here, if you (and others) are available.
I don't mind having a call but the discussion in this current form may
have an important element we shouldn't overlook, which is that it isn't
just me you need to convince on some of this.

If, for example, we should radically change the unpack/patch process,
we need to have a good explanation for why people need to take that
build time/space/resource hit. If we conclude that on a call, the case
to the wider community would still have to be made.

Experience
with archiver.bbclass shows that multiple codepaths doing these things
is a nightmare to keep working, particularly for corner cases which do
interesting things with the code (externalsrc, gcc shared workdir, the
kernel and more).

I had a look at this and was a bit puzzled by some of it.

I can see the issues you'd have if you want to separate the unpatched
source from the patches and know which files had patches applied as
that is hard to track. There would be significiant overhead in trying
to process and store that information in the unpack/patch steps and the
archiver class does some of that already. It is messy, hard and doens't
perform well. I'm reluctant to force everyone to do it as a result but
that can also result in multiple code paths and when you have that, the
result is that one breaks :(.

I also can see the issue with multiple sources in SRC_URI, although you
should be able to map those back if you assume subtrees are "owned" by
given SRC_URI entries. I suspect there may be a SPDX format limit in
documenting that piece?
I'm replying in reverse order:

- there is a SPDX format limit, but it is by design: a SPDX package
entity is a single sw distribution unit, so it may have only one
downloadLocation; if you have more than one downloadLocation, you must
have more than one SPDX package, according to SPDX specs;
I think we may need to talk to the SPDX people about that as I'm not
convinced it always holds that you can divide software into such units.
Certainly you can construct a situation where there are two
repositories, each containing a source file which are only ever linked
together as one binary.

- I understand that my solution is a bit hacky; but IMHO any other
*post-mortem* solution would be far more hacky; the real solution
would be collecting required information directly in do_fetch and
do_unpack
Agreed, this needs to be done at unpack/patch time. Don't underestimate
the impact of this on general users though as many won't appreciate
slowing down their builds generating this information :/.

There is also a pile of information some legal departments want which
you've not mentioned here, such as build scripts and configuration
information. Some previous discussions with other parts of the wider
open source community rejected Yocto Projects efforts as insufficient
since we didn't mandate and capture all of this too (the archiver could
optionally do some of it iirc). Is this just the first step and we're
going to continue dumping more data? Or is this sufficient and all any
legal department should need?

- I also understand that we should reduce pain, otherwise nobody would
use our solution; the simplest and cleanest way I can think about is
collecting just package (in the SPDX sense) files' relative paths and
checksums at every stage (fetch, unpack, patch, package), and leave
data processing (i.e. mapping upstream source packages -> recipe's
WORKDIR package -> debug source package -> binary packages -> binary
image) to a separate tool, that may use (just a thought) a graph
database to process things more efficiently.
I'd suggest stepping back and working out whether the SPDX requirement
of a "single download location" some of this stems from really makes
sense.

Where I became puzzled is where you say "Information about debug
sources for each actual binary file is then taken from
tmp/pkgdata/<machine>/extended/*.json.zstd". This is the data we added
and use for the spdx class so you shouldn't need to reinvent that
piece. It should be the exact same data the spdx class uses.
you're right, but in the context of a POC it was easier to extract them
directly from json files than from SPDX data :) It's just a POC to show
that required information may be retrieved in some way, implementation
details do not matter
Fair enough, I just want to be clear we don't want to duplicate this.


I was also puzzled about the difference between rpm and the other
package backends. The exact same files are packaged by all the package
backends so the checksums from do_package should be fine.
Here I may miss some piece of information. I looked at files in
tmp/pkgdata but I couldn't find package file checksums anywhere: that is
why I parsed rpm packages. But if such checksums were already available
somewhere in tmp/pkgdata, it wouldn't be necessary to parse rpm packages
at all... Could you point me to what I'm (maybe) missing here? Thanks!
In some ways this is quite simple, it is because at do_package time,
the output packages don't exist, only their content. The final output
packages are generated in do_package_write_{ipk|deb|rpm}.

You'd probably have to add a stage to the package_write tasks which
wrote out more checksum data since the checksums are only known at the
end of those tasks. I would question whether adding this additional
checksum into the SPDX output actually helps much in the real world
though. I guess it means you could look an RPM up against it's checksum
but is that something people need to do?

Cheers,

Richard


Alberto Pianon
 

Il 2022-09-16 17:49 Mark Hatle wrote:
On 9/16/22 10:18 AM, Alberto Pianon wrote:
... trimmed ...

I also can see the issue with multiple sources in SRC_URI, although you
should be able to map those back if you assume subtrees are "owned" by
given SRC_URI entries. I suspect there may be a SPDX format limit in
documenting that piece?
I'm replying in reverse order:
- there is a SPDX format limit, but it is by design: a SPDX package
entity is a single sw distribution unit, so it may have only one
downloadLocation; if you have more than one downloadLocation, you must
have more than one SPDX package, according to SPDX specs;
I think my interpretation of this is different. I've got a view of
'sourcing materials', and then verifying the are what we think they
are and can be used the way we want. The "upstream sources" (and
patches) are really just 'raw materials' that we use the Yocto Project
to combined to create "the source".
So for the purpose of the SPDX, each upstream source _may_ have a
corresponding SPDX, but for the binaries their source is the combined
unit.. not multiple SPDXes. Think of it something like:
upstream source1 - SPDX
upstream source2 - SPDX
upstream patch
recipe patch1
recipe patch2
In the above, each of those items would be combined by the recipe
system to construct the source used to build an individual recipe (and
collection of packages). Automation _IS_ used to combine the
components [unpack/fetch] and _MAY_ be used to generated a combined
SPDX.
So your "upstream" location for this recipe is the local machine's
source archive. The SPDX for the local recipe files can merge the
SPDX information they know (and if it's at a file level) can use
checksums to identify the items not captured/modified by the patches
for further review (either manual or automation like fossology). In
the case where an upstream has SPDX data, you should be able to
inherit MOST files this way... but the output is specific to your
configuration and patches.
1 - SPDX |
2 - SPDX |
patch |---> recipe specific SPDX
patch |
patch |
In some cases someone may want to generate SPDX data for the 3
patches, but that may or may not be useful in this context.
IMHO it's a matter of different ways of framing Yocto recipes into SPDX
format.

Upstream sources are all SPDX packages. Yocto layers are SPDX packages,
too, containing some PATCH_FOR upstream packages.

Upstream sources and yocto layers are the "final" upstream sources, and
each of them has its downloadLocation.

"The source" created by a recipe is another SPDX package, GENERATED_FROM
upstream source packages + recipe and patches from Yocto layer
package(s). "The source" may need to be distributed by downstream users
(eg. to comply with *GPL-* obligations or when providing SDKs), so
downstream users may made it available from their own infrastructure,
"giving" it a downloadLocation.

(in SPDX, GENERATED_FROM and PATCH_FOR relationships may be between
files, so one may map files found in "the source" package to individual
files found in upstream source packages)

Binary packages GENERATED_FROM "the source" are local SPDX packages,
too. And firmware images are SPDX packages, too, GENERATED_FROM all the
above. Firmware images are distributed by downstream users, who will
provide their own downloadLocation.


- I understand that my solution is a bit hacky; but IMHO any other
*post-mortem* solution would be far more hacky; the real solution
would be collecting required information directly in do_fetch and
do_unpack
I've not looked at the current SPDX spec, but past versions has a
notes section. Assuming this is still present you can use it to
reference back to how this component was constructed and the upstream
source URIs (and SPDX files) you used for processing.
This way nothing really changes in do_fetch or do_unpack. (You may
want to find a way to capture file checksums and what the source was
for a particular file.. but it may not really be necessary!)
If you want to automatically map all files to their corresponding
upstram sources, it actually is... see my next point


- I also understand that we should reduce pain, otherwise nobody would
use our solution; the simplest and cleanest way I can think about is
collecting just package (in the SPDX sense) files' relative paths and
checksums at every stage (fetch, unpack, patch, package), and leave
data processing (i.e. mapping upstream source packages -> recipe's
WORKDIR package -> debug source package -> binary packages -> binary
image) to a separate tool, that may use (just a thought) a graph
database to process things more efficiently.
Even it do_patch nothing really changes, other then again you may want
to capture checksums to identify thingsthat need further processing.
This approach greatly simplifies things, and gives people doing code
reviews the insight into what is the source used when shipping the
binaries (which is really an important aspect of this), as well as
which recipe and "build" (really fetch/unpack/patch) were used to
construct the sources. If they want to investigate the sources
further back to their provider, then the notes would have the
information for that, and you could transition back to the "raw
materials" providers.
The point is precisely that we would like to help people avoid doing
this job, because if you scale up to n different yocto projects it would
be a time-consuming, error-prone and hardly maintainable process. Since
SPDX allows to represent relationships between any kind of entities
(files, packages), we would like to use that feature to map local source
files to upstream source files, so machines may do the job instead of
people -- and people (auditors) may concentrate on reviewing upstream
sources -- i.e. the atomic ingredients used across different projects or
across different versions of the same project.



Where I became puzzled is where you say "Information about debug
sources for each actual binary file is then taken from
tmp/pkgdata/<machine>/extended/*.json.zstd". This is the data we added
and use for the spdx class so you shouldn't need to reinvent that
piece. It should be the exact same data the spdx class uses.
you're right, but in the context of a POC it was easier to extract them
directly from json files than from SPDX data :) It's just a POC to show
that required information may be retrieved in some way, implementation
details do not matter

I was also puzzled about the difference between rpm and the other
package backends. The exact same files are packaged by all the package
backends so the checksums from do_package should be fine.
Here I may miss some piece of information. I looked at files in
tmp/pkgdata but I couldn't find package file checksums anywhere: that is
why I parsed rpm packages. But if such checksums were already available
somewhere in tmp/pkgdata, it wouldn't be necessary to parse rpm packages
at all... Could you point me to what I'm (maybe) missing here? Thanks!
file checksumming is expensive. There are checksums available to
individual packaging engines, as well as aggregate checksums for "hash
equivalency".. but I'm not aware of any per-file checksum that is
stored.
You definitely shouldn't be parsing packages of any type (rpm or
otherwise), as packages are truly optional. It's the binaries that
matter here.
You are definitely right. I guess that it should be done (optionally) in
do_package


Richard Purdie
 

On Mon, 2022-09-19 at 16:20 +0200, Carlo Piana wrote:
thank you for a well detailed and sensible answer. I certainly cannot
speak on technical issues, although I can understand there are
activities which could seriously impact the overall process and need
to be minimized.


On Fri, 2022-09-16 at 17:18 +0200, Alberto Pianon wrote:
Il 2022-09-15 14:16 Richard Purdie wrote:

For the source issues above it basically it comes down to how much
"pain" we want to push onto all users for the sake of adding in this
data. Unfortunately it is data which many won't need or use and
different legal departments do have different requirements.
We didn't paint the overall picture sufficiently well, therefore our
requirements may come across as coming from a particularly pedantic
legal department; my fault :)

Oniro is not "yet another commercial Yocto project", we are not a legal
department (even if we are experienced FLOSS lawyers and auditors, the
most prominent of whom is Carlo Piana -- cc'ed -- former general counsel
of FSFE and member of OSI Board).

Our rather ambitious goal is not limited to Oniro, and consists in doing
compliance in the open source way and both setting an example and
providing guidance and material for others to benefit from our effort.
Our work will therefore be shared (and possibly improved by others) not
only with Oniro-based projects but also with any Yocto project. Among
other things, the most relevant bit of work that we want to share is
**fully reviewed license information** and other legal metadata about a
whole bunch of open source components commonly used in Yocto projects.
I certainly love the goal. I presume you're going to share your review
criteria somehow? There must be some further set of steps,
documentation and results beyond what we're discussing here?
Our mandate (and our own attitude) is precisely to make everything as
public as possible.

We have published already about it
https://gitlab.eclipse.org/eclipse/oniro-compliancetoolchain/toolchain/docs/-/tree/main/audit_workflow

The entire review process is made using GitLab's issues and will be
made public.
I need to read into the details but that looks like a great start and
I'm happy to see the process being documented!

Thanks for the link, I'll try and have a read.

We have only one reservation concerning sensitive material
in case we found something legally problematic (to comply with
attorney/client privilege) or security-wise critic (in which case we
adopt a responsible disclosure principle and embargo some details).
That makes sense, it is a tricky balancing act at times.

I think the challenge will be whether you can publish that review with
sufficient "proof" that other legal departments can leverage it. I
wouldn't underestimate how different the requirements and process can
be between different people/teams/companies.
Speaking from a legal perspective, this is precisely the point. It is
true that we want to create a curated database of decisions, which as
any human enterprise is prone to errors and correction and therefore
we cannot have the last word. However, IF we can at least point to a
unique artifact and give its exact hash, there will be no need to
trust us, that would be open to inspection, because everybody else
could look at the same source we have identified and make sure we
have extracted all the information.
I do love the idea and I think it is quite possible. I do think this
does lead to one of the key details we need to think about though.

From a legal perspective I'd imagine you like dealing with a set of
files that make up the source of some piece of software. I'm not going
to use the word "package" since I think the term is overloaded and
confusing. That set of files can all be identified by checksums. This
pushes us towards wanting checksums of every file.

Stepping over to the build world, we have bitbake's fetcher and it
actually requires something similar - any given "input" must be
uniquely identifiable from the SRC_URI and possibly a set of SRCREVs.

Why? We firstly need to embed this information into the task signature.
If it changes, we know we need to rerun the fetch and re-obtain the
data. We work on inputs to generate this hash, not outputs and we
require all fetcher modules to be able to identify sources like this.

In the case of a git repo, the hash of a git commit is good enough. For
a tarball, it would be a checksum of the tarball. Where there are patch
local files, we include the hashes of those files.

The bottom line is that we already have a hash which represents the
task inputs. Bugs happen, sure. There are also poor fetchers, npm and
go present challenges in particular but we've tried to work around
those issues.

What you're saying is that you don't trust what bitbake does, so you
want all the next level of information about the individual files.

In theory we could put the SRC_URI and SRCREVs into the SPDX as the
source (which could be summarised into a task hash) rather than the
upstream url. It all depends which level you want to break things down
to.

I do see a case for needing the lower level info as in review, you are
going to want to know the delta to the last review decisions. You also
prefer having a different "upstream" url form for some kinds of checks
like CVEs. It does feel a lot like we're trying to duplicate
information and cause significant growth of the SPDX files without an
actual definitive need.

You could equally put in a mapping between a fetch task checksum and
the checksums of all the files that fetch task would expand to if run
(it should always do it deterministicly).

To be clearer, we are not discussing here the obligation to provide
the entire corresponding source code as with *GPLv3, but rather we
are seeking to establish the *provenance* of the software, of all
bits (also in order to see what patch has been applied by who and to
close which vulnerability, in case).
My worry is that by not considering the obligation, we don't cater for
a portion of the userbase and by doing so, we limit the possible
adoption.

Provenance also has a great impact on "reproducibility" of legal work
on sources. If we are not able to tell what has gone into our package
from where (and this may prove hard and require a lot of manual - and
therefore error-prone - work especially in case of complex Yocto
recipes using f.e. crate/cargo or npm(sw) fetchers), we (lawyers and
compliance specialists) are at a great disadvantage proving we have
covered all our bases.
I understand this more than you realise as we have the same problem in
the bitbake fetcher and have spent a lot of time trying to solve it. I
won't claim we're there for some of the modern runtimes and I'd love
help in both explaining to the upstream projects why we need this and
help to technically fix the fetchers so these modern runtimes work
better.

This is a very good point, and I can vouch that this is really
important, but maybe you are reading too much in here: at this stage,
our goal is not to convince anyone to radically change Yocto tasks to
meet our requirements, but it is to share such requirements and their
rationale, collect your feedback and possibly adjust them, and also
to figure out the least impactful solution to meet them (possibly
without radical changes but just by adding optional functions in
existing tasks).
"optional functions" fill me with dread, this is the archiver problem I
mentioned.

One of the things I try really hard to do is to have one good way of
doing things rather than multiple options with different levels of
functionality. If you give people choices, they use them. When
someone's build fails, I don't want to have to ask "which fetcher were
you using? Did you configure X or Y or Z?". If we can all use the same
code and codepaths, it means we see bugs, we see regressions and we
have a common experience without the need for complex test matrices.

Worst case you can add optional functions but I kind of see that as a
failure. If we can find something with low overhead which we can all
use, that would be much better. Whether it is possible, I don't know
but it is why we're having the discussion. This is why I have a
preference for trying to keep common code paths for the core though.

- I understand that my solution is a bit hacky; but IMHO any other
*post-mortem* solution would be far more hacky; the real solution
would be collecting required information directly in do_fetch and
do_unpack
Agreed, this needs to be done at unpack/patch time. Don't underestimate
the impact of this on general users though as many won't appreciate
slowing down their builds generating this information :/.
Can't this be made optional, so one could just go for the "old" way
without impacting much? Sorry I'm stepping where I'm naive.
See above :).



There is also a pile of information some legal departments want which
you've not mentioned here, such as build scripts and configuration
information. Some previous discussions with other parts of the wider
open source community rejected Yocto Projects efforts as insufficient
since we didn't mandate and capture all of this too (the archiver could
optionally do some of it iirc). Is this just the first step and we're
going to continue dumping more data? Or is this sufficient and all any
legal department should need?
I think that trying to give all legal departments what they want
would prove impossible. I think the idea here is more to start
building a collectively managed database of provenance and licensing
data, with a curate set of decision for as many packages available as
possible. This way that everybody can have some good clue -- and
increasingly a better one -- as to which license(s) apply to which
package, removing much of the guesswork that is required today.
It makes sense and is a worthy goal. I just wish we could key this off
bitbake's fetch task checksum rather than having to dump reams of file
checksums!

We ourselves reuse a lot of information coming from Debian's machine-
readable information, sometimes finding mistakes and opening issues
upstream. That helped us to cut the licensing harvesting information
and review by a great deal.
This does explain why the bitbake fetch mechanism would be a struggle
for you though as you don't want to use our fetch units as your base
component (which is why we end up struggling with some of the issues).

In the interests of moving towards a conclusion, I think what we'll end
up needing to do is generate more information from the fetch and patch
tasks, perhaps with a json file summary of what they do (filenames and
checksums?). That would give your tools data to feed them, even if I'm
not convinced we should be dumping more and more data into the final
SPDX files.

Cheers,

Richard