Date
1 - 9 of 9
Adding more information to the SBOM
Marta Rybczynska
Dear all,
(cross-posting to oe-core and *-architecture) In the last months, we have worked in Oniro on using the create-spdx class for both IP compliance and security. During this work, Alberto Pianon has found that some information is missing from the SBOM and it does not contain enough for Software Composition Analysis. The main missing point is the relation between the actual upstream sources and the final binaries (create-spdx uses composite sources). Alberto has worked on how to obtain the missing data and now has a POC. This POC provides full source-to-binary tracking of Yocto builds through a couple of scripts (intended to be transformed into a new bbclass at a later stage). The goal is to add the missing pieces of information in order to get a "real" SBOM from Yocto, which should, at a minimum: - carefully describe what is found in a final image (i.e. binary files and their dependencies), since that is what is actually distributed and goes into the final product; - describe how such binary files have been generated and where they come from (i.e. upstream sources, including patches and other stuff added from meta-layers); provenance is important for a number of reasons related to IP Compliance and security. The aim is to become able to: - map binaries to their corresponding upstream source packages (and not to the "internal" source packages created by recipes by combining multiple upstream sources and patches) - map binaries to the source files that have been actually used to build them - which usually are a small subset of the whole source package With respect to IP compliance, this would allow to, among other things: - get the real license text for each binary file, by getting the license of the specific source files it has been generated from (provided by Fossology, for instance), - and not the main license stated in the corresponding recipe (which may be as confusing as GPL-2.0-or-later & LGPL-2.1-or-later & BSD-3-Clause & BSD-4-Clause, or even worse) - automatically check license incompatibilities at the binary file level. Other possible interesting things could be done also on the security side. This work intends to add a way to provide additional data that can be used by create-spdx, not to replace create-spdx in any way. The sources with a long README are available at https://gitlab.eclipse.org/eclipse/oniro-compliancetoolchain/toolchain/tinfoilhat/-/tree/srctracker/srctracker What do you think of this work? Would it be of interest to integrate into YP at some point? Shall we discuss this? Marta and Alberto |
|
Joshua Watt
On Wed, Sep 14, 2022 at 9:16 AM Marta Rybczynska <rybczynska@...> wrote:
I believe we map the binaries to the source code from the -dbg packages; is the premise that this is insufficient? Can you elaborate more on why that is, I don't quite understand. The debug sources are (basically) what we actually compiled (e.g. post-do_patch) to produce the binary, and you can in turn follow these back to the upstream sources with the downloadLocation property. Please be a little careful with the wording; SBoMs have a lot of uses, and many of them we can satisfy with what we currently generate; it may not do the exact use case you are looking for, but that doesn't mean it's not a "real" SBoM :) IIUC this is the difference between the "Declared" license and the "Concluded" license. You can report both, and I think create-spdx.bbclass can currently do this with its rudimentary source license scanning. You really do want both and it's a great way to make sure that the "Declared" license (that is the license in the recipe) reflects the reality of the source code. - automatically check license incompatibilities at the binary file level.This seems promising as something that could potentially move into core. I have a few points: - The extraction of the sources to a dedicated directory is something that Richard has been toying around with for quite a while, and I think it would greatly simplify that part of your process. I would very much encourage you to look at the work he's done, and work on that to get it pushed across the finish line as it's a really good improvement that would benefit not just your source scanning. - I would encourage you to not wait to turn this into a bbclass and/or library functions. You should be able to do this in a new layer, and that would make it much clearer as to what the path to being included in OE-core would look like. It also would (IMHO) be nicer to the users :)
|
|
Mark Hatle
On 9/14/22 9:56 AM, Joshua Watt wrote:
On Wed, Sep 14, 2022 at 9:16 AM Marta Rybczynska <rybczynska@...> wrote:When I last looked at this, it was critical that the analysis be:I believe we map the binaries to the source code from the -dbg binary -> patched & configured source (dbg package) -> how the sources were constructed. As Joshua said above. I believe all of the information is present for this as you can tie the binary (through debug symbols) back to the debug package.. and the source of the debug package back to the sources that constructed it via heuristics. (If you enable the git patch mechanism. It should even be possible to use git blame to find exactly what upstreams constructed the patched sources. For generated content, it's more difficult -- but for those items usually there is a header which indicates what generated the content so other heuristics can be used. Full compliance will require binaries mapped to patched source to upstream sources _AND_ the instructions (layer/recipe/configuration) used to build them. But it's up to the local legal determination to figure out 'how far you really need to go', vs just "here are the layers I used to build my project".)Please be a little careful with the wording; SBoMs have a lot of uses, And the thing to keep in mind is that in a given package the "Declared" is usually what a LICENSE file or header says. But the "Concluded" has levels of quality behind them. The first level of quality is "Declared". The next level is automation (something like fossology), the next level is human reviewed, and the highest level is "lawyer reviewed".The aim is to become able to:IIUC this is the difference between the "Declared" license and the So being able to inject SPDX information with Concluded values for evaluation and track the 'quality level' has always been something I wanted to do, but never had time. At the time, my idea was a database (and/or bbappend) for each component that would included pre-processed SPDX data for each recipe. This data would run through a validation step to show it actually matches the patched sources. (If any file checksums do NOT match, then they would be flagged for follow up.) Agreed, this looks useful. The key is start turning it into one or more bbclasses now. Things that work with the Yocto Project process. Don't try to "post-process" and reconstruct sources. Instead inject steps that will run your file checksums, build up your database as the source are constructed. (i.e. do_unpack, do_patch..)- automatically check license incompatibilities at the binary file level.This seems promising as something that could potentially move into etc. The key is, all of the information IS available. It just may not be in the format you want. --Mark
|
|
Richard Purdie
On Wed, 2022-09-14 at 16:16 +0200, Marta Rybczynska wrote:
The sources with a long README are available atI had a look at this and was a bit puzzled by some of it. I can see the issues you'd have if you want to separate the unpatched source from the patches and know which files had patches applied as that is hard to track. There would be significiant overhead in trying to process and store that information in the unpack/patch steps and the archiver class does some of that already. It is messy, hard and doens't perform well. I'm reluctant to force everyone to do it as a result but that can also result in multiple code paths and when you have that, the result is that one breaks :(. I also can see the issue with multiple sources in SRC_URI, although you should be able to map those back if you assume subtrees are "owned" by given SRC_URI entries. I suspect there may be a SPDX format limit in documenting that piece? Where I became puzzled is where you say "Information about debug sources for each actual binary file is then taken from tmp/pkgdata/<machine>/extended/*.json.zstd". This is the data we added and use for the spdx class so you shouldn't need to reinvent that piece. It should be the exact same data the spdx class uses. I was also puzzled about the difference between rpm and the other package backends. The exact same files are packaged by all the package backends so the checksums from do_package should be fine. For the source issues above it basically it comes down to how much "pain" we want to push onto all users for the sake of adding in this data. Unfortunately it is data which many won't need or use and different legal departments do have different requirements. Experience with archiver.bbclass shows that multiple codepaths doing these things is a nightmare to keep working, particularly for corner cases which do interesting things with the code (externalsrc, gcc shared workdir, the kernel and more). Cheers, Richard |
|
Hi Richard,
thank you for your reply, you gave me very interesting cues to think about. I'll reply in reverse/importance order Il 2022-09-15 14:16 Richard Purdie wrote: For the source issues above it basically it comes down to how muchWe didn't paint the overall picture sufficiently well, therefore our requirements may come across as coming from a particularly pedantic legal department; my fault :) Oniro is not "yet another commercial Yocto project", we are not a legal department (even if we are experienced FLOSS lawyers and auditors, the most prominent of whom is Carlo Piana -- cc'ed -- former general counsel of FSFE and member of OSI Board). Our rather ambitious goal is not limited to Oniro, and consists in doing compliance in the open source way and both setting an example and providing guidance and material for others to benefit from our effort. Our work will therefore be shared (and possibly improved by others) not only with Oniro-based projects but also with any Yocto project. Among other things, the most relevant bit of work that we want to share is **fully reviewed license information** and other legal metadata about a whole bunch of open source components commonly used in Yocto projects. To do that in a **scalable and fully automated way**, we need that Yocto collects some information that is currently disposed of (or simply not collected) at build time. Oniro Project Leader, Davide Ricci - cc'ed - strongly encouraged us to seek for feedback from you in order to find out the best way to do it. Maybe organizing a call would be more convenient than discussing background and requirements here, if you (and others) are available. ExperienceI'm replying in reverse order: - there is a SPDX format limit, but it is by design: a SPDX package entity is a single sw distribution unit, so it may have only one downloadLocation; if you have more than one downloadLocation, you must have more than one SPDX package, according to SPDX specs; - I understand that my solution is a bit hacky; but IMHO any other *post-mortem* solution would be far more hacky; the real solution would be collecting required information directly in do_fetch and do_unpack - I also understand that we should reduce pain, otherwise nobody would use our solution; the simplest and cleanest way I can think about is collecting just package (in the SPDX sense) files' relative paths and checksums at every stage (fetch, unpack, patch, package), and leave data processing (i.e. mapping upstream source packages -> recipe's WORKDIR package -> debug source package -> binary packages -> binary image) to a separate tool, that may use (just a thought) a graph database to process things more efficiently. Where I became puzzled is where you say "Information about debugyou're right, but in the context of a POC it was easier to extract them directly from json files than from SPDX data :) It's just a POC to show that required information may be retrieved in some way, implementation details do not matter I was also puzzled about the difference between rpm and the otherHere I may miss some piece of information. I looked at files in tmp/pkgdata but I couldn't find package file checksums anywhere: that is why I parsed rpm packages. But if such checksums were already available somewhere in tmp/pkgdata, it wouldn't be necessary to parse rpm packages at all... Could you point me to what I'm (maybe) missing here? Thanks! In any case, thank you much so for all your insights, they were super-useful! Cheers, Alberto |
|
Mark Hatle
On 9/16/22 10:18 AM, Alberto Pianon wrote:
... trimmed ... I think my interpretation of this is different. I've got a view of 'sourcing materials', and then verifying the are what we think they are and can be used the way we want. The "upstream sources" (and patches) are really just 'raw materials' that we use the Yocto Project to combined to create "the source".I also can see the issue with multiple sources in SRC_URI, although youI'm replying in reverse order: So for the purpose of the SPDX, each upstream source _may_ have a corresponding SPDX, but for the binaries their source is the combined unit.. not multiple SPDXes. Think of it something like: upstream source1 - SPDX upstream source2 - SPDX upstream patch recipe patch1 recipe patch2 In the above, each of those items would be combined by the recipe system to construct the source used to build an individual recipe (and collection of packages). Automation _IS_ used to combine the components [unpack/fetch] and _MAY_ be used to generated a combined SPDX. So your "upstream" location for this recipe is the local machine's source archive. The SPDX for the local recipe files can merge the SPDX information they know (and if it's at a file level) can use checksums to identify the items not captured/modified by the patches for further review (either manual or automation like fossology). In the case where an upstream has SPDX data, you should be able to inherit MOST files this way... but the output is specific to your configuration and patches. 1 - SPDX | 2 - SPDX | patch |---> recipe specific SPDX patch | patch | In some cases someone may want to generate SPDX data for the 3 patches, but that may or may not be useful in this context. - I understand that my solution is a bit hacky; but IMHO any otherI've not looked at the current SPDX spec, but past versions has a notes section. Assuming this is still present you can use it to reference back to how this component was constructed and the upstream source URIs (and SPDX files) you used for processing. This way nothing really changes in do_fetch or do_unpack. (You may want to find a way to capture file checksums and what the source was for a particular file.. but it may not really be necessary!) - I also understand that we should reduce pain, otherwise nobody wouldEven it do_patch nothing really changes, other then again you may want to capture checksums to identify thingsthat need further processing. This approach greatly simplifies things, and gives people doing code reviews the insight into what is the source used when shipping the binaries (which is really an important aspect of this), as well as which recipe and "build" (really fetch/unpack/patch) were used to construct the sources. If they want to investigate the sources further back to their provider, then the notes would have the information for that, and you could transition back to the "raw materials" providers. file checksumming is expensive. There are checksums available to individual packaging engines, as well as aggregate checksums for "hash equivalency".. but I'm not aware of any per-file checksum that is stored.you're right, but in the context of a POC it was easier to extract them You definitely shouldn't be parsing packages of any type (rpm or otherwise), as packages are truly optional. It's the binaries that matter here. --Mark In any case, thank you much so for all your insights, they were |
|
Richard Purdie
On Fri, 2022-09-16 at 17:18 +0200, Alberto Pianon wrote:
Il 2022-09-15 14:16 Richard Purdie wrote:I certainly love the goal. I presume you're going to share your reviewWe didn't paint the overall picture sufficiently well, therefore our criteria somehow? There must be some further set of steps, documentation and results beyond what we're discussing here? I think the challenge will be whether you can publish that review with sufficient "proof" that other legal departments can leverage it. I wouldn't underestimate how different the requirements and process can be between different people/teams/companies. To do that in a **scalable and fully automated way**, we need that YoctoI don't mind having a call but the discussion in this current form may have an important element we shouldn't overlook, which is that it isn't just me you need to convince on some of this. If, for example, we should radically change the unpack/patch process, we need to have a good explanation for why people need to take that build time/space/resource hit. If we conclude that on a call, the case to the wider community would still have to be made. I think we may need to talk to the SPDX people about that as I'm notExperienceI'm replying in reverse order: convinced it always holds that you can divide software into such units. Certainly you can construct a situation where there are two repositories, each containing a source file which are only ever linked together as one binary. - I understand that my solution is a bit hacky; but IMHO any otherAgreed, this needs to be done at unpack/patch time. Don't underestimate the impact of this on general users though as many won't appreciate slowing down their builds generating this information :/. There is also a pile of information some legal departments want which you've not mentioned here, such as build scripts and configuration information. Some previous discussions with other parts of the wider open source community rejected Yocto Projects efforts as insufficient since we didn't mandate and capture all of this too (the archiver could optionally do some of it iirc). Is this just the first step and we're going to continue dumping more data? Or is this sufficient and all any legal department should need? - I also understand that we should reduce pain, otherwise nobody wouldI'd suggest stepping back and working out whether the SPDX requirement of a "single download location" some of this stems from really makes sense. Fair enough, I just want to be clear we don't want to duplicate this.Where I became puzzled is where you say "Information about debugyou're right, but in the context of a POC it was easier to extract them In some ways this is quite simple, it is because at do_package time,I was also puzzled about the difference between rpm and the otherHere I may miss some piece of information. I looked at files in the output packages don't exist, only their content. The final output packages are generated in do_package_write_{ipk|deb|rpm}. You'd probably have to add a stage to the package_write tasks which wrote out more checksum data since the checksums are only known at the end of those tasks. I would question whether adding this additional checksum into the SPDX output actually helps much in the real world though. I guess it means you could look an RPM up against it's checksum but is that something people need to do? Cheers, Richard |
|
Il 2022-09-16 17:49 Mark Hatle wrote:
On 9/16/22 10:18 AM, Alberto Pianon wrote:IMHO it's a matter of different ways of framing Yocto recipes into SPDX format. Upstream sources are all SPDX packages. Yocto layers are SPDX packages, too, containing some PATCH_FOR upstream packages. Upstream sources and yocto layers are the "final" upstream sources, and each of them has its downloadLocation. "The source" created by a recipe is another SPDX package, GENERATED_FROM upstream source packages + recipe and patches from Yocto layer package(s). "The source" may need to be distributed by downstream users (eg. to comply with *GPL-* obligations or when providing SDKs), so downstream users may made it available from their own infrastructure, "giving" it a downloadLocation. (in SPDX, GENERATED_FROM and PATCH_FOR relationships may be between files, so one may map files found in "the source" package to individual files found in upstream source packages) Binary packages GENERATED_FROM "the source" are local SPDX packages, too. And firmware images are SPDX packages, too, GENERATED_FROM all the above. Firmware images are distributed by downstream users, who will provide their own downloadLocation. If you want to automatically map all files to their corresponding- I understand that my solution is a bit hacky; but IMHO any otherI've not looked at the current SPDX spec, but past versions has a upstram sources, it actually is... see my next point The point is precisely that we would like to help people avoid doing- I also understand that we should reduce pain, otherwise nobody wouldEven it do_patch nothing really changes, other then again you may want this job, because if you scale up to n different yocto projects it would be a time-consuming, error-prone and hardly maintainable process. Since SPDX allows to represent relationships between any kind of entities (files, packages), we would like to use that feature to map local source files to upstream source files, so machines may do the job instead of people -- and people (auditors) may concentrate on reviewing upstream sources -- i.e. the atomic ingredients used across different projects or across different versions of the same project. You are definitely right. I guess that it should be done (optionally) infile checksumming is expensive. There are checksums available toWhere I became puzzled is where you say "Information about debugyou're right, but in the context of a POC it was easier to extract them do_package |
|
Richard Purdie
On Mon, 2022-09-19 at 16:20 +0200, Carlo Piana wrote:
thank you for a well detailed and sensible answer. I certainly cannotI need to read into the details but that looks like a great start and I'm happy to see the process being documented! Thanks for the link, I'll try and have a read. We have only one reservation concerning sensitive materialThat makes sense, it is a tricky balancing act at times. I do love the idea and I think it is quite possible. I do think thisI think the challenge will be whether you can publish that review withSpeaking from a legal perspective, this is precisely the point. It is does lead to one of the key details we need to think about though. From a legal perspective I'd imagine you like dealing with a set of files that make up the source of some piece of software. I'm not going to use the word "package" since I think the term is overloaded and confusing. That set of files can all be identified by checksums. This pushes us towards wanting checksums of every file. Stepping over to the build world, we have bitbake's fetcher and it actually requires something similar - any given "input" must be uniquely identifiable from the SRC_URI and possibly a set of SRCREVs. Why? We firstly need to embed this information into the task signature. If it changes, we know we need to rerun the fetch and re-obtain the data. We work on inputs to generate this hash, not outputs and we require all fetcher modules to be able to identify sources like this. In the case of a git repo, the hash of a git commit is good enough. For a tarball, it would be a checksum of the tarball. Where there are patch local files, we include the hashes of those files. The bottom line is that we already have a hash which represents the task inputs. Bugs happen, sure. There are also poor fetchers, npm and go present challenges in particular but we've tried to work around those issues. What you're saying is that you don't trust what bitbake does, so you want all the next level of information about the individual files. In theory we could put the SRC_URI and SRCREVs into the SPDX as the source (which could be summarised into a task hash) rather than the upstream url. It all depends which level you want to break things down to. I do see a case for needing the lower level info as in review, you are going to want to know the delta to the last review decisions. You also prefer having a different "upstream" url form for some kinds of checks like CVEs. It does feel a lot like we're trying to duplicate information and cause significant growth of the SPDX files without an actual definitive need. You could equally put in a mapping between a fetch task checksum and the checksums of all the files that fetch task would expand to if run (it should always do it deterministicly). To be clearer, we are not discussing here the obligation to provideMy worry is that by not considering the obligation, we don't cater for a portion of the userbase and by doing so, we limit the possible adoption. Provenance also has a great impact on "reproducibility" of legal workI understand this more than you realise as we have the same problem in the bitbake fetcher and have spent a lot of time trying to solve it. I won't claim we're there for some of the modern runtimes and I'd love help in both explaining to the upstream projects why we need this and help to technically fix the fetchers so these modern runtimes work better. This is a very good point, and I can vouch that this is really"optional functions" fill me with dread, this is the archiver problem I mentioned. One of the things I try really hard to do is to have one good way of doing things rather than multiple options with different levels of functionality. If you give people choices, they use them. When someone's build fails, I don't want to have to ask "which fetcher were you using? Did you configure X or Y or Z?". If we can all use the same code and codepaths, it means we see bugs, we see regressions and we have a common experience without the need for complex test matrices. Worst case you can add optional functions but I kind of see that as a failure. If we can find something with low overhead which we can all use, that would be much better. Whether it is possible, I don't know but it is why we're having the discussion. This is why I have a preference for trying to keep common code paths for the core though. See above :).Can't this be made optional, so one could just go for the "old" way- I understand that my solution is a bit hacky; but IMHO any otherAgreed, this needs to be done at unpack/patch time. Don't underestimate It makes sense and is a worthy goal. I just wish we could key this offI think that trying to give all legal departments what they want bitbake's fetch task checksum rather than having to dump reams of file checksums! We ourselves reuse a lot of information coming from Debian's machine-This does explain why the bitbake fetch mechanism would be a struggle for you though as you don't want to use our fetch units as your base component (which is why we end up struggling with some of the issues). In the interests of moving towards a conclusion, I think what we'll end up needing to do is generate more information from the fetch and patch tasks, perhaps with a json file summary of what they do (filenames and checksums?). That would give your tools data to feed them, even if I'm not convinced we should be dumping more and more data into the final SPDX files. Cheers, Richard |
|