Skip to content

Commit ba53d70

Browse files
authored
Add new benchmark_purls pipeline #1804 (#1832)
* Add new ``benchmark_purls`` pipeline #1804 Signed-off-by: tdruez <tdruez@nexb.com> * Add unit test for the ``benchmark_purls`` pipeline #1804 Signed-off-by: tdruez <tdruez@nexb.com> * Add documentation about the ``benchmark_purls`` pipeline #1804 Signed-off-by: tdruez <tdruez@nexb.com> --------- Signed-off-by: tdruez <tdruez@nexb.com>
1 parent c7ecb48 commit ba53d70

File tree

11 files changed

+802
-2
lines changed

11 files changed

+802
-2
lines changed

CHANGELOG.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,9 @@ v35.4.0 (unreleased)
1717
- Display the optional steps in the Pipelines autodoc.
1818
https://github.com/aboutcode-org/scancode.io/issues/1822
1919

20+
- Add new ``benchmark_purls`` pipeline.
21+
https://github.com/aboutcode-org/scancode.io/issues/1804
22+
2023
v35.3.0 (2025-08-20)
2124
--------------------
2225

docs/built-in-pipelines.rst

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,47 @@ Analyse Docker Windows Image
4646
:members:
4747
:member-order: bysource
4848

49+
.. _pipeline_benchmark_purls:
50+
51+
Benchmark PURLs (addon)
52+
-----------------------
53+
54+
To check an **SBOM against a list of expected Package URLs (PURLs)**:
55+
56+
1. **Create a new project** and provide two inputs:
57+
58+
* The SBOM file you want to check.
59+
* A list of expected PURLs in a ``*-purls.txt`` file with one PURL per line.
60+
61+
.. tip:: You may also flag any filename using the ``purls`` input tag.
62+
63+
2. **Run the pipelines**:
64+
65+
* Select and run the ``load_sbom`` pipeline to load the SBOM.
66+
* Run the ``benchmark_purls`` pipeline to validate against the expected PURLs.
67+
68+
3. **Download the results** from the "output" section of the project.
69+
70+
The output file contains only the differences between the discovered PURLs and
71+
the expected PURLs:
72+
73+
* Lines starting with ``-`` are missing from the project.
74+
* Lines starting with ``+`` are unexpected in the project.
75+
76+
.. note::
77+
The ``load_sbom`` pipeline is provided as an example to benchmark external
78+
tools using SBOMs as inputs. You can also run ``benchmark_purls`` directly
79+
after any ScanCode.io pipeline to validate the discovered PURLs.
80+
81+
.. tip::
82+
You can provide multiple expected PURLs files.
83+
84+
85+
.. autoclass:: scanpipe.pipelines.benchmark_purls.BenchmarkPurls()
86+
:members:
87+
:member-order: bysource
88+
89+
4990
.. _pipeline_collect_strings_gettext:
5091

5192
Collect string with Xgettext (addon)

docs/scanpipe-pipes.rst

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,11 @@ Generic
88
.. automodule:: scanpipe.pipes
99
:members:
1010

11+
Benchmark
12+
---------
13+
.. automodule:: scanpipe.pipes.benchmark
14+
:members:
15+
1116
ClamAV
1217
------
1318
.. automodule:: scanpipe.pipes.clamav

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -135,6 +135,7 @@ run = "scancodeio:combined_run"
135135
analyze_docker_image = "scanpipe.pipelines.analyze_docker:Docker"
136136
analyze_root_filesystem_or_vm_image = "scanpipe.pipelines.analyze_root_filesystem:RootFS"
137137
analyze_windows_docker_image = "scanpipe.pipelines.analyze_docker_windows:DockerWindows"
138+
benchmark_purls = "scanpipe.pipelines.benchmark_purls:BenchmarkPurls"
138139
collect_strings_gettext = "scanpipe.pipelines.collect_strings_gettext:CollectStringsGettext"
139140
collect_symbols_ctags = "scanpipe.pipelines.collect_symbols_ctags:CollectSymbolsCtags"
140141
collect_symbols_pygments = "scanpipe.pipelines.collect_symbols_pygments:CollectSymbolsPygments"

scanpipe/models.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1147,12 +1147,12 @@ def get_output_file_path(self, name, extension):
11471147
filename = f"{name}-{filename_now()}.{extension}"
11481148
return self.output_path / filename
11491149

1150-
def get_latest_output(self, filename):
1150+
def get_latest_output(self, filename, extension="json"):
11511151
"""
11521152
Return the latest output file with the "filename" prefix, for example
11531153
"scancode-<timestamp>.json".
11541154
"""
1155-
output_files = sorted(self.output_path.glob(f"*{filename}*.json"))
1155+
output_files = sorted(self.output_path.glob(f"*{filename}*.{extension}"))
11561156
if output_files:
11571157
return output_files[-1]
11581158

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
#
3+
# http://nexb.com and https://github.com/aboutcode-org/scancode.io
4+
# The ScanCode.io software is licensed under the Apache License version 2.0.
5+
# Data generated with ScanCode.io is provided as-is without warranties.
6+
# ScanCode is a trademark of nexB Inc.
7+
#
8+
# You may not use this software except in compliance with the License.
9+
# You may obtain a copy of the License at: http://apache.org/licenses/LICENSE-2.0
10+
# Unless required by applicable law or agreed to in writing, software distributed
11+
# under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
12+
# CONDITIONS OF ANY KIND, either express or implied. See the License for the
13+
# specific language governing permissions and limitations under the License.
14+
#
15+
# Data Generated with ScanCode.io is provided on an "AS IS" BASIS, WITHOUT WARRANTIES
16+
# OR CONDITIONS OF ANY KIND, either express or implied. No content created from
17+
# ScanCode.io should be considered or used as legal advice. Consult an Attorney
18+
# for any legal advice.
19+
#
20+
# ScanCode.io is a free software code scanning tool from nexB Inc. and others.
21+
# Visit https://github.com/aboutcode-org/scancode.io for support and download.
22+
23+
from scanpipe.pipelines import Pipeline
24+
from scanpipe.pipes import benchmark
25+
26+
27+
class BenchmarkPurls(Pipeline):
28+
"""
29+
Validate discovered project packages against a reference list of expected PURLs.
30+
31+
The expected PURLs must be provided as a .txt file with one PURL per line.
32+
Input files are recognized if:
33+
34+
- They are tagged with "purls", or
35+
- Their filename ends with "purls.txt" (e.g., "expected_purls.txt").
36+
37+
"""
38+
39+
download_inputs = False
40+
is_addon = True
41+
42+
@classmethod
43+
def steps(cls):
44+
return (
45+
cls.get_expected_purls,
46+
cls.compare_purls,
47+
)
48+
49+
def get_expected_purls(self):
50+
"""Load the expected PURLs defined in the project inputs."""
51+
self.expected_purls = benchmark.get_expected_purls(self.project)
52+
53+
def compare_purls(self):
54+
"""Run the PURLs diff and write the results to a project output file."""
55+
diff_results = benchmark.compare_purls(self.project, self.expected_purls)
56+
output_file = self.project.get_output_file_path("benchmark_purls", "txt")
57+
output_file.write_text("\n".join(diff_results))

scanpipe/pipes/benchmark.py

Lines changed: 68 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,68 @@
1+
# SPDX-License-Identifier: Apache-2.0
2+
#
3+
# http://nexb.com and https://github.com/aboutcode-org/scancode.io
4+
# The ScanCode.io software is licensed under the Apache License version 2.0.
5+
# Data generated with ScanCode.io is provided as-is without warranties.
6+
# ScanCode is a trademark of nexB Inc.
7+
#
8+
# You may not use this software except in compliance with the License.
9+
# You may obtain a copy of the License at: http://apache.org/licenses/LICENSE-2.0
10+
# Unless required by applicable law or agreed to in writing, software distributed
11+
# under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
12+
# CONDITIONS OF ANY KIND, either express or implied. See the License for the
13+
# specific language governing permissions and limitations under the License.
14+
#
15+
# Data Generated with ScanCode.io is provided on an "AS IS" BASIS, WITHOUT WARRANTIES
16+
# OR CONDITIONS OF ANY KIND, either express or implied. No content created from
17+
# ScanCode.io should be considered or used as legal advice. Consult an Attorney
18+
# for any legal advice.
19+
#
20+
# ScanCode.io is a free software code scanning tool from nexB Inc. and others.
21+
# Visit https://github.com/aboutcode-org/scancode.io for support and download.
22+
23+
import difflib
24+
25+
26+
def get_expected_purls(project):
27+
"""
28+
Load the expected Package URLs (PURLs) from the project's input files.
29+
30+
A file is considered an expected PURLs source if:
31+
- Its filename ends with ``*purls.txt``, or
32+
- Its download URL includes the "#purls" tag.
33+
34+
Each line in the file should contain one PURL. Returns a sorted,
35+
deduplicated list of PURLs. Raises an exception if no input is found.
36+
"""
37+
purls_files = list(project.inputs("*purls.txt"))
38+
purls_files.extend(
39+
[input.path for input in project.inputsources.filter(tag="purls")]
40+
)
41+
42+
expected_purls = []
43+
for file_path in purls_files:
44+
expected_purls.extend(file_path.read_text().splitlines())
45+
46+
if not expected_purls:
47+
raise Exception("Expected PURLs not provided.")
48+
49+
return sorted(set(expected_purls))
50+
51+
52+
def compare_purls(project, expected_purls):
53+
"""
54+
Compare discovered project PURLs against the expected PURLs.
55+
56+
Returns only the differences:
57+
- Lines starting with '-' are missing from the project.
58+
- Lines starting with '+' are unexpected in the project.
59+
"""
60+
project_packages = project.discoveredpackages.only_package_url_fields()
61+
sorted_unique_purls = sorted({package.purl for package in project_packages})
62+
63+
diff_result = difflib.ndiff(sorted_unique_purls, expected_purls)
64+
65+
# Keep only lines that are diffs (- or +)
66+
filtered_diff = [line for line in diff_result if line.startswith(("-", "+"))]
67+
68+
return filtered_diff
Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
- pkg:alpine/alpine-keys@2.5-r0?arch=x86_64
2+
+ pkg:alpine/zlib@1.3.2-r2?arch=x86_64
3+
+ pkg:deb/debian/alpine-keys@2.5-r0?arch=x86_64
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
pkg:alpine/alpine-baselayout@3.7.0-r0?arch=x86_64
2+
pkg:alpine/alpine-baselayout-data@3.7.0-r0?arch=x86_64
3+
pkg:deb/debian/alpine-keys@2.5-r0?arch=x86_64
4+
pkg:alpine/alpine-release@3.22.1-r0?arch=x86_64
5+
pkg:alpine/apk-tools@2.14.9-r2?arch=x86_64
6+
pkg:alpine/busybox@1.37.0-r18?arch=x86_64
7+
pkg:alpine/busybox-binsh@1.37.0-r18?arch=x86_64
8+
pkg:alpine/ca-certificates-bundle@20250619-r0?arch=x86_64
9+
pkg:alpine/libapk2@2.14.9-r2?arch=x86_64
10+
pkg:alpine/libcrypto3@3.5.1-r0?arch=x86_64
11+
pkg:alpine/libssl3@3.5.1-r0?arch=x86_64
12+
pkg:alpine/musl@1.2.5-r10?arch=x86_64
13+
pkg:alpine/musl-utils@1.2.5-r10?arch=x86_64
14+
pkg:alpine/scanelf@1.3.8-r1?arch=x86_64
15+
pkg:alpine/ssl_client@1.37.0-r18?arch=x86_64
16+
pkg:alpine/zlib@1.3.1-r2?arch=x86_64
17+
pkg:alpine/zlib@1.3.2-r2?arch=x86_64

0 commit comments

Comments
 (0)