This is a great piece about reproducibility in source and binary packaging by the @GuixHpc folks. Let’s talk about what this means, though, because reproducibility can mean a lot of things, and people seem to get religious about certain aspects. 🧵 🪡
Ever wondered what happens when you run "pip install torch"? What's in a package? 📦 👉 hpc.guix.info/blog/2021/09/w… This post is about 📦 verifiability and its impact on #security and #ReproducibleScience. Cc: @ReproBuilds @pypi @anacondainc @spackpm @nixos_org @PyTorch

5:26 AM · Sep 22, 2021

3
4
2
17
First off, one claim here is that Guix is *more reproducible* than other systems, mainly because it builds everything from libc up, and doesn’t rely on system libraries. That’s probably true, but what can Guix reproduce? Well, Guix can reproduce… Guix.
1
0
0
0
What I mean by that is that Guix reproduces *its own* recipes from the ground up. That’s it. If you think about the way people use #hpc machines, there is a *lot* of software from “the system” that we rely on. Vendor MPI, compilers, math libs, drivers, and the host Linux distro.
2
1
0
2
In some cases we rely on this stuff b/c it’s vetted with the drivers we need; in other cases, (RHEL), it may have bugfixes for our hardware. Or maybe it’s just expedient for a developer to get things working this way.
1
0
0
1
I’m all for open vendor stacks, esp. since there’s evidence that the pain of making them work may not pay off performance-wise (dl.acm.org/doi/abs/10.5555/3…). But, we still need to be able to reproduce proprietary things, as best we can, b/c that’s how people run on the systems.
1
0
0
1
So that’s what @spackpm is doing when it allows “externals” in the build. We can reproduce builds with Red Hat’s gcc, NVHPC, oneapi tarballs, Cray compilers, libsci, etc. You have to have the proprietary binary. Yes, it’s opaque, but we enable reproducing it as much as possible.
1
1
0
5
You can’t do that with Guix because you have to use their libc, their compilers, their recipes, etc. You *could* package it all in Guix, but as the article notes — is it worth it? In many cases the goal on an #hpc system is just to reproduce something on the same machine.
2
0
0
1
Eventually, we’d like to enable Spack to be completely self-contained—it will one day be *able* to build down to libc if I have any say in it. But we also care about reproducing *everything*, even proprietary stuff, and we’ll keep making that as possible as can be expected.
1
0
0
2
There are some other assumptions worth mentioning, as well. @condaforge hosts builds on public resources, and it makes the whole build available to folks who want to see it. It’s not as easy to run locally, but it’s not “opaque” or “doomed” as claimed in the article.
2
0
0
3
Local reproducibility is nice, but you *could* reproduce a Conda build in a local container, and the packages used in the container (if they’re Debian) may be reproducible in the same sense Guix is. It’s not perfect or easy, but it’s also not as evil as the article claims.
1
0
0
2
On verifiability and security: security is ultimately about trust, and while source helps, *just* verifying that particular source produced a particular binary doesn’t make me trust it. I would need to read all the source. Or trust where it came from, as with a binary.
2
0
0
2
Reading all the source is impractical and intractable. So in practice, I *have* to trust something. Do I trust my host OS? Processor? BIOS? Do I trust who I got the source from? There are good arguments why repro builds *aren’t* the whole story: blog.cmpxchg8b.com/2020/07/y…
1
0
0
2
I personally think that reproducible builds are a good practice, and they certainly make it easier to vet things if you trust the source. But they’re not a solution to all my security problems. I don’t fully trust source code just because it’s source code, and neither should you.
1
0
0
2
Finally, bitwise reproducibility is not the only *kind* of reproducibility. What we care about in science are reproducible results, not *necessarily* reproducible bits.
2
0
0
6
Nobody in the scientific community is arguing for atom-wise reproducible chemistry. There are tolerances. In #hpc, we care about reproducing results with optimization and across machines, so bits likely do change from build to build.
1
0
0
4
This is why @spackpm lets you swap things easily. In #hpc, people do a lot of porting and parameter tweaking. Our metadata model tries to preserve that in a way you (or a solver) can reason about. We want to explore performance space, trying different builds w/the same source.
1
0
0
1
If I redo an experiment *on a different machine*, I want to know that the results were the same as the original, within some epsilon, and if they’re *not* the same, I want the build tools to point me at possible causes. I think we are a long way from guarantees in this space.
1
0
0
2
So, great work by @GuixHpc, and it’s great they now have a PyTorch package! I think it’s important, though, to consider *why* other projects make different decisions. There’s more than one way to reproduce a build, and more than one dimension to reproducibility and security.
0
0
0
8