Re: [Discussion]: Metadata to consolidate and rebuild base-apt from distributed CI builds

public inbox for isar-users@googlegroups.com
 help / color / mirror / Atom feed

From: Henning Schild <henning.schild@siemens.com>
To: vijai kumar <vijaikumar.kanagarajan@gmail.com>
Cc: isar-users <isar-users@googlegroups.com>,
	Baurzhan Ismagulov <ibr@radix50.net>,
	Jan Kiszka <jan.kiszka@siemens.com>
Subject: Re: [Discussion]: Metadata to consolidate and rebuild base-apt from distributed CI builds
Date: Thu, 24 Feb 2022 16:42:44 +0100	[thread overview]
Message-ID: <20220224164244.6e4bb002@md1za8fc.ad001.siemens.net> (raw)
In-Reply-To: <CALLGG_+sg5o_kTQo6hE=p4AhgmhZyfROZt8SkuDY1vDtiBdhkg@mail.gmail.com>

Am Thu, 24 Feb 2022 18:50:50 +0530
schrieb vijai kumar <vijaikumar.kanagarajan@gmail.com>:

> Hi Henning,
> 
> On Tue, Feb 22, 2022 at 8:01 PM Henning Schild
> <henning.schild@siemens.com> wrote:
> >
> > Hey Vijai,
> >
> > Am Tue, 22 Feb 2022 16:04:36 +0530
> > schrieb vijai kumar <vijaikumar.kanagarajan@gmail.com>:
> >  
> > > Problem:
> > > --------
> > > We could have several CI jobs that are running in parallel in
> > > different nodes. One might want to consolidate and build a
> > > base-apt from the debs/deb-srcs of all these builds.  
> >
> > Can you go into more detail. I do not yet get the problem.  
> 
> runner 1(Germany) -> Building de0 nano
> runner 2(India) -> Building qemuarm
> runner 3(US) -> Building qemuamd64
> 
> 
> All these builds are running in different servers.
> If we wanted to create a single base-apt from all these servers, then
> we need to copy over their deb/debsrcs/base-apt to a common server and
> then
> create a consolidated repo.

But why would you want to do that? I mean i get why you would want to
store all in the same location, but not why it should be one repo.
Maybe to save some space on sources and arch all .. but hey there are
ways of deduplcating on filesystem or block level.
You are just risking a weird local "all" package not being so "all"
after all ... false sharing.

> This involves moving around this data.

Yes, if it one central storage place. No matter if it is one "repo" or
many "repos" in i.e. folders.

> The problem can be avoided if we have a single metadata produced by
> all these builds which would have details of all the packages the
> build used.
> Basically a manifest of the build. This manifest can be later used to
> recreate the repo which can be hosted later on for these jobs.

We have a manifest for "image content" which already is fed into
clearing, it is a bill of materials an nothing else, it can not
be used to rebuild.
Even if you had all metadata you need to store sources and binaries
somewhere reliable, whether that is central or distributed is another
story.
Pointers to anything on the internet (including all debian repos) will
at some point stop working. So if "exact rebuilding" in a "far away
future" is what you want, mirroring is what you will need.
Partial mirroring based on base-apt even with sources will be shaky and
you will find yourself digging in snapshots again. But it will work.
In the worst case you will not want "exact rebuild" but "fix backported
rebuild", which means you will need all build-deps mirrored ...
rescursively. In fact any "package relationship" maybe even a Conflicts
might become rebuild relevant.
A partial mirror will not cut it, rather take a full one, so you do not
need to care of which bits to ignore and do not risk forgetting
anything.
The ideal way would be to eventually liberate snapshots of its
throttling, the short term way is to spend some bucks on some buckets
(S3).

> Having metadata and recreating repo is one way. There might be other
> ways as well.

I am afraid you likely can not recreate if you do not keep everything
yourself or a place you trust (snapshots?).

There have been several threads on that topic already, including how
one could help make snapshot work for debootstrap and co. Coming from
reproducible builds and qubes-os [1] [2].
If you dig deeper you will find many people offering help and funding
but for some reason things seem still "stuck".

On top we could maybe see if we can establish something like snapshots
in Siemens. But i guess outside and open to anyone will be much better.

[1] https://groups.google.com/g/isar-users/c/X9B5chyEWpc/m/nVXwZuIRBAAJ
[2]
https://www.qubes-os.org/news/2021/10/08/reproducible-builds-for-debian-a-big-step-forward/

> That is where we thought about the --print-uris option of apt. It
> basically gives you the complete URL to the package which we can
> download using wget.
> A manifest containing all the packages ever used by the build with its
> complete url. It could easily be used for several purposes, like as
> clearing input,
> repo regeneration etc.

Maybe we can find valid reasons to extend the manifests. But URLs to
packages seem almost redundant, knowing the package names and versions
and all sources.list entries one can generate these URLs for any
mirror, picking just one of many mirrors would be limiting.

And maybe there are valid reasons to having manifests even for
buildchroots. But the problem here is that they change all the time
while we still use one buildchoot. We see packages being added as build
deps all the time, but also removed when build deps conflict.

> I don't think sstate can help here. I might be wrong though.

I guess sstate will not help. It is even more storage needs and more
storage sync needs between runners if you want to share.

Henning

> Thanks,
> Vijai Kumar K
> 
> >
> > It seems like you want to save compute time by sharing pre-built
> > artifacts via some common storage. The sstate can do that very
> > well, we are using shared folders for on-prem runners, s3 for AWS
> > and sstate mirrors for population of "new empty runners" and
> > "partial result delivery" of failed jobs and to sync on-prem with
> > s3.
> >
> > isar is a tool to build images, not distros or repos or packages.
> > While it can do all of that using it for such things can get tricky
> > and isar was not designed for such cases. Meaning "base-apt" is not
> > meant to be your cache to build many images from ... it is meant to
> > be the cache for exactly one ... and sharing can cause problems.
> >
> > sstate would detect false sharing, say a package recipe for some
> > reason uses a machine-conf variable. multiconfig or base-apt
> > sharing would make you run into that bug, while sstate would likely
> > not.
> >
> > So if it is about build time i suggest you have a look at sstate
> > and the not yet upstreamed python helper scripts for
> > sharing/eviction i can point you to in case you do not find it
> > yourself.
> >
> > Henning
> >  
> > > What's possible:
> > > ---------------
> > > With the current state of ISAR, the below is possible.
> > >
> > > 1. Run all the jobs in parallel in separate CI runners
> > > 2. Collect all the debs and deb-srcs from those builds and push
> > > to a common file server.
> > > 3. Download the debs and deb-srcs and create a repo out of it in
> > > the final CI step,
> > > 4. Upload the base-apt to the server.
> > >
> > > This has some disadvantages, we need to move all those
> > > data(deb/debsrcs), this increases time and cost.
> > >
> > > What's needed:
> > > --------------
> > > The idea is to have a simple meta-data that can be used by repo
> > > generation tools to recreate the repo.
> > >
> > > Why manifest cannot be used:
> > > ----------------------------
> > > Manifest does not serve this particular need. Below are the
> > > shortcomings of image manifest,
> > > 1. Does not have details about removed packages(eg localepurge)
> > > 2. Manifest of buildchroot would not have details about the
> > > package dependencies/imager installs at the time of
> > > generation(i.e. postprocess)
> > >
> > > Some ideas:
> > > -----------
> > > There were a couple of ideas,
> > > 1. To use an external script to create a manifest of the
> > > downloads/{deb, debsrc} folder and try to download the packages
> > > using that manifest and appropriate sourceslist in the final
> > > runner. 2. To use "apt --print-uris" + "debootstrap
> > > --keep-debootstrap-dir" to create a metadata with complete url to
> > > the package. Later wget can be used to download those from the
> > > web.
> > >
> > > We are wondering if we could discuss and derive a solution for
> > > this here in ISAR itself instead of opting for some local scripts
> > > in downstream layers.
> > >
> > > Thanks,
> > > Vijai Kumar K  
> >
> > --
> > You received this message because you are subscribed to the Google
> > Groups "isar-users" group. To unsubscribe from this group and stop
> > receiving emails from it, send an email to
> > isar-users+unsubscribe@googlegroups.com. To view this discussion on
> > the web visit
> > https://groups.google.com/d/msgid/isar-users/20220222153136.08432cb3%40md1za8fc.ad001.siemens.net.
> >

next prev parent reply	other threads:[~2022-02-24 15:42 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-22 10:34 vijai kumar
2022-02-22 14:31 ` Henning Schild
2022-02-24 13:20   ` vijai kumar
2022-02-24 15:42     ` Henning Schild [this message]
2022-02-25 17:27       ` Jan Kiszka
2022-03-03 13:45         ` vijai kumar
2022-03-04 10:03           ` Baurzhan Ismagulov
2022-03-07  7:23             ` vijai kumar
2022-03-15 11:45               ` Baurzhan Ismagulov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220224164244.6e4bb002@md1za8fc.ad001.siemens.net \
    --to=henning.schild@siemens.com \
    --cc=ibr@radix50.net \
    --cc=isar-users@googlegroups.com \
    --cc=jan.kiszka@siemens.com \
    --cc=vijaikumar.kanagarajan@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox