It’s fairly obvious why stopping a service while backing it up makes sense. Imagine backing up Immich while it’s running. You start the backup, db is backed up, now image assets are being copied. That could take an hour. While the assets are being backed up, a new image is uploaded. The live database knows about it but the one you’ve backed up doesn’t. Then your backup process reaches the new image asset and it copies it. If you restore this backup, Immich will contain an asset that isn’t known by the database. In order to avoid scenarios like this, you’d stop Immich while the backup is running.

Now consider a system that can do instant snapshots like ZFS or LVM. Immich is running, you stop it, take a snapshot, then restart it. Then you backup Immich from the snapshot while Immich is running. This should reduce the downtime needed to the time it takes to do the snapshot. The state of Immich data in the snapshot should be equivalent to backing up a stopped Immich instance.

Now consider a case like above without stopping Immich while taking the snapshot. In theory the data you’re backing up should represent the complete state of Immich at a point in time eliminating the possibility of divergent data between databases and assets. It would however represent the state of a live Immich instance. E.g. lock files, etc. Wouldn’t restoring from such a backup be equivalent to kill -9 or pulling the cable and restarting the service? If a service can recover from a cable pull, is it reasonable to consider it should recover from restoring from a snapshot taken while live? If so, is there much point to stopping services during snapshots?

  • butitsnotme@lemmy.world
    link
    fedilink
    English
    arrow-up
    3
    ·
    5 months ago

    I don’t bother stopping services during backup, each service is contained to a single LVM volume, so snapshotting is exactly the same as yanking the plug. I haven’t had any issues yet, either with actual power failures or data restores.

    • Avid Amoeba@lemmy.caOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      edit-2
      5 months ago

      And this implies you have tested such backups right?

      Side Q, how long do those LVM snapshots take? How long does it take to merge them afterwards?

      • butitsnotme@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        5 months ago

        Yes, I have. I should probsbly test them again though, as it’s been a while, and Immich at least has had many potentially significant changes.

        LVM snapshots are virtually instant, and there is no merge operation, so deleting the snapshot is also virtually instant. The way it works is by creating a new space where the difference from the main volume are written, so each time the application writes to the main volume the old block will be copied to the snapshot first. This does mean that disk performance will be somewhat lower than without snapshots, however I’ve not really noticed any practical implications. (I believe LVM typically creates my snapshots on a different physical disk from where the main volume lives though.)

        You can my backup script here.

        • Avid Amoeba@lemmy.caOP
          link
          fedilink
          English
          arrow-up
          1
          ·
          5 months ago

          Oh interesting. I was under the impression that deletion in LVM was actually merging which took some time but I guess not. Thanks for the info!

  • Admiral Patrick@dubvee.org
    link
    fedilink
    English
    arrow-up
    3
    ·
    edit-2
    5 months ago

    Wouldn’t restoring from such a backup be equivalent to kill -9 or pulling the cable and restarting the service?

    Disclaimer: Not familiar with Immich, but this is what I’ve experienced generally.

    AFAIK, effectively yes. The only thing you might lose is anything in memory that hasn’t been written to disk at the time the snapshot was taken (which is still effectively equivalent to kill -9).

    At work, we use Veeam which is snapshot based, and database server restores (or spinning up a test DB based off of production) work just fine. That said, we still take scheduled dumps/backups of the database servers just to have known-good states to roll back to if ever the need arises.

    • Avid Amoeba@lemmy.caOP
      link
      fedilink
      English
      arrow-up
      2
      ·
      edit-2
      5 months ago

      Thanks for validating my reasoning. And yeah, this isn’t Immich-specific, it would be valid for any process and its data.

      • BCsven@lemmy.ca
        link
        fedilink
        English
        arrow-up
        2
        ·
        5 months ago

        What i have seen for corporate server is when backup is started the database goes into a different mode, and a temp writable partition is used while readonly database is backed up, at end of backup that blob created is also stored.

        • Avid Amoeba@lemmy.caOP
          link
          fedilink
          English
          arrow-up
          1
          ·
          edit-2
          5 months ago

          Yeah if you’re making a backup using the database system itself, then it would make sense for it do something like that if it stays live while backing up. If you think about it, it’s kinda similar to taking a snapshot of the volume where an app’s data files are while it still runs. It keeps writing as normally while you copy the data from the snapshot, which is read-only. Of course there’s no built-in way to get the newly written data without stopping the process. But you could get the downtime to a small number. 😄

          • gedhrel@lemmy.world
            link
            fedilink
            English
            arrow-up
            2
            ·
            5 months ago

            The other thing to watch out for is if you’re splitting state between volumes, but i think you’ve already ruled that out.

    • gedhrel@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      ·
      5 months ago

      I’d be cautious about the “kill -9” reasoning. It isn’t necessarily equivalent to yanking power.

      Contents of application memory lost, yes. Contents of unflushed OS buffers, no. Your db will be fsyncing (or moral equivalent thereof) if it’s worth the name.

      This is an aside; backing up from a volume snapshot is half a reasonable idea. (The other half is ensuring that you can restore from the backup, regularly, automatically, and the third half is ensuring that your automated validation can be relied on.)

      • Avid Amoeba@lemmy.caOP
        link
        fedilink
        English
        arrow-up
        1
        ·
        edit-2
        5 months ago

        Contents of application memory lost, yes. Contents of unflushed OS buffers, no. Your db will be fsyncing (or moral equivalent thereof) if it’s worth the name.

        Good point. I guess kill -9 is somewhat less catastrophic than a power-yank. If a service is written well enough to handle the latter it should be able to handle the former. Should, subject to very interesting bugs that can hide in the difference.

        This is an aside; backing up from a volume snapshot is half a reasonable idea. (The other half is ensuring that you can restore from the backup, regularly, automatically, and the third half is ensuring that your automated validation can be relied on.)

        I’m currently thinking of setting up automatic restore of these backups on the off-site backup machine. That is the backups are transferred to the off-site machine, restored to the dirs of the services, then the services are started. This should cover the second half I think. Of course those services can’t be used to store new data because they’ll be regularly overwritten with every backup. In the event of a hard snafu where the main machine disappears, I could stop the auto restore on the off-site machine and start using the services from it, effectively making it the main machine. If this turns out to be reasonable and working, I might trash all of the file-based backup-and-transfer mechanisms and switch to ZFS send/recv. That should allow to shrink the data delta between main and off-site to minutes instead of hours or days. Does this make any sense?

  • Evotech@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    ·
    5 months ago

    Modern image snapshot backups stop the service for av instant, creates a local snapshot to backup while the service runs a Delta then you apply the Delta to the running image

    • Avid Amoeba@lemmy.caOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      5 months ago

      When you say stopping the service for an instant you must mean pausing its execution or at least its IO. Actually stopping the service can’t be guaranteed to take an instant. It can’t be guaranteed to start in an instant. Worst of all, it can’t even be guaranteed that it’ll be able to start again. When I say stopping I mean sysemctl stop or docker stop or pkill etc. In other words delivering an orderly, graceful kill signal and waiting for the process/es to stop execution.

  • adr1an@programming.dev
    link
    fedilink
    English
    arrow-up
    3
    arrow-down
    1
    ·
    5 months ago

    Check “green blue” deployment strategy. This is done by many businesses, where an interrupted service might mean losing a sale, or a client forever… I tried it sometime witj Nginx but it was more pain than gain (for my personal use)

    • Avid Amoeba@lemmy.caOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      5 months ago

      Good suggestion. I’ve done blue-green professionally with services that are built to have high availability and in cloud environments. If I were to actually setup some form of that, I’d probably use ZFS send/rcv to keep a backup server always 15 minutes behind and ready to go. I wouldn’t deal with file-based backups that take an hour to just walk the dataset to just figure out what’s new. 😅 Probably not happening for now.

  • MaximilianKohler@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    ·
    edit-2
    5 months ago

    I ran into a similar problem with snapshots of a forum and email server – if there are scheduled emails when you take the snapshot they get sent out again if you create a new test server from the snapshot. And similarly for the forum.

    I’m not sure what the solution is either. The emails are sent via an SMTP so it’s not as simple as disabling email (ports, firewall, etc.) on the new test server.

    • Avid Amoeba@lemmy.caOP
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      5 months ago

      Not a VM. Consider the service just a program running on the host OS where either the whole OS or just the service data are sitting on ZFS or LVM.