I saw this post and I was curious what was out there.

https://neuromatch.social/@jonny/113444325077647843

Id like to put my lab servers to work archiving US federal data thats likely to get pulled - climate and biomed data seems mostly likely. The most obvious strategy to me seems like setting up mirror torrents on academictorrents. Anyone compiling a list of at-risk data yet?

  • tomtomtom@lemmy.world
    link
    fedilink
    English
    arrow-up
    7
    ·
    2 months ago

    I am using archivebox, it is pretty straight-forward to self-host and use.

    However, it is very difficult to archive most news sites with it and many other sites as well. Most cookie etc pop ups on a site will render the archived page unusable and often archiving won’t work at all because some bot protection (Cloudflare etc.) will kick-in when archivebox tries to access a site.

    If anyone else has more success using it, please let me know if I am doing something wrong…

    • Daniel Quinn@lemmy.ca
      link
      fedilink
      English
      arrow-up
      1
      ·
      2 months ago

      Monolith has the same problem here. I think the best resolution might be some sort of browser-plugin based solution where you could say “archive this” and have it push the result somewhere.

      I wonder if I could combine a dumb plugin with Monolith to do that… A weekend project perhaps.