Is there an open source package that the Internet Archive runs? What is it? I assume sites like archive.is run the same. I’d like to know if I can also run it for self-hosted archiving.
I believe they used heritrix at one point. The important bit is that there is a special archive format that they use which is a standard. There are several tools that support it (both capturing to it and viewing it) - it allows for capturing a website in a ‘working’ condition with history or something. I’m a bit fuzzy on it since it’s been some time since I looked into it.
It seems like all of their software is in the parent account of heritrix - https://github.com/orgs/internetarchive/repositories?type=all.
https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-Community has a wide range of options
Does Linkwarden fit your intended use?
Kind of. Linkwarden seems to save as PDF. That’s better than nothing, however preserving a functional copy of the pages would be better. Archivebox seems to do this.
I don’t know for certain but I’m sure they run lots of different software. They have PBs of data.