iipc/awesome-web-archiving

grab-site by ArchiveTeam

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

updated at May 19, 2024, 9:13 p.m.

Python

40 +0

1,273 +1

125 +0

GitHub

gogetcrawl by karust

Extract web archive data using Wayback Machine and Common Crawl

updated at May 20, 2024, 4:10 p.m.

Go

5 +0

132 +1

15 +0

GitHub

obelisk by go-shiori

Go package and CLI tool for saving web page as single HTML file

updated at May 21, 2024, 6 a.m.

Go

11 +0

240 -1

15 +0

GitHub

MemGator by oduwsdl

A Memento Aggregator CLI and Server in Go

updated at May 21, 2024, 3:07 p.m.

Go

14 +0

54 +0

11 +0

GitHub

jwat-tools by netarchivesuite

JWAT Tools

updated at May 22, 2024, 5:55 a.m.

Java

7 +0

5 +1

2 +0

GitHub

warc-safe by natliblux

A tool for detecting viruses and NSFW material in WARC files

updated at May 22, 2024, 10:07 a.m.

Python

4 +0

7 +2

0 +0

GitHub

chatnoir-resiliparse by chatnoir-eu

A robust web archive analytics toolkit

updated at May 22, 2024, 12:19 p.m.

Cython

9 +0

45 +1

8 +0

GitHub

warcprox by internetarchive

WARC writing MITM HTTP/S proxy

updated at May 23, 2024, 7:34 a.m.

Python

33 +0

366 +1

54 -1

GitHub

zotero-memento by leonkt

Zotero extension that combats link rot by archiving webpages and journal articles.

updated at May 23, 2024, 9:26 a.m.

JavaScript

7 +0

275 +1

14 +0

GitHub

gowarcserver by nlnwa

None

updated at May 23, 2024, 1:51 p.m.

Go

7 +0

12 +0

1 +0

GitHub

wpull by ArchiveTeam

Wget-compatible web downloader and crawler.

updated at May 24, 2024, 4:29 a.m.

HTML

23 +0

538 +2

77 +0

GitHub

browsertrix by webrecorder

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!

updated at May 24, 2024, 5:41 p.m.

TypeScript

10 +0

131 +1

27 +0

GitHub

DownloadNet by dosyago

💾 DownloadNet - All content you browse online available offline. Search through the full-text of all pages in your browser history. ⭐️ Star to support our work!

updated at May 24, 2024, 8:39 p.m.

JavaScript

42 +0

3,662 +5

137 +0

GitHub

waybackpy by akamhy

Wayback Machine API interface & a command-line tool

updated at May 24, 2024, 11:41 p.m.

Python

10 +0

441 +3

33 +0

GitHub

internetarchive by jjjake

A Python and Command-Line Interface to Archive.org

updated at May 24, 2024, 11:52 p.m.

Python

51 +0

1,539 +6

211 +1

GitHub

An archiving tool with an IM-style interface that prioritizes privacy and accessibility, integrated with various archival services including Internet Archive, archive.today, IPFS, Telegraph, and file systems.

updated at May 25, 2024, 7:22 a.m.

Go

9 +0

1,667 +5

61 +0

GitHub

xdotool by jordansissel

fake keyboard/mouse input, window management, and more

updated at May 25, 2024, 11:19 a.m.

C

56 +0

3,057 +6

311 +1

GitHub

auto-archiver by bellingcat

Automatically archive links to videos, images, and social media content from Google Sheets (and more).

updated at May 25, 2024, 11:36 a.m.

Python

19 +0

478 +1

53 +0

GitHub

ipwb by oduwsdl

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

updated at May 25, 2024, 1:58 p.m.

Python

23 +0

591 +1

39 +0

GitHub

chrome-remote-interface by cyrus-and

Chrome Debugging Protocol interface for Node.js

updated at May 25, 2024, 2:40 p.m.

JavaScript

81 +0

4,195 +3

300 +0

GitHub

grab-site by ArchiveTeam

gogetcrawl by karust

obelisk by go-shiori

MemGator by oduwsdl

jwat-tools by netarchivesuite

warc-safe by natliblux

chatnoir-resiliparse by chatnoir-eu

warcprox by internetarchive

zotero-memento by leonkt

gowarcserver by nlnwa

wpull by ArchiveTeam

browsertrix by webrecorder

DownloadNet by dosyago

waybackpy by akamhy

internetarchive by jjjake

wayback by wabarc

xdotool by jordansissel

auto-archiver by bellingcat

ipwb by oduwsdl

chrome-remote-interface by cyrus-and