iipc/awesome-web-archiving

ArchiveBox by ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

created at May 5, 2017, 8:50 a.m.

Python

174 +0

22,334 +88

1,184 +6

GitHub

internetarchive by jjjake

A Python and Command-Line Interface to Archive.org

created at Aug. 15, 2012, 7:18 p.m.

Python

56 +0

1,625 +9

219 +1

GitHub

grab-site by ArchiveTeam

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

created at Feb. 5, 2015, 5:01 a.m.

Python

40 +0

1,398 +4

135 +0

GitHub

twarc by DocNow

A command line tool (and Python library) for archiving Twitter JSON

created at Jan. 14, 2013, 2:35 p.m.

Python

35 +0

1,370 +0

255 +0

GitHub

wikiteam by WikiTeam

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2024, WikiTeam has preserved more than 600,000 wikis.

created at June 25, 2014, 10:18 a.m.

Python

40 +0

729 +0

149 +0

GitHub

brozzler by internetarchive

brozzler - distributed browser-based web crawler

created at July 13, 2015, 11:48 p.m.

Python

40 +0

671 +2

97 +0

GitHub

ipwb by oduwsdl

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

created at March 4, 2016, 3:01 p.m.

Python

23 +0

617 +1

39 +0

GitHub

auto-archiver by bellingcat

Automatically archive links to videos, images, and social media content from Google Sheets (and more).

created at Jan. 15, 2021, 10:30 a.m.

Python

22 +1

570 +4

60 +1

GitHub

waybackpy by akamhy

Wayback Machine API interface & a command-line tool

created at May 2, 2020, 9:19 a.m.

Python

10 +0

480 +2

33 +0

GitHub

archivenow by oduwsdl

A Tool To Push Web Resources Into Web Archives

created at Feb. 9, 2017, 12:29 p.m.

Python

22 +1

410 +2

40 +0

GitHub

WarcDB by Florents-Tselai

WarcDB: Web crawl data as SQLite databases.

created at May 29, 2022, 11:09 a.m.

Python

10 +0

394 +1

11 +0

GitHub

warcio by webrecorder

Streaming WARC/ARC library for fast web archive IO

created at March 6, 2017, 6:17 p.m.

Python

22 +0

385 +2

58 +0

GitHub

warcprox by internetarchive

WARC writing MITM HTTP/S proxy

created at Oct. 25, 2013, 11:27 p.m.

Python

39 +0

381 +1

54 +0

GitHub

warctools by internetarchive

Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)

created at March 22, 2013, 8:52 p.m.

Python

44 +0

152 +0

27 +0

GitHub

warcat by chfoo

Tool and library for handling Web ARChive (WARC) files.

created at April 9, 2013, 4:23 p.m.

Python

11 +0

150 +0

21 +0

GitHub

fbarc by justinlittman

A commandline tool and Python library for archiving data from Facebook using the Graph API.

created at Feb. 14, 2017, 11:45 p.m.

Python

16 +0

77 +0

11 +0

GitHub

ArchiveTools by recrm

A collection of tools for archiving and analysing the internet.

created at Jan. 14, 2015, 6:53 p.m.

Python

6 +0

69 +0

15 +0

GitHub

crau by turicas

Easy-to-use Web archiver

created at Oct. 26, 2019, 7:21 p.m.

Python

4 +0

57 +0

10 +0

GitHub

warcworker by peterk

A dockerized, queued high fidelity web archiver based on Squidwarc

created at July 21, 2018, 8:31 a.m.

Python

6 +0

55 +0

9 +0

GitHub

har2warc by webrecorder

Convert HTTP Archive (HAR) -> Web Archive (WARC) format

created at March 16, 2017, 12:14 a.m.

Python

7 +0

46 +1

4 +0

GitHub