iipc/awesome-web-archiving

waybackpy by akamhy

Wayback Machine API interface & a command-line tool

created at May 2, 2020, 9:19 a.m.

Python

10 +0

480 +2

33 +0

GitHub

archivenow by oduwsdl

A Tool To Push Web Resources Into Web Archives

created at Feb. 9, 2017, 12:29 p.m.

Python

22 +1

410 +2

40 +0

GitHub

WarcDB by Florents-Tselai

WarcDB: Web crawl data as SQLite databases.

created at May 29, 2022, 11:09 a.m.

Python

10 +0

394 +1

11 +0

GitHub

warcio by webrecorder

Streaming WARC/ARC library for fast web archive IO

created at March 6, 2017, 6:17 p.m.

Python

22 +0

385 +2

58 +0

GitHub

warcprox by internetarchive

WARC writing MITM HTTP/S proxy

created at Oct. 25, 2013, 11:27 p.m.

Python

39 +0

381 +1

54 +0

GitHub

wail by machawk1

Web Archiving Integration Layer: One-Click User Instigated Preservation

created at March 20, 2013, 2:42 p.m.

Roff

14 +0

350 +0

35 +0

GitHub

zotero-memento by leonkt

Zotero extension that combats link rot by archiving webpages and journal articles.

created at Aug. 29, 2019, 5:51 p.m.

JavaScript

7 +0

296 +2

14 +0

GitHub

freeze-dry by WebMemex

Snapshots a web page to get it as a static, self-contained HTML document.

created at July 13, 2017, 11:31 p.m.

TypeScript

11 +0

271 +0

18 +0

GitHub

obelisk by go-shiori

Go package and CLI tool for saving web page as single HTML file

created at March 29, 2020, 12:53 a.m.

Go

10 +0

263 +2

20 +0

GitHub

browsertrix by webrecorder

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!

created at June 28, 2021, 10:46 p.m.

TypeScript

12 +0

201 +1

35 +2

GitHub

Squidwarc by N0taN3rd

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

created at July 20, 2017, 6:57 a.m.

JavaScript

10 +0

169 +0

26 +0

GitHub

warctools by internetarchive

Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)

created at March 22, 2013, 8:52 p.m.

Python

44 +0

152 +0

27 +0

GitHub

warcat by chfoo

Tool and library for handling Web ARChive (WARC) files.

created at April 9, 2013, 4:23 p.m.

Python

11 +0

150 +0

21 +0

GitHub

gogetcrawl by karust

Extract web archive data using Wayback Machine and Common Crawl

created at June 14, 2019, 7:02 p.m.

Go

5 +0

147 +2

16 +0

GitHub

ArchiveSpark by helgeho

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

created at Aug. 6, 2015, 7:42 p.m.

Scala

15 +1

145 +1

19 +0

GitHub

aut by archivesunleashed

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

created at July 6, 2017, 10:13 a.m.

Scala

15 +0

137 +0

33 +0

GitHub

wail by N0taN3rd

One-Click User Instigated Preservation

created at May 26, 2016, 4:52 a.m.

JavaScript

13 +0

122 +0

9 +0

GitHub

scoop by harvard-lil

🍨 High-fidelity, browser-based, single-page web archiving library and CLI for witnessing the web.

created at Sept. 20, 2022, 6:50 p.m.

JavaScript

7 +0

117 +0

8 +0

GitHub

webarchive-discovery by ukwa

WARC and ARC indexing and discovery tools.

created at Dec. 20, 2012, 12:17 p.m.

Java

24 +0

116 +0

25 +0

GitHub

solrwayback by netarchivesuite

A search interface and wayback machine for the UKWA Solr based warc-indexer framework.

created at Feb. 8, 2017, 9:33 a.m.

Java

24 +0

102 +0

21 +0

GitHub