iipc/awesome-web-archiving

wget-lua by alard

Wget with Lua extension

created at Aug. 21, 2012, 8:39 p.m.

C

4 +0

22 +0

9 +0

GitHub

warcworker by peterk

A dockerized, queued high fidelity web archiver based on Squidwarc

created at July 21, 2018, 8:31 a.m.

Python

6 +0

53 +0

9 +0

GitHub

webarchive-indexing by ikreymer

Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.

created at March 9, 2015, 8:32 p.m.

Python

9 +0

41 +0

9 +1

GitHub

wail by N0taN3rd

One-Click User Instigated Preservation

created at May 26, 2016, 4:52 a.m.

JavaScript

13 +0

120 +0

9 +0

GitHub

warclight by archivesunleashed

A Rails engine supporting the discovery of web archives.

created at Aug. 3, 2017, 5:45 p.m.

Ruby

5 +0

48 +0

10 +0

GitHub

WarcDB by Florents-Tselai

WarcDB: Web crawl data as SQLite databases.

created at May 29, 2022, 11:09 a.m.

Python

10 +0

384 +0

11 +0

GitHub

MemGator by oduwsdl

A Memento Aggregator CLI and Server in Go

created at Sept. 8, 2015, 1:43 a.m.

Go

14 +0

54 +0

11 +0

GitHub

fbarc by justinlittman

A commandline tool and Python library for archiving data from Facebook using the Graph API.

created at Feb. 14, 2017, 11:45 p.m.

Python

16 +0

77 +0

11 +0

GitHub

zotero-memento by leonkt

Zotero extension that combats link rot by archiving webpages and journal articles.

created at Aug. 29, 2019, 5:51 p.m.

JavaScript

7 +0

275 +1

14 +0

GitHub

obelisk by go-shiori

Go package and CLI tool for saving web page as single HTML file

created at March 29, 2020, 12:53 a.m.

Go

11 +0

240 -1

15 +0

GitHub

gogetcrawl by karust

Extract web archive data using Wayback Machine and Common Crawl

created at June 14, 2019, 7:02 p.m.

Go

5 +0

132 +1

15 +0

GitHub

ArchiveTools by recrm

A collection of tools for archiving and analysing the internet.

created at Jan. 14, 2015, 6:53 p.m.

Python

6 +0

67 +0

15 +0

GitHub

freeze-dry by WebMemex

Snapshots a web page to get it as a static, self-contained HTML document.

created at July 13, 2017, 11:31 p.m.

TypeScript

11 +0

268 +0

18 +0

GitHub

solrwayback by netarchivesuite

A search interface and wayback machine for the UKWA Solr based warc-indexer framework.

created at Feb. 8, 2017, 9:33 a.m.

Java

24 +0

95 +0

18 +0

GitHub

ArchiveSpark by helgeho

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

created at Aug. 6, 2015, 7:42 p.m.

Scala

14 +0

141 +0

19 +0

GitHub

node-warc by N0taN3rd

Parse And Create Web ARChive (WARC) files with node.js

created at May 21, 2017, 6 a.m.

JavaScript

9 +0

92 +0

20 +0

GitHub

outbackcdx by nla

Web archive index server based on RocksDB

created at Jan. 15, 2015, 11:53 p.m.

Java

23 +0

29 +0

20 +0

GitHub

warcat by chfoo

Tool and library for handling Web ARChive (WARC) files.

created at April 9, 2013, 4:23 p.m.

Python

11 +0

136 +0

21 +0

GitHub

webarchive-discovery by ukwa

WARC and ARC indexing and discovery tools.

created at Dec. 20, 2012, 12:17 p.m.

Java

24 +0

113 +0

24 +0

GitHub

warctools by internetarchive

Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)

created at March 22, 2013, 8:52 p.m.

Python

36 +0

141 +0

25 +0

GitHub