iipc/awesome-web-archiving

WarcDB by Florents-Tselai

WarcDB: Web crawl data as SQLite databases.

created at May 29, 2022, 11:09 a.m.

Python

10 +0

384 +0

11 +0

GitHub

MemGator by oduwsdl

A Memento Aggregator CLI and Server in Go

created at Sept. 8, 2015, 1:43 a.m.

Go

14 +0

54 +0

11 +0

GitHub

fbarc by justinlittman

A commandline tool and Python library for archiving data from Facebook using the Graph API.

created at Feb. 14, 2017, 11:45 p.m.

Python

16 +0

77 +0

11 +0

GitHub

warclight by archivesunleashed

A Rails engine supporting the discovery of web archives.

created at Aug. 3, 2017, 5:45 p.m.

Ruby

5 +0

48 +0

10 +0

GitHub

crau by turicas

Easy-to-use Web archiver

created at Oct. 26, 2019, 7:21 p.m.

Python

4 +0

53 +0

9 +0

GitHub

wget-lua by alard

Wget with Lua extension

created at Aug. 21, 2012, 8:39 p.m.

C

4 +0

22 +0

9 +0

GitHub

wail by N0taN3rd

One-Click User Instigated Preservation

created at May 26, 2016, 4:52 a.m.

JavaScript

13 +0

120 +0

9 +0

GitHub

webarchive-indexing by ikreymer

Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.

created at March 9, 2015, 8:32 p.m.

Python

9 +0

41 +0

9 +1

GitHub

warcworker by peterk

A dockerized, queued high fidelity web archiver based on Squidwarc

created at July 21, 2018, 8:31 a.m.

Python

6 +0

53 +0

9 +0

GitHub

chatnoir-resiliparse by chatnoir-eu

A robust web archive analytics toolkit

created at June 22, 2021, 9:03 a.m.

Cython

9 +0

45 +1

8 +0

GitHub

jwarc by iipc

Java library for reading and writing WARC files with a typed API

created at Sept. 21, 2015, 3:07 a.m.

Java

5 +0

42 +0

8 +0

GitHub

awesome-memento by machawk1

A list of things related to software, literature, and other content for 🕣 Memento

created at Sept. 16, 2016, 1:33 a.m.

Unknown languages

8 +0

77 +0

8 +0

GitHub

cc-notebooks by commoncrawl

Various Jupyter notebooks about Common Crawl data

created at July 19, 2019, 11:38 a.m.

Jupyter Notebook

16 +0

40 +0

8 +0

GitHub

crocoite by PromyLOPh

Web archiving using Google Chrome

created at Nov. 17, 2017, 6:56 p.m.

Python

8 +0

42 +0

7 +0

GitHub

shine by ukwa

Prototype SOLR-powered web archive exploration UI.

created at July 3, 2013, 8:18 p.m.

JavaScript

17 +0

42 +0

7 +0

GitHub

httrack2warc by nla

Converts HTTrack crawls to WARC files

created at Oct. 23, 2017, 5:52 a.m.

Java

20 +0

27 +0

6 +0

GitHub

web-archiving-course by vphill

Web Archiving Course

created at Feb. 22, 2022, 2:33 a.m.

Unknown languages

1 +0

19 +0

6 +0

GitHub

chronicler by CGamesPlay

Offline-first web browser

created at Dec. 27, 2018, 4:01 a.m.

JavaScript

6 +0

83 +0

5 +0

GitHub

scoop by harvard-lil

🍨 High-fidelity, browser-based, single-page web archiving library and CLI for witnessing the web.

created at Sept. 20, 2022, 6:50 p.m.

JavaScript

7 +0

101 +0

5 +0

GitHub

Web2Warc by helgeho

An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)

created at Jan. 29, 2016, 10:43 a.m.

Scala

3 +0

24 +0

4 +0

GitHub