iipc/awesome-web-archiving

awesome-memento by machawk1

A list of things related to software, literature, and other content for 🕣 Memento

created at Sept. 16, 2016, 1:33 a.m.

Unknown languages

8 +0

77 +0

8 +0

GitHub

monolith by Y2Z

⬛️ CLI tool for saving complete web pages as a single HTML file

created at Feb. 20, 2017, 7:47 a.m.

Rust

62 +0

10,140 +68

287 +2

GitHub

warcat by chfoo

Tool and library for handling Web ARChive (WARC) files.

created at April 9, 2013, 4:23 p.m.

Python

11 +0

136 +0

21 +0

GitHub

webarchive-indexing by ikreymer

Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.

created at March 9, 2015, 8:32 p.m.

Python

9 +0

41 +0

9 +1

GitHub

wasapi-downloader by sul-dlss

Java application to download WARCs from WASAPI

created at April 28, 2017, 9:15 p.m.

Java

22 +0

6 +0

4 +0

GitHub

heritrix-walkthrough by web-archive-group

None

created at June 1, 2016, 10:35 p.m.

Shell

6 +0

9 +0

1 +0

GitHub

Squidwarc by N0taN3rd

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

created at July 20, 2017, 6:57 a.m.

JavaScript

10 +0

164 +0

26 +0

GitHub

ArchiveSpark by helgeho

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

created at Aug. 6, 2015, 7:42 p.m.

Scala

14 +0

141 +0

19 +0

GitHub

twarc by DocNow

A command line tool (and Python library) for archiving Twitter JSON

created at Jan. 14, 2013, 2:35 p.m.

Python

35 +0

1,355 +0

254 +1

GitHub

node-cdxj by N0taN3rd

Parse CDXJ(https://github.com/oduwsdl/ORS/wiki/CDXJ) files with node.js

created at May 18, 2017, 4:45 a.m.

JavaScript

3 +0

0 +0

1 +0

GitHub

wget-lua by alard

Wget with Lua extension

created at Aug. 21, 2012, 8:39 p.m.

C

4 +0

22 +0

9 +0

GitHub

outbackcdx by nla

Web archive index server based on RocksDB

created at Jan. 15, 2015, 11:53 p.m.

Java

23 +0

29 +0

20 +0

GitHub

ArchiveTools by recrm

A collection of tools for archiving and analysing the internet.

created at Jan. 14, 2015, 6:53 p.m.

Python

6 +0

67 +0

15 +0

GitHub

warclight by archivesunleashed

A Rails engine supporting the discovery of web archives.

created at Aug. 3, 2017, 5:45 p.m.

Ruby

5 +0

48 +0

10 +0

GitHub

SingleFile by gildas-lormeau

Web Extension for saving a faithful copy of a complete web page in a single HTML file

created at Sept. 12, 2010, 11:50 p.m.

JavaScript

114 +0

14,008 +55

924 +2

GitHub

solrwayback by netarchivesuite

A search interface and wayback machine for the UKWA Solr based warc-indexer framework.

created at Feb. 8, 2017, 9:33 a.m.

Java

24 +0

95 +0

18 +0

GitHub

warctools by internetarchive

Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)

created at March 22, 2013, 8:52 p.m.

Python

36 +0

141 +0

25 +0

GitHub

fbarc by justinlittman

A commandline tool and Python library for archiving data from Facebook using the Graph API.

created at Feb. 14, 2017, 11:45 p.m.

Python

16 +0

77 +0

11 +0

GitHub

warcworker by peterk

A dockerized, queued high fidelity web archiver based on Squidwarc

created at July 21, 2018, 8:31 a.m.

Python

6 +0

53 +0

9 +0

GitHub

ipwb by oduwsdl

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

created at March 4, 2016, 3:01 p.m.

Python

23 +0

591 +1

39 +0

GitHub