HadoopConcatGz by helgeho

A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz

created at Aug. 8, 2016, 1:36 p.m.

Java

2 +0

9 +0

3 +0

GitHub
Mink by machawk1

Chrome extension that uses Memento to indicate that a page a user is viewing on the live web has an archived copy and to give the user access to the copy

created at Jan. 17, 2014, 6:25 p.m.

JavaScript

6 +0

45 +0

3 +0

GitHub
har2warc by webrecorder

Convert HTTP Archive (HAR) -> Web Archive (WARC) format

created at March 16, 2017, 12:14 a.m.

Python

7 +0

42 +0

3 +0

GitHub
notebooks by archivesunleashed

Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit.

created at Nov. 6, 2019, 3:09 a.m.

Jupyter Notebook

6 +0

21 +0

4 +0

GitHub
wasapi-downloader by sul-dlss

Java application to download WARCs from WASAPI

created at April 28, 2017, 9:15 p.m.

Java

22 +0

6 +0

4 +0

GitHub
Web2Warc by helgeho

An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)

created at Jan. 29, 2016, 10:43 a.m.

Scala

3 +0

24 +0

4 +0

GitHub
py-wasapi-client by unt-libraries

A client for the Archive-It And Webrecorder WASAPI Data Transfer API

created at Aug. 10, 2017, 5:25 p.m.

Python

5 +0

14 +0

4 +0

GitHub
wasp by webis-de

None

created at March 25, 2018, 6:58 p.m.

Java

13 +0

25 +0

4 +0

GitHub
arch by internetarchive

Web application for distributed compute analysis of Archive-It web archive collections.

created at April 28, 2022, 3:18 p.m.

Scala

19 +0

13 +0

4 +0

GitHub
chronicler by CGamesPlay

Offline-first web browser

created at Dec. 27, 2018, 4:01 a.m.

JavaScript

6 +0

83 +0

5 +0

GitHub
scoop by harvard-lil

🍨 High-fidelity, browser-based, single-page web archiving library and CLI for witnessing the web.

created at Sept. 20, 2022, 6:50 p.m.

JavaScript

7 +0

101 +0

5 +0

GitHub
web-archiving-course by vphill

Web Archiving Course

created at Feb. 22, 2022, 2:33 a.m.

Unknown languages

1 +0

19 +0

6 +0

GitHub
httrack2warc by nla

Converts HTTrack crawls to WARC files

created at Oct. 23, 2017, 5:52 a.m.

Java

20 +0

27 +0

6 +0

GitHub
shine by ukwa

Prototype SOLR-powered web archive exploration UI.

created at July 3, 2013, 8:18 p.m.

JavaScript

17 +0

42 +0

7 +0

GitHub
crocoite by PromyLOPh

Web archiving using Google Chrome

created at Nov. 17, 2017, 6:56 p.m.

Python

8 +0

42 +0

7 +0

GitHub
awesome-memento by machawk1

A list of things related to software, literature, and other content for 🕣 Memento

created at Sept. 16, 2016, 1:33 a.m.

Unknown languages

8 +0

77 +0

8 +0

GitHub
jwarc by iipc

Java library for reading and writing WARC files with a typed API

created at Sept. 21, 2015, 3:07 a.m.

Java

5 +0

42 +0

8 +0

GitHub
chatnoir-resiliparse by chatnoir-eu

A robust web archive analytics toolkit

created at June 22, 2021, 9:03 a.m.

Cython

9 +0

45 +1

8 +0

GitHub
cc-notebooks by commoncrawl

Various Jupyter notebooks about Common Crawl data

created at July 19, 2019, 11:38 a.m.

Jupyter Notebook

16 +0

40 +0

8 +0

GitHub
wail by N0taN3rd

whale2 One-Click User Instigated Preservation

created at May 26, 2016, 4:52 a.m.

JavaScript

13 +0

120 +0

9 +0

GitHub