crau by turicas

Easy-to-use Web archiver

updated at March 11, 2024, 6:49 p.m.

Python

4 +0

53 +0

9 +0

GitHub
har2warc by webrecorder

Convert HTTP Archive (HAR) -> Web Archive (WARC) format

updated at March 12, 2024, 12:41 p.m.

Python

7 +0

42 +0

3 +0

GitHub
playback by wabarc

Playback webpages from Wayback Machine

updated at March 21, 2024, 1:56 p.m.

Go

4 +0

6 +0

1 +0

GitHub
webarchive-indexing by ikreymer

Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.

updated at March 26, 2024, 10:50 p.m.

Python

9 +0

41 +0

9 +1

GitHub
wasp by webis-de

None

updated at March 30, 2024, 10:57 a.m.

Java

13 +0

25 +0

4 +0

GitHub
webarchive-discovery by ukwa

WARC and ARC indexing and discovery tools.

updated at March 31, 2024, 2:13 p.m.

Java

24 +0

113 +0

24 +0

GitHub
Sparkling by internetarchive

Internet Archive's Sparkling Data Processing Library

updated at April 4, 2024, 12:42 a.m.

Scala

17 +0

10 +0

2 +0

GitHub
unwarcit by emmadickson

None

updated at April 9, 2024, 9:06 p.m.

Python

5 +0

6 +0

0 +0

GitHub
Mink by machawk1

Chrome extension that uses Memento to indicate that a page a user is viewing on the live web has an archived copy and to give the user access to the copy

updated at April 10, 2024, 6:35 p.m.

JavaScript

6 +0

45 +0

3 +0

GitHub
warctools by internetarchive

Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)

updated at April 11, 2024, 9:06 a.m.

Python

36 +0

141 +0

25 +0

GitHub
warcworker by peterk

A dockerized, queued high fidelity web archiver based on Squidwarc

updated at April 23, 2024, 1:39 a.m.

Python

6 +0

53 +0

9 +0

GitHub
Squidwarc by N0taN3rd

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

updated at April 23, 2024, 1:39 a.m.

JavaScript

10 +0

164 +0

26 +0

GitHub
arch by internetarchive

Web application for distributed compute analysis of Archive-It web archive collections.

updated at April 24, 2024, 8:10 p.m.

Scala

19 +0

13 +0

4 +0

GitHub
warc2html by iipc

Converts WARC files to static HTML

updated at April 26, 2024, 4:02 p.m.

Java

10 +0

38 +0

3 +0

GitHub
awesome-memento by machawk1

A list of things related to software, literature, and other content for 🕣 Memento

updated at April 27, 2024, 8:55 a.m.

Unknown languages

8 +0

77 +0

8 +0

GitHub
WarcDB by Florents-Tselai

WarcDB: Web crawl data as SQLite databases.

updated at May 1, 2024, 4:03 p.m.

Python

10 +0

384 +0

11 +0

GitHub
node-warc by N0taN3rd

Parse And Create Web ARChive (WARC) files with node.js

updated at May 1, 2024, 4:04 p.m.

JavaScript

9 +0

92 +0

20 +0

GitHub
cc-notebooks by commoncrawl

Various Jupyter notebooks about Common Crawl data

updated at May 1, 2024, 4:06 p.m.

Jupyter Notebook

16 +0

40 +0

8 +0

GitHub
aut by archivesunleashed

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

updated at May 1, 2024, 4:39 p.m.

Scala

15 +0

133 +0

33 +0

GitHub
cairn by wabarc

NPM package and CLI tool for saving web page as single HTML file

updated at May 1, 2024, 4:40 p.m.

TypeScript

4 +0

37 +0

2 +0

GitHub