wail by N0taN3rd

whale2 One-Click User Instigated Preservation

updated at May 10, 2024, 3:37 a.m.

JavaScript

13 +0

120 +0

9 +0

GitHub
html2warc by steffenfritz

simple script to convert web resources to a single warc file

updated at May 8, 2024, 5:21 a.m.

Python

4 +0

15 +0

2 +0

GitHub
solrwayback by netarchivesuite

A search interface and wayback machine for the UKWA Solr based warc-indexer framework.

updated at May 7, 2024, 6:08 a.m.

Java

24 +0

95 +0

18 +0

GitHub
py-wasapi-client by unt-libraries

A client for the Archive-It And Webrecorder WASAPI Data Transfer API

updated at May 7, 2024, 3:08 a.m.

Python

5 +0

14 +0

4 +0

GitHub
ArchiveSpark by helgeho

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

updated at May 5, 2024, 4:14 a.m.

Scala

14 +0

141 +0

19 +0

GitHub
outbackcdx by nla

Web archive index server based on RocksDB

updated at May 4, 2024, 5:05 a.m.

Java

23 +0

29 +0

20 +0

GitHub
brozzler by internetarchive

brozzler - distributed browser-based web crawler

updated at May 4, 2024, 4:59 a.m.

Python

36 +0

630 +0

93 +0

GitHub
chronicler by CGamesPlay

Offline-first web browser

updated at May 1, 2024, 4:40 p.m.

JavaScript

6 +0

83 +0

5 +0

GitHub
cairn by wabarc

NPM package and CLI tool for saving web page as single HTML file

updated at May 1, 2024, 4:40 p.m.

TypeScript

4 +0

37 +0

2 +0

GitHub
aut by archivesunleashed

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

updated at May 1, 2024, 4:39 p.m.

Scala

15 +0

133 +0

33 +0

GitHub
cc-notebooks by commoncrawl

Various Jupyter notebooks about Common Crawl data

updated at May 1, 2024, 4:06 p.m.

Jupyter Notebook

16 +0

40 +0

8 +0

GitHub
node-warc by N0taN3rd

Parse And Create Web ARChive (WARC) files with node.js

updated at May 1, 2024, 4:04 p.m.

JavaScript

9 +0

92 +0

20 +0

GitHub
WarcDB by Florents-Tselai

WarcDB: Web crawl data as SQLite databases.

updated at May 1, 2024, 4:03 p.m.

Python

10 +0

384 +0

11 +0

GitHub
awesome-memento by machawk1

A list of things related to software, literature, and other content for 🕣 Memento

updated at April 27, 2024, 8:55 a.m.

Unknown languages

8 +0

77 +0

8 +0

GitHub
warc2html by iipc

Converts WARC files to static HTML

updated at April 26, 2024, 4:02 p.m.

Java

10 +0

38 +0

3 +0

GitHub
arch by internetarchive

Web application for distributed compute analysis of Archive-It web archive collections.

updated at April 24, 2024, 8:10 p.m.

Scala

19 +0

13 +0

4 +0

GitHub
Squidwarc by N0taN3rd

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

updated at April 23, 2024, 1:39 a.m.

JavaScript

10 +0

164 +0

26 +0

GitHub
warcworker by peterk

A dockerized, queued high fidelity web archiver based on Squidwarc

updated at April 23, 2024, 1:39 a.m.

Python

6 +0

53 +0

9 +0

GitHub
warctools by internetarchive

Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)

updated at April 11, 2024, 9:06 a.m.

Python

36 +0

141 +0

25 +0

GitHub
Mink by machawk1

Chrome extension that uses Memento to indicate that a page a user is viewing on the live web has an archived copy and to give the user access to the copy

updated at April 10, 2024, 6:35 p.m.

JavaScript

6 +0

45 +0

3 +0

GitHub