iipc/awesome-web-archiving

wail by N0taN3rd

One-Click User Instigated Preservation

updated at May 10, 2024, 3:37 a.m.

JavaScript

13 +0

120 +0

9 +0

GitHub

html2warc by steffenfritz

simple script to convert web resources to a single warc file

updated at May 8, 2024, 5:21 a.m.

Python

4 +0

15 +0

2 +0

GitHub

solrwayback by netarchivesuite

A search interface and wayback machine for the UKWA Solr based warc-indexer framework.

updated at May 7, 2024, 6:08 a.m.

Java

24 +0

95 +0

18 +0

GitHub

py-wasapi-client by unt-libraries

A client for the Archive-It And Webrecorder WASAPI Data Transfer API

updated at May 7, 2024, 3:08 a.m.

Python

5 +0

14 +0

4 +0

GitHub

ArchiveSpark by helgeho

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

updated at May 5, 2024, 4:14 a.m.

Scala

14 +0

141 +0

19 +0

GitHub

outbackcdx by nla

Web archive index server based on RocksDB

updated at May 4, 2024, 5:05 a.m.

Java

23 +0

29 +0

20 +0

GitHub

brozzler by internetarchive

brozzler - distributed browser-based web crawler

updated at May 4, 2024, 4:59 a.m.

Python

36 +0

630 +0

93 +0

GitHub

chronicler by CGamesPlay

Offline-first web browser

updated at May 1, 2024, 4:40 p.m.

JavaScript

6 +0

83 +0

5 +0

GitHub

cairn by wabarc

NPM package and CLI tool for saving web page as single HTML file

updated at May 1, 2024, 4:40 p.m.

TypeScript

4 +0

37 +0

2 +0

GitHub

aut by archivesunleashed

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

updated at May 1, 2024, 4:39 p.m.

Scala

15 +0

133 +0

33 +0

GitHub

cc-notebooks by commoncrawl

Various Jupyter notebooks about Common Crawl data

updated at May 1, 2024, 4:06 p.m.

Jupyter Notebook

16 +0

40 +0

8 +0

GitHub

node-warc by N0taN3rd

Parse And Create Web ARChive (WARC) files with node.js

updated at May 1, 2024, 4:04 p.m.

JavaScript

9 +0

92 +0

20 +0

GitHub

WarcDB by Florents-Tselai

WarcDB: Web crawl data as SQLite databases.

updated at May 1, 2024, 4:03 p.m.

Python

10 +0

384 +0

11 +0

GitHub

awesome-memento by machawk1

A list of things related to software, literature, and other content for 🕣 Memento

updated at April 27, 2024, 8:55 a.m.

Unknown languages

8 +0

77 +0

8 +0

GitHub

warc2html by iipc

Converts WARC files to static HTML

updated at April 26, 2024, 4:02 p.m.

Java

10 +0

38 +0

3 +0

GitHub

arch by internetarchive

Web application for distributed compute analysis of Archive-It web archive collections.

updated at April 24, 2024, 8:10 p.m.

Scala

19 +0

13 +0

4 +0

GitHub

Squidwarc by N0taN3rd

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head

updated at April 23, 2024, 1:39 a.m.

JavaScript

10 +0

164 +0

26 +0

GitHub

warcworker by peterk

A dockerized, queued high fidelity web archiver based on Squidwarc

updated at April 23, 2024, 1:39 a.m.

Python

6 +0

53 +0

9 +0

GitHub

warctools by internetarchive

Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)

updated at April 11, 2024, 9:06 a.m.

Python

36 +0

141 +0

25 +0

GitHub

Mink by machawk1

Chrome extension that uses Memento to indicate that a page a user is viewing on the live web has an archived copy and to give the user access to the copy

updated at April 10, 2024, 6:35 p.m.

JavaScript

6 +0

45 +0

3 +0

GitHub