iipc/awesome-web-archiving

warc-safe by natliblux

A tool for detecting viruses and NSFW material in WARC files

created at May 3, 2024, 6:24 a.m.

Python

NEW!

4 +0

5 +0

0 +0

GitHub

scoop by harvard-lil

🍨 High-fidelity, browser-based, single-page web archiving library and CLI for witnessing the web.

created at Sept. 20, 2022, 6:50 p.m.

JavaScript

7 -1

100 +1

5 +1

GitHub

WarcDB by Florents-Tselai

WarcDB: Web crawl data as SQLite databases.

created at May 29, 2022, 11:09 a.m.

Python

10 -1

384 +0

11 +0

GitHub

arch by internetarchive

Web application for distributed compute analysis of Archive-It web archive collections.

created at April 28, 2022, 3:18 p.m.

Scala

19 +0

13 +0

4 +0

GitHub

Sparkling by internetarchive

Internet Archive's Sparkling Data Processing Library

created at April 28, 2022, 2:28 p.m.

Scala

17 +0

10 +0

2 +0

GitHub

web-archiving-course by vphill

Web Archiving Course

created at Feb. 22, 2022, 2:33 a.m.

Unknown languages

1 +0

19 +0

6 +0

GitHub

unwarcit by emmadickson

None

created at Dec. 11, 2021, 7:19 p.m.

Python

5 +0

6 +0

0 +0

GitHub

warc2html by iipc

Converts WARC files to static HTML

created at Nov. 8, 2021, 4:09 a.m.

Java

10 +0

38 +0

3 +0

GitHub

browsertrix by webrecorder

Browsertrix is the hosted, high-fidelity, browser-based crawling service from Webrecorder designed to make web archiving easier and more accessible for all!

created at June 28, 2021, 10:46 p.m.

TypeScript

10 +0

127 +3

26 +0

GitHub

Zotero-Robust-Links-Extension by lanl

Create Robust Links from within Zotero

created at June 28, 2021, 9:38 p.m.

JavaScript

3 +0

17 +0

2 +0

GitHub

chatnoir-resiliparse by chatnoir-eu

A robust web archive analytics toolkit

created at June 22, 2021, 9:03 a.m.

Cython

9 +0

44 +2

8 +0

GitHub

playback by wabarc

Playback webpages from Wayback Machine

created at April 8, 2021, 2:21 p.m.

Go

4 +0

6 +0

1 +0

GitHub

gowarcserver by nlnwa

None

created at Jan. 15, 2021, 10:42 a.m.

Go

7 +0

12 +0

1 +0

GitHub

auto-archiver by bellingcat

Automatically archive links to videos, images, and social media content from Google Sheets (and more).

created at Jan. 15, 2021, 10:30 a.m.

Python

19 +0

474 +4

53 +0

GitHub

browsertrix-crawler by webrecorder

Run a high-fidelity browser-based crawler in a single Docker container

created at Nov. 2, 2020, 4:37 a.m.

TypeScript

23 +0

551 +4

69 +1

GitHub

cairn by wabarc

NPM package and CLI tool for saving web page as single HTML file

created at Oct. 8, 2020, 7:18 a.m.

TypeScript

4 +0

37 +0

2 +0

GitHub

An archiving tool with an IM-style interface that prioritizes privacy and accessibility, integrated with various archival services including Internet Archive, archive.today, IPFS, Telegraph, and file systems.

created at June 13, 2020, 10:08 a.m.

Go

9 -2

1,658 +8

59 -1

GitHub

waybackpy by akamhy

Wayback Machine API interface & a command-line tool

created at May 2, 2020, 9:19 a.m.

Python

10 +0

435 +2

33 +0

GitHub

obelisk by go-shiori

Go package and CLI tool for saving web page as single HTML file

created at March 29, 2020, 12:53 a.m.

Go

11 +0

241 +3

15 +0

GitHub

DownloadNet by dosyago

💾 DownloadNet - All content you browse online available offline. Search through the full-text of all pages in your browser history. ⭐️ Star to support our work!

created at Dec. 20, 2019, 9:47 a.m.

JavaScript

42 +0

3,654 +4

137 +0

GitHub

warc-safe by natliblux

scoop by harvard-lil

WarcDB by Florents-Tselai

arch by internetarchive

Sparkling by internetarchive

web-archiving-course by vphill

unwarcit by emmadickson

warc2html by iipc

browsertrix by webrecorder

Zotero-Robust-Links-Extension by lanl

chatnoir-resiliparse by chatnoir-eu

playback by wabarc

gowarcserver by nlnwa

auto-archiver by bellingcat

browsertrix-crawler by webrecorder

cairn by wabarc

wayback by wabarc

waybackpy by akamhy

obelisk by go-shiori

DownloadNet by dosyago