auto-archiver by bellingcat

Automatically archive links to videos, images, and social media content from Google Sheets (and more).

created at Jan. 15, 2021, 10:30 a.m.

Python

19 +0

478 +1

53 +0

GitHub
httrack2warc by nla

Converts HTTrack crawls to WARC files

created at Oct. 23, 2017, 5:52 a.m.

Java

20 +0

27 +0

6 +0

GitHub
archivenow by oduwsdl

A Tool To Push Web Resources Into Web Archives

created at Feb. 9, 2017, 12:29 p.m.

Python

21 +0

392 +0

41 +0

GitHub
warcio by webrecorder

Streaming WARC/ARC library for fast web archive IO

created at March 6, 2017, 6:17 p.m.

Python

22 +0

349 +0

55 +1

GitHub
wasapi-downloader by sul-dlss

Java application to download WARCs from WASAPI

created at April 28, 2017, 9:15 p.m.

Java

22 +0

6 +0

4 +0

GitHub
outbackcdx by nla

Web archive index server based on RocksDB

created at Jan. 15, 2015, 11:53 p.m.

Java

23 +0

29 +0

20 +0

GitHub
wpull by ArchiveTeam

Wget-compatible web downloader and crawler.

created at Dec. 7, 2013, 1:03 p.m.

HTML

23 +0

538 +2

77 +0

GitHub
ipwb by oduwsdl

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

created at March 4, 2016, 3:01 p.m.

Python

23 +0

591 +1

39 +0

GitHub
webarchive-discovery by ukwa

WARC and ARC indexing and discovery tools.

created at Dec. 20, 2012, 12:17 p.m.

Java

24 +0

113 +0

24 +0

GitHub
browsertrix-crawler by webrecorder

Run a high-fidelity browser-based crawler in a single Docker container

created at Nov. 2, 2020, 4:37 a.m.

TypeScript

24 +0

557 +5

72 +0

GitHub
solrwayback by netarchivesuite

A search interface and wayback machine for the UKWA Solr based warc-indexer framework.

created at Feb. 8, 2017, 9:33 a.m.

Java

24 +0

95 +0

18 +0

GitHub
awesome-website-change-monitoring by edgi-govdata-archiving

A curated list of awesome tools for website diffing and change monitoring.

created at May 24, 2017, 5:33 a.m.

Unknown languages

31 +0

482 +0

31 +0

GitHub
warcprox by internetarchive

WARC writing MITM HTTP/S proxy

created at Oct. 25, 2013, 11:27 p.m.

Python

33 +0

366 +1

54 -1

GitHub
twarc by DocNow

A command line tool (and Python library) for archiving Twitter JSON

created at Jan. 14, 2013, 2:35 p.m.

Python

35 +0

1,355 +0

254 +1

GitHub
brozzler by internetarchive

brozzler - distributed browser-based web crawler

created at July 13, 2015, 11:48 p.m.

Python

36 +0

630 +0

93 +0

GitHub
warctools by internetarchive

Command line tools and libraries for handling and manipulating WARC files (and HTTP contents)

created at March 22, 2013, 8:52 p.m.

Python

36 +0

141 +0

25 +0

GitHub
grab-site by ArchiveTeam

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns

created at Feb. 5, 2015, 5:01 a.m.

Python

40 +0

1,273 +1

125 +0

GitHub
wikiteam by WikiTeam

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2023, WikiTeam has preserved more than 350,000 wikis.

created at June 25, 2014, 10:18 a.m.

Python

40 +0

696 +3

145 +1

GitHub
DownloadNet by dosyago

💾 DownloadNet - All content you browse online available offline. Search through the full-text of all pages in your browser history. ⭐️ Star to support our work!

created at Dec. 20, 2019, 9:47 a.m.

JavaScript

42 +0

3,662 +5

137 +0

GitHub
internetarchive by jjjake

A Python and Command-Line Interface to Archive.org

created at Aug. 15, 2012, 7:18 p.m.

Python

51 +0

1,539 +6

211 +1

GitHub