iipc/awesome-web-archiving

html2warc by steffenfritz

simple script to convert web resources to a single warc file

created at Dec. 30, 2015, 2:29 p.m.

Python

4 +0

15 +0

2 +0

GitHub

jwarc by iipc

Java library for reading and writing WARC files with a typed API

created at Sept. 21, 2015, 3:07 a.m.

Java

5 +0

42 +0

8 +0

GitHub

shine by ukwa

Prototype SOLR-powered web archive exploration UI.

created at July 3, 2013, 8:18 p.m.

JavaScript

17 +0

42 +0

7 +0

GitHub

Mink by machawk1

Chrome extension that uses Memento to indicate that a page a user is viewing on the live web has an archived copy and to give the user access to the copy

created at Jan. 17, 2014, 6:25 p.m.

JavaScript

6 +0

45 +0

3 +0

GitHub

node-warc by N0taN3rd

Parse And Create Web ARChive (WARC) files with node.js

created at May 21, 2017, 6 a.m.

JavaScript

9 +0

92 +0

20 +0

GitHub

twut by archivesunleashed

An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark.

created at Nov. 29, 2019, 2:52 p.m.

Scala

4 +0

9 +0

2 +0

GitHub

notebooks by archivesunleashed

Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit.

created at Nov. 6, 2019, 3:09 a.m.

Jupyter Notebook

6 +0

21 +0

4 +0

GitHub

pywb by webrecorder

Core Python Web Archiving Toolkit for replay and recording of web archives

created at Dec. 9, 2013, 3:30 a.m.

JavaScript

61 +0

1,317 +4

206 -1

GitHub

py-wasapi-client by unt-libraries

A client for the Archive-It And Webrecorder WASAPI Data Transfer API

created at Aug. 10, 2017, 5:25 p.m.

Python

5 +0

14 +0

4 +0

GitHub

Web2Warc by helgeho

An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)

created at Jan. 29, 2016, 10:43 a.m.

Scala

3 +0

24 +0

4 +0

GitHub

MemGator by oduwsdl

A Memento Aggregator CLI and Server in Go

created at Sept. 8, 2015, 1:43 a.m.

Go

14 +0

54 +0

11 +0

GitHub

brozzler by internetarchive

brozzler - distributed browser-based web crawler

created at July 13, 2015, 11:48 p.m.

Python

36 +0

630 +0

93 +0

GitHub

webarchive by richardlehane

golang readers for ARC and WARC webarchive formats

created at Sept. 21, 2015, 6:38 a.m.

Go

7 +0

17 +0

2 +0

GitHub

tikalinkextract by httpreserve

Tika based link (URL) extractor for httpreserve

created at April 3, 2017, 2:35 a.m.

HTML

4 +0

8 +0

1 +0

GitHub

flameshot by flameshot-org

Powerful yet simple to use screenshot software :desktop_computer: :camera_flash:

created at May 10, 2017, 7:44 p.m.

C++

205 +0

23,358 +42

1,507 +3

GitHub

wail by machawk1

Web Archiving Integration Layer: One-Click User Instigated Preservation

created at March 20, 2013, 2:42 p.m.

Roff

14 +0

345 +0

32 +0

GitHub

obelisk by go-shiori

Go package and CLI tool for saving web page as single HTML file

created at March 29, 2020, 12:53 a.m.

Go

11 +0

240 -1

15 +0

GitHub

cairn by wabarc

NPM package and CLI tool for saving web page as single HTML file

created at Oct. 8, 2020, 7:18 a.m.

TypeScript

4 +0

37 +0

2 +0

GitHub

ArchiveBox by ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

created at May 5, 2017, 8:50 a.m.

Python

171 +1

20,012 +65

1,089 +3

GitHub

httrack2warc by nla

Converts HTTrack crawls to WARC files

created at Oct. 23, 2017, 5:52 a.m.

Java

20 +0

27 +0

6 +0

GitHub