html2warc by steffenfritz

simple script to convert web resources to a single warc file

created at Dec. 30, 2015, 2:29 p.m.

Python

4 +0

15 +0

2 +0

GitHub
jwarc by iipc

Java library for reading and writing WARC files with a typed API

created at Sept. 21, 2015, 3:07 a.m.

Java

5 +0

42 +0

8 +0

GitHub
shine by ukwa

Prototype SOLR-powered web archive exploration UI.

created at July 3, 2013, 8:18 p.m.

JavaScript

17 +0

42 +0

7 +0

GitHub
Mink by machawk1

Chrome extension that uses Memento to indicate that a page a user is viewing on the live web has an archived copy and to give the user access to the copy

created at Jan. 17, 2014, 6:25 p.m.

JavaScript

6 +0

45 +0

3 +0

GitHub
node-warc by N0taN3rd

Parse And Create Web ARChive (WARC) files with node.js

created at May 21, 2017, 6 a.m.

JavaScript

9 +0

92 +0

20 +0

GitHub
twut by archivesunleashed

An open-source toolkit for analyzing line-oriented JSON Twitter archives with Apache Spark.

created at Nov. 29, 2019, 2:52 p.m.

Scala

4 +0

9 +0

2 +0

GitHub
notebooks by archivesunleashed

Various examples of notebooks for working with web archives with the Archives Unleashed Toolkit, and derivatives generated by the Archives Unleashed Toolkit.

created at Nov. 6, 2019, 3:09 a.m.

Jupyter Notebook

6 +0

21 +0

4 +0

GitHub
pywb by webrecorder

Core Python Web Archiving Toolkit for replay and recording of web archives

created at Dec. 9, 2013, 3:30 a.m.

JavaScript

61 +0

1,317 +4

206 -1

GitHub
py-wasapi-client by unt-libraries

A client for the Archive-It And Webrecorder WASAPI Data Transfer API

created at Aug. 10, 2017, 5:25 p.m.

Python

5 +0

14 +0

4 +0

GitHub
Web2Warc by helgeho

An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)

created at Jan. 29, 2016, 10:43 a.m.

Scala

3 +0

24 +0

4 +0

GitHub
MemGator by oduwsdl

A Memento Aggregator CLI and Server in Go

created at Sept. 8, 2015, 1:43 a.m.

Go

14 +0

54 +0

11 +0

GitHub
brozzler by internetarchive

brozzler - distributed browser-based web crawler

created at July 13, 2015, 11:48 p.m.

Python

36 +0

630 +0

93 +0

GitHub
webarchive by richardlehane

golang readers for ARC and WARC webarchive formats

created at Sept. 21, 2015, 6:38 a.m.

Go

7 +0

17 +0

2 +0

GitHub
tikalinkextract by httpreserve

Tika based link (URL) extractor for httpreserve

created at April 3, 2017, 2:35 a.m.

HTML

4 +0

8 +0

1 +0

GitHub
flameshot by flameshot-org

Powerful yet simple to use screenshot software :desktop_computer: :camera_flash:

created at May 10, 2017, 7:44 p.m.

C++

205 +0

23,358 +42

1,507 +3

GitHub
wail by machawk1

whale2 Web Archiving Integration Layer: One-Click User Instigated Preservation

created at March 20, 2013, 2:42 p.m.

Roff

14 +0

345 +0

32 +0

GitHub
obelisk by go-shiori

Go package and CLI tool for saving web page as single HTML file

created at March 29, 2020, 12:53 a.m.

Go

11 +0

240 -1

15 +0

GitHub
cairn by wabarc

NPM package and CLI tool for saving web page as single HTML file

created at Oct. 8, 2020, 7:18 a.m.

TypeScript

4 +0

37 +0

2 +0

GitHub
ArchiveBox by ArchiveBox

🗃 Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...

created at May 5, 2017, 8:50 a.m.

Python

171 +1

20,012 +65

1,089 +3

GitHub
httrack2warc by nla

Converts HTTrack crawls to WARC files

created at Oct. 23, 2017, 5:52 a.m.

Java

20 +0

27 +0

6 +0

GitHub