webarchive-indexing by ikreymer

Tools for bulk indexing of WARC/ARC files on Hadoop, EMR or local file system.

created at March 9, 2015, 8:32 p.m.

Python

9 +0

41 +0

9 +1

GitHub
brozzler by internetarchive

brozzler - distributed browser-based web crawler

created at July 13, 2015, 11:48 p.m.

Python

36 +0

630 +0

93 +0

GitHub
ArchiveSpark by helgeho

An Apache Spark framework for easy data processing, extraction as well as derivation for web archives and archival collections, developed at Internet Archive.

created at Aug. 6, 2015, 7:42 p.m.

Scala

14 +0

141 +0

19 +0

GitHub
MemGator by oduwsdl

A Memento Aggregator CLI and Server in Go

created at Sept. 8, 2015, 1:43 a.m.

Go

14 +0

54 +0

11 +0

GitHub
jwarc by iipc

Java library for reading and writing WARC files with a typed API

created at Sept. 21, 2015, 3:07 a.m.

Java

5 +0

42 +0

8 +0

GitHub
webarchive by richardlehane

golang readers for ARC and WARC webarchive formats

created at Sept. 21, 2015, 6:38 a.m.

Go

7 +0

17 +0

2 +0

GitHub
html2warc by steffenfritz

simple script to convert web resources to a single warc file

created at Dec. 30, 2015, 2:29 p.m.

Python

4 +0

15 +0

2 +0

GitHub
Web2Warc by helgeho

An easy-to-use and highly customizable crawler that enables you to create your own little Web archives (WARC/CDX)

created at Jan. 29, 2016, 10:43 a.m.

Scala

3 +0

24 +0

4 +0

GitHub
ipwb by oduwsdl

InterPlanetary Wayback: A distributed and persistent archive replay system using IPFS

created at March 4, 2016, 3:01 p.m.

Python

23 +0

591 +1

39 +0

GitHub
wail by N0taN3rd

whale2 One-Click User Instigated Preservation

created at May 26, 2016, 4:52 a.m.

JavaScript

13 +0

120 +0

9 +0

GitHub
heritrix-walkthrough by web-archive-group

None

created at June 1, 2016, 10:35 p.m.

Shell

6 +0

9 +0

1 +0

GitHub
HadoopConcatGz by helgeho

A Splitable Hadoop InputFormat for Concatenated GZIP Files and *.(w)arc.gz

created at Aug. 8, 2016, 1:36 p.m.

Java

2 +0

9 +0

3 +0

GitHub
awesome-memento by machawk1

A list of things related to software, literature, and other content for 🕣 Memento

created at Sept. 16, 2016, 1:33 a.m.

Unknown languages

8 +0

77 +0

8 +0

GitHub
badger by dgraph-io

Fast key-value DB in Go.

created at Jan. 26, 2017, 5:09 a.m.

Go

239 +0

13,457 +12

1,151 +2

GitHub
solrwayback by netarchivesuite

A search interface and wayback machine for the UKWA Solr based warc-indexer framework.

created at Feb. 8, 2017, 9:33 a.m.

Java

24 +0

95 +0

18 +0

GitHub
archivenow by oduwsdl

A Tool To Push Web Resources Into Web Archives

created at Feb. 9, 2017, 12:29 p.m.

Python

21 +0

392 +0

41 +0

GitHub
WarcPartitioner by helgeho

Partition (W)ARC Files by MIME Type and Year

created at Feb. 13, 2017, 3:45 p.m.

Java

2 +0

1 +0

1 +0

GitHub
fbarc by justinlittman

A commandline tool and Python library for archiving data from Facebook using the Graph API.

created at Feb. 14, 2017, 11:45 p.m.

Python

16 +0

77 +0

11 +0

GitHub
monolith by Y2Z

⬛️ CLI tool for saving complete web pages as a single HTML file

created at Feb. 20, 2017, 7:47 a.m.

Rust

62 +0

10,140 +68

287 +2

GitHub
warcio by webrecorder

Streaming WARC/ARC library for fast web archive IO

created at March 6, 2017, 6:17 p.m.

Python

22 +0

349 +0

55 +1

GitHub