adam by bigdatagenomics

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

created at Nov. 19, 2013, 11:47 p.m.

Scala

100 +0

1,003 +0

308 +0

GitHub
cromwell by broadinstitute

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments

created at April 17, 2015, 7:39 p.m.

Scala

110 +0

997 +1

361 +1

GitHub
crossdata by Stratio

DISCONTINUED - Easy access to big things. Library for Apache Spark extending and improving its capabilities

created at Feb. 6, 2014, 9:41 a.m.

Scala

101 +0

169 +0

51 +0

GitHub
first-edition by spark-in-action

The book's repo

created at March 25, 2015, 2:54 a.m.

Scala

42 +0

273 +0

188 +0

GitHub
aas by sryza

Code to accompany Advanced Analytics with Spark from O'Reilly Media

created at Nov. 8, 2014, 10:18 p.m.

Scala

146 +0

1,520 +0

1,031 +0

GitHub
spark-testing-base by holdenk

Base classes to use when writing tests with Spark

created at Jan. 30, 2015, 10:23 p.m.

Scala

77 +0

1,523 +1

358 +0

GitHub
incubator-toree by apache

Mirror of Apache Toree (Incubating)

created at Jan. 7, 2016, 8 a.m.

Scala

48 +0

740 +0

225 +0

GitHub
spark-jobserver by spark-jobserver

REST job server for Apache Spark

created at Aug. 21, 2014, 11:07 p.m.

Scala

221 +0

2,839 -1

998 +0

GitHub
livy by cloudera

Livy is an open source REST interface for interacting with Apache Spark from anywhere

created at Nov. 17, 2015, 6:55 a.m.

Scala

91 +0

1,009 +0

314 +0

GitHub
sparkling-water by h2oai

Sparkling Water provides H2O functionality inside Spark cluster

created at Oct. 13, 2014, 11:06 p.m.

Scala

180 +0

968 +1

360 +0

GitHub
graphframes by graphframes

None

created at Jan. 20, 2016, 11:17 p.m.

Scala

59 +0

1,001 +2

237 +0

GitHub
spark-cassandra-connector by datastax

DataStax Connector for Apache Spark to Apache Cassandra

created at June 27, 2014, 3:45 p.m.

Scala

163 +0

1,943 +0

918 -1

GitHub
spark-xml by databricks

XML data source for Spark SQL and DataFrames

created at Nov. 26, 2015, 2:46 a.m.

Scala

39 +0

505 +0

226 -1

GitHub
mleap by combust

MLeap: Deploy ML Pipelines to Production

created at Aug. 23, 2016, 3:51 a.m.

Scala

66 +0

1,504 +0

313 +1

GitHub
spark-nlp by JohnSnowLabs

State of the Art Natural Language Processing

created at Sept. 24, 2017, 7:36 p.m.

Scala

100 +0

3,871 +6

712 +2

GitHub
aut by archivesunleashed

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

created at July 6, 2017, 10:13 a.m.

Scala

15 +0

137 +0

33 +0

GitHub
deequ by awslabs

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

created at Aug. 7, 2018, 8:55 p.m.

Scala

81 +0

3,308 +1

539 +1

GitHub
incubator-livy by apache

Apache Livy is an open source REST interface for interacting with Apache Spark from anywhere.

created at June 25, 2017, 7 a.m.

Scala

60 +0

888 +2

602 +0

GitHub
delight by datamechanics

A Spark UI and Spark History Server alternative with CPU and Memory metrics! Delight is free, cross-platform, and open-source.

created at Oct. 26, 2020, 1:56 p.m.

Scala

16 +0

342 +0

53 +0

GitHub
delta by delta-io

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

created at April 22, 2019, 6:56 p.m.

Scala

217 +0

7,599 +18

1,707 +6

GitHub