awesome-spark/awesome-spark

cromwell by broadinstitute

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments

created at April 17, 2015, 7:39 p.m.

Scala

110 +0

1,000 +1

360 -1

GitHub

docker-spark by sequenceiq

None

created at July 11, 2014, 3:45 p.m.

Shell

65 +0

765 +0

282 +0

GitHub

photon-ml by linkedin

A scalable machine learning library on Apache Spark

created at Feb. 3, 2016, 1:12 a.m.

Terra

82 +0

792 +0

177 +0

GitHub

crossdata by Stratio

DISCONTINUED - Easy access to big things. Library for Apache Spark extending and improving its capabilities

created at Feb. 6, 2014, 9:41 a.m.

Scala

101 +0

169 +0

51 +0

GitHub

oryx by OryxProject

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

created at July 25, 2014, 8:08 p.m.

Java

208 +0

1,787 +0

405 +0

GitHub

first-edition by spark-in-action

The book's repo

created at March 25, 2015, 2:54 a.m.

Scala

42 +0

273 +0

188 +0

GitHub

aas by sryza

Code to accompany Advanced Analytics with Spark from O'Reilly Media

created at Nov. 8, 2014, 10:18 p.m.

Scala

146 +0

1,521 +1

1,029 +0

GitHub

spark-testing-base by holdenk

Base classes to use when writing tests with Spark

created at Jan. 30, 2015, 10:23 p.m.

Scala

77 +0

1,525 +1

357 +0

GitHub

incubator-toree by apache

Mirror of Apache Toree (Incubating)

created at Jan. 7, 2016, 8 a.m.

Scala

48 +0

740 +0

225 +0

GitHub

spark-jobserver by spark-jobserver

REST job server for Apache Spark

created at Aug. 21, 2014, 11:07 p.m.

Scala

221 +0

2,839 -1

998 -1

GitHub

livy by cloudera

Livy is an open source REST interface for interacting with Apache Spark from anywhere

created at Nov. 17, 2015, 6:55 a.m.

Scala

91 +0

1,008 -2

314 +0

GitHub

sparkling-water by h2oai

Sparkling Water provides H2O functionality inside Spark cluster

created at Oct. 13, 2014, 11:06 p.m.

Scala

179 -1

967 -1

359 -1

GitHub

scikit-learn by scikit-learn

scikit-learn: machine learning in Python

created at Aug. 17, 2010, 9:43 a.m.

Python

2,140 +0

60,290 +65

25,431 +2

GitHub

graphframes by graphframes

None

created at Jan. 20, 2016, 11:17 p.m.

Scala

59 +0

1,002 +0

238 +0

GitHub

hail by hail-is

Cloud-native genomic dataframes and batch computing

created at Oct. 27, 2015, 8:55 p.m.

Python

55 +0

984 +0

247 +1

GitHub

mongo-spark by mongodb

The MongoDB Spark Connector

created at May 20, 2015, 5:59 p.m.

Java

79 +0

713 +0

311 +1

GitHub

jpmml-evaluator-spark by jpmml

PMML evaluator library for the Apache Spark cluster computing system (http://spark.apache.org/)

created at Nov. 29, 2015, 10:03 a.m.

Java

14 +0

94 +0

43 +0

GitHub

spark-cassandra-connector by datastax

DataStax Connector for Apache Spark to Apache Cassandra

created at June 27, 2014, 3:45 p.m.

Scala

163 +0

1,943 +1

922 +1

GitHub

spark-xml by databricks

XML data source for Spark SQL and DataFrames

created at Nov. 26, 2015, 2:46 a.m.

Scala

39 +0

504 -2

226 +0

GitHub

adam by bigdatagenomics

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

created at Nov. 19, 2013, 11:47 p.m.

Scala

99 +0

1,003 +0

309 +0

GitHub