graphframes by graphframes

None

updated at Nov. 17, 2024, 8:23 p.m.

Scala

59 +0

1,001 +2

237 +0

GitHub
delta by delta-io

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

updated at Nov. 17, 2024, 6:58 p.m.

Scala

217 +0

7,599 +18

1,707 +6

GitHub
spark-jobserver by spark-jobserver

REST job server for Apache Spark

updated at Nov. 17, 2024, 2:15 p.m.

Scala

221 +0

2,839 -1

998 +0

GitHub
deequ by awslabs

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

updated at Nov. 17, 2024, 2:14 p.m.

Scala

81 +0

3,308 +1

539 +1

GitHub
incubator-livy by apache

Apache Livy is an open source REST interface for interacting with Apache Spark from anywhere.

updated at Nov. 16, 2024, 9:40 a.m.

Scala

60 +0

888 +2

602 +0

GitHub
sparkling-water by h2oai

Sparkling Water provides H2O functionality inside Spark cluster

updated at Nov. 15, 2024, 8:11 p.m.

Scala

180 +0

968 +1

360 +0

GitHub
spark-nlp by JohnSnowLabs

State of the Art Natural Language Processing

updated at Nov. 15, 2024, 2:29 p.m.

Scala

100 +0

3,871 +6

712 +2

GitHub
cromwell by broadinstitute

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments

updated at Nov. 15, 2024, 9:25 a.m.

Scala

110 +0

997 +1

361 +1

GitHub
spark-testing-base by holdenk

Base classes to use when writing tests with Spark

updated at Nov. 15, 2024, 9:20 a.m.

Scala

77 +0

1,523 +1

358 +0

GitHub
kyuubi by apache

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

updated at Nov. 15, 2024, 8:22 a.m.

Scala

62 +0

2,105 +7

914 -2

GitHub
SynapseML by Microsoft

Simple and Distributed Machine Learning

updated at Nov. 14, 2024, 10:15 a.m.

Scala

146 +0

5,068 +3

831 -1

GitHub
neo4j-spark-connector by neo4j

Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs

updated at Nov. 14, 2024, 9:10 a.m.

Scala

34 +0

313 +0

112 +0

GitHub
mleap by combust

MLeap: Deploy ML Pipelines to Production

updated at Nov. 12, 2024, 3:16 p.m.

Scala

66 +0

1,504 +0

313 +1

GitHub
incubator-toree by apache

Mirror of Apache Toree (Incubating)

updated at Nov. 8, 2024, 5:15 p.m.

Scala

48 +0

740 +0

225 +0

GitHub
spark-fast-tests by mrpowers-io

Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)

updated at Nov. 8, 2024, 12:32 p.m.

Scala

16 +0

436 +0

77 +0

GitHub
spark-daria by mrpowers-io

Essential Spark extensions and helper methods ✨😲

updated at Nov. 8, 2024, 2:27 a.m.

Scala

34 +0

754 +0

152 +0

GitHub
livy by cloudera

Livy is an open source REST interface for interacting with Apache Spark from anywhere

updated at Nov. 7, 2024, 8:17 a.m.

Scala

91 +0

1,009 +0

314 +0

GitHub
spark-cassandra-connector by datastax

DataStax Connector for Apache Spark to Apache Cassandra

updated at Nov. 6, 2024, 1:04 a.m.

Scala

163 +0

1,943 +0

918 -1

GitHub
aas by sryza

Code to accompany Advanced Analytics with Spark from O'Reilly Media

updated at Nov. 5, 2024, 9:15 a.m.

Scala

146 +0

1,520 +0

1,031 +0

GitHub
adam by bigdatagenomics

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

updated at Nov. 4, 2024, 1:06 a.m.

Scala

100 +0

1,003 +0

308 +0

GitHub