awesome-spark/awesome-spark

spark-xml by databricks

XML data source for Spark SQL and DataFrames

updated at May 23, 2024, 1:15 a.m.

Scala

40 +0

487 -1

224 +0

GitHub

blaze by blaze

NumPy and Pandas interface to Big Data

updated at May 23, 2024, 2:20 a.m.

Python

195 +1

3,180 +1

388 -5

GitHub

spark by dotnet

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.

updated at May 23, 2024, 6:11 a.m.

C#

91 +0

2,002 +1

309 +0

GitHub

spark-avro by databricks

Avro Data Source for Apache Spark

updated at May 23, 2024, 12:39 p.m.

Scala

70 -1

539 -1

310 +0

GitHub

Mobius by Microsoft

C# and F# language binding and extensions to Apache Spark

updated at May 23, 2024, 7:21 p.m.

C#

145 +0

940 +1

212 +0

GitHub

incubator-toree by apache

Mirror of Apache Toree (Incubating)

updated at May 24, 2024, 8:39 a.m.

Scala

48 +0

733 +1

224 +0

GitHub

graphframes by graphframes

None

updated at May 24, 2024, 9:11 a.m.

Scala

58 +0

972 +1

232 +0

GitHub

spark-cassandra-connector by datastax

DataStax Connector for Apache Spark to Apache Cassandra

updated at May 24, 2024, 12:26 p.m.

Scala

162 +0

1,932 +1

913 +0

GitHub

sparkling-water by h2oai

Sparkling Water provides H2O functionality inside Spark cluster

updated at May 24, 2024, 2:50 p.m.

Scala

179 +0

952 +0

363 +0

GitHub

deequ by awslabs

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

updated at May 24, 2024, 4:08 p.m.

Scala

80 +0

3,145 +1

513 -1

GitHub

hail by hail-is

Cloud-native genomic dataframes and batch computing

updated at May 24, 2024, 7:38 p.m.

Python

55 +0

943 +2

238 +2

GitHub

sparklyr by sparklyr

R interface for Apache Spark

updated at May 25, 2024, 6 a.m.

R

73 +0

929 +3

302 +0

GitHub

sedona by apache

A cluster computing framework for processing large-scale geospatial data

updated at May 25, 2024, 7:33 a.m.

Java

96 +1

1,791 +5

648 +2

GitHub

cromwell by broadinstitute

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments

updated at May 25, 2024, 9:18 a.m.

Scala

112 +0

962 +1

351 +1

GitHub

joblib by joblib

Computing with Python functions.

updated at May 25, 2024, 1:34 p.m.

Python

63 +1

3,694 +9

408 +1

GitHub

joblib-spark by joblib

Joblib Apache Spark Backend

updated at May 25, 2024, 1:34 p.m.

Python

9 +0

239 +1

26 +0

GitHub

sparkmagic by jupyter-incubator

Jupyter magics and kernels for working with remote Spark clusters

updated at May 25, 2024, 2:45 p.m.

Python

49 +0

1,288 +2

438 +0

GitHub

delta by delta-io

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

updated at May 25, 2024, 3:05 p.m.

Scala

216 +1

6,972 +11

1,591 +7

GitHub

dplyr by tidyverse

dplyr: A grammar of data manipulation

updated at May 25, 2024, 3:41 p.m.

R

245 +0

4,672 +1

2,117 +0

GitHub

spark-testing-base by holdenk

Base classes to use when writing tests with Spark

updated at May 25, 2024, 7:11 p.m.

Scala

78 +0

1,499 +1

358 +0

GitHub