awesome-spark/awesome-spark

iceberg by apache

Apache Iceberg

updated at Nov. 17, 2024, 8:29 p.m.

Java

160 +0

6,464 +20

2,235 +10

GitHub

graphframes by graphframes

None

updated at Nov. 17, 2024, 8:23 p.m.

Scala

59 +0

1,001 +2

237 +0

GitHub

delta by delta-io

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

updated at Nov. 17, 2024, 6:58 p.m.

Scala

217 +0

7,599 +18

1,707 +6

GitHub

sedona by apache

A cluster computing framework for processing large-scale geospatial data

updated at Nov. 17, 2024, 5:14 p.m.

Java

95 +0

1,956 +2

693 -2

GitHub

hudi by apache

Upserts, Deletes And Incremental Processing on Big Data.

updated at Nov. 17, 2024, 5:10 p.m.

Java

1,164 +1

5,436 +21

2,424 -1

GitHub

spark-jobserver by spark-jobserver

REST job server for Apache Spark

updated at Nov. 17, 2024, 2:15 p.m.

Scala

221 +0

2,839 -1

998 +0

GitHub

deequ by awslabs

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

updated at Nov. 17, 2024, 2:14 p.m.

Scala

81 +0

3,308 +1

539 +1

GitHub

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc

updated at Nov. 17, 2024, 11:33 a.m.

Python

251 +0

6,718 +29

1,264 +3

GitHub

scikit-learn by scikit-learn

scikit-learn: machine learning in Python

updated at Nov. 17, 2024, 1:36 a.m.

Python

2,138 -1

60,149 +80

25,410 +17

GitHub

spark-connect-go by apache

Apache Spark Connect Client for Golang

updated at Nov. 17, 2024, 12:26 a.m.

Go

25 +0

161 +2

32 +0

GitHub

spark by dotnet

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.

updated at Nov. 16, 2024, 1:01 p.m.

C#

93 +0

2,024 +1

315 +0

GitHub

koalas by databricks

Koalas: pandas API on Apache Spark

updated at Nov. 16, 2024, 10:21 a.m.

Python

326 +0

3,338 +3

358 +0

GitHub

incubator-livy by apache

Apache Livy is an open source REST interface for interacting with Apache Spark from anywhere.

updated at Nov. 16, 2024, 9:40 a.m.

Scala

60 +0

888 +2

602 +0

GitHub

joblib by joblib

Computing with Python functions.

updated at Nov. 15, 2024, 11:01 p.m.

Python

64 +1

3,876 +11

416 +2

GitHub

sparkling-water by h2oai

Sparkling Water provides H2O functionality inside Spark cluster

updated at Nov. 15, 2024, 8:11 p.m.

Scala

180 +0

968 +1

360 +0

GitHub

python-deequ by awslabs

Python API for Deequ

updated at Nov. 15, 2024, 5:51 p.m.

Jupyter Notebook

17 +0

730 +3

136 +1

GitHub

chispa by MrPowers

PySpark test helper methods with beautiful error messages

updated at Nov. 15, 2024, 5:38 p.m.

Python

5 +0

620 +3

68 +0

GitHub

spark-nlp by JohnSnowLabs

State of the Art Natural Language Processing

updated at Nov. 15, 2024, 2:29 p.m.

Scala

100 +0

3,871 +6

712 +2

GitHub

hail by hail-is

Cloud-native genomic dataframes and batch computing

updated at Nov. 15, 2024, 10:30 a.m.

Python

55 +0

984 +2

246 +0

GitHub

cromwell by broadinstitute

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments

updated at Nov. 15, 2024, 9:25 a.m.

Scala

110 +0

997 +1

361 +1

GitHub

iceberg by apache

graphframes by graphframes

delta by delta-io

sedona by apache

hudi by apache

spark-jobserver by spark-jobserver

deequ by awslabs

ipex-llm by intel-analytics

scikit-learn by scikit-learn

spark-connect-go by apache

spark by dotnet

koalas by databricks

incubator-livy by apache

joblib by joblib

sparkling-water by h2oai

python-deequ by awslabs

chispa by MrPowers

spark-nlp by JohnSnowLabs

hail by hail-is

cromwell by broadinstitute