awesome-spark/awesome-spark

mongo-spark by mongodb

The MongoDB Spark Connector

updated at May 12, 2024, 6:15 a.m.

Java

79 +0

702 -1

307 +0

GitHub

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max). A PyTorch LLM library that seamlessly integrates with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, etc.

updated at May 12, 2024, 3:48 a.m.

Python

242 +0

6,049 +48

1,204 +2

GitHub

scikit-learn by scikit-learn

scikit-learn: machine learning in Python

updated at May 12, 2024, 2:16 a.m.

Python

2,141 +0

58,265 +63

25,004 +18

GitHub

mleap by combust

MLeap: Deploy ML Pipelines to Production

updated at May 12, 2024, 1 a.m.

Scala

69 +0

1,496 +2

313 +0

GitHub

deequ by awslabs

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

updated at May 11, 2024, 11:29 p.m.

Scala

80 +0

3,140 +6

514 +0

GitHub

spark-nlp by JohnSnowLabs

State of the Art Natural Language Processing

updated at May 11, 2024, 9:33 p.m.

Scala

100 +0

3,708 +9

702 +1

GitHub

hail by hail-is

Cloud-native genomic dataframes and batch computing

updated at May 11, 2024, 1:20 p.m.

Python

55 +0

938 +0

235 +0

GitHub

joblib by joblib

Computing with Python functions.

updated at May 11, 2024, 9 a.m.

Python

61 +0

3,679 +9

405 +3

GitHub

kyuubi by apache

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

updated at May 11, 2024, 7:40 a.m.

Scala

62 -1

1,947 +6

860 +1

GitHub

sedona by apache

A cluster computing framework for processing large-scale geospatial data

updated at May 11, 2024, 6:20 a.m.

Java

96 +0

1,784 +4

646 +1

GitHub

spark-testing-base by holdenk

Base classes to use when writing tests with Spark

updated at May 11, 2024, 6 a.m.

Scala

78 +0

1,497 +4

358 +0

GitHub

koalas by databricks

Koalas: pandas API on Apache Spark

updated at May 11, 2024, 3:34 a.m.

Python

316 +0

3,321 +0

355 +0

GitHub

delta by delta-io

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

updated at May 10, 2024, 11 p.m.

Scala

215 +0

6,935 +13

1,583 +3

GitHub

cromwell by broadinstitute

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments

updated at May 10, 2024, 10:06 p.m.

Scala

112 +0

959 -1

350 +0

GitHub

Mobius by Microsoft

C# and F# language binding and extensions to Apache Spark

updated at May 10, 2024, 9:08 p.m.

C#

145 +0

939 +2

212 +0

GitHub

adam by bigdatagenomics

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

updated at May 10, 2024, 3:25 p.m.

Scala

100 +0

968 +1

304 -1

GitHub

dplyr by tidyverse

dplyr: A grammar of data manipulation

updated at May 10, 2024, 1:59 p.m.

R

247 +1

4,665 +6

2,119 +1

GitHub

neo4j-spark-connector by neo4j-contrib

Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs

updated at May 10, 2024, 1:50 p.m.

Scala

35 +0

304 +1

114 +0

GitHub

graphframes by graphframes

None

updated at May 10, 2024, 11:48 a.m.

Scala

58 +0

972 +1

232 +0

GitHub

SynapseML by Microsoft

Simple and Distributed Machine Learning

updated at May 10, 2024, 10:34 a.m.

Scala

146 +0

4,975 +3

815 +0

GitHub

mongo-spark by mongodb

ipex-llm by intel-analytics

scikit-learn by scikit-learn

mleap by combust

deequ by awslabs

spark-nlp by JohnSnowLabs

hail by hail-is

joblib by joblib

kyuubi by apache

sedona by apache

spark-testing-base by holdenk

koalas by databricks

delta by delta-io

cromwell by broadinstitute

Mobius by Microsoft

adam by bigdatagenomics

dplyr by tidyverse

neo4j-spark-connector by neo4j-contrib

graphframes by graphframes

SynapseML by Microsoft