delta by delta-io

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

updated at June 9, 2024, 3:33 a.m.

Scala

217 +1

7,110 +127

1,608 +12

GitHub
ipex-llm by intel-analytics

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, Phi, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, Axolotl, etc.

updated at June 9, 2024, 1:48 a.m.

Python

245 +1

6,135 +20

1,212 +4

GitHub
joblib by joblib

Computing with Python functions.

updated at June 9, 2024, 12:29 a.m.

Python

63 +0

3,706 +7

409 +0

GitHub
SynapseML by Microsoft

Simple and Distributed Machine Learning

updated at June 8, 2024, 10:20 p.m.

Scala

145 -1

4,994 +3

820 +1

GitHub
scikit-learn by scikit-learn

scikit-learn: machine learning in Python

updated at June 8, 2024, 8:56 p.m.

Python

2,140 -1

58,571 +82

25,067 +33

GitHub
quinn by MrPowers

pyspark methods to enhance developer productivity 📣 👯 🎉

updated at June 8, 2024, 3:53 p.m.

Python

19 +0

586 +2

92 +0

GitHub
hail by hail-is

Cloud-native genomic dataframes and batch computing

updated at June 7, 2024, 11:09 p.m.

Python

55 +0

946 +2

238 +0

GitHub
sedona by apache

A cluster computing framework for processing large-scale geospatial data

updated at June 7, 2024, 10:19 p.m.

Java

96 +0

1,800 +6

653 +3

GitHub
deequ by awslabs

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

updated at June 7, 2024, 7:06 p.m.

Scala

80 +0

3,158 +5

513 +0

GitHub
cromwell by broadinstitute

Scientific workflow engine designed for simplicity & scalability. Trivially transition between one off use cases to massive scale production environments

updated at June 7, 2024, 4:58 p.m.

Scala

112 +0

965 +2

351 +1

GitHub
dplyr by tidyverse

dplyr: A grammar of data manipulation

updated at June 7, 2024, 12:51 p.m.

R

244 +0

4,677 +1

2,118 +0

GitHub
kyuubi by apache

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

updated at June 7, 2024, 9:17 a.m.

Scala

64 +0

1,980 +12

867 +2

GitHub
spark by dotnet

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.

updated at June 7, 2024, 5:56 a.m.

C#

92 +0

2,004 +1

309 -1

GitHub
aut by archivesunleashed

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.

updated at June 6, 2024, 8:46 p.m.

Scala

15 +0

134 +1

33 +0

GitHub
spark-fast-tests by MrPowers

Apache Spark testing helpers (dependency free & works with Scalatest, uTest, and MUnit)

updated at June 6, 2024, 6:17 p.m.

Scala

15 +0

422 +1

74 +1

GitHub
sparkmagic by jupyter-incubator

Jupyter magics and kernels for working with remote Spark clusters

updated at June 6, 2024, 8:59 a.m.

Python

49 +0

1,294 +3

439 +1

GitHub
magellan by harsha2010

Geo Spatial Data Analytics on Spark

updated at June 6, 2024, 8:35 a.m.

Scala

65 +0

534 +0

149 +0

GitHub
mongo-spark by mongodb

The MongoDB Spark Connector

updated at June 6, 2024, 7:07 a.m.

Java

79 +0

703 -1

307 +0

GitHub
spark-nlp by JohnSnowLabs

State of the Art Natural Language Processing

updated at June 6, 2024, 5:30 a.m.

Scala

99 -1

3,730 +6

704 +0

GitHub
neo4j-spark-connector by neo4j-contrib

Neo4j Connector for Apache Spark, which provides bi-directional read/write access to Neo4j from Spark, using the Spark DataSource APIs

updated at June 5, 2024, 3:49 p.m.

Scala

35 +0

305 +1

114 +0

GitHub