scikit-learn by scikit-learn

scikit-learn: machine learning in Python

created at Aug. 17, 2010, 9:43 a.m.

Python

2,138 -1

60,149 +80

25,410 +17

GitHub
delta by delta-io

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs

created at April 22, 2019, 6:56 p.m.

Scala

217 +0

7,599 +18

1,707 +6

GitHub
ipex-llm by intel-analytics

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc

created at Aug. 29, 2016, 7:59 a.m.

Python

251 +0

6,718 +29

1,264 +3

GitHub
iceberg by apache

Apache Iceberg

created at Nov. 19, 2018, 4:26 p.m.

Java

160 +0

6,464 +20

2,235 +10

GitHub
hudi by apache

Upserts, Deletes And Incremental Processing on Big Data.

created at Dec. 14, 2016, 3:53 p.m.

Java

1,164 +1

5,436 +21

2,424 -1

GitHub
SynapseML by Microsoft

Simple and Distributed Machine Learning

created at June 5, 2017, 8:23 a.m.

Scala

146 +0

5,068 +3

831 -1

GitHub
dplyr by tidyverse

dplyr: A grammar of data manipulation

created at Oct. 28, 2012, 1:39 p.m.

R

244 +0

4,781 +6

2,122 +0

GitHub
joblib by joblib

Computing with Python functions.

created at May 7, 2010, 6:48 a.m.

Python

64 +1

3,876 +11

416 +2

GitHub
spark-nlp by JohnSnowLabs

State of the Art Natural Language Processing

created at Sept. 24, 2017, 7:36 p.m.

Scala

100 +0

3,871 +6

712 +2

GitHub
koalas by databricks

Koalas: pandas API on Apache Spark

created at Jan. 3, 2019, 9:46 p.m.

Python

326 +0

3,338 +3

358 +0

GitHub
deequ by awslabs

Deequ is a library built on top of Apache Spark for defining "unit tests for data", which measure data quality in large datasets.

created at Aug. 7, 2018, 8:55 p.m.

Scala

81 +0

3,308 +1

539 +1

GitHub
spark-jobserver by spark-jobserver

REST job server for Apache Spark

created at Aug. 21, 2014, 11:07 p.m.

Scala

221 +0

2,839 -1

998 +0

GitHub
kyuubi by apache

Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.

created at Dec. 18, 2017, 9:05 a.m.

Scala

62 +0

2,105 +7

914 -2

GitHub
spark by dotnet

.NET for Apache® Spark™ makes Apache Spark™ easily accessible to .NET developers.

created at April 22, 2019, 6:55 p.m.

C#

93 +0

2,024 +1

315 +0

GitHub
sedona by apache

A cluster computing framework for processing large-scale geospatial data

created at April 24, 2015, 6:01 p.m.

Java

95 +0

1,956 +2

693 -2

GitHub
spark-cassandra-connector by datastax

DataStax Connector for Apache Spark to Apache Cassandra

created at June 27, 2014, 3:45 p.m.

Scala

163 +0

1,943 +0

918 -1

GitHub
oryx by OryxProject

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

created at July 25, 2014, 8:08 p.m.

Java

208 +0

1,788 +0

405 +0

GitHub
spark-testing-base by holdenk

Base classes to use when writing tests with Spark

created at Jan. 30, 2015, 10:23 p.m.

Scala

77 +0

1,523 +1

358 +0

GitHub
aas by sryza

Code to accompany Advanced Analytics with Spark from O'Reilly Media

created at Nov. 8, 2014, 10:18 p.m.

Scala

146 +0

1,520 +0

1,031 +0

GitHub
mleap by combust

MLeap: Deploy ML Pipelines to Production

created at Aug. 23, 2016, 3:51 a.m.

Scala

66 +0

1,504 +0

313 +1

GitHub