A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).
updated at May 1, 2024, 3:49 p.m.
Materials for GWU DNSC 6279 and DNSC 6290.
updated at April 28, 2024, 10:45 a.m.
Repository of my thesis "Understanding Random Forests"
updated at April 26, 2024, 12:39 p.m.
Materials for STATS 418 - Tools in Data Science course taught in the Master of Applied Statistics at UCLA
updated at April 11, 2024, 7:32 a.m.
Forecast the US demand for electricity
updated at March 25, 2024, 10:38 p.m.
Estimating the number of clusters in a data set via the gap statistic. Implemented in H2O-3
updated at May 18, 2023, 5:19 p.m.
A repository for deploying an AWS EMR cluster and submiting spark jobs on it. Boostrapping by default does inclues pysparkling so one can easily use h2o with python and spark.
updated at March 2, 2022, 9:37 p.m.