A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).
created at March 28, 2015, 12:34 a.m.
Repository of my thesis "Understanding Random Forests"
created at Jan. 6, 2014, 6:57 a.m.
Materials for GWU DNSC 6279 and DNSC 6290.
created at Jan. 12, 2016, 2:40 a.m.
Materials for STATS 418 - Tools in Data Science course taught in the Master of Applied Statistics at UCLA
created at Feb. 10, 2017, 10:12 a.m.
Forecast the US demand for electricity
created at Jan. 14, 2021, 1:03 p.m.
A repository for deploying an AWS EMR cluster and submiting spark jobs on it. Boostrapping by default does inclues pysparkling so one can easily use h2o with python and spark.
created at July 1, 2017, 12:06 a.m.
Estimating the number of clusters in a data set via the gap statistic. Implemented in H2O-3
created at Aug. 5, 2020, 12:05 a.m.