XGBoost on Spark

XGBoost is a widely used library for parallelized gradient tree boosting. It has had R, Python and Julia packages for a while. The team announced XGBoost4J, a Java/Scala package just a few days ago (blog post).

Here are a few examples written in Java & Scala. Here is an example demonstrating its use with Apache Spark.

Similar APIs

I have been using XGBoost’s R package over the past 6 - 7 months. Was happy to see the API looks similar (ignore the verbosity of the snippets for the time being).

R
xgb_params <- list(
  objective           = "binary:logistic",
  booster             = "gbtree",
  eval_metric         = "logloss",
  eta                 = config$eta,
  max_depth           = config$max_depth,
  subsample           = config$subsample,
  colsample_bytree    = config$colsample_bytree,
  min_child_weight    = config$min_child_weight
)

xgb_model <- xgb.train(
  params              = xgb_params,
  data                = dtrain,
  nrounds             = config$num_rounds,
  verbose             = config$verbosity,
  watchlist           = xgb_watchlist,
  print.every.n       = config$print_every_n,
  early.stop.round    = config$early_stop_round,
  nthread             = config$nthread,
  maximize            = FALSE
)
Scala
val params = new mutable.HashMap[String, Any]()
params += "eta" -> config("eta")
params += "max_depth" -> config("max_depth")
params += "silent" -> config("silent")
params += "objective" -> "binary:logistic"

val watchers = new mutable.HashMap[String, DMatrix]
watchers += "train" -> trainMat
watchers += "test" -> testMat

val model = XGBoost.train(trainRDD,
                          params.toMap,
                          round = config("num_rounds"),
                          watches.toMap)

Reference : XGBoost.scala