Hi, In this tutorial, you will learn, how to create CatBoost Regression model using the R Programming. The goal of this tutorial is, to create a regression model using CatBoost r package with simple steps. I assume you already know something about gradient boosting. This tutorial focuses on, how to write R code using the catboost package

If you have not installed CatBoost on your machine, please follow the link to install the CatBoost r package.

https://ampersandacademy.com/tutorials/r-programming/install-catboost-r-package-on-mac-linux-and-windows.

Here we are going to use the Boston Housing dataset to create CatBoost Model. This dataset is fully cleaned. No need for any data pre-processing.
The target variable is medv(selling price of house values in $1000).

Attribute Information:

    1. CRIM      per capita crime rate by town
    2. ZN        proportion of residential land zoned for lots over 
                    25,000 sq.ft.
    3. INDUS     proportion of non-retail business acres per town
    4. CHAS      Charles River dummy variable (= 1 if tract bounds 
                 river; 0 otherwise)
    5. NOX       nitric oxides concentration (parts per 10 million)
    6. RM        average number of rooms per dwelling
    7. AGE       proportion of owner-occupied units built prior to 1940
    8. DIS       weighted distances to five Boston employment centres
    9. RAD       index of accessibility to radial highways
    10. TAX      full-value property-tax rate per $10,000
    11. PTRATIO  pupil-teacher ratio by town
    12. B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks 
                 by town
    13. LSTAT    % lower status of the population
    14. PRICE     Median value of owner-occupied homes in $1000's

Let's start creating the CatBoost regression model using the catboost r package.

Step1.

Load the catboost r package.

library(catboost)

Step2.

Load Boston Housing dataset using the mlbench package.

# load libraries
library(mlbench)

# attach the BostonHousing dataset
data(BostonHousing)

Step3:

Split the dataset as train and test using the caret package.

#caret library
library(caret)

# Split out validation dataset
# create a list of 80% of the rows in the original dataset we can use for training
set.seed(7)
validation_index <- createDataPartition(BostonHousing$medv, p=0.80, list=FALSE)
# select 20% of the data for validation
validation <- BostonHousing[-validation_index,]
# use the remaining 80% of data to training and testing the models
dataset <- BostonHousing[validation_index,]

Here we using 80% data for training purpose and 20% data for test purpose.

medv is the target variable here.

Step4.

Separate x and y of train and test dataset, which will very useful when we using this in the catboost package.

library(dplyr)
y_train <- unlist(dataset[c('medv')])
X_train <- dataset %>% select(-medv)

y_valid <- unlist(validation[c('medv')])
X_valid <- validation %>% select(-medv)

Step5.

Convert the train and test dataset to catboost specific format using the load_pool function by mentioning x and y of both train and test.

train_pool <- catboost.load_pool(data = X_train, label = y_train)
test_pool <- catboost.load_pool(data = X_valid, label = y_valid)

Step6.

Create an input params for the CatBoost regression. 

params <- list(iterations=500,
       learning_rate=0.01,
       depth=10,
       loss_function='RMSE',
       eval_metric='RMSE',
       random_seed = 55,
       od_type='Iter',
       metric_period = 50,
       od_wait=20,
       use_best_model=TRUE)

You can get detailed information about the parameters at the end of this post.

It is regression example, that is why we using the RMSE as the loss_function and eval_metric. You can use any other regression metric too here.

Iterations- The maximum number of trees that can be built when solving machine learning problems. You can change this value and compare the result.

Step7.
Build a model using the catboost train function. Pass the train dataset and parameters to the catboost train function.

model <- catboost.train(learn_pool = train_pool,params = params)

Step8. 

Predict the output using the catboost predict function.

#predict
y_pred=catboost.predict(model,test_pool)

Step9.

Calculate error metrics using caret package postResample function.

#calculate error metrics
postResample(y_pred,validation$medv)

#output
  RMSE Rsquared    MAE
3.1027671 0.8670278 2.2757869

Conclusion.
Here we getting the RMSE value of 3.10.

Visit the below link.

https://ampersandacademy.com/tutorials/python-data-science/random-forest-regression-using-python-sklearn-from-scratch

 You can look at the random forest regression RMSE value of the same dataset is 3.88  using the above link. 

Without fine tunning any other parameter except the number of iterations in both Random Forest and CatBoost, CatBoost gives us more accuracy when compared to Random Forest.

Parameter Info.

1. iterations.

Aliases:
num_boost_round
n_estimators
num_trees

Data type is int .
   
The maximum number of trees that can be built when solving machine learning problems.

When using other parameters that limit the number of iterations, the final number of trees may be less than the number specified in this parameter.

2. learning_rate

Alias: eta

Data type is float

The learning rate. Used for reducing the gradient step.

3. depth

Alias: max_depth

Data type is int.

Depth of the tree.

The range of supported values depends on the processing unit type and the type of the selected loss function:
A). CPU — Any integer up to  16.
B). GPU — Any integer up to 8 pairwise modes (YetiRank, PairLogitPairwise and QueryCrossEntropy) and up to   16 for all other loss functions.

4. loss_function

Alias: objective

Data type is string
Data type is object

The metric to use in training. The specified value also determines the machine learning problem to solve.

5. eval_metric

Data type is string
Data type is object

The metric used for overfitting detection (if enabled) and best model selection (if enabled).

6. random_seed

Alias: random_state

Data type is int    

The random seed used for training.

7. od_type    

Data type is string    

The type of the overfitting detector to use.

Possible values:
IncToDec
Iter

8. metric_period    

Data type is int    

The frequency of iterations to calculate the values of objectives and metrics. The value should be a positive integer.

The usage of this parameter speeds up the training.

9. od_wait

Data type is int

The number of iterations to continue the training after the iteration with the optimal metric value.
The purpose of this parameter differs depending on the selected overfitting detector type:
IncToDec — Ignore the overfitting detector when the threshold is reached and continue learning for the specified number of iterations after the iteration with the optimal metric value.
Iter — Consider the model overfitted and stop training after the specified number of iterations since the iteration with the optimal metric value.

For more info on the parameter, please visit the below link.

https://tech.yandex.com/catboost/doc/dg/concepts/python-reference_parameters-list-docpage/