This tutorial explains how to implement the Random Forest Regression algorithm using the Python Sklearn.

We are going to use the Boston housing data. You can get the data using the below links.

https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

https://github.com/bharathirajatut/python-data-science/tree/master/Random%20Forest%20Regression%20-%20Boston%20Dataset

Attribute Information:

    1. CRIM      per capita crime rate by town
    2. ZN        proportion of residential land zoned for lots over 
                 25,000 sq.ft.
    3. INDUS     proportion of non-retail business acres per town
    4. CHAS      Charles River dummy variable (= 1 if tract bounds 
                 river; 0 otherwise)
    5. NOX       nitric oxides concentration (parts per 10 million)
    6. RM        average number of rooms per dwelling
    7. AGE       proportion of owner-occupied units built prior to 1940
    8. DIS       weighted distances to five Boston employment centres
    9. RAD       index of accessibility to radial highways
    10. TAX      full-value property-tax rate per $10,000
    11. PTRATIO  pupil-teacher ratio by town
    12. B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks 
                 by town
    13. LSTAT    % lower status of the population
    14. PRICE     Median value of owner-occupied homes in $1000's

In this dataset, we are going to create a machine learning model to predict the price of owner-occupied homes in $1000's.

Error Metrics for Regression
1. Mean Absolute Error
2. Mean Squared Error
3. Root Mean Squared Error

You can use any of the above error metrics to evaluate the random forest regression model. Lower error value defines the more accuracy of the model. So if the error value is low, then you are creating a better model.

This tutorial is not meant to be more theory. You can design the random forest regression model in fewer steps.

Step1.

Import pandas library and read the housing CSV file.

Get the housing file using the below link.
https://github.com/bharathirajatut/python-data-science/tree/master/Random%20Forest%20Regression%20-%20Boston%20Dataset

import pandas as pd
dataset=pd.read_csv("housing.csv")

Step2.

The dataset is already preprocessed. So there is no need to preprocess the date. Spit the data into x(input variable) and y(target variable).

#split x and y
x = dataset.drop('PRICE', axis = 1)
y = dataset['PRICE']

Step3.

Splitting the dataset into the Training set and Test set.

# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)

Step4.
import random forest regressor class and create an object reference for the class.

# Fitting Random Forest Regression to the Training set
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 50, random_state = 0)

The n_estimators parameter defines the number of trees in the random forest. You can use any numeric value to the n_estimators parameter. However, start with a low value is recommended.

The random_state parameter is the seed used by the random number. Passing any value (whether a specific int, e.g., 0, or a RandomState instance), will not change that. The only rationale for passing in an int value (0 or otherwise) is to make the outcome consistent across calls: if you call this with random_state=0 (or any other value), then each and every time, you’ll get the same result generator.

Step5.

Pass your training data to train the random forest regressor model.

regressor.fit(X_train, y_train)

Step6.
Predicting the test set results using the random forest regressor model.

# Predicting the Test set results
y_pred = regressor.predict(X_test)

Step7.
Evaluating the random forest regressor model algorithm using the error metrics.

# Evaluating the Algorithm
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Output

Result for n_estimators=50
???????Mean Absolute Error: 2.55118110236
Mean Squared Error: 15.7084229921
Root Mean Squared Error: 3.96338529443

That’s all. You are now created a machine learning regression model using the python sklearn. This is a very simple model. I have not done any fine-tuning of this model.

To get a better model, you can try different tree size using the n_estimators parameter and compute the error metrics. Take the model with lower RMSE value.

In this, I tried various tree size and 40 tree size gives me a better result.

Result for n_estimators=40
Mean Absolute Error: 2.52090551181
Mean Squared Error: 15.0942913386
Root Mean Squared Error: 3.88513723549

Get Full source code Link.

https://github.com/bharathirajatut/python-data-science/tree/master/Random%20Forest%20Regression%20-%20Boston%20Dataset

The GitHub contains two random forest model file.

The first file is developed with housing csv file.

The second file is developed using the built-in Boston dataset.

Other tree size results.

Result for n_estimators=50

Mean Absolute Error: 2.55118110236
Mean Squared Error: 15.7084229921
Root Mean Squared Error: 3.96338529443


Result for n_estimators=40

Mean Absolute Error: 2.52090551181
Mean Squared Error: 15.0942913386
Root Mean Squared Error: 3.88513723549


Result for n_estimators=30

Mean Absolute Error: 2.54162729659
Mean Squared Error: 15.5711529309
Root Mean Squared Error: 3.94603002154


Result for n_estimators=60

Mean Absolute Error: 2.55049868766
Mean Squared Error: 15.9157054243
Root Mean Squared Error: 3.98944926328


Result for n_estimators=100

Mean Absolute Error: 2.55906299213
Mean Squared Error: 16.7221060866
Root Mean Squared Error: 4.0892671821