This tutorial explains how to implement the Random Forest Regression algorithm using the Python Sklearn.

We are going to use the Boston housing data. You can get the data using the below links.

https://archive.ics.uci.edu/ml/machine-learning-databases/housing/

**Attribute Information:**

1. CRIM per capita crime rate by town

2. ZN proportion of residential land zoned for lots over

25,000 sq.ft.

3. INDUS proportion of non-retail business acres per town

4. CHAS Charles River dummy variable (= 1 if tract bounds

river; 0 otherwise)

5. NOX nitric oxides concentration (parts per 10 million)

6. RM average number of rooms per dwelling

7. AGE proportion of owner-occupied units built prior to 1940

8. DIS weighted distances to five Boston employment centres

9. RAD index of accessibility to radial highways

10. TAX full-value property-tax rate per $10,000

11. PTRATIO pupil-teacher ratio by town

12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks

by town

13. LSTAT % lower status of the population

14. PRICE Median value of owner-occupied homes in $1000's

In this dataset, we are going to create a machine learning model to predict the price of owner-occupied homes in $1000's.

**Error Metrics for Regression**

1. Mean Absolute Error

2. Mean Squared Error

3. Root Mean Squared Error

You can use any of the above error metrics to evaluate the random forest regression model. Lower error value defines the more accuracy of the model. So if the error value is low, then you are creating a better model.

This tutorial is not meant to be more theory. You can design the random forest regression model in fewer steps.

**Step1.**

Import pandas library and read the housing CSV file.

Get the housing file using the below link.

https://github.com/bharathirajatut/python-data-science/tree/master/Random%20Forest%20Regression%20-%20Boston%20Dataset

```
import pandas as pd
dataset=pd.read_csv("housing.csv")
```

**Step2.**

The dataset is already preprocessed. So there is no need to preprocess the date. Spit the data into x(input variable) and y(target variable).

```
#split x and y
x = dataset.drop('PRICE', axis = 1)
y = dataset['PRICE']
```

**Step3.**

Splitting the dataset into the Training set and Test set.

```
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)
```

**Step4.**

import random forest regressor class and create an object reference for the class.

```
# Fitting Random Forest Regression to the Training set
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators = 50, random_state = 0)
```

The n_estimators parameter defines the number of trees in the random forest. You can use any numeric value to the n_estimators parameter. However, start with a low value is recommended.

The random_state parameter is the seed used by the random number. Passing any value (whether a specific int, e.g., 0, or a RandomState instance), will not change that. The only rationale for passing in an int value (0 or otherwise) is to make the outcome consistent across calls: if you call this with random_state=0 (or any other value), then each and every time, you’ll get the same result generator.

**Step5.**

Pass your training data to train the random forest regressor model.

`regressor.fit(X_train, y_train)`

**Step6.**

Predicting the test set results using the random forest regressor model.

```
# Predicting the Test set results
y_pred = regressor.predict(X_test)
```

**Step7.**

Evaluating the random forest regressor model algorithm using the error metrics.

```
# Evaluating the Algorithm
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
```

**Output**

```
Result for n_estimators=50
???????Mean Absolute Error: 2.55118110236
Mean Squared Error: 15.7084229921
Root Mean Squared Error: 3.96338529443
```

That’s all. You are now created a machine learning regression model using the python sklearn. This is a very simple model. I have not done any fine-tuning of this model.

To get a better model, you can try different tree size using the n_estimators parameter and compute the error metrics. Take the model with lower RMSE value.

In this, I tried various tree size and 40 tree size gives me a better result.

```
Result for n_estimators=40
Mean Absolute Error: 2.52090551181
Mean Squared Error: 15.0942913386
Root Mean Squared Error: 3.88513723549
```

**Get Full source code Link.**

The GitHub contains two random forest model file.

The first file is developed with housing csv file.

The second file is developed using the built-in Boston dataset.

**Other tree size results.**

```
Result for n_estimators=50
Mean Absolute Error: 2.55118110236
Mean Squared Error: 15.7084229921
Root Mean Squared Error: 3.96338529443
Result for n_estimators=40
Mean Absolute Error: 2.52090551181
Mean Squared Error: 15.0942913386
Root Mean Squared Error: 3.88513723549
Result for n_estimators=30
Mean Absolute Error: 2.54162729659
Mean Squared Error: 15.5711529309
Root Mean Squared Error: 3.94603002154
Result for n_estimators=60
Mean Absolute Error: 2.55049868766
Mean Squared Error: 15.9157054243
Root Mean Squared Error: 3.98944926328
Result for n_estimators=100
Mean Absolute Error: 2.55906299213
Mean Squared Error: 16.7221060866
Root Mean Squared Error: 4.0892671821
```