This tutorial explains how to implement the Random Forest Regression algorithm using the Python Sklearn.
We are going to use the Boston housing data. You can get the data using the below links.
1. CRIM per capita crime rate by town
2. ZN proportion of residential land zoned for lots over
3. INDUS proportion of non-retail business acres per town
4. CHAS Charles River dummy variable (= 1 if tract bounds
river; 0 otherwise)
5. NOX nitric oxides concentration (parts per 10 million)
6. RM average number of rooms per dwelling
7. AGE proportion of owner-occupied units built prior to 1940
8. DIS weighted distances to five Boston employment centres
9. RAD index of accessibility to radial highways
10. TAX full-value property-tax rate per $10,000
11. PTRATIO pupil-teacher ratio by town
12. B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks
13. LSTAT % lower status of the population
14. PRICE Median value of owner-occupied homes in $1000's
In this dataset, we are going to create a machine learning model to predict the price of owner-occupied homes in $1000's.
Error Metrics for Regression
1. Mean Absolute Error
2. Mean Squared Error
3. Root Mean Squared Error
You can use any of the above error metrics to evaluate the random forest regression model. Lower error value defines the more accuracy of the model. So if the error value is low, then you are creating a better model.
This tutorial is not meant to be more theory. You can design the random forest regression model in fewer steps.
Import pandas library and read the housing CSV file.
Get the housing file using the below link.
import pandas as pd dataset=pd.read_csv("housing.csv")
The dataset is already preprocessed. So there is no need to preprocess the date. Spit the data into x(input variable) and y(target variable).
#split x and y x = dataset.drop('PRICE', axis = 1) y = dataset['PRICE']
Splitting the dataset into the Training set and Test set.
# Splitting the dataset into the Training set and Test set from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.25, random_state = 0)
import random forest regressor class and create an object reference for the class.
# Fitting Random Forest Regression to the Training set from sklearn.ensemble import RandomForestRegressor regressor = RandomForestRegressor(n_estimators = 50, random_state = 0)
The n_estimators parameter defines the number of trees in the random forest. You can use any numeric value to the n_estimators parameter. However, start with a low value is recommended.
The random_state parameter is the seed used by the random number. Passing any value (whether a specific int, e.g., 0, or a RandomState instance), will not change that. The only rationale for passing in an int value (0 or otherwise) is to make the outcome consistent across calls: if you call this with random_state=0 (or any other value), then each and every time, you’ll get the same result generator.
Pass your training data to train the random forest regressor model.
Predicting the test set results using the random forest regressor model.
# Predicting the Test set results y_pred = regressor.predict(X_test)
Evaluating the random forest regressor model algorithm using the error metrics.
# Evaluating the Algorithm from sklearn import metrics print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred)) print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred)) print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
Result for n_estimators=50 ???????Mean Absolute Error: 2.55118110236 Mean Squared Error: 15.7084229921 Root Mean Squared Error: 3.96338529443
That’s all. You are now created a machine learning regression model using the python sklearn. This is a very simple model. I have not done any fine-tuning of this model.
To get a better model, you can try different tree size using the n_estimators parameter and compute the error metrics. Take the model with lower RMSE value.
In this, I tried various tree size and 40 tree size gives me a better result.
Result for n_estimators=40 Mean Absolute Error: 2.52090551181 Mean Squared Error: 15.0942913386 Root Mean Squared Error: 3.88513723549
Get Full source code Link.
The GitHub contains two random forest model file.
The first file is developed with housing csv file.
The second file is developed using the built-in Boston dataset.
Other tree size results.
Result for n_estimators=50 Mean Absolute Error: 2.55118110236 Mean Squared Error: 15.7084229921 Root Mean Squared Error: 3.96338529443 Result for n_estimators=40 Mean Absolute Error: 2.52090551181 Mean Squared Error: 15.0942913386 Root Mean Squared Error: 3.88513723549 Result for n_estimators=30 Mean Absolute Error: 2.54162729659 Mean Squared Error: 15.5711529309 Root Mean Squared Error: 3.94603002154 Result for n_estimators=60 Mean Absolute Error: 2.55049868766 Mean Squared Error: 15.9157054243 Root Mean Squared Error: 3.98944926328 Result for n_estimators=100 Mean Absolute Error: 2.55906299213 Mean Squared Error: 16.7221060866 Root Mean Squared Error: 4.0892671821