Modeling - Ensemble Learning


Overview of Ensemble Learning

Ensemble learning is a technique that attempts to improve model performance, bias, and variance by chaining together multiple models. Bias is a metric often associated with underfitting and poor performance while variance is a metric often associated with overfitting and creating models that are overly sensitive to the training data. Some of the more common methods used in ensemble learning are Bagging, Stacking, and Boosting. Essentially, these methods train base learners, which can be the same type of machine learning models or different models and use some sort of voting approach on different samples of the data to choose the best overall model or models.

Bagging was originally introduced as training similar machine learning base models on random samples with replacement. However, the “with replacement” rule is not a hard rule anymore.

Stacking is a “strong” ensemble method. Strong doesn’t refer to “better” but refers to the strength of the types of base learners. “Stacking allows to use the strength of each individual estimator by using their output as input of a final estimator” [sklearn documentation].

Boosting is a “weak” ensemble method. The refers to the process emphasizing misclassified sample with higher weights in an iterative sampling and chaining process.

In particular, this section will focus on the ensemble method of Random Forest Classification. Random Forests consist of multiple Decision Trees ensembled together as an extension of bagging. The process is considered an extension of bagging because they also use a method known as feature randomness which samples the columns, or features, as well. Therefore, the strength of a Random Forest lies in its ability to chain together both samples of rows and columns to create an uncorrelated forest of decision trees!


Data Preparation

For a last try at predicting a weather label icon (i.e. the category of the weather for a day), Random Forest Classification will be used to try and predict between Clear Day, Rain, Snow, or Other. The final dataset will include the variables:


To accomplish this, the weather dataset and resort dataset were used. The resort dataset contains the coordinates which were merged in. Additionally, the icon label of Other contains subcategories such as Fog, Wind, and Overcast. After replacing these values with Other, the proportion of the labels had to be accounted for. Now that there were 4 categories to predict, the data was downsampled to the minimum categorical size. This still left over 500,000 datapoints to train a model on.

After the merging and accounting for proportioning, the data was split into a training and testing set. Ensemble is a supervised machine learning method, so the data trained on must be labeled. Additionally, training and testing sets were created. The two sets are disjoint, and must be disjoint. Using non-disjoint data between testing and training won't give an accurate representation of the performance of the model. First, this could result in an overfit of the model, which could end up describing noise, rather than the underlying distribution. Second, the testing set being non-disjoint helps to represent real-world data (i.e. unseen data).

The initial resort and weather datasets are illustrated below, along with the merged data and the training and testing datasets.


Weather Data
datetime tempmax tempmin temp feelslikemax feelslikemin feelslike dew humidity precip precipprob precipcover snow snowdepth windgust windspeed winddir pressure cloudcover visibility solarradiation solarenergy uvindex sunrise sunset moonphase icon stations resort tzoffset severerisk type_freezingrain type_ice type_none type_rain type_snow
2019-01-01 16.4 2.0 7.0 10.2 -13.3 -0.6 -1.2 69.1 0.008 100.0 20.83 0.0 20.7 18.30000 10.6 4.9 1014.5 59.0 8.6 116.8 9.9 5.0 07:26:20 16:51:51 0.85 snow ['72467523063', '72206103038', 'CACMC', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009'] Vail 0.0 0.0 0 0 0 0 1
2019-01-02 24.3 -0.9 11.4 21.9 -11.9 5.4 -10.5 39.9 0.004 100.0 4.17 0.0 20.8 29.77377 8.7 353.1 1021.4 0.0 9.9 121.6 10.7 5.0 07:26:27 16:52:41 0.89 snow ['72467523063', '72206103038', 'CACMC', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009'] Vail 0.0 0.0 0 0 0 0 1
2019-01-03 29.0 5.3 17.6 21.9 -4.0 8.6 4.1 56.1 0.004 100.0 4.17 0.2 20.8 32.20000 9.8 328.9 1024.7 0.0 9.8 123.3 10.6 5.0 07:26:31 16:53:33 0.92 snow ['72467523063', '72206103038', 'CACMC', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009'] Vail 0.0 0.0 0 0 0 1 1
2019-01-04 34.0 11.9 23.4 28.7 3.4 17.1 7.0 50.4 0.001 100.0 4.17 0.1 20.8 20.80000 9.0 311.0 1025.5 0.0 9.9 123.7 10.7 5.0 07:26:34 16:54:26 0.96 snow ['72467523063', '72206103038', 'CACMC', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009'] Vail 0.0 0.0 0 0 0 1 1
2019-01-05 34.1 14.3 27.1 29.4 4.3 20.1 1.9 33.9 0.001 100.0 4.17 0.0 20.4 20.80000 10.1 243.5 1022.2 19.4 9.7 110.3 9.6 5.0 07:26:34 16:55:20 0.00 rain ['72467523063', '72206103038', 'CACMC', 'DYGC2', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009'] Vail 0.0 0.0 0 0 0 1 1
2019-01-06 29.9 18.5 25.9 22.4 5.1 16.1 18.1 72.5 0.035 100.0 58.33 0.6 20.6 33.30000 16.9 266.7 1009.3 78.7 6.3 47.3 4.1 2.0 07:26:32 16:56:16 0.02 snow ['72467523063', '72206103038', 'CACMC', '72038500419', 'DYGC2', 'KCCU', 'KEGE', 'A0000594076', 'KLXV', 'DJTC2', 'K20V', '72467393009'] Vail 0.0 0.0 0 0 0 1 1
2019-01-07 24.8 14.7 20.3 12.8 2.5 6.5 13.7 75.2 0.004 100.0 8.33 0.4 21.3 45.70000 27.9 271.2 1015.6 83.7 4.8 35.8 3.0 2.0 07:26:27 16:57:13 0.06 snow ['72467523063', '72206103038', 'CACMC', 'DYGC2', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009'] Vail 0.0 0.0 0 0 0 0 1
2019-01-08 34.6 17.2 25.1 34.6 5.0 17.8 12.0 59.5 0.013 100.0 8.33 0.0 21.3 27.70000 15.2 312.1 1029.4 34.5 9.5 122.9 10.5 5.0 07:26:21 16:58:11 0.09 rain ['72467523063', '72206103038', 'CACMC', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009'] Vail 0.0 0.0 0 0 0 1 1
2019-01-09 38.3 23.0 28.6 38.3 13.6 22.6 9.9 45.4 0.000 0.0 0.00 0.0 21.2 23.00000 13.0 142.9 1029.6 1.0 9.9 114.0 9.8 5.0 07:26:12 16:59:11 0.12 clear-day ['72467523063', '72206103038', 'CACMC', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009'] Vail 0.0 0.0 0 0 1 0 0
2019-01-10 33.7 17.0 26.4 33.7 9.8 22.6 14.3 60.6 0.026 100.0 12.50 0.8 21.4 17.20000 8.8 323.7 1023.3 39.9 8.3 75.9 6.6 4.0 07:26:01 17:00:11 0.16 snow ['72467523063', '72206103038', 'CACMC', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009'] Vail 0.0 0.0 0 0 0 1 1

Resort Data
Resort state_province_territory Country City Overall Rating Elevation Difference Elevation Low Elevation High Trails Total Trails Easy Trails Intermediate Trails Difficult Lifts Price Resort Size Run Variety Lifts Quality Latitude Longitude Pass Region
49 Degrees North Mountain Resort Washington United States Chewelah 3.4 564 1196 1760 68.0 20.0 27.0 21.0 7 82.0 3.5 4.0 3.3 48.277375 -117.701815 Other West
Crystal Mountain (WA) Washington United States Sunrise 3.3 796 1341 2137 50.0 8.0 27.0 15.0 11 199.0 3.2 3.6 3.7 46.928167 -121.504535 Ikon West
Mt. Baker Washington United States White Salmon 3.4 455 1070 1525 100.0 24.0 45.0 31.0 10 91.0 3.9 4.3 3.0 45.727775 -121.486699 Other West
Mt. Spokane Washington United States Mead 3.0 610 1185 1795 26.0 6.5 16.0 3.5 7 75.0 2.7 3.1 3.0 47.919072 -117.092505 Other West
Sitzmark Washington United States Tonasket 2.6 155 1330 1485 7.5 2.0 3.0 2.5 2 50.0 1.9 2.4 2.9 48.863907 -119.165077 Other West
Stevens Pass Washington United States Baring 3.3 580 1170 1750 39.0 6.0 18.0 15.0 10 119.0 3.1 3.5 3.6 47.764031 -121.474822 Epic West
The Summit at Snoqualmie Washington United States Snoqualmie Pass 3.0 380 800 1180 27.9 5.2 13.7 9.0 22 135.0 2.6 3.0 3.2 47.405235 -121.412783 Ikon West
Wenatchee Mission Ridge Washington United States Wenatchee 3.2 686 1392 2078 36.0 4.0 21.0 11.0 4 119.0 2.9 3.3 3.6 47.292466 -120.399871 Other West
Abenaki New Hampshire United States Wolfeboro 2.1 70 180 250 2.0 1.2 0.5 0.3 1 24.0 1.4 1.8 1.4 43.609528 -71.229692 Other Northeast
Attitash Mountain Resort New Hampshire United States Bartlett 3.2 533 183 716 37.0 7.4 17.4 12.2 8 129.0 2.9 3.3 3.7 44.084603 -71.221525 Epic Northeast

Data Prepared for Random Forest
day_of_year temp dew humidity pressure latitude longitude icon
174 74.5 44.4 36.5 1015.7 40.637580 -111.478971 clear-day
307 34.2 15.9 46.7 1040.5 39.875755 -105.762776 clear-day
234 62.5 49.3 64.2 1017.5 44.520279 -85.943972 clear-day
121 51.8 15.0 28.0 1014.0 39.336911 -119.872522 clear-day
188 61.5 30.5 35.7 1014.2 39.317521 -120.330546 clear-day
138 42.7 22.8 46.6 1020.5 47.027702 -71.383543 clear-day
108 49.2 16.1 31.0 1025.1 38.534719 -105.998902 clear-day
168 64.1 51.9 66.0 1014.9 43.057705 -86.239159 clear-day
261 64.6 56.2 76.2 1016.7 41.501664 -72.736446 clear-day
112 46.1 13.1 30.1 1013.8 37.772320 -119.092603 clear-day

Training Data for Random Forest
day_of_year temp dew humidity pressure latitude longitude icon
97 36.2 18.3 49.0 1024.1 45.888411 -74.140173 other
215 75.4 40.0 34.1 1024.4 45.265990 -111.253120 clear-day
33 22.7 18.5 84.2 995.9 45.679871 -65.381268 snow
229 67.5 60.3 79.2 1014.4 50.213297 -66.375792 rain
103 30.1 24.2 79.5 1010.3 48.525998 -89.127943 snow
88 45.7 13.0 27.5 1017.6 35.941036 -106.275751 clear-day
31 40.1 35.4 83.3 1015.1 49.844217 -119.682713 snow
164 57.0 44.9 65.2 1010.8 44.353591 -73.861412 rain
59 31.2 6.6 38.6 1029.6 37.937746 -107.820799 clear-day
217 64.3 54.4 73.0 1018.9 44.535401 -80.378871 other

Testing Data for Random Forest
day_of_year temp dew humidity pressure latitude longitude icon
240 50.0 40.0 69.7 1006.9 46.031602 -71.193532 clear-day
135 59.6 56.1 88.3 1012.0 42.500124 -88.190056 rain
325 25.6 22.6 88.4 1015.9 48.892154 -72.238314 other
18 8.2 -1.1 65.5 1002.3 46.449381 -70.537122 snow
200 66.1 36.2 38.1 1024.4 45.261781 -111.308024 clear-day
248 50.6 46.4 86.0 1019.2 48.437691 -77.637239 other
67 23.9 12.8 64.0 1014.0 45.175126 -109.317871 clear-day
123 50.8 48.5 92.0 1017.2 43.678495 -73.991520 rain
23 15.9 6.6 66.8 1013.5 44.743490 -85.514576 snow
28 33.1 29.0 84.7 1018.8 39.153678 -84.888465 other

Coding Ensemble (Random Forest Classification)

The code for the data preparation and performing ensemble random forest classification can be found [here].
This code includes an ensemble for shallow trees to illustrate the process and an ensemble for deeper trees in an attempt to produce better results.


Results - The Process

This ensemble method is known as random forest because it contains multiple decision trees. To illustrate the process, a random forest of 10 trees (estimators) with a max depth of 3 was trained. The accuracy and confusion matrix is reported and the base nodes of the first three trees are examined.

The trees have different base nodes, which illustrates the different subsets either through sampling or feature randomness. This is difficult to achieve with singular decision trees without dropping complete features. However, note that this illustrative example has nowhere near pure leaf nodes at the end of the trees. For visualization purposes, the depth was purposefully kept very shallow for a dataset of this size. This suggests that training a random forest classifier with a greater max depth could increase the accuracy of the model. However, caution must be used to prevent overfitting.


Accuracy and Confusion Matrix for the Shallow Tree Example.



Results - Optimal Model

In an attempt to create a better model, 100 trees (estimators) were used in the random forest classifer ensemble with a max depth of 15. This did increase the accuracy of the model. To ensure the ensemble did not contain overfit, the purity of the final leaves was examined. There was a majority of final leaves that were 100% pure, however, some were not. This should be acceptable in preventing an overfit model.

The purity of the first three trees were tested, resulting in almost identical distributions.


Accuracy and Confusion Matrix for the Deeper Tree.

Final Leaves Purity of the First Three Trees.

Conclusion

Weather is a difficult phenomenon to predict when using relatively basic methods on a smaller scale of time. An analysis was performed using commonly available weather metrics that would be featured in a forecast. Using a technique which combines prediction methods to improve the accuracy was used on these features in an attempt to predict the type of weather a day will bring. The type of weather could be Clear, Rain, Snow, or Other. Other contains subcategories such as Fog, Wind, and Overcast. Predicting if a day will bring Clear, Rain, or Snow resulted in decent performance, but the Other category was misclassified the most and was the top misclassification when the other categories were not classified correctly. Overall, this model has potential, but perhaps better indicators within Other could improve the results.