Ensemble learning is a technique that attempts to improve model performance, bias, and variance by chaining together multiple models.
Bias is a metric often associated with underfitting and poor performance while variance is a metric often associated with overfitting and creating models that are overly sensitive to the training data.
Some of the more common methods used in ensemble learning are Bagging, Stacking, and Boosting.
Essentially, these methods train base learners, which can be the same type of machine learning models or different models and use some sort of voting approach on different samples of the data to choose the best overall model or models.
Bagging was originally introduced as training similar machine learning base models on random samples with replacement.
However, the “with replacement” rule is not a hard rule anymore.
Stacking is a “strong” ensemble method.
Strong doesn’t refer to “better” but refers to the strength of the types of base learners.
“Stacking allows to use the strength of each individual estimator by using their output as input of a final estimator” [sklearn documentation].
Boosting is a “weak” ensemble method.
The refers to the process emphasizing misclassified sample with higher weights in an iterative sampling and chaining process.
In particular, this section will focus on the ensemble method of Random Forest Classification.
Random Forests consist of multiple Decision Trees ensembled together as an extension of bagging.
The process is considered an extension of bagging because they also use a method known as feature randomness which samples the columns, or features, as well.
Therefore, the strength of a Random Forest lies in its ability to chain together both samples of rows and columns to create an uncorrelated forest of decision trees!
For a last try at predicting a weather label icon (i.e. the category of the weather for a day), Random Forest Classification will be used to try and predict between
Clear Day, Rain, Snow, or Other. The final dataset will include the variables:
datetime | tempmax | tempmin | temp | feelslikemax | feelslikemin | feelslike | dew | humidity | precip | precipprob | precipcover | snow | snowdepth | windgust | windspeed | winddir | pressure | cloudcover | visibility | solarradiation | solarenergy | uvindex | sunrise | sunset | moonphase | icon | stations | resort | tzoffset | severerisk | type_freezingrain | type_ice | type_none | type_rain | type_snow |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2019-01-01 | 16.4 | 2.0 | 7.0 | 10.2 | -13.3 | -0.6 | -1.2 | 69.1 | 0.008 | 100.0 | 20.83 | 0.0 | 20.7 | 18.30000 | 10.6 | 4.9 | 1014.5 | 59.0 | 8.6 | 116.8 | 9.9 | 5.0 | 07:26:20 | 16:51:51 | 0.85 | snow | ['72467523063', '72206103038', 'CACMC', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009'] | Vail | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 1 |
2019-01-02 | 24.3 | -0.9 | 11.4 | 21.9 | -11.9 | 5.4 | -10.5 | 39.9 | 0.004 | 100.0 | 4.17 | 0.0 | 20.8 | 29.77377 | 8.7 | 353.1 | 1021.4 | 0.0 | 9.9 | 121.6 | 10.7 | 5.0 | 07:26:27 | 16:52:41 | 0.89 | snow | ['72467523063', '72206103038', 'CACMC', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009'] | Vail | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 1 |
2019-01-03 | 29.0 | 5.3 | 17.6 | 21.9 | -4.0 | 8.6 | 4.1 | 56.1 | 0.004 | 100.0 | 4.17 | 0.2 | 20.8 | 32.20000 | 9.8 | 328.9 | 1024.7 | 0.0 | 9.8 | 123.3 | 10.6 | 5.0 | 07:26:31 | 16:53:33 | 0.92 | snow | ['72467523063', '72206103038', 'CACMC', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009'] | Vail | 0.0 | 0.0 | 0 | 0 | 0 | 1 | 1 |
2019-01-04 | 34.0 | 11.9 | 23.4 | 28.7 | 3.4 | 17.1 | 7.0 | 50.4 | 0.001 | 100.0 | 4.17 | 0.1 | 20.8 | 20.80000 | 9.0 | 311.0 | 1025.5 | 0.0 | 9.9 | 123.7 | 10.7 | 5.0 | 07:26:34 | 16:54:26 | 0.96 | snow | ['72467523063', '72206103038', 'CACMC', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009'] | Vail | 0.0 | 0.0 | 0 | 0 | 0 | 1 | 1 |
2019-01-05 | 34.1 | 14.3 | 27.1 | 29.4 | 4.3 | 20.1 | 1.9 | 33.9 | 0.001 | 100.0 | 4.17 | 0.0 | 20.4 | 20.80000 | 10.1 | 243.5 | 1022.2 | 19.4 | 9.7 | 110.3 | 9.6 | 5.0 | 07:26:34 | 16:55:20 | 0.00 | rain | ['72467523063', '72206103038', 'CACMC', 'DYGC2', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009'] | Vail | 0.0 | 0.0 | 0 | 0 | 0 | 1 | 1 |
2019-01-06 | 29.9 | 18.5 | 25.9 | 22.4 | 5.1 | 16.1 | 18.1 | 72.5 | 0.035 | 100.0 | 58.33 | 0.6 | 20.6 | 33.30000 | 16.9 | 266.7 | 1009.3 | 78.7 | 6.3 | 47.3 | 4.1 | 2.0 | 07:26:32 | 16:56:16 | 0.02 | snow | ['72467523063', '72206103038', 'CACMC', '72038500419', 'DYGC2', 'KCCU', 'KEGE', 'A0000594076', 'KLXV', 'DJTC2', 'K20V', '72467393009'] | Vail | 0.0 | 0.0 | 0 | 0 | 0 | 1 | 1 |
2019-01-07 | 24.8 | 14.7 | 20.3 | 12.8 | 2.5 | 6.5 | 13.7 | 75.2 | 0.004 | 100.0 | 8.33 | 0.4 | 21.3 | 45.70000 | 27.9 | 271.2 | 1015.6 | 83.7 | 4.8 | 35.8 | 3.0 | 2.0 | 07:26:27 | 16:57:13 | 0.06 | snow | ['72467523063', '72206103038', 'CACMC', 'DYGC2', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009'] | Vail | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 1 |
2019-01-08 | 34.6 | 17.2 | 25.1 | 34.6 | 5.0 | 17.8 | 12.0 | 59.5 | 0.013 | 100.0 | 8.33 | 0.0 | 21.3 | 27.70000 | 15.2 | 312.1 | 1029.4 | 34.5 | 9.5 | 122.9 | 10.5 | 5.0 | 07:26:21 | 16:58:11 | 0.09 | rain | ['72467523063', '72206103038', 'CACMC', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009'] | Vail | 0.0 | 0.0 | 0 | 0 | 0 | 1 | 1 |
2019-01-09 | 38.3 | 23.0 | 28.6 | 38.3 | 13.6 | 22.6 | 9.9 | 45.4 | 0.000 | 0.0 | 0.00 | 0.0 | 21.2 | 23.00000 | 13.0 | 142.9 | 1029.6 | 1.0 | 9.9 | 114.0 | 9.8 | 5.0 | 07:26:12 | 16:59:11 | 0.12 | clear-day | ['72467523063', '72206103038', 'CACMC', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009'] | Vail | 0.0 | 0.0 | 0 | 0 | 1 | 0 | 0 |
2019-01-10 | 33.7 | 17.0 | 26.4 | 33.7 | 9.8 | 22.6 | 14.3 | 60.6 | 0.026 | 100.0 | 12.50 | 0.8 | 21.4 | 17.20000 | 8.8 | 323.7 | 1023.3 | 39.9 | 8.3 | 75.9 | 6.6 | 4.0 | 07:26:01 | 17:00:11 | 0.16 | snow | ['72467523063', '72206103038', 'CACMC', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009'] | Vail | 0.0 | 0.0 | 0 | 0 | 0 | 1 | 1 |
Resort | state_province_territory | Country | City | Overall Rating | Elevation Difference | Elevation Low | Elevation High | Trails Total | Trails Easy | Trails Intermediate | Trails Difficult | Lifts | Price | Resort Size | Run Variety | Lifts Quality | Latitude | Longitude | Pass | Region |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
49 Degrees North Mountain Resort | Washington | United States | Chewelah | 3.4 | 564 | 1196 | 1760 | 68.0 | 20.0 | 27.0 | 21.0 | 7 | 82.0 | 3.5 | 4.0 | 3.3 | 48.277375 | -117.701815 | Other | West |
Crystal Mountain (WA) | Washington | United States | Sunrise | 3.3 | 796 | 1341 | 2137 | 50.0 | 8.0 | 27.0 | 15.0 | 11 | 199.0 | 3.2 | 3.6 | 3.7 | 46.928167 | -121.504535 | Ikon | West |
Mt. Baker | Washington | United States | White Salmon | 3.4 | 455 | 1070 | 1525 | 100.0 | 24.0 | 45.0 | 31.0 | 10 | 91.0 | 3.9 | 4.3 | 3.0 | 45.727775 | -121.486699 | Other | West |
Mt. Spokane | Washington | United States | Mead | 3.0 | 610 | 1185 | 1795 | 26.0 | 6.5 | 16.0 | 3.5 | 7 | 75.0 | 2.7 | 3.1 | 3.0 | 47.919072 | -117.092505 | Other | West |
Sitzmark | Washington | United States | Tonasket | 2.6 | 155 | 1330 | 1485 | 7.5 | 2.0 | 3.0 | 2.5 | 2 | 50.0 | 1.9 | 2.4 | 2.9 | 48.863907 | -119.165077 | Other | West |
Stevens Pass | Washington | United States | Baring | 3.3 | 580 | 1170 | 1750 | 39.0 | 6.0 | 18.0 | 15.0 | 10 | 119.0 | 3.1 | 3.5 | 3.6 | 47.764031 | -121.474822 | Epic | West |
The Summit at Snoqualmie | Washington | United States | Snoqualmie Pass | 3.0 | 380 | 800 | 1180 | 27.9 | 5.2 | 13.7 | 9.0 | 22 | 135.0 | 2.6 | 3.0 | 3.2 | 47.405235 | -121.412783 | Ikon | West |
Wenatchee Mission Ridge | Washington | United States | Wenatchee | 3.2 | 686 | 1392 | 2078 | 36.0 | 4.0 | 21.0 | 11.0 | 4 | 119.0 | 2.9 | 3.3 | 3.6 | 47.292466 | -120.399871 | Other | West |
Abenaki | New Hampshire | United States | Wolfeboro | 2.1 | 70 | 180 | 250 | 2.0 | 1.2 | 0.5 | 0.3 | 1 | 24.0 | 1.4 | 1.8 | 1.4 | 43.609528 | -71.229692 | Other | Northeast |
Attitash Mountain Resort | New Hampshire | United States | Bartlett | 3.2 | 533 | 183 | 716 | 37.0 | 7.4 | 17.4 | 12.2 | 8 | 129.0 | 2.9 | 3.3 | 3.7 | 44.084603 | -71.221525 | Epic | Northeast |
day_of_year | temp | dew | humidity | pressure | latitude | longitude | icon |
---|---|---|---|---|---|---|---|
174 | 74.5 | 44.4 | 36.5 | 1015.7 | 40.637580 | -111.478971 | clear-day |
307 | 34.2 | 15.9 | 46.7 | 1040.5 | 39.875755 | -105.762776 | clear-day |
234 | 62.5 | 49.3 | 64.2 | 1017.5 | 44.520279 | -85.943972 | clear-day |
121 | 51.8 | 15.0 | 28.0 | 1014.0 | 39.336911 | -119.872522 | clear-day |
188 | 61.5 | 30.5 | 35.7 | 1014.2 | 39.317521 | -120.330546 | clear-day |
138 | 42.7 | 22.8 | 46.6 | 1020.5 | 47.027702 | -71.383543 | clear-day |
108 | 49.2 | 16.1 | 31.0 | 1025.1 | 38.534719 | -105.998902 | clear-day |
168 | 64.1 | 51.9 | 66.0 | 1014.9 | 43.057705 | -86.239159 | clear-day |
261 | 64.6 | 56.2 | 76.2 | 1016.7 | 41.501664 | -72.736446 | clear-day |
112 | 46.1 | 13.1 | 30.1 | 1013.8 | 37.772320 | -119.092603 | clear-day |
day_of_year | temp | dew | humidity | pressure | latitude | longitude | icon |
---|---|---|---|---|---|---|---|
97 | 36.2 | 18.3 | 49.0 | 1024.1 | 45.888411 | -74.140173 | other |
215 | 75.4 | 40.0 | 34.1 | 1024.4 | 45.265990 | -111.253120 | clear-day |
33 | 22.7 | 18.5 | 84.2 | 995.9 | 45.679871 | -65.381268 | snow |
229 | 67.5 | 60.3 | 79.2 | 1014.4 | 50.213297 | -66.375792 | rain |
103 | 30.1 | 24.2 | 79.5 | 1010.3 | 48.525998 | -89.127943 | snow |
88 | 45.7 | 13.0 | 27.5 | 1017.6 | 35.941036 | -106.275751 | clear-day |
31 | 40.1 | 35.4 | 83.3 | 1015.1 | 49.844217 | -119.682713 | snow |
164 | 57.0 | 44.9 | 65.2 | 1010.8 | 44.353591 | -73.861412 | rain |
59 | 31.2 | 6.6 | 38.6 | 1029.6 | 37.937746 | -107.820799 | clear-day |
217 | 64.3 | 54.4 | 73.0 | 1018.9 | 44.535401 | -80.378871 | other |
day_of_year | temp | dew | humidity | pressure | latitude | longitude | icon |
---|---|---|---|---|---|---|---|
240 | 50.0 | 40.0 | 69.7 | 1006.9 | 46.031602 | -71.193532 | clear-day |
135 | 59.6 | 56.1 | 88.3 | 1012.0 | 42.500124 | -88.190056 | rain |
325 | 25.6 | 22.6 | 88.4 | 1015.9 | 48.892154 | -72.238314 | other |
18 | 8.2 | -1.1 | 65.5 | 1002.3 | 46.449381 | -70.537122 | snow |
200 | 66.1 | 36.2 | 38.1 | 1024.4 | 45.261781 | -111.308024 | clear-day |
248 | 50.6 | 46.4 | 86.0 | 1019.2 | 48.437691 | -77.637239 | other |
67 | 23.9 | 12.8 | 64.0 | 1014.0 | 45.175126 | -109.317871 | clear-day |
123 | 50.8 | 48.5 | 92.0 | 1017.2 | 43.678495 | -73.991520 | rain |
23 | 15.9 | 6.6 | 66.8 | 1013.5 | 44.743490 | -85.514576 | snow |
28 | 33.1 | 29.0 | 84.7 | 1018.8 | 39.153678 | -84.888465 | other |
The code for the data preparation and performing ensemble random forest classification can be found [here].
This code includes an ensemble for shallow trees to illustrate the process and an ensemble for deeper trees in an attempt to
produce better results.
This ensemble method is known as random forest because it contains multiple decision trees. To illustrate the process, a random forest of
10 trees (estimators) with a max depth of 3 was trained. The accuracy and confusion matrix is reported and the base nodes of the first three trees are examined.
The trees have different base nodes, which illustrates the different subsets either through sampling or feature randomness.
This is difficult to achieve with singular decision trees without dropping complete features.
However, note that this illustrative example has nowhere near pure leaf nodes at the end of the trees.
For visualization purposes, the depth was purposefully kept very shallow for a dataset of this size.
This suggests that training a random forest classifier with a greater max depth could increase the accuracy of the model.
However, caution must be used to prevent overfitting.
In an attempt to create a better model, 100 trees (estimators) were used in the random forest classifer ensemble with a max depth of
15. This did increase the accuracy of the model. To ensure the ensemble did not contain overfit, the purity of the final leaves was examined.
There was a majority of final leaves that were 100% pure, however, some were not. This should be acceptable in preventing an overfit model.
The purity of the first three trees were tested, resulting in almost identical distributions.
Weather is a difficult phenomenon to predict when using relatively basic methods on a smaller scale of time. An analysis was performed using commonly available weather metrics that would be featured in a forecast. Using a technique which combines prediction methods to improve the accuracy was used on these features in an attempt to predict the type of weather a day will bring. The type of weather could be Clear, Rain, Snow, or Other. Other contains subcategories such as Fog, Wind, and Overcast. Predicting if a day will bring Clear, Rain, or Snow resulted in decent performance, but the Other category was misclassified the most and was the top misclassification when the other categories were not classified correctly. Overall, this model has potential, but perhaps better indicators within Other could improve the results.