Modeling - Ensemble Learning

Overview of Ensemble Learning

Ensemble learning is a technique that attempts to improve model performance, bias, and variance by chaining together multiple models. Bias is a metric often associated with underfitting and poor performance while variance is a metric often associated with overfitting and creating models that are overly sensitive to the training data. Some of the more common methods used in ensemble learning are Bagging, Stacking, and Boosting. Essentially, these methods train base learners, which can be the same type of machine learning models or different models and use some sort of voting approach on different samples of the data to choose the best overall model or models.

Bagging was originally introduced as training similar machine learning base models on random samples with replacement. However, the “with replacement” rule is not a hard rule anymore.

Stacking is a “strong” ensemble method. Strong doesn’t refer to “better” but refers to the strength of the types of base learners. “Stacking allows to use the strength of each individual estimator by using their output as input of a final estimator” [sklearn documentation].

Boosting is a “weak” ensemble method. The refers to the process emphasizing misclassified sample with higher weights in an iterative sampling and chaining process.

In particular, this section will focus on the ensemble method of Random Forest Classification. Random Forests consist of multiple Decision Trees ensembled together as an extension of bagging. The process is considered an extension of bagging because they also use a method known as feature randomness which samples the columns, or features, as well. Therefore, the strength of a Random Forest lies in its ability to chain together both samples of rows and columns to create an uncorrelated forest of decision trees!

Data Preparation

For a last try at predicting a weather label icon (i.e. the category of the weather for a day), Random Forest Classification will be used to try and predict between Clear Day, Rain, Snow, or Other. The final dataset will include the variables:

Day of the Year
Temperature
Dew Point
Humidity
Pressure
Latitude
Longitude

To accomplish this, the weather dataset and resort dataset were used. The resort dataset contains the coordinates which were merged in. Additionally, the icon label of Other contains subcategories such as Fog, Wind, and Overcast. After replacing these values with Other, the proportion of the labels had to be accounted for. Now that there were 4 categories to predict, the data was downsampled to the minimum categorical size. This still left over 500,000 datapoints to train a model on.

After the merging and accounting for proportioning, the data was split into a training and testing set. Ensemble is a supervised machine learning method, so the data trained on must be labeled. Additionally, training and testing sets were created. The two sets are disjoint, and must be disjoint. Using non-disjoint data between testing and training won't give an accurate representation of the performance of the model. First, this could result in an overfit of the model, which could end up describing noise, rather than the underlying distribution. Second, the testing set being non-disjoint helps to represent real-world data (i.e. unseen data).

The initial resort and weather datasets are illustrated below, along with the merged data and the training and testing datasets.

Weather Data

datetime	tempmax	tempmin	temp	feelslikemax	feelslikemin	feelslike	dew	humidity	precip	precipprob	precipcover	snow	snowdepth	windgust	windspeed	winddir	pressure	cloudcover	visibility	solarradiation	solarenergy	uvindex	sunrise	sunset	moonphase	icon	stations	resort	type_none	type_rain	type_snow
2019-01-01	16.4	2.0	7.0	10.2	-13.3	-0.6	-1.2	69.1	0.008	100.0	20.83	0.0	20.7	18.30000	10.6	4.9	1014.5	59.0	8.6	116.8	9.9	5.0	07:26:20	16:51:51	0.85	snow	['72467523063', '72206103038', 'CACMC', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009']	Vail	0	0	1
2019-01-02	24.3	-0.9	11.4	21.9	-11.9	5.4	-10.5	39.9	0.004	100.0	4.17	0.0	20.8	29.77377	8.7	353.1	1021.4	0.0	9.9	121.6	10.7	5.0	07:26:27	16:52:41	0.89	snow	['72467523063', '72206103038', 'CACMC', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009']	Vail	0	0	1
2019-01-03	29.0	5.3	17.6	21.9	-4.0	8.6	4.1	56.1	0.004	100.0	4.17	0.2	20.8	32.20000	9.8	328.9	1024.7	0.0	9.8	123.3	10.6	5.0	07:26:31	16:53:33	0.92	snow	['72467523063', '72206103038', 'CACMC', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009']	Vail	0	1	1
2019-01-04	34.0	11.9	23.4	28.7	3.4	17.1	7.0	50.4	0.001	100.0	4.17	0.1	20.8	20.80000	9.0	311.0	1025.5	0.0	9.9	123.7	10.7	5.0	07:26:34	16:54:26	0.96	snow	['72467523063', '72206103038', 'CACMC', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009']	Vail	0	1	1
2019-01-05	34.1	14.3	27.1	29.4	4.3	20.1	1.9	33.9	0.001	100.0	4.17	0.0	20.4	20.80000	10.1	243.5	1022.2	19.4	9.7	110.3	9.6	5.0	07:26:34	16:55:20	0.00	rain	['72467523063', '72206103038', 'CACMC', 'DYGC2', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009']	Vail	0	1	1
2019-01-06	29.9	18.5	25.9	22.4	5.1	16.1	18.1	72.5	0.035	100.0	58.33	0.6	20.6	33.30000	16.9	266.7	1009.3	78.7	6.3	47.3	4.1	2.0	07:26:32	16:56:16	0.02	snow	['72467523063', '72206103038', 'CACMC', '72038500419', 'DYGC2', 'KCCU', 'KEGE', 'A0000594076', 'KLXV', 'DJTC2', 'K20V', '72467393009']	Vail	0	1	1
2019-01-07	24.8	14.7	20.3	12.8	2.5	6.5	13.7	75.2	0.004	100.0	8.33	0.4	21.3	45.70000	27.9	271.2	1015.6	83.7	4.8	35.8	3.0	2.0	07:26:27	16:57:13	0.06	snow	['72467523063', '72206103038', 'CACMC', 'DYGC2', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009']	Vail	0	0	1
2019-01-08	34.6	17.2	25.1	34.6	5.0	17.8	12.0	59.5	0.013	100.0	8.33	0.0	21.3	27.70000	15.2	312.1	1029.4	34.5	9.5	122.9	10.5	5.0	07:26:21	16:58:11	0.09	rain	['72467523063', '72206103038', 'CACMC', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009']	Vail	0	1	1
2019-01-09	38.3	23.0	28.6	38.3	13.6	22.6	9.9	45.4	0.000	0.0	0.00	0.0	21.2	23.00000	13.0	142.9	1029.6	1.0	9.9	114.0	9.8	5.0	07:26:12	16:59:11	0.12	clear-day	['72467523063', '72206103038', 'CACMC', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009']	Vail	1	0	0
2019-01-10	33.7	17.0	26.4	33.7	9.8	22.6	14.3	60.6	0.026	100.0	12.50	0.8	21.4	17.20000	8.8	323.7	1023.3	39.9	8.3	75.9	6.6	4.0	07:26:01	17:00:11	0.16	snow	['72467523063', '72206103038', 'CACMC', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009']	Vail	0	1	1

Resort Data

Resort	state_province_territory	Country	City	Overall Rating	Elevation Difference	Elevation Low	Elevation High	Trails Total	Trails Easy	Trails Intermediate	Trails Difficult	Lifts	Price	Resort Size	Run Variety	Lifts Quality	Latitude	Longitude	Pass	Region
49 Degrees North Mountain Resort	Washington	United States	Chewelah	3.4	564	1196	1760	68.0	20.0	27.0	21.0	7	82.0	3.5	4.0	3.3	48.277375	-117.701815	Other	West
Crystal Mountain (WA)	Washington	United States	Sunrise	3.3	796	1341	2137	50.0	8.0	27.0	15.0	11	199.0	3.2	3.6	3.7	46.928167	-121.504535	Ikon	West
Mt. Baker	Washington	United States	White Salmon	3.4	455	1070	1525	100.0	24.0	45.0	31.0	10	91.0	3.9	4.3	3.0	45.727775	-121.486699	Other	West
Mt. Spokane	Washington	United States	Mead	3.0	610	1185	1795	26.0	6.5	16.0	3.5	7	75.0	2.7	3.1	3.0	47.919072	-117.092505	Other	West
Sitzmark	Washington	United States	Tonasket	2.6	155	1330	1485	7.5	2.0	3.0	2.5	2	50.0	1.9	2.4	2.9	48.863907	-119.165077	Other	West
Stevens Pass	Washington	United States	Baring	3.3	580	1170	1750	39.0	6.0	18.0	15.0	10	119.0	3.1	3.5	3.6	47.764031	-121.474822	Epic	West
The Summit at Snoqualmie	Washington	United States	Snoqualmie Pass	3.0	380	800	1180	27.9	5.2	13.7	9.0	22	135.0	2.6	3.0	3.2	47.405235	-121.412783	Ikon	West
Wenatchee Mission Ridge	Washington	United States	Wenatchee	3.2	686	1392	2078	36.0	4.0	21.0	11.0	4	119.0	2.9	3.3	3.6	47.292466	-120.399871	Other	West
Abenaki	New Hampshire	United States	Wolfeboro	2.1	70	180	250	2.0	1.2	0.5	0.3	1	24.0	1.4	1.8	1.4	43.609528	-71.229692	Other	Northeast
Attitash Mountain Resort	New Hampshire	United States	Bartlett	3.2	533	183	716	37.0	7.4	17.4	12.2	8	129.0	2.9	3.3	3.7	44.084603	-71.221525	Epic	Northeast

Data Prepared for Random Forest

day_of_year	temp	dew	humidity	pressure	latitude	longitude	icon
174	74.5	44.4	36.5	1015.7	40.637580	-111.478971	clear-day
307	34.2	15.9	46.7	1040.5	39.875755	-105.762776	clear-day
234	62.5	49.3	64.2	1017.5	44.520279	-85.943972	clear-day
121	51.8	15.0	28.0	1014.0	39.336911	-119.872522	clear-day
188	61.5	30.5	35.7	1014.2	39.317521	-120.330546	clear-day
138	42.7	22.8	46.6	1020.5	47.027702	-71.383543	clear-day
108	49.2	16.1	31.0	1025.1	38.534719	-105.998902	clear-day
168	64.1	51.9	66.0	1014.9	43.057705	-86.239159	clear-day
261	64.6	56.2	76.2	1016.7	41.501664	-72.736446	clear-day
112	46.1	13.1	30.1	1013.8	37.772320	-119.092603	clear-day

Training Data for Random Forest

day_of_year	temp	dew	humidity	pressure	latitude	longitude	icon
97	36.2	18.3	49.0	1024.1	45.888411	-74.140173	other
215	75.4	40.0	34.1	1024.4	45.265990	-111.253120	clear-day
33	22.7	18.5	84.2	995.9	45.679871	-65.381268	snow
229	67.5	60.3	79.2	1014.4	50.213297	-66.375792	rain
103	30.1	24.2	79.5	1010.3	48.525998	-89.127943	snow
88	45.7	13.0	27.5	1017.6	35.941036	-106.275751	clear-day
31	40.1	35.4	83.3	1015.1	49.844217	-119.682713	snow
164	57.0	44.9	65.2	1010.8	44.353591	-73.861412	rain
59	31.2	6.6	38.6	1029.6	37.937746	-107.820799	clear-day
217	64.3	54.4	73.0	1018.9	44.535401	-80.378871	other

Testing Data for Random Forest

day_of_year	temp	dew	humidity	pressure	latitude	longitude	icon
240	50.0	40.0	69.7	1006.9	46.031602	-71.193532	clear-day
135	59.6	56.1	88.3	1012.0	42.500124	-88.190056	rain
325	25.6	22.6	88.4	1015.9	48.892154	-72.238314	other
18	8.2	-1.1	65.5	1002.3	46.449381	-70.537122	snow
200	66.1	36.2	38.1	1024.4	45.261781	-111.308024	clear-day
248	50.6	46.4	86.0	1019.2	48.437691	-77.637239	other
67	23.9	12.8	64.0	1014.0	45.175126	-109.317871	clear-day
123	50.8	48.5	92.0	1017.2	43.678495	-73.991520	rain
23	15.9	6.6	66.8	1013.5	44.743490	-85.514576	snow
28	33.1	29.0	84.7	1018.8	39.153678	-84.888465	other

Coding Ensemble (Random Forest Classification)

The code for the data preparation and performing ensemble random forest classification can be found [here].
This code includes an ensemble for shallow trees to illustrate the process and an ensemble for deeper trees in an attempt to produce better results.

Results - The Process

This ensemble method is known as random forest because it contains multiple decision trees. To illustrate the process, a random forest of 10 trees (estimators) with a max depth of 3 was trained. The accuracy and confusion matrix is reported and the base nodes of the first three trees are examined.

The trees have different base nodes, which illustrates the different subsets either through sampling or feature randomness. This is difficult to achieve with singular decision trees without dropping complete features. However, note that this illustrative example has nowhere near pure leaf nodes at the end of the trees. For visualization purposes, the depth was purposefully kept very shallow for a dataset of this size. This suggests that training a random forest classifier with a greater max depth could increase the accuracy of the model. However, caution must be used to prevent overfitting.

Accuracy and Confusion Matrix for the Shallow Tree Example.

First Tree in the Random Forest. (expand image)

Second Tree in the Random Forest. (expand image)

Third Tree in the Random Forest. (expand image)

Results - Optimal Model

In an attempt to create a better model, 100 trees (estimators) were used in the random forest classifer ensemble with a max depth of 15. This did increase the accuracy of the model. To ensure the ensemble did not contain overfit, the purity of the final leaves was examined. There was a majority of final leaves that were 100% pure, however, some were not. This should be acceptable in preventing an overfit model.

The purity of the first three trees were tested, resulting in almost identical distributions.

Accuracy and Confusion Matrix for the Deeper Tree.

Final Leaves Purity of the First Three Trees.

Conclusion

Weather is a difficult phenomenon to predict when using relatively basic methods on a smaller scale of time. An analysis was performed using commonly available weather metrics that would be featured in a forecast. Using a technique which combines prediction methods to improve the accuracy was used on these features in an attempt to predict the type of weather a day will bring. The type of weather could be Clear, Rain, Snow, or Other. Other contains subcategories such as Fog, Wind, and Overcast. Predicting if a day will bring Clear, Rain, or Snow resulted in decent performance, but the Other category was misclassified the most and was the top misclassification when the other categories were not classified correctly. Overall, this model has potential, but perhaps better indicators within Other could improve the results.