Modeling - Clustering


Clustering

Clustering is a method which groups the "closest" values together. This page will analyze results and applications from different clustering methods. Namely, the following will be explored:


What Data Can Be Used?

The data for cluster modeling will be initialized with PCA. Therefore, the data preparation will be identical to the PCA section. This analysis focuses on the quantitative data of the main datasets used throughout this project:


Ski Resorts Data - Snippet

Resort state_province_territory Country City Overall Rating Elevation Difference Elevation Low Elevation High Trails Total Trails Easy Trails Intermediate Trails Difficult Lifts Price Resort Size Run Variety Lifts Quality Latitude Longitude Pass Region
49 Degrees North Mountain Resort Washington United States Chewelah 3.4 564 1196 1760 68.0 20.0 27.0 21.0 7 82.0 3.5 4.0 3.3 48.277375 -117.701815 Other West
Crystal Mountain (WA) Washington United States Sunrise 3.3 796 1341 2137 50.0 8.0 27.0 15.0 11 199.0 3.2 3.6 3.7 46.928167 -121.504535 Ikon West
Mt. Baker Washington United States White Salmon 3.4 455 1070 1525 100.0 24.0 45.0 31.0 10 91.0 3.9 4.3 3.0 45.727775 -121.486699 Other West
Mt. Spokane Washington United States Mead 3.0 610 1185 1795 26.0 6.5 16.0 3.5 7 75.0 2.7 3.1 3.0 47.919072 -117.092505 Other West
Sitzmark Washington United States Tonasket 2.6 155 1330 1485 7.5 2.0 3.0 2.5 2 50.0 1.9 2.4 2.9 48.863907 -119.165077 Other West
Stevens Pass Washington United States Baring 3.3 580 1170 1750 39.0 6.0 18.0 15.0 10 119.0 3.1 3.5 3.6 47.764031 -121.474822 Epic West
The Summit at Snoqualmie Washington United States Snoqualmie Pass 3.0 380 800 1180 27.9 5.2 13.7 9.0 22 135.0 2.6 3.0 3.2 47.405235 -121.412783 Ikon West
Wenatchee Mission Ridge Washington United States Wenatchee 3.2 686 1392 2078 36.0 4.0 21.0 11.0 4 119.0 2.9 3.3 3.6 47.292466 -120.399871 Other West
Abenaki New Hampshire United States Wolfeboro 2.1 70 180 250 2.0 1.2 0.5 0.3 1 24.0 1.4 1.8 1.4 43.609528 -71.229692 Other Northeast
Attitash Mountain Resort New Hampshire United States Bartlett 3.2 533 183 716 37.0 7.4 17.4 12.2 8 129.0 2.9 3.3 3.7 44.084603 -71.221525 Epic Northeast

Weather Data - Snippet

datetime tempmax tempmin temp feelslikemax feelslikemin feelslike dew humidity precip precipprob precipcover snow snowdepth windgust windspeed winddir pressure cloudcover visibility solarradiation solarenergy uvindex sunrise sunset moonphase icon stations resort tzoffset severerisk type_freezingrain type_ice type_none type_rain type_snow
2019-01-01 16.4 2.0 7.0 10.2 -13.3 -0.6 -1.2 69.1 0.008 100.0 20.83 0.0 20.7 18.30000 10.6 4.9 1014.5 59.0 8.6 116.8 9.9 5.0 07:26:20 16:51:51 0.85 snow ['72467523063', '72206103038', 'CACMC', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009'] Vail 0.0 0.0 0 0 0 0 1
2019-01-02 24.3 -0.9 11.4 21.9 -11.9 5.4 -10.5 39.9 0.004 100.0 4.17 0.0 20.8 29.77377 8.7 353.1 1021.4 0.0 9.9 121.6 10.7 5.0 07:26:27 16:52:41 0.89 snow ['72467523063', '72206103038', 'CACMC', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009'] Vail 0.0 0.0 0 0 0 0 1
2019-01-03 29.0 5.3 17.6 21.9 -4.0 8.6 4.1 56.1 0.004 100.0 4.17 0.2 20.8 32.20000 9.8 328.9 1024.7 0.0 9.8 123.3 10.6 5.0 07:26:31 16:53:33 0.92 snow ['72467523063', '72206103038', 'CACMC', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009'] Vail 0.0 0.0 0 0 0 1 1
2019-01-04 34.0 11.9 23.4 28.7 3.4 17.1 7.0 50.4 0.001 100.0 4.17 0.1 20.8 20.80000 9.0 311.0 1025.5 0.0 9.9 123.7 10.7 5.0 07:26:34 16:54:26 0.96 snow ['72467523063', '72206103038', 'CACMC', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009'] Vail 0.0 0.0 0 0 0 1 1
2019-01-05 34.1 14.3 27.1 29.4 4.3 20.1 1.9 33.9 0.001 100.0 4.17 0.0 20.4 20.80000 10.1 243.5 1022.2 19.4 9.7 110.3 9.6 5.0 07:26:34 16:55:20 0.00 rain ['72467523063', '72206103038', 'CACMC', 'DYGC2', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009'] Vail 0.0 0.0 0 0 0 1 1
2019-01-06 29.9 18.5 25.9 22.4 5.1 16.1 18.1 72.5 0.035 100.0 58.33 0.6 20.6 33.30000 16.9 266.7 1009.3 78.7 6.3 47.3 4.1 2.0 07:26:32 16:56:16 0.02 snow ['72467523063', '72206103038', 'CACMC', '72038500419', 'DYGC2', 'KCCU', 'KEGE', 'A0000594076', 'KLXV', 'DJTC2', 'K20V', '72467393009'] Vail 0.0 0.0 0 0 0 1 1
2019-01-07 24.8 14.7 20.3 12.8 2.5 6.5 13.7 75.2 0.004 100.0 8.33 0.4 21.3 45.70000 27.9 271.2 1015.6 83.7 4.8 35.8 3.0 2.0 07:26:27 16:57:13 0.06 snow ['72467523063', '72206103038', 'CACMC', 'DYGC2', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009'] Vail 0.0 0.0 0 0 0 0 1
2019-01-08 34.6 17.2 25.1 34.6 5.0 17.8 12.0 59.5 0.013 100.0 8.33 0.0 21.3 27.70000 15.2 312.1 1029.4 34.5 9.5 122.9 10.5 5.0 07:26:21 16:58:11 0.09 rain ['72467523063', '72206103038', 'CACMC', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009'] Vail 0.0 0.0 0 0 0 1 1
2019-01-09 38.3 23.0 28.6 38.3 13.6 22.6 9.9 45.4 0.000 0.0 0.00 0.0 21.2 23.00000 13.0 142.9 1029.6 1.0 9.9 114.0 9.8 5.0 07:26:12 16:59:11 0.12 clear-day ['72467523063', '72206103038', 'CACMC', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009'] Vail 0.0 0.0 0 0 1 0 0
2019-01-10 33.7 17.0 26.4 33.7 9.8 22.6 14.3 60.6 0.026 100.0 12.50 0.8 21.4 17.20000 8.8 323.7 1023.3 39.9 8.3 75.9 6.6 4.0 07:26:01 17:00:11 0.16 snow ['72467523063', '72206103038', 'CACMC', 'KCCU', 'KEGE', 'KLXV', 'DJTC2', 'K20V', '72467393009'] Vail 0.0 0.0 0 0 0 1 1

Google Places Data - Snippet

Latitude Longitude Name rating total_ratings Resort Call Category Initial Category Secondary Category Tertiary Category
39.639411 -106.367836 Manor Vail Lodge 4.7 370.0 Vail Restaurants bar lodging restaurant
39.641578 -106.371678 Gravity Haus Vail 4.4 256.0 Vail Restaurants gym spa lodging
39.642639 -106.377803 Leonora 4.3 167.0 Vail Restaurants restaurant food point_of_interest
39.638962 -106.369379 Larkspur Events & Dining 4.5 198.0 Vail Restaurants restaurant food point_of_interest
39.630370 -106.418694 Subway 2.7 105.0 Vail Restaurants meal_takeaway restaurant food
39.640861 -106.374665 Sweet Basil 4.4 838.0 Vail Restaurants bar restaurant food
39.640228 -106.374381 Elway's 4.3 385.0 Vail Restaurants bar restaurant food
39.643914 -106.390088 The Little Diner 4.7 1390.0 Vail Restaurants restaurant food point_of_interest
39.640248 -106.373333 Red Lion 3.9 740.0 Vail Restaurants bar restaurant food
39.641490 -106.397471 Chicago Pizza 3.9 216.0 Vail Restaurants meal_delivery meal_takeaway restaurant

Data Preparation

Each dataset required some alteration in preparation for PCA. Namely, this included subsetting the data to quantitative values and separating the labels. The labels would be saved for later to compare with the results. Some of the datasets had multiple categorical data features which could be used as labels depending on the purpose of the analysis. Other columns were simply dropped. Thus, a concise script with which could perform this cleaning, along with applying the PCA algorithm and analysis of the results was created. This script can be found here, and contains detailed documentation on these functions.


Ski Resort Data - Preparation

  • Quantitative Data Retained:
    • Overall Rating
    • Elevation Difference
    • Elevation Low
    • Elevation High
    • Trails Total
    • Trails Easy
    • Trails Intermediate
    • Trails Difficult
    • Lifts
    • Price
    • Resort Size
    • Run Variety
    • Lifts Quality
    • Latitude
    • Longitude
  • Potential Label Columns Set Aside:
    • Resort
    • state_province_territory
    • Country
    • City
    • Pass
    • Region

Weather Data - Preparation

  • Quantitative Data Retained:
    • tempmax
    • tempmin
    • temp
    • feelslikemax
    • feelslikemin
    • feelslike
    • dew
    • humidity
    • precip
    • snow
    • snowdepth
    • windgust
    • windspeed
    • winddir
    • pressure
    • cloudcover
    • visibility
    • solarradiation
    • solarenergy
    • uvindex
    • moonphase
    • severerisk
  • Potential Label Columns Set Aside:
    • datetime
    • icon
    • resort
    • type_snow
    • type_rain
    • type_ice
    • type_freezingrain
    • type_none

Google Places Data - Preparation

  • Quantitative Data Retained:
    • Latitude
    • Longitude
    • rating
    • total_ratings
  • Potential Label Columns Set Aside:
    • Name
    • Resort
    • Call Category
    • Initial Category
    • Secondary Category
    • Tertiary Category

Ski Resorts Data - PCA Projection Snippet

principal_component_1 principal_component_2 principal_component_3
2.263004 -0.438256 1.305198
2.675585 -0.690901 0.544741
3.200620 -0.063937 1.225309
0.279424 -1.197545 0.793791
-1.721875 -1.844334 0.638262
1.532984 -0.698907 0.789206
0.792826 -0.224605 0.456074
1.243168 -1.223002 0.656630
-4.115028 -0.166513 -0.391799
0.504947 1.693590 -0.261678

The three-dimensional PCA for the resorts data retains 85.63% of information from the orginal features.


Weather Data - PCA Projection Snippet

principal_component_1 principal_component_2 principal_component_3
-4.657852 -1.403255 -1.031176
-3.895448 -3.615710 -0.216463
-3.383677 -2.874224 -0.205030
-2.621581 -2.679694 -1.003783
-2.555386 -2.624466 -0.628127
-3.662800 0.343598 1.519572
-4.547421 -0.178571 3.063166
-2.624131 -2.111392 -0.006581
-1.954785 -2.406680 -0.846676
-2.738489 -1.102249 -0.887917

The three-dimensional PCA for the weather data retains 59.95% of information from the orginal features.


Google Places Data - PCA Projection Snippet

principal_component_1 principal_component_2 principal_component_3
-0.302800 0.579023 0.216165
-0.272657 0.344079 0.186517
-0.253036 0.220012 0.222411
-0.265369 0.336181 0.282614
-0.195169 -0.542970 -0.426522
-0.379546 0.866591 -0.336260
-0.293520 0.415678 0.026604
-0.488774 1.494936 -0.700102
-0.347043 0.557516 -0.468297
-0.249838 0.087253 0.002198

The three-dimensional PCA for the resorts data retains 83.12% of information from the orginal features.


KMeans Clustering

KMeans is one of the more common methods of clustering. The goal is to partition a dataset into a specified number of clusters, where each point belong to the cluster with the nearest mean. It is an unsupervised method which works to accomplish this goal by minimizing the sum of square distances between centroids. Distance and "closeness" will be evaluated via Euclidean Distance throughout this method and the other clustering methods. When choosing centroids, k randomly selected points are used as the initial centroids. Data points are assigned to the nearest centroid, forming k-clusters, with the mean updated for the points in each cluster. This is repeated until centroids no longer have signficant changes or a maximum number of iterations is reached.

There are several different methods to choose an optimal number of clusters to use. Using both the Silhouette Method and Elbow Method can help confirm this optimal amount. This script can be found here, and contains detailed documentation on these functions.


KMeans - Resorts

2, 3, and 10 clusters are decent choices for the optimal number of clusters for the Resorts data, with 3 showing strong potential to be the most optimal.


KMeans - Weather

2, 3, and 5 clusters are decent choices for the optimal number of clusters for the Weather data, with 3 and 5 showing strong potential to be the most optimal.


KMeans - Google

4, 5 and 6 clusters are decent choices for the optimal number of clusters for the Google data, with 5 and 6 showing strong potential to be the most optimal.


KMeans Visualization - Resorts

For the Resorts data, clustering was performed for the following combinations of clusters and labels:

For each combination, the legend features are toggleable.



KMeans Visualization - Weather

For the Weather data, clustering was performed for the following combinations of clusters and labels:

Note that a subset of the Weather data was used for illustrative purposes.
For each combination, the legend features are toggleable.



KMeans Visualization - Google

For the Google data, clustering was performed for the following combinations of clusters and labels:

Note that a subset of the Google data was used for illustrative purposes.
For each combination, the legend features are toggleable.



KMeans Clustering Results and Discussion


Hierarchical Clustering

Hierarchical Clustering determines cluster assignments by building a hierarchy, either bottom-up (agglomerative) or top-down (divisive), and produces a tree-based hierarchy of points called a dendrogram. There are two main types of hierarchical clustering:

Although some methods can take in a specified number of clusters, the strength of Hierarchical Clustering is in its ability to produce a visual tree known as a dendrogram, which is somewhat agnostic of this criteria, in an effort to visually compare similarities via distances.

The common machine learning libraries which allow for hierarchical clustering mostly use agglomerative clustering (bottom-up). Scikit-Learn has a module for this, sklearn.cluster.AgglomerativeClustering, however, this mainly relies on a specified number of clusters input. A more common approach in analysis of this sort is to use scip.cluster.hierarchy with dendrogram and linkage. This particular analysis will feature SciPy's libraries and modules.

Functions were created and then applied to create and analyze agglomerative dendrograms.

The functions script can be found here, and contains detailed documentation.

The applications of the functions can be found here.

Note that the same three-dimensional PCA space projected data on the three main datasets were used again for hierarchical clustering, however the Resorts data provided the best results.


Hierarchical Clustering - Resorts

The following process of analysis was applied to the Resorts data:

  1. Run three-dimensional PCA on the data.
  2. Create Linkage Matrix using the reduced dataset using the Ward's method.
  3. Create dendrogram with parameters which essentially shows a complete bottom-up approach.
  4. Tune parameters to investigate different tree clusters.
  5. Expand leaves.


Complete Hierarchical Tree

The complete bottom-up visualization for the Resorts data:


Full dendrogram illustrating a complete bottom-up approach.

There are a few things to note about this image:

One key takeway is that with the current parameters, there is little diagnostic value from this dendrogram. By tuning the parameters and expanding the nodes, actionable results can be achieved.


Truncated to Level 3 - Default Threshold

Truncating the dendrogram to level 3 will yield something more reasonable:

Truncated to Level 3.

There are a few things to note about this image:

With these parameters, there are two clear clusters, but what's actually underneath the leaves (i.e. original observations)? If labels are applied, is there a clear split between labels? How can statistical tests help to ensure a clear differentiation between labels?

By examining different labels and applying a chi-squared statistic hypothesis test, whether or not a significant association between clusters and column labels can be concluded.


Although there isn't a distinct label separation between the clusters, different distributions of the labels are present in each cluster.

Truncated to Level 3 - Color Threshold set to Distance 30

With a truncation to level 3 revealing a more interpretable dendrogram, further patterns were revealed. By lowering the distance threshold for identifying clusters, different clusters within the data and different distributions can be revealed as well.

Truncated to Level 3 with Color Threshold set to a distance of 30.

With these parameters, there are three clear clusters. What is underneath the leaves?


Truncated to Level 4 - Color Threshold set to Distance 20

Truncating to Level 3 was vetted well, but further clusters and patterns can be revealed given different parameters. How does a deeper truncation level with a closer distance fair for this data?

Truncated to Level 4 with Color Threshold set to a distance of 20.

With these parameters, there are five clear clusters. What is underneath the leaves?


Hierarchical Clustering Results and Discussion


Density Clustering (DBSCAN)

Density Clustering is particularly useful when performing clustering on datasets with irregular (arbitrary) shapes and varying densities. Similar to hierarchical clustering, and different to KMeans clustering, this method doesn't require a specified number of clusters.

The main properties of DBSCAN clustering, comparitively are:

Using Scikit-Learn's module sklearn.cluster.DBSCAN, DBSCAN can be implemented on the datasets associated within this scope. The inputs necessary to utilize this module are namely: Tuning the correct input parameters can be implemented via industry knowledge or dataset specific. This analysis will use a set of heuristics for choosing the correct input parameters. The process is detailed in this script.

Note that the same three-dimensional PCA space projected data on the three main datasets were used again for density based clustering.

Parameter Heuristics

If not utilizing domain knowledge for parameter selection, the following heuristics are generally accepted for density clustering:

Minimum Sample choice is relatively straight forward, thus the question of epsilon remains. A k-distance graph approach is initialized with a choice of minimum samples made, and then the average distance between each point in the dataset and its minimum sample of neighbors is calculated. This can be aided with sklearn.neighbors.NearestNeighbors. A decent choice for epsilon is then the elbow of the sorted plotted ascending distances.

Finding the elbow, or maximum curvature, can be accomplished through "eyeballing", a manual implementation, or even through a module. This analysis take a manual approach which utilizes KMeans clustering to cluster the slopes between curves and uses an automated approach using kneed.KneeLocator.

For each dataset, the following parameters will be implemented:

DBSCAN Clusering - Resorts


The epsilon value with the automatic detection technique gives a value of 1.22, while looking at the slope clusters give a value of around 0.1. Another manual value for epsilon could be right where the values start to change on the k-distance graph, which looks to be around 0.8.

By running the DBSCAN algorithm and visually evaluating how the parameters cluster the data can provide insight into the best choice.



DBSCAN Clusering - Weather


The epsilon value with the automatic detection technique gives a value of 0.86, while looking at the slope clusters give a value of around 0.1. Another manual value for epsilon could be right where the values start to change on the k-distance graph, which looks to be between these.

By running the DBSCAN algorithm and visually evaluating how the parameters cluster the data can provide insight into the best choice.



DBSCAN Clusering - Google


The epsilon value with the automatic detection technique gives a value of 0.99, while looking at the slope clusters give a value of around just above 0 to about 1. A custom version of the parameters was also used for this with epsilon value of 0.65 and minimum samples of 10.

By running the DBSCAN algorithm and visually evaluating how the parameters cluster the data can provide insight into the best choice.



Density Clustering (DBSCAN) Results and Discussion

Given the arbitrary shape detection used in density clustering, the clustering detection seemed to work best for Resorts Data and Google Data. A property of this clustering method is its outlier detection. Resorts and Google have rather obscure shapes, especially Google, and it was mentioned previously that outliers could be causing this. Although these aren't outliers in a typical sense, as they just don't belong any single cluster, its still a decent analysis technique.

This method wasn't great for the PCA projection of the Weather Data due to its spherical nature. Already being not an obscure shape, the algorithm seemed to cluster itself almost entirely. Although a small change of parameters did result in some of the datapoints on the outside of the data being grouped together, with more outliers (or not assigned to a single group) were recognized as well.


An Overview on Clustering Techniques (Strengths and Weaknesses)


Conclusion

Distinct groupings can be identified for staple sources associated with the ski and snowbound community. Ski Resorts themselves, Weather Phenomena surrounding Ski Resorts, and Places of Business all contain characteristics which can be consolidated and patterns extracted. Ski Resorts can be processed into hierarchical structure which explains different distributions such as location (either county or region) or by Pass Type (Ikon, Epic, Other). Weather characteristics tend to form patterns around the type of weather (snow, rain, clear, etc.) and especially around time of year (i.e. months). Places of Business tend to form different groups based on their category, location (either county or region), and even by the Pass Type (Ikon, Epic, Other) of their associated ski resort!