
Danielle Navarro - Stalk and Feather
NO = Learning statistics
YES = Getting accustomed to R and grasping its use
NO = Learning statistics






janitor, dplyr)%>%)ggplot
Let’s load, clean up names and select columns all in one go with the pipe (%>%)
Let’s load, clean up names and select columns all in one go with the pipe (%>%)
Let’s load, clean up names and select columns all in one go with the pipe (%>%)
Let’s load, clean up names and select columns all in one go with the pipe (%>%)
space_data <- space_data %>%
group_by(movie) %>%
summarise(frequency_space = sum(normalised_frequency)) %>%
ungroup()
west_data <- west_data %>%
group_by(movie) %>%
summarise(frequency_west = sum(normalised_frequency)) %>
ungroup()
winter_data <- winter_data %>%
group_by(movie) %>%
summarise(frequency_winter = sum(normalised_frequency)) %>%
ungroup()In R we usually want to work with data stored in one dataset
Start with a blank canvas
Add a primer - the x and y axis
Add the subject
Make it stylish
Layers stack on top of each other!
We can look at an example (if we have time)
| id | Q1 | Q2 | Q3 |
|---|---|---|---|
| 1 | 43.4947976 | -4.1284871 | 139.7134 |
| 2 | 0.5145435 | 5.1488069 | 134.8856 |
| 3 | 42.3330507 | -0.3975525 | 146.8220 |
| id | question | score |
|---|---|---|
| 1 | Q1 | 43.4947976 |
| 1 | Q2 | -4.1284871 |
| 1 | Q3 | 139.7134024 |
| 2 | Q1 | 0.5145435 |
| 2 | Q2 | 5.1488069 |
| 2 | Q3 | 134.8855685 |
| 3 | Q1 | 42.3330507 |
| 3 | Q2 | -0.3975525 |
| 3 | Q3 | 146.8220343 |
# Boxplots with scatterplot layer on top
box_with_scatter <- long_data %>%
# Remove winter cause it's distribution ruins the visualisation here
filter(category != 'frequency_winter') %>%
# Set up canvas
ggplot(aes(x = category, y = frequency)) +
# First layer: boxplot
geom_violin(aes(fill=category)) +
# Second layer: scatterplot
geom_point(position = position_jitter(width=0.1), size=4, alpha=.60 ) +
# Title
labs(title = 'Boxplot + Scatterplot') +
# Change x labels
scale_x_discrete(labels=c('space', 'west', 'winter')) +
# Remove legend
theme_minimal() +
theme(legend.position = 'none')# Boxplots with scatterplot layer on top
scatter_with_box <- long_data %>%
# Remove winter cause it's distribution ruins the visualisation here
filter(category != 'frequency_winter') %>%
# Set up canvas
ggplot(aes(x = category, y = frequency)) +
# First layer: scatterplot
geom_point(position = position_jitter(width=0.1), size=4, alpha=.60 ) +
# Second layer: boxplot
geom_boxplot(aes(fill=category)) +
# Title
labs(title = 'Scatterplot + Boxplot') +
# Change x labels
scale_x_discrete(labels=c('space', 'west', 'winter')) +
# Remove legend
theme_minimal() +
theme(legend.position = 'none')




k. (2023, April 13). In Wikipedia. https://en.wikipedia.org/wiki/K-means_clustering
EEG data segmentation (microstates)
Michel, C. M., & Koenig, T. (2018). EEG microstates as a tool for studying the temporal dynamics of whole-brain neuronal networks: A review. NeuroImage, 180(Pt B), 577–593. https://doi.org/10.1016/j.neuroimage.2017.11.062
We start by considering only 2 dimensions of our data (Space & Winter)
# kmeans model on two dimensions
km_model_space_winter <- kmeans(category_data[, c('frequency_space', 'frequency_winter')], centers = 2)
# Add cluster info to data
cluster_results <- cbind(category_data, "cluster"= km_model$cluster)
# Plot the result
cluster_results %>%
ggplot(aes(x = frequency_space, y = frequency_winter, colour = as.factor(cluster))) +
geom_point(size=3) +
labs(x = 'Space',
y = 'Winter',
colour = 'Cluster') +
theme_minimal()We start by considering only 2 dimensions of our data (Space & Winter)
# kmeans model on two dimensions
km_model_space_winter <- kmeans(category_data[, c('frequency_space', 'frequency_winter')], centers = 2)
# Add cluster info to data
cluster_results <- cbind(category_data, "cluster"= km_model$cluster)
# Plot the result
cluster_results %>%
ggplot(aes(x = frequency_space, y = frequency_winter, colour = as.factor(cluster))) +
geom_point(size=3) +
labs(x = 'Space',
y = 'Winter',
colour = 'Cluster') +
theme_minimal()We start by considering only 2 dimensions of our data (Space & Winter)
# kmeans model on two dimensions
km_model_space_winter <- kmeans(category_data[, c('frequency_space', 'frequency_winter')], centers = 2)
# Add cluster info to data
cluster_results <- cbind(category_data, "cluster"= km_model$cluster)
# Plot the result
cluster_results %>%
ggplot(aes(x = frequency_space, y = frequency_winter, colour = as.factor(cluster))) +
geom_point(size=3) +
labs(x = 'Space',
y = 'Winter',
colour = 'Cluster') +
theme_minimal()You should see something like this (possibly different though!)
In this data set we have three features to feed our kmeans (space, west and winter). Let’s use them all!
This is not essential (3D plotting in R is annoying/tricky)
This is not essential (3D plotting in R is tricky)
?
A simple way is to try to maximise the “homogeneity” within each group. We can operationalise “homogeneity” through the Sum of Squares.
But… we don’t want to overfit the data
iterative_kmeans <- function(in_data, k_max) {
# Create a vector to store the within-cluster sum of squares for each iteration
within_sum_squares <- c()
# Repeat the clustering changing systematically the number of clusters
for (iCluster in 2:k_max) {
# Compute k-means
current_km_model <- kmeans(in_data, centers = iCluster)
# Extract within-cluster sum of squares
within_sum_squares <- append(within_sum_squares, current_km_model$tot.withinss)
}
# Create a data frame containing the relevant information
clustering_data <- data.frame(cbind('sum_of_squares' = within_sum_squares, "clusters" = 2:k_max))
# Return the data frame
return(clustering_data)
}iterative_kmeans <- function(in_data, k_max) {
# Create a vector to store the within-cluster sum of squares for each iteration
within_sum_squares <- c()
# Repeat the clustering changing systematically the number of clusters
for (iCluster in 2:k_max) {
# Compute k-means
current_km_model <- kmeans(in_data, centers = iCluster)
# Extract within-cluster sum of squares
within_sum_squares <- append(within_sum_squares, current_km_model$tot.withinss)
}
# Create a data frame containing the relevant information
clustering_data <- data.frame(cbind('sum_of_squares' = within_sum_squares, "clusters" = 2:k_max))
# Return the data frame
return(clustering_data)
}iterative_kmeans <- function(in_data, k_max) {
# Create a vector to store the within-cluster sum of squares for each iteration
within_sum_squares <- c()
# Repeat the clustering changing systematically the number of clusters
for (iCluster in 2:k_max) {
# Compute k-means
current_km_model <- kmeans(in_data, centers = iCluster)
# Extract within-cluster sum of squares
within_sum_squares <- append(within_sum_squares, current_km_model$tot.withinss)
}
# Create a data frame containing the relevant information
clustering_data <- data.frame(cbind('sum_of_squares' = within_sum_squares, "clusters" = 2:k_max))
# Return the data frame
return(clustering_data)
}iterative_kmeans <- function(in_data, k_max) {
# Create a vector to store the within-cluster sum of squares for each iteration
within_sum_squares <- c()
# Repeat the clustering changing systematically the number of clusters
for (iCluster in 2:k_max) {
# Compute k-means
current_km_model <- kmeans(in_data, centers = iCluster)
# Extract within-cluster sum of squares
within_sum_squares <- append(within_sum_squares, current_km_model$tot.withinss)
}
# Create a data frame containing the relevant information
clustering_data <- data.frame(cbind('sum_of_squares' = within_sum_squares, "clusters" = 2:k_max))
# Return the data frame
return(clustering_data)
}iterative_kmeans <- function(in_data, k_max) {
# Create a vector to store the within-cluster sum of squares for each iteration
within_sum_squares <- c()
# Repeat the clustering changing systematically the number of clusters
for (iCluster in 2:k_max) {
# Compute k-means
current_km_model <- kmeans(in_data, centers = iCluster)
# Extract within-cluster sum of squares
within_sum_squares <- append(within_sum_squares, current_km_model$tot.withinss)
}
# Create a data frame containing the relevant information
clustering_data <- data.frame(cbind('sum_of_squares' = within_sum_squares, "clusters" = 2:k_max))
# Return the data frame
return(clustering_data)
}iterative_kmeans <- function(in_data, k_max) {
# Create a vector to store the within-cluster sum of squares for each iteration
within_sum_squares <- c()
# Repeat the clustering changing systematically the number of clusters
for (iCluster in 2:k_max) {
# Compute k-means
current_km_model <- kmeans(in_data, centers = iCluster)
# Extract within-cluster sum of squares
within_sum_squares <- append(within_sum_squares, current_km_model$tot.withinss)
}
# Create a data frame containing the relevant information
clustering_data <- data.frame(cbind('sum_of_squares' = within_sum_squares, "clusters" = 2:k_max))
# Return the data frame
return(clustering_data)
}set.seed(27)
# Collect the results and plot them as a scree plot
data_scree <- iterative_kmeans(category_data[, 2:4], 10)
# Plot data
data_scree %>%
ggplot(aes(x = clusters, y = sum_of_squares)) +
geom_point() +
geom_path() +
labs(x = 'Number of Clusters',
y = 'Sum of Squares (within groups)',
title = 'Scree Plot') +
theme_minimal()Try to plot the results in 2D (space vs winter/space vs west/winter vs west). Feel like a challenge? Try using plotly to plot in 3D
Points should be coloured by group