Cleaning Mental Rotation Data

As we learned during the first part of the bootcamp, we need to start our data analysis journey by loading and cleaning the data.

Thankfully, psychopy produces well-organised output files, although we will see they contain way more information than we need.

I won’t get into every detail of the code for this tutorial as the steps are very similar to the one we conducted during the bootcamp session.

Before we start, let’s load the required packages. Please, install them if you haven’t done this yet.

# Load required packages
library(tidyverse)
library(janitor)

Now, we load the csv file. Make sure to provide the correct path. Here we use relative paths as we have stored the file within our project directory.

# Load mental rotation data
mental_rotation_raw <- read_csv("datasets/multiple_subjects.csv")

You should see the variable mental_rotation_raw appearing in your Environment pane. If we look into it, we see that Psychopy has a bit of an annoying way of naming the variables. For instance, it uses “full stops” and has a mix of upper and lowercase letters. Furthermore, there are lots of variables.

# Show names of the columns
knitr::kable(colnames(mental_rotation_raw), format = "pipe")

So, now we will clean the column names and select only the ones we care for. Note that we define a vector that contains all the column names we want to keep. This is just to make the code cleaner and simpler to modify. Compare this to what we did during the bootcamp to get a better understanding.

# Vector of columns we want to keep
keep_columns <- c("condition", "target_angle", "letter_angle", "expected_key", 
                  "part_response_keys", "part_response_corr", "part_response_rt", 
                  "participant")

# Use clean_names and select to achieve what we need
mental_rotation_clean <- mental_rotation_raw %>% 
    clean_names() %>% 
    select(keep_columns)

We now have a nice tidy dataset we can work with easily. If you look at it, you can see that every column is easy to interpret, the names are meaningful, and the data is well organised. For this dataset, we don’t have to do anything else. However, remember that this is also because Psychopy produces a reasonably clean output. Thus, in your research, you might encounter other issues, and the details about how to solve them depend on your specific situation.

Let’s now look at the data to get a sense of what we are working with. We start by focusing on the effects of the binary variable flip. As we are working with a categorical independent variable and a continuous dependent variable, a nice way to visualise the data is through violin plots and box plots.

mental_rotation_clean %>%
    ggplot(aes(x = condition, y = part_response_rt, fill = condition)) +
    geom_violin() +
    geom_boxplot(width = .2) +
    labs( 
        x = "Condition", 
        y = "RT (s)",
        title = "Participants' reaction times") +
    theme_minimal()

We can already notice a couple of things. Firstly, there are some NA (missing values) in the condition column. We know this as they appear as being a separate group. Secondly, there are some very low reaction times. We can get a better idea by summarising the data.

In the following code, note that I’m doing something different from what we did during the bootcamp. As there might be some missing values even in the reaction time column, rather than removing them and then summarising the data, I ask each function to discard any NA before running. This way, we can preserve the whole data while obtaining the statistics we need.

mental_rotation_clean %>% 
    # Group results into flip and mirrored conditions. We use the dplyr function
    # group_by
    group_by(condition) %>% 
    # Get summary stats
    summarise(min_rt = min(part_response_rt, na.rm =TRUE),
              max_rt = max(part_response_rt, na.rm =TRUE),
              avg_rt = mean(part_response_rt, na.rm =TRUE), 
              IQR_rt = IQR(part_response_rt, na.rm =TRUE)) %>% 
    knitr::kable(format = "pipe")

The NA as a condition remains as we have grouped the data by conditions. This is not an NA in the reaction time columns, so it’s not removed by the functions called inside summarise.

The reaction times are expressed in seconds. We can see that the minimum reaction time is 0.4223 ms, an implausible value. The maximum reaction time is fine here, and the trials were limited to 5 seconds anyway. So no problems there.

As a next step, let’s remove values that are too low. We define a boundary at 100ms, as we expect the quickest reaction time to be around that. All the trials with lower reaction times than this threshold will be discarded. We also remove the rows containing an NA. If you look inside the data set you’ll see that these are the rows referring to the instruction and end routines of our experiment. Thus, they are irrelevant to the study.

ATTENTION: as highlighted during the bootcamp, you need to think carefully before removing data. There must be a good and highly valid reason to do so.

# Use filter to select all the rows where the reaction time is higher than 100ms (0.1 second)
# Note that in R you can skip the trailing 0 in front of a number
mental_rotation_trimmed <- mental_rotation_clean %>% 
    filter(part_response_rt > .1) %>% 
    # Remove rows with NA in the condition or RT columns. We also check the angle columns
    # as we will use them later so we don't want any missing value there. 
    drop_na(c("condition", "part_response_rt", "target_angle", "letter_angle"))

And this is it. We have a clean data set ready to go!