Correlation

Introduction

One of the first things we want to do with data is explore the relationship between variables. One of the simplest ways to do this is to measure the correlation between values. Together with a scatterplot of the data, correlation can provide very rapid insights into the relationship between variables, without any particularly complex analysis. Correlation co-efficients are a numerical representation of the relationship between two variables, ranging from -1 to 1. Negative correlation co-efficients indicate that as one value increases, the other decreases (and vice versa), while positive correlation co-efficients indicate that as one value increases, the other increases (and vice versa). The closer the value of the correlation co-efficient is to -1 or 1, the stronger or tighter the relationship is between them. Values of 0 indicate no relationship between two variables, and close to 0 indicate only very weak or loose relationships. These are illustrated in the figure below, which shows five scatterplots and correlation co-efficients (denoted as “r”):

correlations

[Click to Enlarge]

If you have multiple variables and you would like to run correlations on all of them (pairwise correlations), you can do this in the AURIN portal. It is important to understand that while correlation matrix can show how much one variable changes with another variable, this does not necessarily mean that change in one is causing the change in the other – the basis of the well worn phrase of “Correlation doesn’t imply causation”. There are many things that correlate with each other because they are both caused by another variable. In other instances there are things that correlate with each other for no other reason than by chance alone (check out Spurious Correlations for some really interesting ones!) so be wary of drawing conclusions based on correlations. Still – they are often statistically significant associations which can warrant further investigation and hypothesising.

Inputs

Open the Correlation inputs box

[Analyse your Data > Tools > Statistical Analysis > Correlation]
correlationinputs

[Click to Enlarge]

The key parameter inputs are explained below:

Name: Enter the name of your correlation

Correlation Dataset Input: Enter which dataset you’d like to run the correlation on

Correlation Column Input Variable Name:  This is where you enter the variables that you’d like to include in the correlation matrix. Try to remember to only include variables that are meaningful (i.e. don’t include unit area identification codes!)

Correlation Use: This allows you to select rows from the dataset where either

  1. everything is used
  2. all.obs – all observations are used
  3. complete.obs – where rows are only included where they have a value for all of the attributes
  4. na.or.complete – where rows either have a value or a null value
  5. pairwise.complete.obs – rows are included for each of the pairwise comparisons where they have a value for those attributes

Once you have entered the parameters add the tool, open it under Analyse your data and execute it

[Add Tools > Analyse your data > Show/Hide > Execute]

Outputs

Once you have run the correlation, open the outputs. The output window is a text file, tab delimited so may be quite difficult to read. The easiest thing to do is to copy and paste the outputs into a spreadsheet like excel, and the columns should automatically align. It should look something like this:

correlationoutput

[Click to Enlarge]

 

It’s important to realise that the correlation matrix is doubled, that is, the values of the first column top to bottom are identical to the values of the first row left to right – you only need to look at either the bottom left or top right halves of your matrix.