# Correlation

## Contents

## Introduction

One of the first things we want to do with data is explore the relationship between variables. One of the simplest ways to do this is to measure the correlation between values. Together with a scatterplot of the data, correlation can provide very rapid insights into the relationship between variables, without any particularly complex analysis. Correlation co-efficients are a numerical representation of the relationship between two variables, ranging from -1 to 1. Negative correlation co-efficients indicate that as one value increases, the other decreases (and vice versa), while positive correlation co-efficients indicate that as one value increases, the other increases (and vice versa). The closer the value of the correlation co-efficient is to -1 or 1, the stronger or tighter the relationship is between them. Values of 0 indicate no relationship between two variables, and close to 0 indicate only very weak or loose relationships. These are illustrated in the figure below, which shows five scatterplots and correlation co-efficients (denoted as “r”):

If you have multiple variables and you would like to run correlations on all of them (pairwise correlations), you can do this in the AURIN portal. It is important to understand that while correlation matrix can show how much one variable changes with another variable, this does not necessarily mean that change in one is*the change in the other – the basis of the well worn phrase of “Correlation doesn’t imply causation”. There are many things that correlate with each other because they are both caused by another variable. In other instances there are things that correlate with each other for no other reason than by chance alone (check out Spurious Correlations for some really interesting ones!) so be wary of drawing conclusions based on correlations. Still – they are often statistically significant associations which can warrant further investigation and hypothesising.*

**causing**## Inputs

*] The key parameter inputs are explained below:*

**Analyse your Data > Tools > Statistical Analysis > Correlation****Name:**** **Enter the name of your correlation

**Correlation Dataset Input:**** **Enter which dataset you’d like to run the correlation on

**Correlation Column Input Variable Name:**** **** **This is where you enter the variables that you’d like to include in the correlation matrix. Try to remember to only include variables that are meaningful (i.e. don’t include unit area identification codes!)

**Correlation Use:**** **This allows you to select rows from the dataset where either

- everything is used
- all.obs – all observations are used
- complete.obs – where rows are only included where they have a value for all of the attributes
- na.or.complete – where rows either have a value or a null value
- pairwise.complete.obs – rows are included for each of the pairwise comparisons where they have a value for those attributes

Once you have entered the parameters add the tool, open it under Analyse your data and execute it

[*]*

**Add Tools > Analyse your data > Show/Hide > Execute**## Outputs

It’s important to realise that the correlation matrix is doubled, that is, the values of the first column top to bottom are identical to the values of the first row left to right – you only need to look at either the bottom left or top right halves of your matrix.