## Sampling

Sampling is the process of collecting data from some sites or people in order to obtain a perspective on the population. You should explain how representative a sample is. It affects how the findings can be applied.

### Sample

A limited number of things, such as a group of 100 people or 50 pebbles on a beach.

### Population

The total number of things, such as all residents of a city or all pebbles on a beach.

### Representative

How closely the relevant characteristics of the sample match the characteristics of the population.

### Bias

An inclination or prejudice towards or against a specific finding or outcome.

## How big should your sample be?

If you are planning to carry out statistical analysis of your results, you need to take enough measurements.

The minimum number of replicates is often determined by the number that is needed to be collected in order to carry out a valid statistical test.

There is not a maximum number of replicates as the general rules is more is better.

### Running mean: justifying your sample size

It’s always a good thing to be able to say why you took a certain number of measurements. Why, for instance, did you count things in 30 rather than 20 quadrats?

The running mean is a simple technique that allows you to judge whether or not you have enough measurements or counts.

A more statistically valid approach to determine the number of repeats required is to calculate the running mean. By taking a number of repeat reading in a single location you can determine the number of samples required that will give you an average that takes in to account the natural variation that may occur at each sample point.

Begin by finding the mean of your first two readings, then the mean of the first three readings, then the mean of the first four readings and so on. The mean values will fluctuate each time, but will gradually settle within a closer limit, until the point is reached where adding to the sample only has a very small effect on the mean. You can assume at this point that the number of repeats is adequate.

#### Worked example

A student measured 15 seaweed fronds, and collected the following data

Length in metres (m): 1.2, 0.7, 1.9, 1.4, 1.3, 1.6, 1.8, 1.3, 0.8, 1.3, 0.8, 1.2, 1.6, 1.3, 1.4

Is this an adequate sample? Recalculate the mean for each increase in sample size:

Reading | Running mean | |
---|---|---|

1.2 | ||

0.7 | (1.2 + 0.7)/2 | 0.95 |

1.9 | (1.2 + 0.7 + 1.9)/3 | 1.26 |

1.4 | (1.2 + 0.7 + 1.9 + 1.4)/4 | 1.30 |

1.3 | (1.2 + 0.7 + 1.9 + 1.4 + 1.3)/5 | 1.30 |

1.6 | (1.2 + 0.7 + 1.9 + 1.4 + 1.3 + 1.6)/6 | 1.35 |

1.8 | (1.2 + 0.7 + 1.9 + 1.4 + 1.3 + 1.6 + 1.8)/7 | 1.41 |

1.3 | (1.2 + 0.7 + 1.9 + 1.4 + 1.3 + 1.6 + 1.8 + 1.3)/8 | 1.40 |

0.8 | (1.2 + 0.7 + 1.9 + 1.4 + 1.3 + 1.6 + 1.8 + 1.3 + 0.8)/9 | 1.33 |

1.3 | (1.2 + 0.7 + 1.9 + 1.4 + 1.3 + 1.6 + 1.8 + 1.3 + 0.8 + 1.3)/10 | 1.33 |

0.8 | (1.2 + 0.7 + 1.9 + 1.4 + 1.3 + 1.6 + 1.8 + 1.3 + 0.8 +1.3 + 0.8)/11 | 1.28 |

1.2 | (1.2 + 0.7 + 1.9 + 1.4 + 1.3 + 1.6 + 1.8 + 1.3 + 0.8 +1.3 + 0.8 + 1.2)/12 | 1.27 |

1.6 | (1.2 + 0.7 + 1.9 + 1.4 + 1.3 + 1.6 + 1.8 + 1.3 + 0.8 +1.3 + 0.8 + 1.2 + 1.6)/13 | 1.30 |

1.3 | (1.2 + 0.7 + 1.9 + 1.4 + 1.3 + 1.6 + 1.8 + 1.3 + 0.8 +1.3 + 0.8 + 1.2 + 1.6 + 1.3)/14 | 1.30 |

1.4 | (1.2 + 0.7 + 1.9 + 1.4 + 1.3 + 1.6 + 1.8 + 1.3 + 0.8 +1.3 + 0.8 + 1.2 + 1.6 + 1.3 + 1.4)/15 | 1.31 |

A good way to show this data is in the form of a graph of sample size against running mean.

## Variables

The things that you are interested in measuring are called variables. There are two types:

- Independent variable is not affected by other things. It is
*independent*of other variables. - Dependent variable is affected by other things. It is
*dependent*on other variables.

An independent variable causes a change in a dependent variable. A dependent variable cannot cause a change in an independent variable.

There are four measurement scales for variables:

**Nominal**: variables that are not numerical,

e.g.*categories*like gender and ethnicity.**Ordinal**: variables where order has meaning, but the difference between values is not important,

e.g.*ranks*like 1st, 2nd and 3rd, or the ACFOR scale.**Interval**: variables where the difference between values is important,

e.g.*actual number*s like the temperature in °C.**Ratio**: Interval data with a natural (absolute) zero point.

Time in seconds has a ratio scale, but temperature in °C does not (since 0°C does not mean no heat).

### Matched and unmatched data

Your data is matched if a piece of data from one set goes with only one piece of data from the other set. For example you might be measuring temperature of the sea with depth. A specific temperature recording would only be associated with one specific depth.

Your data is unmatched if there is no reason to associate a piece of data from one set with any particular piece of data from the other set. For example you might be measuring the heights of vegetation on trampled and untrampled parts of a path. There is no connection between any of the measurements from the trampled part and the untrampled part.

## Statistical tests

### T test

A T test will tell you if the means of two sets of normally distributed, unmatched, continuous data, with interval level measurements are significantly different to one another.

For any T test, the null hypothesis will be: There is no significant difference between the means of the 2 sets of data

### Spearman’s rank correlation coefficient

Spearman’s rank correlation coefficient will tell you whether 2 variables are correlated. In other words, des one variable change as the other one changes?

It will tell you whether the relationship is positive (both go up together) or negative (one goes up as the other goes down) and the strength of any correlation. It assumes that any relationship is roughly a straight line one.

For any Spearman’s Rank test, the null hypothesis will be: there is no significant correlation between the 2 variables.

### Chi-squared test

A chi-squared test can see if an observed set of data (which has to be counts of things in categories, or *frequencies*) differs significantly from what might be expected.

For any chi-squared test, the null hypothesis will be: There is no significant difference between the observed and the expected frequencies.

### Mann-Whitney U test

The Mann-Whitney U test tells you whether the median values of two sets of data are significantly different from one another.

It has the advantage that the data does not have to be normally distributed and you can use it on smallish quantities of count data.

For any Mann-Whitney U test, the null hypothesis will be: There is no significant difference between the medians of the two sets of data.

## Spearman’s Rank Correlation Test

Spearman’s Rank Correlation is a statistical test to test whether there is a significant relationship between two sets of data.

The Spearman’s Rank Correlation test can only be used if there are at least 10 (ideally at least 15-15) pairs of data.

There are 3 steps to take when using the Spearman’s Rank Correlation Test

### Step 1. State the null hypothesis

There is no significant relationship between _______ and _______

#### Step 2. Calculate the Spearman’s Rank Correlation Coefficient

\(r_s = 1-\frac{(6∑D^2)}{n(n^2-1)}\)\(r_s\) = Spearman’s Rank correlation coefficient

\(D\) = differences between ranks

\(n\) = number of pairs of measurements

#### Step 3. Test the significance of the result

Compare the value of \(r_s\) that you have calculated against the critical value for \(r_s\) at a confidence level of 95% / significance value of p = 0.05.

If \(r_s\) is equal to or above the critical value (p=0.05) the REJECT the null hypothesis. There is a SIGNIFICANT relationship between the 2 variables.

A positive sign for \(r_s\) indicates a significant positive relationship and a negative sign indicates a significant negative relationship.

If \(r_s\) (ignoring any sign) is less than the critical value, ACCEPT the null hypothesis. There is NO SIGNIFICANT relationship between the 2 variables.

### Chi-squared test

Chi squared is a statistical test that is used either to test whether there is a significant difference, goodness of fit or an association between observed and expected values.

\(\chi^2 = ∑ \frac{(O-E)^2}{E}\)The chi squared test can only be used if

- the data are in the form of frequencies in a number of categories (i.e. nominal data).
- there are more than 20 observations in total
- the observations are independent: one observation does not affect another

There are 3 steps to take when using the chi squared test

### Step 1. State the null hypothesis

There is no significant association between _______ and _______

### Step 2. Calculate the chi squared statistic

\(\chi^2 = ∑ \frac{(O-E)^2}{E}\)\(\chi^2\) = chi squared statistic

\(O\) = Observed values

\(E\) = Expected values

### Step 3. Test the significance of the result

Compare your calculated value of \(\chi^2\) against the critical value for \(\chi^2\) at a confidence level of 95% / significance value of P = 0.05, and appropriate degrees of freedom.

\(\mathsf{Degrees\;of\;freedom = (number\;of\;rows\;– 1) \times (number\;of\;columns\;– 1)}\)If Chi Squared is equal to or greater than the critical value REJECT the null hypothesis. There is a SIGNIFICANT difference between the observed and expected values**.**

If Chi Squared is less than the critical value, ACCEPT the null hypothesis. There is NO SIGNIFICANT difference between the observed and expected values**.**

## Mann Whitney U test

Mann Whitney U is a statistical test that is used either to test whether there is a significant difference between the medians of two sets of data.

The Mann Whitney U test can only be used if there are at least 6 pairs of data. It does not require a normal distribution.

There are 3 steps to take when using the Mann Whitney U test

### Step 1. State the null hypothesis

There is no significant difference between _______ and _______

### Step 2. Calculate the Mann Whitney U statistic

\(U_1= n_1 \times n_2 + 0.5 n_2 (n_2 + 1)\;- ∑ R_2\) \(U_2 = n_1 \times n_2 + 0.5 n_1 (n_1 + 1)\;- ∑ R_1\)- \(n_1\) is the number of values of \(x_1\)
- \(n_2\) is the number of values of \(x_2\)
- \(R_1\) is the ranks given to \(x_1\)
- \(R_2\) is the ranks given to \(x_2\)

### Step 3. Test the significance of the result

Compare the value of U against the critical value for U at a confidence level of 95% / significance value of P = 0.05.

If U is equal to or smaller than the critical value (p=0.05) the REJECT the null hypothesis. There is a SIGNIFICANT difference between the 2 data sets.

If U is greater than the critical value, then ACCEPT the null hypothesis. There is NOT a significant difference between the 2 data sets.