Getting started with ... Stats

some basic statistics concepts

What is statistics?

We measure things. In practice, there are two major difficulties: 1) there are too many things of the same kind, and it is not possible to measure every one of them; 2) you cannot eliminate errors when you measure, so different individuals gives you different measurements, even multiple measurement of the same individual yields different results; how do you describe this character being measured for the whole population of this kind of things?

Staticstics was invented to solve this problem. In short, it a) describes the similarity and variation of measurements (scores) within a group of individuals (sample) that you can measure, which is called descriptive statistics; and b) help us estimate the characteristics in question in a larger group, which we often cannot or should not measure one by one, and this is called inferential statistics.

Descriptive statistics

When we obtain scores over a sample, the first things that we need to know are: what are these values? What is the maximum and the minimum value? How large are the differences between the scores? What information do the values tell us as a whole? These questions are answered by descriptive statistics.

Central tendency

Central tendency tells you where the typical or common value is in your sample’s scores. The mostly widely used central tendency measures are (in that order): the mean, the median, and the mode.

The Mean

Just the arithmetic mean of all scores in the sample.

X=i=1nXin \bar{X} = \frac{\sum_{i=1}^{n}{X_{i}}}{n}

where \(n\) is sample size, \(X_{i}\) is the score for the individual \(i\).

In R, the mean is calculated with

mean(x, trim = 0, na.rm = FALSE, ...)

The Median

If you sort your sample scores in ascending order, and pick the one in the middle if you have a sample size of odd value, or the mean of the two in the middle if you have a sample size of even value, you get the sample’s median. A median is no smaller than half of the scores in the sample and no larger than the other half.

Median is more useful than the mean in telling you the position of the typical value if you have some extreme values on either end of the score distribution, or the scores are ordinal values but not interval, i.e. the distance between values has no meaning.

In R, the median is calculated with

median(x, na.rm = FALSE)

The Mode

The mode is simply the value that appears most frequently in the sample scores. Note that in some samples, there may be two or more modes. These are called bimodal and multimodal distributions. Also note that the mode is not necessarily close to the mean.

In R, you can use such a function to find mode numbers for a univariate sample:

Mode <- function(x) {
  ux <- unique(x)
  tab <- tabulate(match(x, ux)); ux[tab == max(tab)]
}

Alternatively you can use modeest package.

Dispersion

Dispersion tells you how large are the differences between scores. Some measurements here, namely variance and standard deviation, are basis of inferential statistics later, as they combined with central tendency give us ideas about where to expect scores in the population.

The Range

The range is the distance between the largest and the smallest value in the sample. It is often reported with the max/min values.

R=XmaxXmin R = X_{\text{max}} - X_{\text{min}}

In R, the range is calculated with

range(..., na.rm = FALSE)

Note that it reports the min and max values in a pair, not the range itself.

The Interquartile Range

The interquartile range is less influenced by extreme values than the range. As finding the median, you sort the scores, find the median and divide the samples into two groups, and find the medians in the two groups respectively. If you find the medians by calculating means, use the larger score. The distance between the two values you find is the interquartile range.

In R, the interquartile range is calculated with

# assuming continuous sample
IQR(x, na.rm = FALSE, type = 7)

Variance

Variance indicates the average of the amount of dispersion in a distribution of scores. To calculate variance, you add the distances of each score from the mean together and divide the sum by population size. But by definition, if you just add positive and negative distances themselves, the result would be 0. So the differences are squared first.

Therefore, for the population’s variance:

σ2=(Xμ)2N \sigma^{2} = \frac{\sum{(X-\mu)}^{2}}{N}

where \(N\) is the size of the population, \(\) is the population mean.

For the sample’s variance, because the mean of the sample is a parameter calculated with all the sample scores, the degree of freedom is 1 less than the sample size:

s2=(XX)2n1 s^{2} = \frac{\sum{(X-\bar{X})}^{2}}{n - 1}

where \({X}\) is the sample mean, and \(n\) the sample size.

In R, variance is calculated with

var(x, y = NULL, na.rm = FALSE)

Standard Deviation

SD is the typical deviation between individual scores in a distribution and the mean of the distribution.

It is simply calculated by obtaining the square root of variance.

In R, the sample’s SD is calculated with sd from stats package.

sd(x, na.rm = FALSE)

Standardization and z-score

z-score is a quick way to describe how far the score is from the mean.

z=XXSD z = \frac{X - \bar{X}}{SD}

So, if \(X\) is 0.5 SD larger than the mean, its z-score is 0.5. If \(X\) is 1 SD smaller than the mean, its z-score is -1.

In R, z-scores can be calculated with

scale(x, center = TRUE, scale = TRUE)

Inferential statistics

Inferential statistics is used when we know some characteristics of the sample, and want to infer whether such characteristics exist in the larger population.

Standard error of the mean

Suppose we have a population. We randomly draw a sample, calculate its mean. Then we put the sampled individuals back into the population, and randomly draw another sample, we can again calculate its mean. Repeat this a few (hundred) more times, the means you calculated forms a distribution, called the sampling distribution of the mean, and the standard error of the mean is the standard deviation of this sampling distribution of the mean.

Because of the central limit theorem, the sampling distribution of the mean will be approching normal distribution. This makes standard error an indispensible tool in inferential statistics.

Standard error of the mean is calculated from SD and sample size like this:

Se=SDn S_{e} = \frac{SD}{\sqrt{n}}

In R, you can use the above equation, or use std.error from the plotrix package.

t distributions and t-value

t distributions are a family of symmetrical distributions. They describe the probability distributions that arise when estimating the mean of a normally distributed population when the sample size is small and the population SD is unknown. The larger the sample, the closer t distribution resembles the normal distribution. When sample size is around 120, t distribution is almost identical to normal distribution.

Similar to z-score describing a score relative to the mean, t-value describes how far the sample mean is from the population mean, with the unit being standard error, given a certain sample size (degree of freedom). With t-value, we can lookup what is the probability of obtaining such a sample mean, if we know the population mean. In reverse, we can tell the probability that the population mean is to an extent near the sample mean we obtained. Thus inferring parameters of population from those of samples is possible.

The t-value is calculated with the population mean, the sample mean and SE like this:

t=XμSe t = \frac{\bar{X} - \mu}{S_e}

In R, you can use dt in stats package to calculate the probability of obtaining a certain t-value by chance in sampling.

Statistical significance and hypothesis testing

Suppose a theory declares that a measurement of population should have a mean value of \(\), while a study of a sample from the population yields a mean value of \({X}\) that is different from \(\). Is the difference due to chance (random sampling error) or indicating an error in theory?

The hypothesis that observed difference is due to chance is called null hypothesis, and the other is called alternative hypothesis. In most cases, only when statistics tells us that the probability that null hypothesis is valid is less than 0.05, can we claim that the observed difference is significant, i.e. alternative hypothesis is valid. To calculate the probability, we calculate the t-value, then lookup the corresponding probability. A large t-value means the alternative hypothesis is more likely.

Confidence interval of the mean

Suppose a randomly selected sample of size \(n\) yields a mean of \({X}\), how can we estimate the population mean \(\)? We can almost be certain that \({X}\). What we care about, is how large the range around \({X}\) should be, if we are 95% or 99% sure that \(\) is in this range.

This range is called 95% confidence interval or 99% confidence interval, the formula to calculate it is

CI95=X±t95SeCI99=X±t99Se \begin{eqnarray} CI_{95} & = & \bar{X} \pm t_{95}S_e \\ CI_{99} & = & \bar{X} \pm t_{99}S_e \end{eqnarray}

\(t_{95}\) and other t-values corresponding to the confidence required can be found by looking up a t distributions table. In R, you can use

qt(.95, df)

to get the \(t_{95}\) value given sample size (degree of freedom) df.

Correlation

When we measure two or more variables, the question of correlation often pops up. In many cases, we raise correlation questions to begin with (“is the amount of people entering the mall related to weather?”).

We calculate correlation coefficients, to see how strong the correlation is between variables. The most used may be the Pearson product-moment correlation coefficient. To calculate it, we first standardize the variables \(X\) and \(Y\), to convert them to z-scores. For each case in the sample, we multiply its \(X\) variable’s z-score by its \(Y\) variable’s z-score, add up the products for all the cases, then divide the sum by the sample size:

r=zxzyN r = \frac{\sum{z_{x}z_{y}}}{N}

The coefficient will be between -1 and 1. Higher absolute value means a strong correlation and 0 means no correlation at all. Positive value indicates a positive correlation.

In R, Pearson correlation coefficient can be calculated with

cor(x, y, method = "pearson")

When \(X\) is a continuous variable and \(Y\) a naturally two-category nominal variable, one can use a special case of Pearson coefficients called point-biserial.

In R, point-biserial coefficient can be calculated with biserial.cor from the ltm package.

If both \(X\) and \(Y\) are dichotomous variables, one can use a phi coefficient, or use chi-square analysis. Phi coefficient is yet another special case of Pearson coefficient.

In R, you can use phi from the psych package to calculate phi coefficients.

If one of the variable is an ordinal but not interval variable, one should use Spearman’s rho coefficient. It is, you guessed right, another specialized form of Pearson coefficient.

In R, you can use the same cor function, with method = "spearman".

Significance of correlation

The correlation coefficient \(r\) can tell us whether a correlation exists between two variables in the sample. But is the correlation significant? Can we say the correlation exists in the population? We use the versatile t distributions again to answer the question.

The t value for the correlation is

t=rρsr t = \frac{r - \rho}{s_r}

where \(r\) is the sample correlation coefficient, \(\) is the population correlation coefficient, for a null hypothesis it is 0, and \(s_r\) is the standard error of the sample correlation coefficient.

\(s_r\) can be calculated with the following formula:

sr=(1r2)+(N2) s_r = \sqrt{(1 - r^2) + (N - 2)}

where N is the sample size.

So the formula for calculating \(t\) can be written as

t=rN21r2 t = r \sqrt{\frac{N - 2}{1 - r^2}}

Then we can obtain the probability that null hypothesis is true by looking up tables.

The coefficient of determination

\(r^2\) actually denotes how much variance is shared between the two variables, if you look closely. So the value of \(r^2\) is interpreted as how much variance in one variable can be explained by the variance in another.

Correlation and causality

Correlation does not imply causality. In many cases, the logical relationship between the two variables is not directly explained by their correlation. Maybe a third unobserved variable causes both variables to change. Maybe the two have no relationships whatsoever and we are observing “artifacts”.

On the other hand, if you want to prove the existence of causality, you have to first prove there is correlation between the independent variable and the dependent variable.

Independent samples t-test and pair-samples t-test

It is a common task in statistics that we want to know whether the difference observed in two groups of samples are the result of differences in the populations delinated by a grouping variable, or are they just due to chances. If the variable in question is a continuous interval or ratio variable, and the grouping variable is a nominal or categorial variable that separates the samples into independent groups, e.g. men and women, non-smokers and smokers, 3-graders and 5-graders, we can use the independent samples t-test to see if the differences are statistically significant.

The basic idea is the same as estimating the probability of the population parameter falling into a certain interval given the sample parameters. We have the parameter differences between the two groups of samples, and obtain the t-value by dividing the difference with the standard error, then we look up a probability table to see how likely the t-value is the result of chances alone.

For example, if we need to calculate whether the difference in the mean of two groups of samples is significant, we use the following equation:

t=X1X2Se t = \frac{\bar{X}_1 - \bar{X}_2}{S_e}

\(S_e\) here is the standard error of the difference between the means. From the name you can tell it is a bit more complex than the sample’s standard error of the mean. \(S_e\) is calculated as follows:

Se=SX12+SX22 S_e = \sqrt{S_{X_1}^2 + S_{X_2}^2}

where \(S_{X_1}\) and \(S_{X_2}\) are the two groups’ respective standard errors of the mean … if the two groups of samples are similar in size. In some cases this can be a very big IF. When the two groups differ greatly in size or variance, or the data are not normally distributed, you may want to use some non-parametric alternatives such as Mann-Whitney U test.

When looking up the table, the degree of freedom is the sum of two sample sizes minus 2, because you have two parameters: the means of two groups:

df=n1+n22 df = n_1 + n_2 - 2

Paired-sample t-test answers a similar question, but in this case, each individual in one group is paired with one individual in another group in some way. For example, we want to look at the effect of father’s TV watching habits on their eldest children, so we take observations of two groups, fathers in a group, their child in another, the sample comprise father-child pairs. Or, we do a longitudinal research, observe some children when they are 3, then take another measure of the same indicators at the age of 7, in this case the samples are also paired, or dependent.

Again,

t=XYSe t = \frac{\bar{X} - \bar{Y}}{S_e}

\(S_e\), the standard error of the difference between dependent sample means, is even more complex to calculate here. You have to first calculate the standard deviation of the difference between dependent sample means:

SD=D2(D)2NN1 SD = \sqrt{\frac{\sum{D^2} - \frac{(\sum{D})^2}{N}}{N-1}}

and then calculate the standard error in the good old way:

Se=SDN S_e = \frac{SD}{\sqrt{N}}

In this and previous equations, \(N\) stands for the number of pairs in the sample. The degree of freedom in this case is \(N - 1\).

References

Descriptive statistics

Inferential statistics