Contents_2b. Correlation and Regression

Please note : the data presented in all course material for the statistical module are
generated by computers to demonstrate the methodologies, and should not be confused with
actual clinical information
Introduction Correlation Regression
This page discusses how two measurements may relate to each other, and supports the programs in StatPgm 2b. One Group : Correlation and Regression.
The subsections are
- Correlation, Sample size, Pearsons; Correlation Coefficient for parametric data, and Spearman's Correlation Coefficient for nonparametric data
- Regression Analysis
- Scatter plot of the data and drawing the regression lines
Introduction
Pearson's Correlation Coefficient for Parametric Data
Spearman's Correlation Coefficient for Nonparametric Data
Correlation (co-relation) is how two measurements (x and y) relate to each other. This is usually expressed as the correlation coefficient, abbreviated as r, or the Greek letter rho (ρ).
Correlation coefficients have a value between -1 and 1, as shown in the diagram to the left. When the two measurements (x and y) relate to each other perfectly in the same direction, as in the green data points, the correlation coefficient ρ=1. When the two measurements (x and y) relate to each other perfectly but in opposite directions, as in the red data points, the correlation coefficient ρ=-1. When there is no relationship between the measurements, as in the blue data points, the correlation ρ=0. In most cases, however, a relationship exists, but not a perfect one, as shown in the plot to the right. This is a plot of the relationship between the crown heel lengths and the head circumferences in new born babies. A strong relationship exists, but the relationship is not perfect. In this case ρ=0.6, which is a measure of the strength of the relationship. Please note that correlation means co-relation, and merely describe the strength of relationship between two measurements. It does not explain whether one measurement precedes, predicts, or control the other measurement. In most cases, two measurements are correlated because both depends on or controlled by some other factors. In this example, both head circumference and crown heel length depend on the gestational age. Introduction
The Pearson's Correlation coefficient is calculated on the assumption that the two measurements concerned (x and y), are parametric, continuous measurements Normally distributed. The assumption of parametric statistics allows great flexibility in the use of results, so that the precision of the correlation coefficient (its Standard Error and 95% confidence interval) can be calculated. The coefficient can also be transformed to a normally distributed effect, which can be incorporated into meta-analysis
- The two tail model is required if the direction of the correlation (whether the coefficient has a positive or negative value) is unknown at the planning stage, if the upper and lower limits of the 95% confidence interval are to be calculated, or if the coefficient will be used to compare with others or to combine with others in a meta-analysis.
- In most cases, the researcher is interested only whether a significant correlation exists, whether the tail towards null (ρ=0) overlaps the null value. In such cases, only the sample size for the one tail model is required.
- The default sample size calculation from most statistical packages provides the sample size for the one tail model.
Example : We wish to test whether a significant correlation exists between gestational age and
birth weight, that this correlation is a positive one (weight increases as gestation advances),
and to be clinically relevant, the correlation coefficient needs to be 0.6 or higher.
We enter 0.6 into the calculation and found that we need a sample size of 16 cases for the one tail model.
and 19 cases for the two tail model.
We are only interested in whether a positive correlation exists, and not interested in whether birth weight decreases as gestation advances. We are also interested only whether the correlation is significant (whether the 95% confidence interval overlaps the null value of 0, and not what the upper limit of the 95% confidence interval is. We therefore only require the the one tail model, and the sample size required is 16 cases.
- The data are pairs of measurements, crown heel lengths and head circumferences from new born babies.
- The data is in two columns separated by spaces. The first column contains the crown heel length and the second the head circumferences.
- Each row contains the pair from each baby.
- Pearson's correlation coefficient ρ=0.6026, Standard Error of ρ=0.1881, with degrees of freedom=18.
- Standard null hypothesis testing :
- Student's t = ρ / SE = 0.6026/0.1881 = 3.2033
- Probability of t=3.2033, df=18, p=0.0049 (two tail). Therefore p<0.05 and the result is statistically significant
In most cases of biomedical research, this represents a minor distortion of little consequence, particularly as the main concern is whether the correlation coefficient is significantly different from null (zero value). When meta-analysis is increasingly being performed, however, a more precise representation of the Standard Error and the 95% confidence interval becomes necessary, as with the combination of many studies, the accumulation of distortions would result in a misleading interpretation. A more precise representation is to transform the correlation coefficient into a normally distributed effect. The procedure is called -
Z = 0.5 log((1+r)/(1-r)), and its Standard Error SE = 1/sqrt(n-3)
Using this transformation, our result of n=20 and r (ρ)=0.6026, Z=0.6972 and its Standard Error=0.2425. As this transformation is based on a population model, the 95% confidence interval is Z±1.96SE = 0.6973±1.96*0.2425 = 0.2218 to 1.1725 in a two tail model. As this confidence interval does not overlap the null (0) value, the result is statistically significant The Z values can then be reverse transformed to correlation value as follows -
Reverse of transformation formula r = (exp(2Z)-1) / (exp(2Z)+1)
Reverse transformation of Z=0.2218, r = (exp(2*0.2218)-1) / (exp(2*0.2218)+1) = 0.2182 Reverse transformation of Z=1.1725, r = (exp(2*1.1725)-1) / (exp(2*1.1725)+1) = 0.8251 The 95% confidence interval of the correlation coefficient is therefore 0.22 to 0.83. As the interval does not overlap the null (0) value, we can conclude that a significant correlation exists. In a meta-analysis, it is the Z and its Standard Error that will be used, as these are normally distributed.
Introduction
Where one or both measurements in a relationship to be tested is not parametric (not a continuous normally distributed measurement), the Spearman's Correlation Coefficient is used
Examples of nonparametric measurements are pain scales (0=no pain, 1=minor pain, 2=moderate pain, 3=severe pain), cancer stages (0=no cancer, 1=local and pre-invasion, 2=local invasion, 3=adjacent organ and lymphatic involvement, 4=distant metastasis), Likert Scales (1=strongly disagree, 2=disagree, 3=neutral, 4=agree, 5=strongly agree), stages of labour (0=before any contractions, 1=contractions but cervix not fully dilated, 2=cervix fully dilated but baby not delivered, 3=baby delivered but placenta not delivered). When correlation involves these variables, the Spearman Correlation is used.
- Using raw data as collected :
The data is entered in the same manner as for Pearson's Correlation. The program first ranks the two measurements, and creates a table, where rows represents ranks of the x measurements and columns the ranks of y measurements, and each cell contains the number of cases with these rankings. This table is then used to calculate the statistical significance of the Spearman's Correlation Coefficient, in terms if the probability of Type I Error (α, p). A significant correlation exists if p is less than 0.05. **Examples :**We will use the default example data presented in StatPgm 2b. One Group : Correlation and Regression.- In a survey of 8 women in the postnatal ward, where they were asked to fill in 2
Likert Scales, Q1="My labour was painful" and Q2="I had poor care in labour".
Each row of the data is from a mother, and the 2 columns are her response to the two questions, where 1=Strongly Disagree, 2=Disagree, 3=Neutral, 4=Agree, and 5=Strongly Agree. Spearman's Correlation coefficient r = 0.7975, n = 8, p<0.05. So a conclusion that a significant correlation exists between perceptions of pain and quality of care in labour can be made.
- In a survey of 8 women in the postnatal ward, where they were asked to fill in 2
Likert Scales, Q1="My labour was painful" and Q2="I had poor care in labour".
- Using a table of counts : We will use the default example data presented in StatPgm 2b. One Group : Correlation and Regression.
If the raw data is already converted into a table of counts, it can be entered as a matrix where all rows have the same number of columns (no blanks). The same data (transformed into a table) is used, and the same results are produced
Introduction
Regression analysis produces the formula for a line that relates x and y variables. The formula is y = constant + (regression coefficient)x. Various abbreviations have been used, such as y=a+bx, y=m+dx, and others, but essentially they are just abbreviations. In this module we will use y=a+bx, where a = constant, and b = regression coefficient. The regression line is calculated to best fit the data points, in the sense that the distance between each data point and the regression line, along the y axis, is minimized. It is therefore important to carefully considered which variable is x and which is y. The x variable is called the independent variable, as how it changes is not dependent on or influenced by y. The y is called the dependent variable, as the model assumes that y is dependent on or influenced by x. The independent (x) variable must be ordinal (ordered), in the sense that 3>2>1. x can therefore be binary (0,1 or 1,2), multiple groups that are ordered (1=young, 2=middle age, 3=old), or an actual measurement (height of a person). However, x needs not be parametric (continuous and normally distributed) The y variable is assumed to be a parametric one, continuous measurement that is normally distributed.
Sample size calculation for regression analysis is based on the Analysis of variance F distribution, so the algorithm differs from that used for calculating sample size for correlation. In most cases however, similar sample size is obtained by either calculations. Instead of covering a new calculation for sample size for regression, the module recommends that sample size for Pearson's correlation be used for regression analysis.
The calculations uses the default example data for regression analysis in StatPgm 2b. One Group : Correlation and Regression. The data consists of 22 pairs of measurements, each pair occupies its own row. The data are in a 2 column table. - Column 1 is the x variable, and in the example they represent gestational age (weeks)
- Column 2 is the y variable, and in the example data they are birth weight in grams.
Calculations resulted in the regression formula Birth weight (y) = -5585 + 230 (Gestation (x)). This tells us the following - In our model, the dependent variable (y) is birth weight and the independent variable (x) is gestation. This means we think that birth weight is influenced by gestation, and not the other way around. It also mean that we can predict birth weight from knowing gestation, but not the other way around.
- It tells us that, within the gestational range in our data, birth weight increases by 230 grams per week.
- We are able to predict the average birth weight from gestation. For example,
- at 38 weeks, the average birth weight is -5585 + 230(38) = 3155g
- at 40 weeks, the average birth weight is -5585 + 230(40) = 3615g
As well as calculating the regression formula, the program also creates the scatter plot of the data points and draws the regression line. After some editing, the resulting plot should look like the figure above and to the right. Editing the scatter plot is covered in Graphic Editor and MacroPlot : Explanation and Help |