8/8/2019 Correlacion y Regresion 2
1/28
Correlation & Regression
Do heavier people burn more energy? Does
wine consumption affect cause a decrease inheart disease?
These questions reflect a desire to understand the
relationship between two variables.
What we need:
1. A plot/graph to view the relationship
2. Characteristics to describe
3. Measures of the characteristics4. Method to make inferences about the relationship
8/8/2019 Correlacion y Regresion 2
2/28
Correlation & Regression
The grapha Scatter Plot
X
YResponse variable
(dependent variable)
Explanatory variable
(independent variable)
8/8/2019 Correlacion y Regresion 2
3/28
Correlation & Regression
Do heavier people burn more energy?
Response: metabolic rate
Explanatory: weight or mass
Does wine consumption cause a decrease in heart
disease?
Response: death rate from heart diseaseExplanatory: wine consumption
8/8/2019 Correlacion y Regresion 2
4/28
Correlation & Regression
60504030
2000
1500
1000
Mass(kg)
Rate(cal)
Do heavier people burn more energy?
Lean body mass vs. metabolic rate
8/8/2019 Correlacion y Regresion 2
5/28
Correlation & Regression
0 1 2 3 4 5 6 7 8 9
100
200
300
Alcoho l
hrt_
deathrate
Is wine good for your heart?
wine consumption vs. heart disease rate (per 100,000)
wine consumption
8/8/2019 Correlacion y Regresion 2
6/28
Correlation & Regression
Interpretingcharacteristics to look for:
Patterns:
Form (clusters, scatter, linear..)
Direction (positive, negative)
Strength ( how closely points follow form)
Deviations:
Outliers
Interpret the last two scatter plots.
8/8/2019 Correlacion y Regresion 2
7/28
Correlation & Regression
Options to consider:
Adding a categorical variable
8/8/2019 Correlacion y Regresion 2
8/28
Scatter plot:
relationship between
quantitative variables
Form: Linear is
probably the most
common form
Strength: We can
measure the strength of
a linear relationship
because our eyes can
deceive us!!!
Strength?
Strength?
8/8/2019 Correlacion y Regresion 2
9/28
Correlation & Regression
Correlation
measure the direction and strength of a linear relationship
Standardised value of each x
Standardised value of each y
Correlation is an average product of standardised values
8/8/2019 Correlacion y Regresion 2
10/28
Quantitative variables
Linear relationships
r has no units
r can be between 1 and 1
Positive r =positive association
Negative r =
negative association
0 = no association
r is influenced by outliers
Correlation = r
8/8/2019 Correlacion y Regresion 2
11/28
Correlation & Regression
Correlations: Mass (kg), Rate (cal)Pearson correlation of Mass(kg) and Rate(cal) = 0.865
P-Value = 0.000
60504030
2000
1500
1000
a g
ae
a
o hea e peop e bu n o e ene g ?
Lean bod a e abo a e
r
8/8/2019 Correlacion y Regresion 2
12/28
Correlation & Regression
30 40 50 60
1000
1500
2000
a g
ae
a
a e +
Fe ae o
We gh a e abo a e
Correlations: Mass (kg)_F, Rate (cal)_FPearson correlation of Mass(kg)_F and Rate(cal)_F = 0.876
Correlations: Mass (kg)_M, Rate (cal)_M
Pearson correlation of Mass (kg)_M and Rate (cal)_M = 0.592
8/8/2019 Correlacion y Regresion 2
13/28
Correlation & Regression
Correlations: Alcohol, heart_death ratePearson correlation of Alcohol and hrt_death rate = -0.843
0 1 2 3 4 5 6 7 8 9
100
200
300
Alcohol
h
rt_
thr
a
t
I w ood for ourheart
w econsumpt on s. heart diseaserate per100,000)
wineconsumption
8/8/2019 Correlacion y Regresion 2
14/28
Correlation & Regression
Correlations: Alcohol Wine consumption, heart death rate
Pearson correlation of Alc Wine consumption and hrt death rate = -0.648
1 2 3 4
150
200
250
300
h
d
eah
ae
hea di ea e dea h a e ine on u p i n
ou lie e o ed
Al wine n u p i n
8/8/2019 Correlacion y Regresion 2
15/28
Correlation & Regression
Linear relationshipsusing a LINE
0 1 2 3 4 5 6 7 8 9
100
200
300
A l c o h o l
h
rt
thr
t
I ood foryourheart
econsumpt onvs. heart disease rate (per100,000)
ineconsumption
We can summarise an overall linear form with a linethe
best line is called the Regression Line
8/8/2019 Correlacion y Regresion 2
16/28
Correlation & Regression
9876543210
30 0
20 0
10 0
wi c s m i
t
t
S = 37.8786 R-S q = 71.0 % R-Sq ( j) = 69.3 %
t t = 260.563 - 22.9688w i c s m t
Fitt g ssi li t t vs.wi c s m ti
A regression line describes how a response variable changes as an
explanatory variable changes. We can nowpredicta value of y when
given an x.
What would be the death rate
due to heart disease if the
average daily consumption of
wine was 3 glasses?
191.66 deaths per 100,000
8/8/2019 Correlacion y Regresion 2
17/28
Correlation & Regression
How do we determine the regression line?
We want the vertical
distances from the
points (observed) to
the line (predicted) to
be as small as
possiblethis means
our error in predicting
y is small.
8/8/2019 Correlacion y Regresion 2
18/28
Correlation & Regression
Calculating the line
We will use the method of least squares to calculate the line.
Least squares regression is the line that makes the sum of the
squares of the vertical distances as small as possible.
! a bx
b! rsysx
a ! y bx
Equation of the line (read y hat)
b is the slope (rate of change iny whenx
increases)
a is the y intercept (value of y whenxis 0)
8/8/2019 Correlacion y Regresion 2
19/28
Correlation & Regression
9876543210
30 0
20 0
10 0
wine c ns tion
deat
rate
S = 37.8786 R-Sq = 71.0 % R-S q (ad j ) = 69.3 %
de a t ra te = 260.563 - 22.9688 w ine c on s
t
Fitted regression line deat rate vs.wine cons tion
The regression equation is
death rate = 260.563 - 22.9688 wine consumption
S = 37.8786 R-Sq = 71.0 % R-Sq(adj) = 69.3 %
Analysis of Variance
Source DF SS MS F P
Regression 1 59813.6 59813.6 41.6881 0.000
Error 17 24391.4 1434.8
Total 18 84204.9
8/8/2019 Correlacion y Regresion 2
20/28
Correlation & Regression
Facts about regression.
1. Clear distinction between the response variable and theexplanatory variable.
2. Correlation and slopea change in one Wofx
corresponds to a change ofrW in y.
3. Least-squares regression line passes through
4. Some variation (spread) in y can be accounted for by
changes in x when there is a linear relationship. The
square of the correlation coefficient is the the fraction of
the variation in y values that is explained by changes in x.
(x,y)
!variation in y due to x
total variation in observed y
= coefficientofdetermination
8/8/2019 Correlacion y Regresion 2
21/28
Correlation & Regression
9876543210
30 0
20 0
10 0
wine cons tion
deat
rate
S = 37.8786 R-Sq = 71.0 % R-S q (ad j ) = 69.3 %
de a t ra te = 260.563 - 22.9688 w ine c on s
t
Fitted regression line deat rate vs.wine cons tion
The regression equation is
death rate = 260.563 - 22.9688 wine consumption
S = 37.8786 R-Sq = 71.0 % R-Sq(adj) = 69.3 %
R-sq can have a value between 0 and 1.
8/8/2019 Correlacion y Regresion 2
22/28
Correlation & Regression
VARIATION OF DEPENDENT Y
8/8/2019 Correlacion y Regresion 2
23/28
Correlation & Regression
Residuals
the left overs from least-squares regression
Deviations from the overall pattern are important. The deviations
In regression are the scatter of points about the line. The
vertical distances from the line to the points are called residualsand they are the left-over variation after a regression line is fit.
Residual = observedy predictedy
residuals ! y y
8/8/2019 Correlacion y Regresion 2
24/28
Correlation & Regression
Obs Alcohol hrt_deat Fit SE Fit Residual St Resid1 2.50 211.00 203.14 8.89 7.86 0.21
2 3.90 167.00 170.99 9.23 -3.99 -0.11
3 2.90 131.00 193.95 8.70 -62.95 -1.71
4 2.40 191.00 205.44 8.97 -14.44 -0.39
5 2.90 220.00 193.95 8.70 26.05 0.71
6 0.80 297.00 242.19 11.76 54.81 1.52
7 9.10 71.00 51.55 23.29 19.45 0.65 X
8 0.80 211.00 242.19 11.76 -31.19 -0.87
9 0.70 300.00 244.49 12.00 55.51 1.55
10 7.90 107.00 79.11 19.39 27.89 0.86
11 1.80 167.00 219.22 9.72 -52.22 -1.43
12 1.90 266.00 216.92 9.57 49.08 1.34
13 0.80 227.00 242.19 11.76 -15.19 -0.42
14 6.50 86.00 111.27 15.11 -25.27 -0.73
15 1.60 207.00 223.81 10.06 -16.81 -0.46
16 5.80 115.00 127.34 13.15 -12.34 -0.35
17 1.30 285.00 230.70 10.64 54.30 1.49
18 1.20 199.00 233.00 10.85 -34.00 -0.94
19 2.70 172.00 198.55 8.77 -26.55 -0.72
The regression equation is
death rate = 260.563 - 22.9688 wine consumption
s = 37.8786 R-Sq = 71.0 % R-Sq(adj) = 69.3 %
The residuals are.
The mean of residuals is always equal to 0
8/8/2019 Correlacion y Regresion 2
25/28
Correlation & Regression
Residual Plots
9876543210
50
0
-50
Alcohol
Residual
Residuals Versus Alcohol(response is hr
_deat)
Things to look for:
1. A curved pattern means
the relationship is not
linear.
2. Increasing/decreasing
spread about the line
3. Individual points with
large residuals
4. Individual points that areextreme in the x
directionDo we have any influential
points here?
8/8/2019 Correlacion y Regresion 2
26/28
Correlation & Regression
Ideal residual pattern
Curvaturea linear fit is not
appropriate
Increasing variation
8/8/2019 Correlacion y Regresion 2
27/28
Correlation & Regression
4321
30 0
25 0
20 0
15 0
C5
C
6
S 40 .0879
-S
42 .0
-S
(
!
j ) 37 .5
C6 28 0 .21 5 -33 .7666 C5
Regressi Pl t
9876543210
30 0
20 0
10 0
wi ec s ti
eat
rate
S
37 .8786 R -S
71 .0
R -S
(adj )
69 .3
de a t"
ra te 26 0 .56 3 -22 .9688 w i#
e c$ #
s% &
t
itted regressi li edeat rate s.wi ec s ti
9876543210
50
0
-50
Alc'
(
'
l
Residual
Residuals VersusAlc l(res ) 0 1 seis 2 rt_deat)
4321
50
0
-50
C5
Res
idual
Res iduals VersusC5(res
3
4
5se is C6 )
8/8/2019 Correlacion y Regresion 2
28/28
Correlation & Regression
Attention!! Caution!!
1. Correlation and regression describe only linearrelationships
2. R and r-sq are not resistant
3. Do not extrapolate!!! What is extrapolate?
4. Correlations based on averages are too high when
applied to individualsif the data has been averaged,
the values of correlation and regression cannot be used
with un-averaged values. (i.e., average alcohol
consumption per countrynot individuals).
5. Lurking variableslike the male/female variable in theweight vs. energy and the possible Mediterranean
variable in the wine data.
6. Correlation/association is not causation.