METEO 469
From Meteorology to Mitigation: Understanding Global Warming

Review of Basic Statistical Analysis Methods for Analyzing Data - Part 3

PrintPrint

Establishing Relationships Between Two Variables

Another important application of OLS is the comparison of two different data sets. In this case, we can think of one of the time series as constituting the independent variable x and the other constituting the independent variable y. The methods that we discussed in the previous section for estimating trends in a time series generalize readily, except our predictor is no longer time, but rather, some variable. Note that the correction for autocorrelation is actually somewhat more complicated in this case, and the details are beyond the scope of this course. As a general rule, even if the residuals show substantial autocorrelation, the required correction to the statistical degrees of freedom (N' ), will be small as long as either one of the two time series being compared has low autocorrelation. Nonetheless, any substantial structure in the residuals remains a cause for concern regarding the reliability of the regression results.

We will investigate this sort of application of OLS with an example, where our independent variable is a measure of El Niño  the so-called Niño 3.4 index — and our dependent variable is December average temperatures in State College, PA.

The demonstration is given in three parts below:

Demo part 1
Click here for a transcript

PRESENTER: Well, now we're going to look at a somewhat different situation where our independent variable is no longer time but it's some quantity. It could be temperature. It could be an index of El Nino, the North Atlantic Oscillation Index.

So let's look at an example of that sort. We are going to now look at the relationship between El Nino and December temperatures in State College, Pennsylvania. And we can plot out that relationship as a scatterplot. So on the vertical axis we have December temperature in State College. On the horizontal axis our independent variable is the Nino 3.4 index. Negative values indicating La Ninas, positive values indicating El Ninos.

And the strength of the relationship between the two is going to be determined by the trend line. That describes how winter temperatures in State College, December temperatures in State College depend on El Nino. And so by fitting the regression, we obtain a slope of 0.7397. That means for each unit change in El Nino, in Nino 3.4, we get a 0.74 unit change in temperature. So for a moderate El Nino event where the Nino 3.4 index is in the range of plus 1, that would imply December temperatures in State College that year are 0.74-- degrees Fahrenheit is the scale here-- 0.74 degrees Fahrenheit warmer than usual. And for a modestly strong La Nina where the Nino 3.4 is on the order of minus 1 or so, State College December temperatures would be about 0.74 degrees colder than normal.

Now the correlation coefficient associated with that linear regression in this case is 0.174. Now we have 107 years here. Our data set, as before, goes from 1888 to 1994. So that's 107 years. We've got a correlation coefficient of 0.174.

So if we use our table and we take N equal 0.107, r of 0.174, we find that the one-tailed value of P is 0.365. The two-tailed value is 0.073. So if our threshold for significance were P of 0.05, the 95% significance level, then that relationship, a correlation coefficient of 0.174 with 107 years of information, would be significant for a one-tailed test, but it would not pass the 0.05, the 95% significance threshold for a two-tailed test. So we have to ask the question, which is more appropriate here, a one-tailed test or a two-tailed test?

Now if you had reason to believe that El Nino events warm the Northeastern US, for example, then you might motivate a one-tailed test since only a positive relationship would be consistent with your expectations. But if we didn't know beforehand whether El Ninos had a cooling influence or a warming influence on the Northeastern US, you might argue for a two-tailed test. So whether or not the relationship is significant at the P equal 0.05 level is going to depend on which type of hypothesis test we're able to motivate in this case.

Demo part 2
Click here for a transcript

PRESENTER: OK, well, let's continue with this analysis. Now, what I'm going to do here is plot instead the temperature as a function of year. That's plot number 1. And we no longer want a trend line here. That's blue. That's the State College December temperatures.

And now, for plot number 2, I am going to plot the Nino 3.4 index for that year. And I'll use axis b to put them on the same scales. So here we can see the two series. Blue is the December temperatures. Red is the Nino 3.4 index. And you can see that in various individual years, there does seem to be a relationship where large positive departures of the Nino 3.4 index are associated with warm Decembers, and large negative departures are associated with cold temperatures.

So we can see visually that relationship that we also saw when we plotted the two variables in a two-dimensional scatterplot, and looked at the slope of the line relating the two data sets. Here now, we're looking at the time series of the two data sets. And we can see some of that positive covariance, if you will, that there does appear to be a positive relationship. Although, we already know it's a fairly weak relationship.

Now, let's do the formal regression. So I'm going to take away the El Nino series. So here we've got State College December temperatures in blue. Now, our regression model is going to use the Nino 3.4 index as our independent variable, as a predictor of State College December temperatures, our dependent variable.

We'll run the linear regression. There is the slope. 0.74 is the coefficient that describes the relationship in how temperature depends on the Nino 3.4 index. It's positive. We already saw that the slope was positive. There's also a constant term that we're not going to worry about too much here. What we're really interested in is the slope of the regression line that describes how changes in temperature depend on changes in the Nino 3.4 index.

And as we've seen, that 0.74 implies that for a unit increase in Nino 3.4, in an anomaly of plus 1 on the Nino 3.4 scale, we get a temperature for December that on average is 0.74 degrees Fahrenheit warmer than average.

The R-squared value is 0.0301. Well, if we take 0.0301, 0.0301, and take the square root of that, that's the r value of 0.1734. And we know it's a positive correlation because the slope is positive.

We already looked up the statistical significance of that number, and we found that for a one-sided hypothesis test, that the relationship is significant at the 0.05 level. But if we were using a two-sided significant criterion hypothesis test-- that is to say, if we didn't know a priori whether we had reason to believe that El Nino's warm or cool State College December temperatures-- then the relationship would not quite be statistically significant.

So we've calculated the linear model. We can now plot it. So now, I'm going to plot year in model output on the same scale. And so now, the red curve is showing us the component of variation in the blue curve that can be explained by El Nino.

And we can see it's a fairly small component. It's small compared to the overall level of variability in December State College temperatures, which vary by as much as plus or minus 4 degrees or so Fahrenheit. The standard deviation is close to [AUDIO OUT]

Demo part 3
Click here for a transcript

PRESENTER: OK, so continuing where we left off. The red curve is showing us the component of the variation in December State College temperatures that can be explained by El Nino. If in a particularly strong El Nino year, where the Nino 3.4 Index is say as large as plus two, we get a December temperature that's about a 1.5 Fahrenheit above average. That is to say twice that 0.74 degrees effect that we get for a one unit change in Nino 3.4.

For a particularly strong La Nina event which would correspond to a negative Nino 3.4 anomaly of say negative 2 or so, we would get a 1.5 Fahrenheit cooling effect on State College December temperatures. So the influence of El Nino is small compared to the overall variability in the series. But it is statistically significant.

At least if we are able to motivate a one sided hypothesis test, if we had reason a priori to believe that El Nino events warm State College temperatures in the winter, then the regression gives us a result that's significant at the 0.05 level. The standard threshold for statistical significance.

OK. So it may not be that satisfying. We're not explaining a large amount of variation in the data. But we do appear to be explaining a statistically significant fraction of variability in the data.

Now finally, let's look at the trend. Sorry. Let's look at the residuals from that regression. And what I'll do is, I will get rid of these graphs that we have right now. And I'm just going to plot the model residuals as a function of time.

That's what they look like. There isn't a whole lot of obvious structure. And in fact, if you go back to the regression model tab, and we look at the value of the lag 1 auto-correlation coefficient, we see that it's minus .09. That's quite-- it's slightly negative. It's quite small, close to zero.

If we look up the statistical significance, it's not going to be even remotely significant. So we don't have to worry about auto correlation influencing our estimate of statistical significance. We also don't have much evidence here of the sort of low frequency structure in the residuals that might cause us worry.

So the nominal results of our regression analysis appear valid. And again, if we were to invoke a one sided hypothesis test, we would have found a statistically significant-- albeit a weak-- influence of El Nino on State College December temperatures.

You can play around with the data set used in this example using this link: Explore Using the File testdata.txt