Section 2.3: Modeling Linear Data

Should you choose to go into the education field once you graduate, you will find there is a heavy emphasis on "what does the data show?" In education, educators are frequently looking at data to determine if there is some type of relationship between a number of different factors. To visually see that data, a person may use something called a scatter plot. What a scatter plot does is it allows a person to graph a "cause and effect" type of relationship.

Scatter Plots

Every scatter plot is collection of points\( (x,y) \) in which the x-axis demonstrates the explanatory variable, or inputs, and the y-axis demonstrates the response variable, or outputs, of the relationship.

Looking at the above scatter plot, we can really derive a lot of information from the data collected. From the scatter plot, we can see that someone was looking to show a relationship between a student's age and the grade they received on the final exam. We can see that the ages range from about 18 to 40, and the grades range from about 60% to 99%. However, from this scatter plot, we cannot conclude that there is a relationship between the students age and their final exam grade. Throughout the remainder of this unit, we will be discussing how we can determine if a relationship actually exists. In this section, we will really concentrate on looking at linear models.

Recall that a linear model is a set of data that creates a straight line that either increases or decreases.

Example 1

The scientist is looking to determine if there is a relationship with how much a cricket chirps depending on the air temperature in degrees Fahrenheit. The table below displays the number of chirps a cricket makes in 15 seconds and the air temperature during that data recording. Create a scatter plot and determine whether the data appears to be linearly related?

Data values of numbers of chirps and their coordinating temperature.
Chirps 44 35 20.4 33 31 35 18.5 37 26
Temperature (°F) 80.5 70.5 57 66 68 72 52 73.5 53

Solution

To create this table, we first need to figure out what will go on the $x$-axis and what will go on the $y$-axis. Are we trying to show that the number of chirps is dependent on the temperature, or that the temperature dependent on the number of chirps? It would be silly to assume that the temperature outside is dependent on how many times a cricket chirps. This suggests that our explanatory variable is the temperature and our response variable is the number of chirps. Thus if we create our scatter plot, we'd get:

Scatter plot of the temperature on the x-axis and chirps in 15 seconds on the y-axis. 40 50 60 70 80 90 Temperature (°F) 0 10 20 30 40 50 Chirps in 15 Seconds Temperature vs Chirps

Looking at the scatter plot, we can see that the data trends upwards even though the line is not a perfect straight line. With real data, it will rarely fall on the line 100 % accurately. It can be concluded from looking at the scatter plot that a linear relationship does exist.

Writing Linear Models

Once we recognize a need for a linear function to model a given set of data, the natural follow-up question is, "How do we write that model?" The non-technological way would be to select two points on the graph and use the slope equation that is learned in Algebra I to identify the slope. Then, find the y-intercept of the graph and write the equation(model) using the point-slope formula. Let's be honest, this is college. The use of technology is to be used to enhance your skills. While we will cover what the slope and y-intercept represent graphically, we are going to allow technology to create that model for us.

Recall in high school, or possibly even middle school, you learned about the slope of a line. You were taught that the slope $m=\dfrac{\text{rise}}{\text{run}}=\dfrac{y_2-y_1}{x_2-x_1}$ is the rate-of-change of the linear function. We've previously discussed the $y$-intercept $(0,b)$ where $b$ is the place where the equation crosses through the $y$-axis. Using these two pieces, an equation can be created that will model our data. Typically when technology finds the model of best fit or the linear regression model, they will use the equation $\hat{y}=mx+b$, where $\hat{y}$ is the estimated response variable $y$ value, and $x$ is the input or explanatory variable.

For the purpose of this course, we will primarily use Microsoft Excel or Google Sheets to help us find these models. Should you need tutorials on either of these, please refer to the appendix of the book. Let's see how we would read and interpret the model produced by technology.

Example 2

Looking back at our cricket data, identify a linear function that would best model the data. Then identify how many chirps we could expect if the temperature was 84$^\circ$F.

Data values of numbers of chirps and their coordinating temperature.
Chirps 44 35 20.4 33 31 35 18.5 37 26
Temperature (°F) 80.5 70.5 57 66 68 72 52 73.5 53
Scatter plot of the temperature on the x-axis and chirps in 15 seconds on the y-axis. 40 50 60 70 80 90 Temperature (°F) 0 10 20 30 40 50 Chirps in 15 Seconds Temperature vs Chirps

Solution

First, we would need to put our data into Excel or Google Sheets, placing the Temperature in the first column and the chirps in the second column. From there, you'd insert a scatterplot, add a chart element, and add a linear trendline. Once again, if you're unsure of the steps, please see the appendix. Once you're finished, you'd end up with the following results:

Excel screenshot showing a data table and scatter plot relating temperature to chirp counts. On the left, a two-column table lists Temperature values in the first (left) column and Chirps counts in the second (right) column. On the right, an XY scatter plot graphs Chirps (y-axis) versus Temperature (x-axis), includes a fitted trendline, and displays the regression equation for predicted chirps (y-hat) along with the R-squared value.
Picture of output using Microsoft Excel.

From this excel result, we can see on the graph that the equation(model) would be $\hat{y}=0.7955x-21.297$. What this equation tells us is that for every degree the temperature increases, the number of times that a cricket will chirp in 15 seconds is increased by 0.7955 chirps. When the temperature is $0^\circ$F, a cricket will chirp $-21.297$ times in 15 seconds, which we know is not possible.

To answer the second question of how many times will a cricket chirp in 15 seconds if the temperature is $84^\circ$F, we'd take our model and substitute $x=84$ into the equation for $x$.

$$ \begin{aligned} y&=0.7955x-21.297\\ y&=0.7955(84)-21.297\\ y&=66.822-21.297\\ y&=45.525\\ \end{aligned} $$

We can expect on an $84^\circ$F day that the average cricket will chirp $45.525$ times in 15 seconds. If we were to look at our original scatter plot, we would see that this looks like a reasonable solution or estimate to the problem.

Relationship Strength

When we are creating models to fit data and different types of models, wouldn't it be nice if there was a magical number that tells us if this model is a good fit or not? Luckily for us, there is. It is called the Correlation Coefficient and we denote it with the variable $r$.

Correlation coefficient information box describing r values and their meanings The Correlation Coefficient is a value, r, between -1 and 1. r > 0 suggests a positive (increasing) relationship r < 0 suggests a negative (decreasing) relationship r = 0 suggests no relationship exists The closer |r| is to 1, the stronger the relationship

Example 3

Given the different correlation coefficients below, order them from strongest to weakest relationship. $$\{-0.98,-.85,0.97,0.86,-0.64,0.06,0.54\}$$

Solution

Since the negative sign has no impact on the strength of the relationship, it just dictates the type of relationship(positive or negative), we will look solely at the numbers and arrange them from largest to smallest. This would give us: $$\{-0.98,0.97,0.86,-0.85,-0.64,0.54,0.06\}$$

Most technology programs will actually produce us with a Coefficient of Determination "$r^2$". The coefficient of determination is related to the correlation coefficient. The coefficient of determination tells a person how much variation or change they can expect between the estimated value $\hat{y}$ and the actual value $y$.

Excel screenshot showing a data table and scatter plot relating temperature to chirp counts. On the left, a two-column table lists Temperature values in the first (left) column and Chirps counts in the second (right) column. On the right, an XY scatter plot graphs Chirps (y-axis) versus Temperature (x-axis), includes a fitted trendline, and displays the regression equation for predicted chirps (y-hat) along with the R-squared value.
Picture of output using Microsoft Excel.

Looking back at our cricket example, we can see that the coefficient of determination is $r^2=0.9038$ . We can find the correlation coefficient r by taking the square root of that number. $r=\sqrt{0.9038}=0.9507$. This tell us that our cricket to temperature relationship is on the strong end, because $0.9507$ is rather close to 1. Additionally, since it is a positive number, we know that our relationship is a strong positive relationship, as can be seen by the graph. However, there is no mathematical proof to verify our interpretation.

What does the coefficient of determination tell us about our data? Let's look at the $(68,31)$ datum from our set. We know from our actual data that when it is $68^\circ$F, a cricket will chirp 31 times. If we were you use our model to help us determine our number of chirps, would it be the same?

$$ \begin{aligned} y&=0.7955x-21.297\\ y&=0.7955(68)-21.297\\ y&=54.094-21.297\\ y&=32.797\\ \end{aligned} $$

According to our model, when the temperature is $68^\circ$F, the cricket should chirp 32.797 times. This number is slightly off from our actual value. This is where that coefficient of determination plays its part. We can express the coefficient of determination as a percentage $0.9038*100\%=90.38\%$ which tells us that approximately 90.38\% of the time the model will accurately predict the number of chirps. Since we are approximating it is going to be off just slightly, this does not make the prediction wrong.

When using the model to make a prediction or estimation, there are several factors that need to be considered:

The purpose of creating a model is so a person can actually use the model to make predictions. When making predictions, if the value that is substituted in is between the lowest $x$-value and highest $x$-value(in the domain), we call this an example of Interpolation . If the value that is substituted in is not within the domain of the $x$-values, then it is an example of Extrapolation . Extrapolation should really only be done when there is a significantly strong relationship. Otherwise, you run the risk of the solution being unrealistic and unreasonable. If the $x$-value doesn't fall within the red box below, then it would be an example of extrapolation, not interpolation.

Scatter plot displaying the number of chirps in 15 seconds based on the temperature of the day. A red box surrounds the data to illustrate interpolation and extrapolation regions. Temperature (Degrees Fahrenheit) Chirps in 15 seconds Temperature Vs Chirps

So far in this section, we evaluated our model at $84^\circ$F and $68^\circ$F. The first one we evaluated was $84^\circ$F, which is an example of extrapolation, as the $x$-value of 84 is outside out red box and the second one was $68^\circ$F, which is an example of interpolation, as the value of 68 is inside our red box, or within our domain of our data. In either case, we had a strong positive linear relationship, and each of our predictions were reasonable and acceptable.

Example 4

Gasoline consumption in the United States has been steadily increasing. Consumption(in billions of gallons) data from 1994 to 2004 is shown in the table below. Determine whether the trend is linear and if it is, find a linear model to fit the data. Lastly, use the model to predict the consumption in the year 2008. Allow x to be the years since 1994.

Table displaying the paired data of year and consumption (billions of gallons).
Year 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004
Consumption 113 116 118 119 120 125 126 128 131 133 136

Solution

After we put all of our data into excel or google sheets, column A should be the years, column B should be the years since 1994, and column C should be the Consumption. Then, we can create the plot below, using columns B and C:

Scatter plot output from Microsoft Excel displaying the years since 1994 on the x-axis and consumption of gasoline in billions of gallons on the y-axis. Additionally, the linear regression line was added to the plot.
Gasoline consumption (billions of gallons) vs. years since 1994 with linear regression line

Based on the plot, we can see that there is a linear relationship with a coefficient of determination of $r^2=0.9931$, which would give us a correlation coefficient of $r=\sqrt{0.9931}=0.9965$, further supporting that there is a strong positive linear relationship. Additionally, we can see from the plot that the model that best represents the linear relationship is $\hat{y}=2.2091x+113.32$.

Lastly to find the amount of gasoline consumption in 2008, we'd need to figure out how many years it was since 1994 $(2008-1994=14)$ and we'd substitute that number into our model for $x$ which would give us:

$$ \begin{aligned} y&=2.2091(14)+113.32\\ y&=30.9271+113.32\\ y&=144.2474\\ \end{aligned} $$

This would be an example of extrapolation since 2008 is outside the domain of our $x$-values. Since we have an extremely strong relationship, it is relatively safe to estimate outside our data. The estimation of 144.2474 billion gallons is both reasonable and acceptable based on the information we have. Especially since in 2004, the data shows there were 136 billion gallons being consumed, and the model is increasing by roughly 2.2091 billion gallons per year.