Fitting Straight Lines to Data by Least Squares

This example has a tiny data set, with two variables, X and Y, observed on each of four cases. We will consider two different straight lines for describing how Y depends on X, and explore the notion that one of the lines fits the data better than the other line.

WARNING! Using only four data points keeps a pedagogical example simple, while demonstrating the principles that apply just the same with hundreds of data points. Be aware, however, that it is almost always a bad idea to fit a straight line to such a small amount of data.

The X and Y values for our four cases of artificial data are shown in the first two columns of the table below. Also see their plot.

Data Line 1 Line 2

X Y

1 1

2 2

3 3

4 2

Sum

1.2+.3X Err sqErr

1.5 -.5 .25

1.8 +.2 .04

2.1 +.9 .81

2.4 -.4 .16

1.26

.4+.6X Err sqErr

1.0 0 0

1.6 +.4 .16

2.2 +.8 .64

2.8 -.8 .64

1.44

A straight line has an equation of the form Y=a+bX. In this formulation, Y is called the dependent variable, X is called the independent variable, a is called the intercept, and b is called the slope. We are usually interested primarily in the slope, and want to know both its sign and its magnitude.

Actual data almost never lie exactly on a straight line, but we sometimes want to approximate them with a straight line that fits as well as possible. The most commonly used fitting technique is called least squares, and evaluates how well a line fits the data by calculating the sum of the squares of its prediction errors. We will illustrate that with two different straight lines, Y=1.2+.3X, and Y=.4+.6X.

The table has three columns for each of the two lines: a column of predicted values of Y, obtained by plugging the observed value of X into the equation; a column of prediction errors, calculated as (Observed Y - Predicted Y); and a column of squared prediction errors, whose sum is the overall index of how poorly the straight line fits the data. We would like the sum of squared errors to be as small as possible, so the 1.26 is preferable to 1.44, and Line 1 fits better, in the least squares sense, than Line 2.

Still there might be some other lines that fit better than either of those we tried. The computational algorithm used in regression programs (such as stata) systematically determines the coefficients of the very best fitting straight line, so we can concentrate on other aspects of the research problem rather than the computational chores.