Least Median of Squares and Regression through the Origin
Supporting files online at
http://www.wabash.edu/econexcel/LMSOrigin
By
Humberto Barreto
Department of Economics
Wabash College
Crawfordsville, IN 47933
and
David Maharry
Department of Mathematics and Computer Science
Wabash College
Crawfordsville, IN 47933
The authors thank Michael Axtell, Frank Howland, and anonymous referees
for suggestions and criticisms.
Do not quote without the author’s permission
Abstract
An exact algorithm is provided for finding the Least Median of Squares (LMS) line for a bivariate regression with no intercept term. It is shown that the popular PROGRESS routine will not, in general, find the LMS slope when the intercept is suppressed.
A Microsoft Excel workbook that provides the code in Visual Basic is made available at www.wabash.edu/econexcel/LMSOrigin
Keywords: LMS, Robust Regression, PROGRESS
1. Introduction
Rousseeuw [1984] introduced Least Median of Squares (LMS) as a robust regression procedure. Instead of minimizing the sum of squared residuals, coefficients are chosen so as to minimize the median of the squared residuals. Unlike conventional least squares (LS), there is no closed-form solution with which to easily calculate the LMS line since the median is an order or rank statistic. A general non-linear optimization algorithm performs poorly because the median of squared residuals surface is so bumpy that merely local minima are often incorrectly reported as the solution.
Although a closed-form solution does not exist and brute force optimization is not reliable, several algorithms are available for fitting the LMS line (or hyperplane). Perhaps the most popular approach is called PROGRESS (from Program for RObust reGRESSion). The program itself is explained in Rousseeuw and Leroy [1987] and the most recent version is available at http://www.agoras.ua.ac.be/. Several software packages, such as SAS/IML (version 6.12 or greater), have an LMS routine based on PROGRESS.
This paper focuses on the special problem of finding the LMS fitted line through the origin in the bivariate case. The next section presents the model and defines the LMS line. Section 3 shows that the PROGRESS algorithm gives an incorrect solution, in general, when the intercept is restricted to zero. Section 4 presents an analytical, exact method for finding the minimum median squared residual for the bivariate, zero intercept case. Finally, a simple example is provided to illustrate the algorithm and show why PROGRESS fails in the zero-intercept case.
It is possible for more than two parabolas to intersect at a point, but the parabola that becomes the new median can be determined by ordering the intersecting parabolas based on their slope and curvature at the point of intersection. In the case of an even number of data points it is necessary to follow two parabolas, representing the (n/2)th deviation and the (n/2+1)th deviation, since the median is the average of these two values.
When there are n data points the efficiency of this algorithm is O(n2 log n) in the worst case. It requires determining the intersections of each of the parabolas with the median parabola to choose the next intersection to use. It is possible that each parabola might be the median parabola at some point of the algorithm. Thus one may have to determine as many as intersections and these intersections have to be ordered.
Figure 2 can also be used to show another view of how the PROGRESS algorithm works in the zero intercept case. For each value of the slope that causes the straight line to pass through a data point, giving a squared residual of zero, the PROGRESS algorithm computes the median deviation of the 5 data points. It then chooses the slope that causes the minimum value of this set of deviations. The discussion presented in the previous paragraphs makes clear why this approach fails—the global minimum squared residual will not, in general, be associated with a slope where a squared residual value for an individual observation is zero. This will provide the correct result only in the case where a majority of the data points lie on a straight line through the origin.
When applying Least Median of Squares, coefficients are chosen so as to minimize the median of the squared residuals. Because the median is not sensitive to extreme values, it can outperform conventional least squares when data are contaminated. This paper makes two contributions to the LMS literature:
PROGRESS, the standard algorithm for fitting the LMS estimator, does not find the true LMS fit when the intercept is suppressed. Any computations based on the estimated slope (such as regression diagnostics and estimated standard errors) are also wrong.
For a bivariate regression with a zero-intercept, , an algorithmic method based on keeping track of the median squared residual is demonstrated.
References
Barreto, Humberto (2001) “An Introduction to Least Median of Squares,” unpublished manuscript, http://www.wabash.edu/econexcel/LMSOrigin (LMSIntro.doc).
Rousseeuw, Peter J. (1984) “Least Median of Squares Regression,” Journal of the American Statistical Association, 79 (388), 871-880.
Rousseeuw, Peter J. and Annick M. Leroy (1987) Robust Regression and Outlier Detection, John Wiley & Sons: New York.
1 “In simple regression (p=2), it follows from (Steele and Steiger 1986) that if all 2-subsets are used and their intercept is adjusted each time, we obtain the exact LQS.” Rousseeuw and Hubert [1997], p. 9.
APPENDIX C WEIGHTED LEAST SQUARES FIT BY ITERATION1
AT LEAST 3 OF PUBLIC SERVICE JOBS ARE FOR
BATAVIA BUSINESSES OPERATING FOR AT LEAST 50 YEARS COMPILED
Tags: least median, to least, median, squares, origin, regression, through, least