Cluster-Based Estimators For Multiple And Multivariate Linear Regression Models

Loading...
Thumbnail Image
Date
2015-06
Authors
Alih, Ekele
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
In the field of linear regression modelling, the classical least squares (LS) regression is susceptible to a single outlier whereas low-breakdown regression estimators like M regression and bounded influence regression are able to resist the influence of a small percentage of outliers. High-breakdown estimators like the least trimmed squares (LTS) and MM regression estimators are resistant to as much as 50% of data contamination. The problems with these estimation procedures include enormous computational demands and subsampling variability, severe coefficient susceptibility to very small variability in initial values, internal deviation from the general trend and capabilities in clean data and in low breakdown situations. This study proposes a new high breakdown regression estimator that addresses these problems in multiple regression and multivariate regression models as well as providing insightful information about the presence and structure of multivariate outliers. In the multiple regression model, the proposed procedures unify a concentration step (C-step) phase with a sequential regression phase. A minimum Mahalanobis distance variance referred to as (MMD)-variance concentration algorithm produces a preliminary estimator. Thereafter, a hierarchical cluster analysis is performed and then the data is partitioned into a main cluster of “half set” and a minor cluster of one or more groups. An initial least squares regression estimate arises from the main cluster with a difference in fit statistic (DFFITS-statistic) that sequentially activates the minor clusters in a bounded influence regression scenario. In the multivariate regression setting, a minimum Mahalanobis distance covariance determinant referred to as (MMCD)-covariance concentration algorithm produces a preliminary estimator. Residual distances computed from this preliminary estimator serves as a distance metric for agglomerative hierarchical cluster (AHC) analysis. The AHC then partition the data into a main cluster of “half set” and a minor cluster of one or more groups. An initial least squares estimate is obtained from the main cluster. The initial estimate is thereafter, optimized using concentration steps that lower the objective function of the residuals at each concentration step. To improve the efficiency of the initial estimates, a difference in fit statistic (DFFITS-statistic) is used to activate the minor cluster. Since the proposed method blends the cluster phase with repeated least squares regression phase, it is called the Cluster-based estimators for Multiple and Multivariate Linear regression Models (CLreg for short). CLreg achieves a high breakdown point which can be determined by the user. It inherits the asymptotic normal properties of the least squares regression and is also equivariant. Case studies comparisons and Monte Carlo simulation experiments depict the performance advantage of CLreg over the other high breakdown methods for coefficient stability. A dendrogram plot obtained from cluster analysis is used for multivariate outlier detection. Overall, the proposed procedure is a contribution in the area of robust regression, offering a distinct philosophical viewpoint towards data analysis and the marriage between estimation and diagnostic summary.
Description
Keywords
Mathematics , Regression analysis
Citation