Cluster-Based Estimators For Multiple And Multivariate Linear Regression Models
Loading...
Date
2015-06
Authors
Alih, Ekele
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
In the field of linear regression modelling, the classical least squares (LS) regression is
susceptible to a single outlier whereas low-breakdown regression estimators like M regression
and bounded influence regression are able to resist the influence of a small percentage
of outliers. High-breakdown estimators like the least trimmed squares (LTS)
and MM regression estimators are resistant to as much as 50% of data contamination.
The problems with these estimation procedures include enormous computational
demands and subsampling variability, severe coefficient susceptibility to very small
variability in initial values, internal deviation from the general trend and capabilities
in clean data and in low breakdown situations. This study proposes a new high breakdown
regression estimator that addresses these problems in multiple regression and
multivariate regression models as well as providing insightful information about the
presence and structure of multivariate outliers. In the multiple regression model, the
proposed procedures unify a concentration step (C-step) phase with a sequential regression
phase. A minimum Mahalanobis distance variance referred to as (MMD)-variance
concentration algorithm produces a preliminary estimator. Thereafter, a hierarchical
cluster analysis is performed and then the data is partitioned into a main cluster of “half
set” and a minor cluster of one or more groups. An initial least squares regression estimate arises from the main cluster with a difference in fit statistic (DFFITS-statistic)
that sequentially activates the minor clusters in a bounded influence regression scenario.
In the multivariate regression setting, a minimum Mahalanobis distance covariance
determinant referred to as (MMCD)-covariance concentration algorithm produces
a preliminary estimator. Residual distances computed from this preliminary estimator
serves as a distance metric for agglomerative hierarchical cluster (AHC) analysis. The
AHC then partition the data into a main cluster of “half set” and a minor cluster of
one or more groups. An initial least squares estimate is obtained from the main cluster.
The initial estimate is thereafter, optimized using concentration steps that lower
the objective function of the residuals at each concentration step. To improve the efficiency
of the initial estimates, a difference in fit statistic (DFFITS-statistic) is used
to activate the minor cluster. Since the proposed method blends the cluster phase with
repeated least squares regression phase, it is called the Cluster-based estimators for
Multiple and Multivariate Linear regression Models (CLreg for short). CLreg achieves
a high breakdown point which can be determined by the user. It inherits the asymptotic
normal properties of the least squares regression and is also equivariant. Case
studies comparisons and Monte Carlo simulation experiments depict the performance
advantage of CLreg over the other high breakdown methods for coefficient stability. A
dendrogram plot obtained from cluster analysis is used for multivariate outlier detection.
Overall, the proposed procedure is a contribution in the area of robust regression,
offering a distinct philosophical viewpoint towards data analysis and the marriage between
estimation and diagnostic summary.
Description
Keywords
Mathematics , Regression analysis