Data mining for robust tests of spread
Loading...
Date
2008-11
Authors
Sin Yin, Teh
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Large quantity of multidimensional data (simulation data sets) from SAS output listings of six
hundred and thirty four robust tests of spread procedures conducted by Keselman, Wilcox, Algina,
Othman, and Fradette (in press) was available. The robust procedures that Keselman, et al. (in
press) utilized were either based on prior or empirically determined symmetric or asymmetric
trimming strategies. The Levene-type and O'Brien-type transformed scores were used with either
the ANOV A F-test, a robust test due to Lee and Fung (1985), or the Welch (1951) test. P-values
from these tests were than collected. A test is robust if it is not seriously disturbed by the violation
of underlying assumptions. Robust statistical tests are tests that operate well across a wide variety
of distributions. A test can be also considered robust if it provides p-values 'close' to the target
(usually 0.05) in the presence of (slight) departures from its assumptions. In order to make sense
of importance of the features of the procedures on the p-values generated, we collated these
quantities of data from the output listings into a large SAS data set of 26,628 records and conduct
data mining on it. Data mining of simulation conditions and the characteristics of spread
procedures that correspond to the target p-value of Q.05 or a set of value that is 'close' to 0.05
were then carried out. Three data mining methods were used. They are logistic regression,
discriminant analysis, and a composite method combining the two methods. We did separate
analyses for the 'simulation conditions' and 'characteristics of the procedures'. The simulation
conditions evaluated were seven different distributions by six designs. The characteristics of the
procedures contained information of central locations, type of trimming, transformation, and test
statistics. For each analysis, data was partitioned using 95% for training and 5% for validation. In
the first analysis, our findings agreed with the norm in statistics that robust Type I error rates were
obtained from procedures that were run on the standard normal distribution, and with large total
sample size. However, findings in association with logistic regression indicated that procedure
with symmetric platykurtic distributions or symmetric leptokurtic distributions, and moderately
unequal sample size, can still perfonn well in tenns of Type I error rates. In the second analysis,
both logistic regression and discriminant analysis revealed that no trimming was needed on ZiJ I RiJ
in order to obtain robust Type I error rates. In logistic regression, Type I error rates falling in
[0.045, 0.050] was observed for the procedures that used group means in the transfonnation of XiJ;
no trimming applied on }(;1 and no hinge estimator used on XI If asymmetric trimming was carried
out then the best results were observed when I 0% trimming when applied on ZiJ I RiJ with HI as the
hinge estimator and used with usual F test as the test statistic. In discriminant analysis, robust
Type I error rates was observed for the procedures that included O'Brien transfonnation on X,1 ,
and asymmetric trimming applied on X;1. In the overall percentages of correct classification,
although the logistic regression model accuracy rate (only slightly more than 50%) was higher
than discriminant analysis and the ensemble method, its predictive perfonnance is still very low.
Keywords: Data mining, tests of spread, robust statistical tests, logistic regression, discriminant
analysis, hinge estimator.
Description
Keywords
Robust tests