Calculation of the chi-square statistic in Feature Selection

Calculation of the chi-square statistic in Feature Selection

book

Article ID: KB0081498

calendar_today

Updated On:

Products Versions
Spotfire Statistica 12.7

Description

Binning of continuous variables reduces the predictive power of the variable in Feature Selection

Cause

For categorical dependent variables, a chi-square test is used in the Feature Selection module to assess predictor performance.  A chi-square test requires two categorical variables, so if a predictor is continuous then it must be binned prior to the computing of the test statistic. The bins are generated based on the range of the data but unfortunately, if there are a number of outliers in the lower or upper tails, this can cause the far most right or left bin to contain almost all of the observations. This will reduce the number of bins, the degrees of freedom, and the predictive power of the variable.

 

Issue/Introduction

Calculation of the chi-square statistic in Feature Selection

Resolution

To resolve this issue, please follow these steps:

  1. Manually bin the data prior to Feature Selection
  2. Use a decision tree to tell how well the variables separate the group
  3. Use a t-test