Output of Spark Make Design Matrix node and Variable selection for downstream Spark modelling analysis nodes

Output of Spark Make Design Matrix node and Variable selection for downstream Spark modelling analysis nodes

book

Article ID: KB0082895

calendar_today

Updated On:

Products Versions
Spotfire Statistica 13.3, 13.3.1

Description

How to interpret the output from "Spark Make Design Matrix" node and How to select variables for downstream Spark modelling analysis nodes?

Issue/Introduction

This article introduce the output of "Spark Make Design Matrix" node and variable selection for downstream Spark modelling analysis nodes.

Environment

Windows 7, Windows Server 2012 R2

Resolution

1. "Spark Make Design Matrix" node
  • This node uses the RFormula API to produce a main effects design matrix. A single dependent variable and any number of continuous and/or categorical variables are expected inputs. The input data should not contain missing values. This node produces a design matrix as a downstream document that can be fed to  further spark ml analyses.
  • The output of this node contains two new columns named "outfeatures" and "label" (see below figure). “label” represents dependent variable chosen in this node. “outfeatures” is a spark vector (sparse or dense). It consists of all the predictors you chose to add to design matrix (continuous + ‘one hot coded’ categorical variables). Note, ‘one hot encoding’ is dummy coding. In spark it uses reference coding, meaning last level is set to zero.
  • This node's output spreadsheet table showing “outfeatures” as "[OBJECT]" is normal as expected, as we do not have a sparse representation in spreadsheet. If the user wants to see details, he can add “.show()” to the design matrix in Scala code and the details will be printed in the node report.
2. Variable selection for downstream Spark modelling analysis nodes (linked downstream to "Spark Mark Design Matrix" node)
  • User just need to choose “label” as dependent variable and “outfeatures” as the predictors, and analysis will use all the variables in the design matrix.
  • The use of “label” as dependent and “outfeatures” as predictor applies to various downstream Spark modelling analysis, including Spark Linear regression, Spark Logistic regression, Spark Generalized linear regression, and Spark Random Forest regression/classification. 
User-added image