Output of Spark Make Design Matrix node and Variable selection for downstream Spark modelling analysis nodes

book

calendar_today

Products	Versions
Spotfire Statistica	13.3, 13.3.1

How to interpret the output from "Spark Make Design Matrix" node and How to select variables for downstream Spark modelling analysis nodes?

Windows 7, Windows Server 2012 R2

1. "Spark Make Design Matrix" node

This node uses the RFormula API to produce a main effects design matrix. A single dependent variable and any number of continuous and/or categorical variables are expected inputs. The input data should not contain missing values. This node produces a design matrix as a downstream document that can be fed to further spark ml analyses.
The output of this node contains two new columns named "outfeatures" and "label" (see below figure). “label” represents dependent variable chosen in this node. “outfeatures” is a spark vector (sparse or dense). It consists of all the predictors you chose to add to design matrix (continuous + ‘one hot coded’ categorical variables). Note, ‘one hot encoding’ is dummy coding. In spark it uses reference coding, meaning last level is set to zero.
This node's output spreadsheet table showing “outfeatures” as "[OBJECT]" is normal as expected, as we do not have a sparse representation in spreadsheet. If the user wants to see details, he can add “.show()” to the design matrix in Scala code and the details will be printed in the node report.

2. Variable selection for downstream Spark modelling analysis nodes (linked downstream to "Spark Mark Design Matrix" node)

User just need to choose “label” as dependent variable and “outfeatures” as the predictors, and analysis will use all the variables in the design matrix.
The use of “label” as dependent and “outfeatures” as predictor applies to various downstream Spark modelling analysis, including Spark Linear regression, Spark Logistic regression, Spark Generalized linear regression, and Spark Random Forest regression/classification.

This article introduce the output of "Spark Make Design Matrix" node and variable selection for downstream Spark modelling analysis nodes.

Was this article helpful?

thumb_up Yes

thumb_down No

Welcome to "KB Articles"