A brief introduction to the PMML scripts generated from Tree-based algorithms node in TIBCO Statistica

Products	Versions
Spotfire Statistica	13.5

Description

In order to explain the PMML code sessions, the Basic_DM_Example workspace from Statistica Workspace example folder is used.

1. After launching Statistica, go to File|Open Examples|Workspaces and open the Basic_DM_Example workspace.

2. In this example, three tree-based models are utilized to train the training set, which includes "Advanced Classification Trees(C&RT)", "Boosted Classification Trees", and "Random Forest Classification". A screenshot of the workspace is at below:

User-added image

3. Click Run All to run the workspace, PMML codes will be outputted in the PMML Model node for every configured modelling node.

4. Click the Wheel icon on top left of the PMML model node to open the node, and select the PMML tab to review the PMML script.

Resolution

The General information about Predictive Model Markup Language (PMML) can be found here:
https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language

Its general structure can be found on the DMG consortium:
http://dmg.org/pmml/v4-2-1/GeneralStructure.html

The PMML example script at below is taken from the PMML node output of the Boosted Classification Tree model in the Basic_DM_Example workspace. And TreeModel PMML reference( http://dmg.org/pmml/v4-2-1/TreeModel.html#xsdElement_TreeModel) is used as the guideline for the explanation.

------------------------------------------------------------------------------------------------------------
<PMML xmlns="http://www.dmg.org/PMML-4_2" version ="4.2">
<Header copyright="STATISTICA Data Miner, Copyright 1984-2018 TIBCO Software Inc. All rights reserved."></Header>
------------------------------------------------------------------------------------------------------------
This top session displays the PMML version information, and the header describes the copy right of application that generates the model, in this case, STATISTICA Data Miner.

------------------------------------------------------------------------------------------------------------
DataDictionary numberOfFields="18">
<DataField name="Credit Rating" optype="categorical" dataType="string">
<Value value="bad"></Value>
<Value value="good"></Value>
</DataField>
<DataField name="Duration of Credit" optype="continuous" dataType="double"></DataField>
<DataField name="Amount of Credit" optype="continuous" dataType="double"></DataField>
<DataField name="Age" optype="continuous" dataType="double"></DataField>
<DataField name="Balance of Current Account" optype="categorical" dataType="string">
<Value value="no running account"></Value>
<Value value="no balance"></Value>
<Value value="<= $300"></Value>
<Value value=">$300"></Value>
......
</DataField>
</DataDictionary>
------------------------------------------------------------------------------------------------------------
The DataDictionary sessions describes fields that are specified by the user to be used in mining models with their respective types and value ranges.

------------------------------------------------------------------------------------------------------------
<MiningSchema>
<MiningField name="Credit Rating" usageType="predicted" />
...
</MiningSchema>
---------------------------------------------
The MiningSchema describes all data entered in a model. When there are multiple model selected, each MiningSchema corresponds to a specific model. In contrast, the DataDictionary contains data definitions parsed, which do not vary by model. It is important to keep in mind that the MiningSchema lists the fields that have to be provided in order to apply the model(i.e. the PMML script). A target variable is identified by its useageType being "predicted"/"target". An independent variable is identified by its usageType being "active".

------------------------------------------------------------------------------------------------------------
<TreeModel modelName="BoostTreeModel" functionName="regression" algorithmName="BoostedTrees" splitCharacteristic="multiSplit">
...
<Node score="-8.52058874993726e-003">
<True></True>
<Node score="-3.55882666891956e-002">
<SimplePredicate field="Most Valuable Assets" operator="equal" value="no assets"></SimplePredicate>
</Node>
<Node score="6.77198232180596e-003">
<CompoundPredicate booleanOperator="or">
<SimplePredicate field="Most Valuable Assets" operator="equal" value="life insurance"></SimplePredicate>
<SimplePredicate field="Most Valuable Assets" operator="equal" value="car"></SimplePredicate>
<SimplePredicate field="Most Valuable Assets" operator="equal" value="ownership of house or land"></SimplePredicate>
</CompoundPredicate>
</Node>
</Node>
</TreeModel>
</Segment>
<Segment id="57">
------------------------------------------------------------------------------------------------------------
The TreeModel part describes the definition of a tree model. Under the Node, it defines the splitting rules and splitting predictors.

------------------------------------------------------------------------------------------------------------
<Output>
<OutputField name="TreePredictedValue156" optype="continuous" dataType="double" feature="predictedValue" />
<OutputField name="UpdatedPredictedValue156" optype="continuous" dataType="double" feature="transformedValue">
<Apply function="+">
<FieldRef field="UpdatedPredictedValue155" />
<FieldRef field="TreePredictedValue156" />
</Apply>
</OutputField>
</Output>
------------------------------------------------------------------------------------------------------------
The Output session describes the results that need to be returned from a model.

Issue/Introduction

This article explains a few sections on the PMML code generated out of Tree-based algorithm nodes in Statistica.

Additional Information

https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language
http://dmg.org/pmml/v4-2-1/GeneralStructure.html
http://dmg.org/pmml/v4-2-1/TreeModel.html#xsdElement_TreeModel

Welcome to "KB Articles"