It is easy and simple to build a decision tree using scikit-learn. If you haven't familiarized with it yet, we recommend that you take a look at the following scikit-learn documentation:
Once the decision tree is built, all you need to do is to export the model parameters to a .txt file together with the tree structure (represented as a .dot file) which can then be given to Py2PMML so that it generates the equivalent PMML code for your model. Note that the .dot file is the Graphviz representation of the tree structure which you can export from python using the export_graphviz exporter.
The .txt file needs to follow a strict sequence. This is defined as follows:
- Model type: The name of scikit-learn class used to build the model (string). In this case "DecisionTreeRegressor" (without quotation marks).
- Model name: This is the name you are given your model in PMML (string)
- Function type: In this case, "regression" (no quotation marks - string)
- Split characteristic: In this case, "binarySplit" (no quotation marks - string)
- Number of input variables (integer)
- Input fields (one entry per line): name, data type, operational type, missing value replacement, missing value treatment, and invalid value treatment (sequence of strings, comma separated). Make sure each string follows the PMML nomenclature as outlined below.
- Number of target categories (integer)
- The name of each of the target categories (sequence of strings, one name per line)
The following Python code exemplifies the writing of the .txt and .dot files for a simple decision tree built using the Iris dataset. Note that the input features were named X0, X1, X2, ... X9.
For this code, the resulting .txt file is written to file "TreeModel.txt" which will contain the following information:
PMML Nomenclature and Input Field Information
For each input field, the .txt file is required to have the following important information which must follow the PMML nomenclature:
- Data type: "double", "float", "integer", or "string". This information is required in PMML.
- Operational type: "continuous" or "categorical". This information is required in PMML.
- Missing value replacement: If the field is continuous, this value should be a number. For example, you may want to replace any possible missing values for input field "age" by value "31", the mean value for age in your historical data. Use value "NA" if you do not want to this information to be part of the PMML code.
- Missing value treatment: PMML uses this attribute for information only. Possible values are: "asMean", "asMode", "asMedian", "asValue", and "asIs". Use value "NA" if you do not want this information to be part of the PMML code.
- Invalid value treatment: Used to deal with invalid values. For example, if an invalid value is encountered, you may want to treat it the same way you are treating missing values. Possible values are: "returnInvalid" (default), "asIs", "asMissing". Use value "NA" if you do not want this information to be part of the PMML code.
Go ahead and give it a try. Pass the .txt and .dot files through Py2PMML to obtain the equivalent PMML code for the model. With the PMML file in hand, your model can be executed anywhere using one of the Zementis scoring products.