It is easy and simple to build a random forest model using scikit-learn. If you haven't familiarized with it yet, we recommend that you take a look at the following scikit-learn documentation:
Once the model is built, all you need to do is to export the model parameters to a .txt file together with the structure information for all the random forest trees (represented as several .dot files, one per tree). These files can then be given to Py2PMML so that it generates the equivalent PMML code for your model. Note that a .dot file is the Graphviz representation of the tree structure which you can export from python using the export_graphviz exporter.
The .txt file needs to follow a strict sequence. This is defined as follows:
- Model type: The name of the scikit-learn class used to build the model (string). In this case, "RandomForestRegressor" (without quotation marks).
- Model name: This is the name you are given your model in PMML (string)
- Function type: In this case, "regression" (no quotation marks - string)
- Split characteristic: In this case, "binarySplit" (no quotation marks - string)
- Number of input fields (integer)
- Input fields (one entry per line): name, data type, operational type, missing value replacement, missing value treatment, and invalid value treatment (sequence of strings, comma separated). Make sure each string follows the PMML nomenclature as outlined below.
- Number of target categories (integer)
- The name of each of the target categories (sequence of strings, one name per line)
- Total number of trees in the random forest model (integer)
The following Python code exemplifies the writing of the .txt and .dot files for a random forest model with just 10 trees built using the Diabetes dataset. Note that the input features are named X0, X1, X2, ... X9.
For this code, the resulting .txt file is written to file "RandomForestModel.txt" which will contain the following information:
PMML Nomenclature and Input Field Information
For each input field, the .txt file is required to have the following important information which must follow the PMML nomenclature:
- Data type: "double", "float", "integer", or "string". This information is required in PMML.
- Operational type: "continuous" or "categorical". This information is required in PMML.
- Missing value replacement: If the field is continuous, this value should be a number. For example, you may want to replace any possible missing values for input field "age" by value "31", the mean value for age in your historical data. Use value "NA" if you do not want to this information to be part of the PMML code.
- Missing value treatment: PMML uses this attribute for information only. Possible values are: "asMean", "asMode", "asMedian", "asValue", and "asIs". Use value "NA" if you do not want this information to be part of the PMML code.
- Invalid value treatment: Used to deal with invalid values. For example, if an invalid value is encountered, you may want to treat it the same way you are treating missing values. Possible values are: "returnInvalid" (default), "asIs", "asMissing". Use value "NA" if you do not want this information to be part of the PMML code.
Go ahead and give it a try. Pass the .txt and .dot files (one per tree - 10 dot files included in the attached zip file) through Py2PMML to obtain the equivalent PMML code for the model. With the PMML file in hand, your model can be executed anywhere using one of the Zementis scoring products.