It is easy and simple to build a naive Bayes model using scikit-learn. If you haven't familiarized with it yet, we recommend that you take a look at the following scikit-learn documentation:
Once the naive Bayes model is built, all you need to do is to export the model parameters to a .txt file which can then be given to Py2PMML so that it generates the equivalent PMML code for your model.
The .txt file needs to follow a strict sequence. This is defined as follows:
- Model type: The name of the scikit-learn class used to build the model (string). In this case, "GaussianNB" (without quotation marks)
- Model name: This is the name you are given your model in PMML (string)
- Function type: In this case, "classification" (no quotation marks - string)
- Threshold: A number close to zero (float)
- Number of input fields (integer)
- Input fields (one entry per line): name, data type, operational type, missing value replacement, missing value treatment, and invalid value treatment (sequence of strings, comma separated). Make sure each string follows the PMML nomenclature as outlined below.
- Number of target categories (integer)
- The name of each of the target categories (sequence of strings, one name per line)
- Mean followed by Variance (float, one value per line. A pair of values per target category and and model input)
- The total number of input records per target category (float, one value per line)
The following Python code exemplifies the writing of the .txt file for a simple naive Bayes model built using the Iris dataset.
For this code, the resulting .txt file is written to file "GaussianNBModel.txt" which will contain the following information:
PMML Nomenclature and Input Field Information
For each input field, the .txt file is required to have the following important information which must follow the PMML nomenclature:
- Data type: "double", "float", "integer", or "string". This information is required in PMML.
- Operational type: "continuous" or "categorical". This information is required in PMML.
- Missing value replacement: If the field is continuous, this value should be a number. For example, you may want to replace any possible missing values for input field "age" by value "31", the mean value for age in your historical data. Use value "NA" if you do not want to this information to be part of the PMML code.
- Missing value treatment: PMML uses this attribute for information only. Possible values are: "asMean", "asMode", "asMedian", "asValue", and "asIs". Use value "NA" if you do not want this information to be part of the PMML code.
- Invalid value treatment: Used to deal with invalid values. For example, if an invalid value is encountered, you may want to treat it the same way you are treating missing values. Possible values are: "returnInvalid" (default), "asIs", "asMissing". Use value "NA" if you do not want this information to be part of the PMML code.
Go ahead and give it a try. Pass the .txt file through Py2PMML to obtain the equivalent PMML code for the model. With the PMML file in hand, your model can be executed anywhere using one of the Zementis scoring products.
To learn how Zementis supports continuous inputs in PMML for naive Bayes models, please refer to the following article we presented at KDD 2013: