It is easy and simple to build a KMeans clustering model using scikit-learn. If you haven't familiarized with it yet, we recommend that you take a look at the following scikit-learn documentation:
Once the clustering model is built, all you need to do is to export the model parameters to a .txt file which can then be given to Py2PMML so that it generates the equivalent PMML code for your model.
The .txt file needs to follow a strict sequence. This is defined as follows:
- Model type: The name of scikit-learn class used to build the model (string). In this case, "KMeans" (without quotation marks).
- Model name: This is the name you are given your model in PMML (string)
- Number of input variables (integer)
- Input fields (one entry per line): name, data type, operational type, missing value replacement, missing value treatment, and invalid value treatment (sequence of strings, comma separated). Make sure each string follows the PMML nomenclature as outlined below.
- Number of clusters (integer)
- Centroids (float, one value per line. Dependent on the number of clusters and model inputs)
- Comparison measure (string)
- Compare function (string)
The following Python code exemplifies the writing of the .txt file for a simple KMeans clustering model built using the Iris dataset.
For this code, the resulting .txt file is written to file "ClusteringModel.txt" which will contain the following information:
PMML Nomenclature and Input Field Information
For each input field, the .txt file is required to have the following important information which must follow the PMML nomenclature:
- Data type: "double", "float", "integer", or "string". This information is required in PMML.
- Operational type: "continuous" or "categorical". This information is required in PMML.
- Missing value replacement: If the field is continuous, this value should be a number. For example, you may want to replace any possible missing values for input field "age" by value "31", the mean value for age in your historical data. Use value "NA" if you do not want to this information to be part of the PMML code.
- Missing value treatment: PMML uses this attribute for information only. Possible values are: "asMean", "asMode", "asMedian", "asValue", and "asIs". Use value "NA" if you do not want this information to be part of the PMML code.
- Invalid value treatment: Used to deal with invalid values. For example, if an invalid value is encountered, you may want to treat it the same way you are treating missing values. Possible values are: "returnInvalid" (default), "asIs", "asMissing". Use value "NA" if you do not want this information to be part of the PMML code.
Go ahead and give it a try. Pass the .txt file through Py2PMML to obtain the equivalent PMML code for the model. With the PMML file in hand, your model can be executed anywhere using one of the Zementis scoring products.