What is KNIME? According to knime.com:
KNIME (Konstanz Information Miner) is a user-friendly and comprehensive open-source data integration, processing, analysis, and exploration platform.
Yes, KNIME is user-friendly, not only because it offers an intuitive GUI to analyze data, but also because it is open-source. KNIME is also standards friendly. KNIME 2.0 released in 2008 was the first release to offer PMML support. PMML, the Predictive Model Markup Language, is the de facto standard to represent data mining and predictive analytic models. PMML today is supported by all the top statistical packages, including SAS, IBM SPSS, KXEN, and R.
Since release 2.0, PMML support in KNIME has matured considerably, from the import and export of predictive models all the way to the pre-processing of input variables. KNIME 2.5, released December 01, 2011 offers a series of PMML-enabled pre-processing nodes which can be embedded automatically in the final PMML model. All these features are documented in a paper presented at the KDD 2011 PMML Workshop:
Peer-reviewed article: KDD 2011 – PMML Pre-processing in KNIME
To illustrate some of KNIME capabilities when it comes to PMML, we describe below a workflow we built in KNIME for training a neural network model for classification of the audit data set. This workflow encapsulates the following high-level tasks:
- The reading in of the audit data set (this data set is supplied as part of the R Rattle package): This is an artificial data set consisting of fictional clients who have been audited, perhaps for tax refund compliance. For each case an outcome is recorded: whether the taxpayer's claims had to be adjusted or not which in the data is represented by 0 (no) and 1 (yes).
- The pre-processing of input variables, which involves dummyfication of categorical variables and normalization of numerical variables
- The training and testing of a neural network model
- The exporting of the resulting PMML file which includes all pre-processing steps as well as the neural network model itself.
KNIME Workflow - Step-by-Step
Below we describe in 8 steps how we went around building such a workflow.
Step 1: We start by reading the audit data set from a csv file. We simply use node "CSV Reader" for that. We then use node "Number To String" to tell KNIME that our predicted variable "TARGET_Adjusted" should be treated as a string.
Step 2: Since we do not want to use all variables in the data set for training our neural network, we use the node "Column Filter" to filter out variables such as ID and IGNORE_Accounts.
Step 3: We are now ready to start massaging the remaining data. For that we use the new PMML-enabled node "One2Many" to create dummy variables out of the categorical raw input variables. Note that this node comes with a blue port indicating its PMML capabilities. We also use another "Column Filter" node to remove the original categorical variables from our data.
Step 4: We then add PMML-enabled node "Normalizer" to the workflow. This node normalizes all the numerical variables so that they can be presented to the neural network for training. Note that we linked the blue port from the preceding node to this node. This signals KNIME that we would like to have the PMML representation passed between nodes.
Step 5: We then use the node "Partitioning" to partition the audit data into two data sets, one for training and another for testing.
Step 6: We can now use node "RProp MLP Learner" to train our neural network model. Note that this node is also PMML-enabled and so we link the blue port from node "Normalizer" to it. This ensures that the PMML equivalent of the pre-processing operations are being passed to the neural net learner node.
Step 7: Given that the neural network has been trained, it is time to export the resulting PMML file. For that we use the node "PMML Writer". You can inspect the exported PMML file on your own (see RESOURCES below).
Step 8: As far as PMML is concerned, we are done. But, to complete the model building process, we must evaluate our model against the test data. For that, we connect the test piece of node "Partitioning" to node "MultiLayerPerceptron Predictor". Note that the trained neural network model is communicated from the learner node to the predictor node via a blue PMML port. Finally, we can then visualize the scoring results using node "Interactive Table". With this step, our data workflow is complete.
Putting your model to work
Once you have you verified that the model works and that it generalizes over the testing data, you can simply upload the resulting PMML file into ADAPA where it will be made available for execution.
- Read the article published in the KDD 2011 PMML Workshop: PMML Pre-processing in KNIME.
- Read the white-paper Zementis and KNIME published together: Social Media, Recommendation Engines and Real-Time Model Execution with KNIME and ADAPA
- Join the PMML discussion group in LinkedIn.
Get products and technologies
- KNIME, a user-friendly and comprehensive open-source data integration, processing, analysis, and exploration platform. From day one, KNIME has been developed using rigorous software engineering practices and is used by professionals in both industry and academia in over 60 countries.
- ADAPA is a revolutionary predictive analytics decision management platform, available as a service on the cloud or for on site. It provides a secure, fast, and scalable environment to deploy your data mining models and business logic and put them into actual use.