Follow

Formatting your data file for batch-scoring in ADAPA

For batch scoring in ADAPA, you should upload your data as a CSV (Comma Separated) file. For that, make sure the data file contains all the input fields you actually use in your model. If you are missing a field, ADAPA will not generate any scores.

Also, the first row should contain the name of the variables.

For example, for the model "TaxAudit_SVM.pmml" available as part of the sample PMML files distributed with ADAPA, the first 6 rows of the CSV data file used to validate the model look like:

AGE,Employment,Education,Marital,Occupation,Income,Sex, Deductions,Hours,Adjusted
38,Private,College,Unmarried,Service,81838,Female,0,72,0
35,Private,Associate,Absent,Transport,72099,Male,0,30,0
32,Private,HSgrad,Divorced,Clerical,154676.74,Male,0,40,0
45,Private,Bachelor,Married,Repair,27743.82,Male,0,55,1
60,Private,College,Married,Executive,7568.23,Male,0,40,0

ADAPA also supports the use of double quotes around any of the fields (data or field names). Therefore, the following line is also compatible with ADAPA:

"38","Private","College","Unmarried","Service", ...

You should use double quotes to include commas inside a string as shown below:

"Ryan, Private": without double quotes, ADAPA would treat this single value as two strings.

You should also use double quotes to represent blank characters before or after a string. For example:

" AGE", "AGE ", and "AGE" represent different values whereas "AGE" and AGE are the same.

To represent double quotes inside a string, repeat them twice: "COLOR:""YELLOW""" will be interpreted by ADAPA as COLOR:"YELLOW". Make sure you only use the two adjacent double quotes inside a string surrounded by double quotes.

For more on how to represent your .csv file, click here (beware though that ADAPA does not allow fields to contain embedded line-breaks. In ADAPA, a record is represented by a single line).

Association Rules

Association Rules in ADAPA are supported for rectangular and transaction data files (contact us for details).

Predicted Field

Also, note that in the example above the variable "Adjusted" is actually the predicted field. It is present in the example above since we are using this file for model verification (see below). Obviously, if you are only trying to score your data, you should leave the predicted column out. ADAPA will return computed scores for each entry.

Model Verification

Given that you built your model outside of ADAPA, you want to make sure that both ADAPA and your development environment produce exactly the same results.

ADAPA provides an integrated testing process to make sure your model was uploaded and works as expected. It allows for a test file containing from 1 to thousands of records with all the necessary input variables and the expected result for each record to be uploaded for score matching.

This can be done easily through the ADAPA Console. After processing the file, ADAPA returns statistics on total amount of matched and unmatched records, percentages, etc (see figure below). If any records failed the matching test, a complete list of all failed records is displayed. One can then peer through computed information for each record to locate where expected and computed values differed and thus pinpoint the source of the problem.

Screen_Shot_2014-07-15_at_3.16.23_PM.png

PMML also offers a Model Verification element for similar testing purposes. In this way, verification records are part of the PMML file itself. The PMML "ModelVerification" element has been integrated into ADAPA as of release 3.0. In so doing, ADAPA users have more than one way to test their models.

0 Comments

Article is closed for comments.
Powered by Zendesk