AutoML
Last updated
Last updated
Gaio uses technology to create predictive models H2O AutoML (Automatic Machine Learning). This means that Gaio operationalizes the connection to data, data processing, delivers training and modeling data and directives to H2O AutoML, retrieves the result of the execution and delivers the results in a user-friendly interface. This entire process can be automated within Gaio.
Within Gaio, the process for creating predictive models is very simple.
Click on the table with historical data to train the models
From the Tasks menu, choose AutoML
Define the name of the model that will be saved by Gaio
Define what the response variable will be
Define the time that Gaio will have to search for patterns in the data
Exclude fields that don't make sense in training, such as Customer Code
Click Train or Save. Run the task and wait for the set time.
The model building interface is very simple and does not require specialized knowledge, however it is very important that the analyst knows what is happening when building models.
Row Volume
The modeling process is often memory and processing intensive. Therefore, special attention to the volume of rows in the table to be used is essential. A good one sample it is an excellent strategy as it generally represents the entire data set well and thus allows more models to be created in less time, in addition to not overloading the server. By default, Gaio limits it to 100 thousand lines, however it is possible to change this value, but it is necessary to be aware of the impact and it is only interesting in cases where the server is very large.
Several techniques are used in the automatic modeling process. The following list contains the link to the official H2O documentation:
GLM: Generalized Linear Model.
XGBoost: Combination of multiple decision trees created in parallel.
GBM: Gradient Boosting Machine.
DeepLearning: use of Neural Networks.
Training and validation criteria are applied. Gaio uses Cross-Validation to evaluate whether the models are being assertive. A 5-Fold is used to generate 5 random samples of the same size that will be used to train several models, as shown in the image below:
The criterion for prioritizing the model is Accuracy .
Categorical (text) and Numeric are accepted as response variables. In the case of a numerical variable, it will always be considered that the desire is to predict the number and not to indicate the probability of that event occurring.
If the response variable is, for example, Service Cancellation and has values 0 or 1, it will be necessary to transform the values in this column into, for example, R0 or R1. This is because in this case we expect to know the probability of the customer canceling, that is, being 1 and at the same time the probability of the customer being 0, not canceling. However, as it is a numerical variable, Gaio understands that the intention is to predict a number, such as the amount that the customer can purchase . Different techniques and different results are applied to the two different types of response variable.
After executing the AutoML task , the results are made available in a new object in the process. Below is an example whose response variable is categorical.
A summary of the automatic model building process is generated, and the overall quality of the model is reported.
The variables that most impacted the model are ordered. In the example above, Age was the variable that most contributed to predicting the event, reaching a 57.3% contribution.
The Summary screen is standard when entering the model result and provides the main information about the model chosen as the best.
The confusion matrix indicates the percentages of correct answers for each value of the categorical response variable (see image below).
The list of all models that were created in the predetermined time with some model quality statistics.
Circled in green are the model hits, where it coincided with what happened in the past. The red circles signal where the model made a mistake, differing from what happened in the past. In this example above, when the model says (first line) that the customer will not cancel, it gets it wrong 5 times and therefore gets 99.2% correct. However, when the model predicts that the customer will cancel, it is wrong 26 times, resulting in a 92.4% success rate. Overall, the accuracy (degree of success) is 97.3%.
In this run, 16 different models were generated, which are ordered from best to worst. In the columns on the right, some model quality indicators are presented , including the AUC (Area Under the curve) and the RMSE (Root Mean Square Error).