Cluster

Traditionally used in Customer Segmentation, cluster analysis has multiple applications. Its purpose is to group very similar lines into groups. As a basic output, a table is generated with a new column where the created groups are defined.

Gaio uses the K-Means technique to identify groups and analysis calculations are made in H2O, whose documentation can be accessed here.

1. Configuration

To build a cluster analysis, simply click on the table that will be used, access the Tasks menu and choose Cluster.

  1. Set the task name.

  2. Define the name of the table that will be generated from the execution.

  3. Exclude unwanted fields in the group composition (clusters).

  4. Determine the maximum time for identifying groups.

  5. As for the number of groups, there are two options. The first is to let the platform identify on its own how many clusters make the most sense given the data used. The technology challenge is to place similar lines in the same cluster. Identical lines are easy to group together. The challenge begins when you start grouping different lines together. As this occurs, the "error" will increase and the technology will evaluate to have the most homogeneous groups possible, without generating a high volume of clusters.

  6. As an analyst, you can determine how many groups should be generated, for example, in the situation where in the company we are unable to build differentiated value propositions for more than 5 groups of customers. So, it may be interesting to define 5 clusters.

2. Results

As a result, the clusterPredict column will signal which group each row belongs to, in addition to repeating all columns from the source table.

To understand the differences between the groups created, descriptive statistical analyzes must be carried out for the numeric and categorical columns, such as:

  1. Numerical: Compare the different groups with means, minimums, maximums and standard deviations and thus understand which, for example, have higher average salaries

  2. Categorical : bar graphs comparing clusters and categorical variable, indicating in which cluster there is a greater concentration of men, for example, and in which there is a greater concentration of women.

Below are some comparison examples.

Last updated