This task generates a new table with a random sample of the source table. This feature is widely used and useful when creating predictive models, as it is computationally very heavy to apply, for example, Neural Networks to a large set of data. At the same time it is inefficient, because, with less data, it is possible to run more techniques, with more parameterizations and therefore, find a better model. Furthermore, a good sample is enough to understand the universe under study.

Basically there are two alternatives:

  1. Choose a percentage of rows from the source table.

  2. Choose a specific number of rows that will be in the generated table.

Every time this task is run again, a new random set of data will be generated.

In many cases, the desire is to generate a random base and work with it for a longer period. If so, generate the table with random data and immediately delete the Sample task , not allowing the random table to be generated again.

All columns from the source table will be present in the random table. Only the number of lines will be smaller.

Last updated