All of us have experienced getting different results after successive iterations of an algorithm with the same data.

This can occur due to randomness in the order in which the data is exposed to the model, the type of cross-validation, different initial weight values, and other parameters such as the number of hidden layers.

In simple terms, reproducibility means to be able to get the same final model by running the experiment with the same parameters on a dataset. The instances mentioned above make it difficult to achieve reproducibility.

By fixing the random number generator’s seed before constructing each model, we can achieve reproducibility.

At run time, we frequently split the data into training, validation, and test data. When comparing various hyper-parameters or algorithms, data splitting should be the same throughout all parallel executed evaluations since we want to evaluate the differences between the parameters of interest, not the different data splits. It is worth noting that the data in certain circumstances has time dependencies. In that instance, a time-dependent splitting of the data should be used rather than a random splitting.

We should take note of the random seed to avoid variation in the results during the process of reproducing the results of the experiments.

It is also a good idea to keep track of the machine learning algorithm's parameters configuration if you want to replicate your results, either by yourself, or by other people who could be interested in your findings.

Boost model performance quickly with AI-powered labeling and 100% QA.

Learn more
Last modified