The batch size depicts the
number of samples that propagate through the neural network before
updating the model parameters. Each batch of samples goes through one
full forward and one full backward propagation.
Suppose you have 3000 training samples and set the batch size to 128. The algorithm trains the network using the first 128 samples from the training dataset. The network is then trained again using the second 128 samples. It will repeat this process until all samples have been propagated across the network.
When the problem arises that the final batch has fewer samples than the other batches, you can either skip these samples or create a smaller last batch.
Finding the right batch size for your use case depends highly on the
number of images you use for training and the diversity in your data.
Depending on your hardware (RAM + GPU), you could be limited to small batch sizes.
Smaller batches mean that each step in a gradient descent (optimizer)
may be less accurate, so the algorithm might take longer to converge.
However, it has been learned that the model's quality, as determined
by its ability to generalize, suffers significantly for larger batches.
If you fit the entire data and update the weights, it may work well for
training data but does not work well for other data.
It is generally good practice to increase the batch size
until you saturate your GPU consumption. However, like everything, the
batch size is merely a hyperparameter.
"For the most part, a batch size of 32 is a decent starting point, but you can also experiment with 64, 128, and 256. Other values (lower or higher) may be appropriate for certain data sets. Again, it completely depends on the problem, but this specified range is usually the best to begin experimenting with."