Thank you for your question.
Training a custom speech-to-text (STT) model can indeed vary significantly in terms of time and cost depending on several factors.
Training time increases with the amount of audio and transcript data; dedicated hardware regions process ~10 hours/day, while others handle ~1 hour/day. Training is faster in regions with dedicated hardware, and more complex models take longer. While the Speech to text FAQ and How long does it take to train a custom model with audio data- doesn't specifically mention adjusting epochs, reducing the number of epochs or other training parameters might help speed up training. However, this could also affect model performance. More complex models (or models that require more data for fine-tuning) will naturally take longer to train.
I hope this helps. Thank you.