How a TF-IDF/NLP indexer for 1,000+ multimedia files went from 30 seconds to 1.5 on a single GPU. Batch shape mattered more than batch size, torch.compile earned its keep for a reason I didn't expect, and I burned three engineer-days chasing the last 10% before I quit.
GPUCUDAPyTorchPerformance