A Python question regarding large file transfers over HTTP
I'm working on a project that involves retrieving large (~2-8 GB) .zip files through HTTP and storing them for later processing. I've written a script that uses an API to lookup and generate URLs for a series of needed files, and then attempts to stream each file to storage using requests.get().iter_content.
The problem is, my connection isn't perfectly stable (and I'm running this on a laptop which sometimes goes to sleep). When the connection is interrupted, the transfer dies and I need to restart it.
What would be the best way to add a resume capacity to my file transfer? So that if the script stalls or the connection drops, it would be possible to resume the download from where it failed?
A Python question regarding large file transfers over HTTP
Here's a cleaned version of what I currently have: https://pastebin.com/Tpgqrvdi
A Python question regarding large file transfers over HTTP
@spinflip Did you get the resume working? It might be worth looking into Twisted and the ReconnectingClientFactory.
A Python question regarding large file transfers over HTTP
@drewfer kind of: it worked for one file, then failed badly on another one and was still downloading and writing when the file on-disk was twice the size of the file being retrieved from the server...
A Python question regarding large file transfers over HTTP
@spinflip can you share the code?
A Python question regarding large file transfers over HTTP
@spinflip The tdqm_notebook is probably calling the request multiple times with the same range header. You're going to have to build an iterator around iter_content() that updates the Range header when the underlying connection closes and then pass that to the notebook.