A Python question regarding large file transfers over HTTP
I'm working on a project that involves retrieving large (~2-8 GB) .zip files through HTTP and storing them for later processing. I've written a script that uses an API to lookup and generate URLs for a series of needed files, and then attempts to stream each file to storage using requests.get().iter_content.
The problem is, my connection isn't perfectly stable (and I'm running this on a laptop which sometimes goes to sleep). When the connection is interrupted, the transfer dies and I need to restart it.
What would be the best way to add a resume capacity to my file transfer? So that if the script stalls or the connection drops, it would be possible to resume the download from where it failed?
A Python question regarding large file transfers over HTTP
@spinflip Did you get the resume working? It might be worth looking into Twisted and the ReconnectingClientFactory.
A Python question regarding large file transfers over HTTP
@drewfer kind of: it worked for one file, then failed badly on another one and was still downloading and writing when the file on-disk was twice the size of the file being retrieved from the server...
A Python question regarding large file transfers over HTTP
@spinflip can you share the code?
A Python question regarding large file transfers over HTTP
@drewfer Sure! here: https://pastebin.com/JmNG2s7B
A Python question regarding large file transfers over HTTP
@spinflip The tdqm_notebook is probably calling the request multiple times with the same range header. You're going to have to build an iterator around iter_content() that updates the Range header when the underlying connection closes and then pass that to the notebook.
A Python question regarding large file transfers over HTTP
@spinflip Hi, I don't have a direct answer to your question; I've never tried to do this before. However the problem makes me think of mosh ( https://mosh.org/ ) which is an ssh alternative specifically developed for intermittent connections and shoop which is an scp alternative. Perhaps these could be of use if the normal http method turns out to be difficult.
A Python question regarding large file transfers over HTTP
@spinflip Http 1.1 has a range option. More info here - https://stackoverflow.com/questions/22894211/how-to-resume-file-download-in-python
A Python question regarding large file transfers over HTTP
@drewfer Huh, I think that might be working... Fingers crossed, and I'll find out in a few GB. Thank you!
A Python question regarding large file transfers over HTTP
This is where I shell out to wget, which has a 'continue' flag.
Might be a python library that does that, I don't know.
A Python question regarding large file transfers over HTTP
A Python question regarding large file transfers over HTTP
Here's a cleaned version of what I currently have: https://pastebin.com/Tpgqrvdi