A Python question regarding large file transfers over HTTP 

I'm working on a project that involves retrieving large (~2-8 GB) .zip files through HTTP and storing them for later processing. I've written a script that uses an API to lookup and generate URLs for a series of needed files, and then attempts to stream each file to storage using requests.get().iter_content.

The problem is, my connection isn't perfectly stable (and I'm running this on a laptop which sometimes goes to sleep). When the connection is interrupted, the transfer dies and I need to restart it.

What would be the best way to add a resume capacity to my file transfer? So that if the script stalls or the connection drops, it would be possible to resume the download from where it failed?

Follow

A Python question regarding large file transfers over HTTP 

Here's a cleaned version of what I currently have: pastebin.com/Tpgqrvdi

A Python question regarding large file transfers over HTTP 

@spinflip Did you get the resume working? It might be worth looking into Twisted and the ReconnectingClientFactory.

A Python question regarding large file transfers over HTTP 

@drewfer kind of: it worked for one file, then failed badly on another one and was still downloading and writing when the file on-disk was twice the size of the file being retrieved from the server...

A Python question regarding large file transfers over HTTP 

@spinflip can you share the code?

A Python question regarding large file transfers over HTTP 

A Python question regarding large file transfers over HTTP 

@spinflip The tdqm_notebook is probably calling the request multiple times with the same range header. You're going to have to build an iterator around iter_content() that updates the Range header when the underlying connection closes and then pass that to the notebook.

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.