Follow

@jonny In auto-pi-lot do you ever run into a situation where a zmq request is not replied to? and things get stuck? I took a look in node.py and couldn’t tell if you do some time of timeout and message handling.

I’m asking because I’m thinking of re-writing part of our infrastructure to use REQ/REP zmq calls (which are now local function calls. this would be the first step in moving away from Matlab)..

@jerlich
Oh yes, tried many strategies. I dont think I got it quite right, and to some degree needing to rewrite a lot of how behavior is built around messaging made me want to spend the time to go and make a solid networking layer and then return to autopilot which is what im doing now.

In normal conditions, when using across a local network as is typical for autopilot, message loss is very rare. almost never see it. All messaging is async with dealers/routers and tornado, and message handling is callback driven with each object expected to be able to handle itself without input, so the problems from dropped messages wouldnt be getting stuck, but missing some state transition.

After using them a bunch, nodes worked best as "UDP-like" things that dont try and guarantee reliability themselves (but something else can check for msg confirmation by watching the outbox). I split reliability out into a separate station process that is just for handling communication between raspis so nodes also don't have to handle routing. So nodes will hand off messages to the station as quickly as their queues can clear them, and stations will handle resending them and heartbeating and etc.

You can see in the docs that I wasnt all too happy with that and the next thing I was going to do was try and simplify that topology so reliability could be handled more granularly, but then I decided it was worth spending more time on rather than trying to keep hacking at the same design.

@jonny Interesting. Currently, we are using pub/sub so I don't have to worry about getting out of sync in a req/rep loop. but for some reason Matlab/jeromq does miss messages sometimes and our current hack is to just send messages meant for Matlab a few times. but I'm not really happy with that and would like something more robust. it's low priority as things are working...

@jerlich
One of the little unreasonably effective tricks in autopilot that makes it work better than it has any right to is that the (tornado) messaging handler is actually at the root of the running process and most of the program logic happens in threads spawned from message callbacks. That pattern seems to make messaging highly available, and also dealers/routers seem to have more reliable implementation than pub/sub, which I had tried early on but found was more effective for really broad fanouts but not so great for normal messaging.

@jerlich
Not sure how easy it is in MATLAB to do threading/processes, but ya having something that can just wait on messages is p good idea

Sign in to participate in the conversation
Qoto Mastodon

QOTO: Question Others to Teach Ourselves
An inclusive, Academic Freedom, instance
All cultures welcome.
Hate speech and harassment strictly forbidden.