Real-time notification stops working

2018-06-28 18:59:00.867Z

Hello!

I've deployed a talkyard for my community but after a few seconds the real-time notification stops working and I need to refresh the page.

I've check the logs and I have a "client prematurely closed connection while sending to client" right along the time the browser stops receiving it.

I tried to check from where it might be coming but due to my lack of full understanding of the Talkyard, and where to look, I can't.

Could you help me?

Thanks!

Reply

8 replies

KajMagnus @KajMagnus
core-dev
support-team
2018-06-29 12:15:58.053Z
Hello Tiago! Thanks for including the error log message, i.e. "client prematurely ...". Apparently it's from Nchan, the Nginx module that's being used for real time notifications. I'll look into this now & during the weekend ... I think this problem happens to me too sometimes.

(Interesting that you tried to check from where the message comes — what's your background if I may ask? You do softw dev sometimes?)
Reply
1. T Tiago @Tiago
  2018-06-29 12:44:03.221Z
  Actually yes, sometimes. I am doing my Master in Machine Learning and Deep Learning.
  
  I thought about nchan but given the fact I'm still trying to understand how everything works together ahah.
  Reply
  KajMagnus @KajMagnus
  core-dev
  support-team
  2018-06-30 06:22:59.541Z
  Ok :- ) that sounds interesting. I did juts a little bit neural networks long ago b.t.w. ... before Deep Learning happened.
  
  I've fixed the bug now (or so I think), and works when I test on localhost. I'll release a new version in a few days and then live notifications should work again.
  
  (The reason for the bug, is that long ago I changed from jQuery.ajax, to Bliss.fetch, and didn't notice that after that, theRequest.abort() no longer invoked [an error callback in which next long polling request got sent]. ... So, after the first long polling request, no more long polling requests got sent :- P )
  
  Reply
  T Tiago @Tiago
  2018-06-30 07:58:15.981Z
  Oh nice! It's amazing if we compare the uses it had before and how the architectures evolved and got ultra complex. You get a bit "how the hell does this work?" ahah
  
  Oh So it was on the client side. I would have spent a few days on that one (I was still reading and understanding nchan ahaha).
  
  Thank you very much :D
  
  Reply
  KajMagnus @KajMagnus
  core-dev
  support-team
  2018-07-20 15:25:31.319Z
  (Sorry for the late reply.) Seems as if the above-mentioned fix wasn't the only problem. Live notifications now work when I test on localhost, but when the server has been up and running for a while, apparently they stop wroking. My best guess right now is that there's a bug in Nchan, and the Nchan author have been coding a lot lately and fixed bugs, and says he'll release a new version soon, like, in a week. So I'll upgrade to that new version and see if live notfs start working properly then.
  
  Reply
  In reply toTiago⬆:
  KajMagnus @KajMagnus
  core-dev
  support-team
  2018-08-02 13:34:47.606Z
  @Tiago Turns out there's another problem too: there's a segfault (C code crash) in an Nginx worker thread, from inside a Lua module. When the worker thread suddenly exits, Nchan's internal state gets messed up, and notifications stop working. I posted about this yesterday over at GitHub, the Lua module repo. https://github.com/openresty/lua-nginx-module/issues/1361
  
  Reply
Progress
with handling this problem
@KajMagnus marked this topic as Planned 2018-08-05 11:36:54.087Z.
@KajMagnus marked this topic as Started 2018-08-05 11:36:56.353Z.
KajMagnus @KajMagnus
core-dev
support-team
2018-10-15 14:31:37.813Z
This is still a problem — i.e. the Nginx worker segfault mentioned above. Recently the Nchan author fixed a worker crash; hopefully it's the same crash. I'll upgrade to the new version of Nchan:

1.2.2 (Oct. 9 2018) ... fix (security): subscriber may erroneously receive a 400 Bad Request or crash a worker based on data from a previous subscriber
https://github.com/slact/nchan/blob/master/changelog.txt
Reply
KajMagnus @KajMagnus
core-dev
support-team
2019-02-07 18:40:57.006Z
I think this has been fixed now ... after 7 months :- P. I changed Nginx to use only 1 worker, and that reportedly avoids the problem. If a worker crashes, Nginx will somehow reset its state, if there's just one single worker. i haven't noticed any problems since changing to 1 worker, soon a month ago.

1 worker is faster, than fast enough. Nevertheless, the long term plan is to actually remove Nchan. In Talkyard's case, I think it's not really needed. Instead I have in mind to use Server Sent Events and HTTP/2 directly from Play Framework.

From a GitHub issue: https://github.com/slact/nchan/issues/477#issuecomment-452848234

I wrote:

@neben I have this problem too, that Nchan in effect stops working after a worker crash, and can stay broken until the next restart which might not be until weeks later (no live notifications, until then). How do you detect a crash and send a SIGHUP, you don't happen to have to have a reusable script or something?

( @slact I suppose it'd be impossibly much work to do this, but anyway, there's Rust for Nginx: https://github.com/nginxinc/ngx-rust (hmm only a proof of concept though) — maybe Rust could be a way to fix all crashes once and for all ... except that ... impossibly much work to port to Rust I suppose.)

@ivanovv replied:

@kajmagnus the easiest fix for this is to have only one worker, then nginx master process will auto restart it and everything works again. I guess that needs to go into the README as it is non obvious thing and many had prod servers stuck
Reply
@KajMagnus marked this topic as Done 2019-02-07 18:41:00.335Z.

Reply (discussion)Add progress note