-
Notifications
You must be signed in to change notification settings - Fork 11
Someone on IRC claims we are leaking file descriptors #7
Comments
Hmm, seems likely to be a network-transport-tcp thing actually. I'm pretty certain this library isn't doing anything wrong. Will discuss with @agentm what they're seeing and close or update accordingly. |
To be fair, I suspect that I am at fault for the leaking descriptors and trying to figure out how to resolve it. In our case, we have a server process proxying access to a d-p-client-server process. I understand the philosophy of retaining connections, but, in this client-server case, the lifetime for connect/disconnect is clear: the websocket disconnects, so the proxied connection should be disconnected as well. Currently, that's not happening. If I add the reconnect call to the server-side handler for logging out (via handleCall ala this), then the server process dies and gets unregistered. I suppose I should be catching the exception, but I'm not sure where it makes sense or how to recover on the server-side since I do want the client-side to receive the killed exception. Is this a case for monitoring? Is there an example I could follow? On a side note, I assumed that by using the same remote NodeId to connect (returned from whereisRemote), the same socket would be re-used for all connections but that appears not to be the case. I'm not sure what I am missing there. |
Hi @agentm. First of all, thanks for using the library - sorry I've been awol and not maintaining, I am getting back on it now!
There's no wrapping going on there, you spawn a new managed server and it runs on the DB node, that looks fine.
I'm a little unclear on how this happens. Where is the websocket code that does this? All I can see is this client module.
I'm a bit unclear on this. You wouldn't need to call reconnect here. The business of The general paradigm for managed servers is that the server is long lived and handles multiple clients, which are differentiated internally by tagging client requests with a unique identifier and relying on Cloud Haskell's ordering guarantees between two isolated processes. The server process should keep running until you shut it down, by sending a
Failures on the server will be automatically detected by clients that are using the |
In Cloud Haskell there are heavy weight (TCP/IP) connections, and lightweight connections that multiplex over the TCP/IP connection. All interactions between two processes on separate nodes use the network-transport layer to establish a lightweight connection - a full connection over which to multiplex is created if not already present - and this connection is reused whenever possible. Only network failures between the two nodes should cause the issue to which |
Oh wait, I'm quite wrong, I can see what you mean by wrapping now.... Although this shouldn't be a problem - viz one managed process calling another - I wouldn't structure your process tree like that. Let me look through the code a bit more and I'll try to advise. |
Right... It's a little unclear what you're doing, but if I've understood it correctly, then I think I've found the problem in the CH code: timeoutOrDie :: Timeout -> IO a -> Process (Either ServerError a)
timeoutOrDie micros act = do
if micros == 0 then
liftIO act >>= \x -> pure (Right x)
else do
mRes <- liftIO (timeout micros act)
case mRes of
Just res -> pure (Right res)
Nothing -> pure (Left RequestTimeoutError) That A quick fix would be to use the That's the workaround at least. I should point out here that it's still not clear whether or not leaking Cloud Haskell resources has anything to do with your running out of file descriptors. Now what I /would/ do if I were you, is re-design the process tree a little bit. First of all, it seems to me that your main (outer) server is mainly there to maintain a list of current active sessions. That's great, so maybe consider abstracting your sessions out into their own processes. If your sessions need to wrap a handle to the underlying database/file-system/etc then you can use STM to share data between a parent and child processes (as long as the children are spawned locally to the parent process). There is a great API available for doing this, in the distributed-process-async package. For starters, your timeout code would be greatly simplified: testAsyncWaitTimeout result = do
hAsync <- async $ task (expect :: Process ())
waitTimeout 100000 hAsync >>= stash result
cancelWait hAsync >> return () If you're going to use But I wouldn't even bother with this tbh. Instead of the session management server performing I/O and/or STM actions, you should break out the logic for an individual session into its own process and have the parent either return an opaque handle to the session (which the websocket server can store in memory for its persistent connection) or you can use the parent process as a dispatcher if you prefer. If child processes need to coordinate around a shared resource then you either isolate that in its own process (be it the parent server process, or some other process) or use a shared STM handle that all the children inherit. There are lots of examples of this that I can provide. If you choose to isolate sessions in their own child processes, then I'd suggest supervising them. The easiest way to do this would be to spawn a supervision tree linked to the session management server, and add child handles for each new session (you can wrap managed-process startup functions using the If you're feeling brave, you might also want to take a look at the resource pool implementation here. In particular it handles resource acquisition and both deliberate and accidental client disconnects, all using the managed process API. The latter (monitor notifications of disconnects) are handled by the backing pool (i.e., the implementation the server defers to at runtime - see the example code here. Please note that the pool implementations aren't fit for production use yet, there are some bugs and failing tests so please don't use these or copy verbatim - I'm just pointing out techniques you might want to be aware of when constructing complex managed processes using -client-server. If the -task or -supervisor libraries are failing in CI it is probably because the travis config is very out of date. I will try and go through and fix these asap. Do please shout with any further questions - I'm most happy to help, and of course will help hunt for leaking file handles as best I can too. |
Wow, thanks for pointing out the mistaken assumptions in my code and the additional details. I will definitely want to fix the timeout. I spent a few hours narrowing down the file handle issue specifically and was able to narrow it down to my misunderstanding of "ownership" of the transport. When a new local node is created, it adds an endpoint to the transport and closes out the endpoint in I am able to resolve the leaking handles by retaining a reference to the transport and closing it separately. |
Ah perfect, makes a good but of sense and I'm glad you found it! Feel free to shout with and further / future issues of questions! |
I think this requires calling
reconnect
and it should be in here already, but I'll double check. Via @agentm on IRC afaik.The text was updated successfully, but these errors were encountered: