Skip to content

Conversation

@mhanberg
Copy link
Member

@mhanberg mhanberg commented Nov 3, 2025

This reverts the revert of the epmdless pr, and introduces the swarm library to fix the "noproc" problems faced because the epmdless makes the engine node
a "hidden" node, which precludes the usage of :global.

Currently the tests are not passing, I'm trying to figure that out.

But, I am able to build a local release and it "all works" on my personal linux desktop. I will try soon with my work laptop, which usually has more IT hurdles that cause the need for epmdless.

cc @doorgan I'm going to a conference this week, so please feel free to pull this down to see if you can get the tests working (and also let me know what you think). The code currently has lots of dbg expressions everywhere.

@doorgan
Copy link
Collaborator

doorgan commented Nov 3, 2025

@mhanberg thanks this is great, TIL about swarm
I'll try this out and I can take if from here if that's ok

@doorgan doorgan changed the title feat: cluster without epmd (epmdess) feat: cluster without epmd (epmdless) Nov 3, 2025
@doorgan
Copy link
Collaborator

doorgan commented Nov 3, 2025

I'm checking this out, and:

  • A lot of tests fail because we're using the short name as the node_name when doing :erpc calls or even Node.ping, using the long name name@hostname seems to work there, but part of your issue is that you can't use fully qualified hostnames
  • I see the same issue when I create a release
  • A lot of tests using CompletionCase fail because apparently the test process never receives the "project index ready" message from the engine
  • If I use :longnames then I can get the server and project to connect, but the index store crashes:
19:17:36.614 [error] Task #PID<0.340.0> started from XPEngine.Search.Store terminating
** (stop) exited in: GenServer.call({:via, :xp_swarm, {:ets_search, "rainet"}}, {:reduce, [%{}, #Function<9.115955553/2 in XPEngine.Search.Indexer.do_update_index/2>]}, :infinity)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (elixir 1.17.1) lib/gen_server.ex:1121: GenServer.call/3
    (xp_engine 0.1.0) lib/engine/search/indexer.ex:34: XPEngine.Search.Indexer.do_update_index/2
    (xp_engine 0.1.0) lib/engine/search/indexer.ex:28: XPEngine.Search.Indexer.update_index/2
    (xp_engine 0.1.0) lib/engine/search/store/state.ex:195: anonymous fn/2 in XPEngine.Search.Store.State.prepare_backend_async/2
    (elixir 1.17.1) lib/task/supervised.ex:101: Task.Supervised.invoke_mfa/2
    (elixir 1.17.1) lib/task/supervised.ex:36: Task.Supervised.reply/4
Function: #Function<2.49658960/0 in XPEngine.Search.Store.State.prepare_backend_async/2>
    Args: []

I'll try to fix all those

@mhanberg
Copy link
Member Author

mhanberg commented Nov 4, 2025

I think short names is preferred when you are only clustering with local nodes, at which is what I read and why I made the changes (as well as that one issue you mentioned).

So you're saying the server and engine aren't connecting for you?

@mhanberg
Copy link
Member Author

mhanberg commented Nov 4, 2025

With short names, I still connected like this in Next LS

{:ok, host} = :inet.gethostname()
node = :"#{sname}@#{host}"

@doorgan
Copy link
Collaborator

doorgan commented Nov 4, 2025

So you're saying the server and engine aren't connecting for you?

yes, which is rather odd, we're starting the net kernel explicitly with :shortnames, my expectation was that shortnames should work, but they don't

@doorgan
Copy link
Collaborator

doorgan commented Nov 4, 2025

With short names, I still connected like this in Next LS

{:ok, host} = :inet.gethostname()
node = :"#{sname}@#{host}"

I'm doing the same locally and it works, I was wondering if it'd work on your end given the fully qualified hostnames issues you were seeing

@mhanberg
Copy link
Member Author

mhanberg commented Nov 4, 2025

I think that issue is more about using long names and it being an ip address, but I could be mistaken.

@mhanberg
Copy link
Member Author

mhanberg commented Nov 4, 2025

Also to clarify, I don't remember actually if I ever got the FQDN thing, but I definitely made the issue after another user posting a similar issue.

Using `:swarm` causes the store to not be discoverable by
the GenServer apis. I'm not sure this is the right way
to go, but it seems to work when I build a release and
a lot of tests got fixed.
Forge gets compiled as :dev even in tests for both engine and
expert, so we need to make the clustering method configurable.
Furthermore, the engine and expert have different requirements
for how the store is clustered, so we need to set different
values for both.
@doorgan doorgan marked this pull request as ready for review November 7, 2025 07:27
@doorgan
Copy link
Collaborator

doorgan commented Nov 7, 2025

I'm opening this up for review, I think I fixed the all the test issues, and I got Expert to work properly on my machine.

@mhanberg
Copy link
Member Author

mhanberg commented Nov 7, 2025

I'm opening this up for review, I think I fixed the all the test issues, and I got Expert to work properly on my machine.

You're a beast 🙌🏻

I'll be able to test on my work computer this weekend when I get home from traveling.

@doorgan doorgan mentioned this pull request Nov 10, 2025
1 task
@mhanberg
Copy link
Member Author

Going to close this as #205 is going to supersede it.

@mhanberg mhanberg closed this Nov 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants