vtkSocketCommunicator errors: "Could not receive tag" and "Endian handshake failed"

Issue and situation

This post is a second branch of Reverse client-server connection fails with more than one MPI process. Parallel jobs fall back to serial in a reverse connection between client and server within the same network – see above post for the mechanics of the error.

Search

Earlier I was trying to pinpoint the cause of the error linked to the disconnection with a mpi process: "vtkSocketCommunicator::GetVersion() returns different values on the two connecting processes". @mwestphal has pointed out that this very message is not informative and is a known issue in itself, see https://gitlab.kitware.com/paraview/paraview/issues/18171. Therefore it makes sense to inquire into the error messages received just before that one.

[31m( 26.150s) [pvserver ]vtkSocketCommunicator.c:808 ERR| vtkSocketCommunicator (0x56537cb0bff0): Could not receive tag. 1010580540
[31m( 26.150s) [pvserver ]vtkSocketCommunicator.c:557 ERR| vtkSocketCommunicator (0x56537cb0bff0): Endian handshake failed.

I found documentation of either error or of both in

which all kind of relate to the same attempts to have pvserver work as expected. But they do not seem to have received an adequate answer, or I could not decipher a relationship to my specifics.

Previous activity on this forum lets me think that

Same tune on Stackexchange https://stackexchange.com/search?q="Could+not+find+tag", if not no tune at all: https://stackexchange.com/search?q="Endian+handshake+failed"

Questions

Any suggestion for fixes and workarounds to have the reverse connection work in parallel?

Resuming this topic whereby pvserver does not work in parallel mode with reverse connection, I hope that the search reporting given above is exhaustive enough.
Elaborating on it, I have server and client clashing on this line of the subroutine vtkSocketCommunicator (namely from /superbuild/paraview/src/VTK/Parallel/Core/vtkSocketCommunicator.cxx)

554 if (!this->ReceiveTagged(&serverIsBE, static_cast<int(sizeof(char)), 1,
555 vtkSocketController::ENDIAN_TAG, nullptr))

556 {
557 vtkSocketCommunicatorErrorMacro(“Endian handshake failed.”);
558 return 0;
559 }

What does the condition 554 imply? Is this a question for the developers? Thanks for advising. Reverse connection is the only way for me to use Paraview purposefully, alas.

First, make sure that the VTK submodule is updated on both the client and server:

git submodule update --recursive

I have forgotten to do that and run into similar connection errors.

If that doesn’t work, you can try a make clean in your build directory. If that still doesn’t work, delete everything inside the build directory and start fresh.

1 Like

That’s the solution! I can see four pvserver processes working remotely for my local viewer, as intended. No crashes


@cory.quammen Please note that in the superbuild instructions at
https://gitlab.kitware.com/paraview/paraview-superbuild/blob/master/README.md#building-a-specific-version the suggested operation is
git submodule update
without the --recursive option. Out of your experience would you then reckon that the mistake could lay in the lack of the recursive option? Is it the case to align that readme file with your tip?

The --recusive option just tells the VTK submodule to update its (currently one) submodule, VTK-m, which shouldn’t make a difference with regards to the client/server handshake. But that operation should be updated in the build instructions.

@ben.boeckel?

Reference is made to https://gitlab.kitware.com/paraview/paraview-superbuild/blob/master/README.md#building-a-specific-version

I’m even more confused now. Those build instructions you’ve cited are for the superbuild, not paraview itself - --recursive is only useful when updating the paraview repository. The superbuild does have a submodule that requires updating, but it does not have nested submodules that would require the --recursive option to be passed to git submodule update. In any case, I am glad it is working for you now. I suspect maybe you had a build in a bad state and rebuilding was the actual solution.

For superbuild, I’m not sure how well make clean works in practice. make clean is always a little dicey when source code might be changing (via git submodule update, etc.) as build rules are added and deleted. It is usually safest to start from an empty build directory if you get into a “weird” state and make clean hasn’t worked.

And I am ever more confused now.

The errors have been launched by pvserver, and both client and sever are little endian; I cannot say which potential bad build was sending a wrong endianness or reading it wrong, though. I sort of think to have launched a make clean after each rebuild attempt as a matter of habit.

I reported my previous investigations in Reverse client-server connection fails with more than one MPI process. At the server side I could have skipped doing the git submodule update when I cloned the repository the first or other times. Or not having cleaned the build dir.

However, after the git-update-submodule’s, I had to work around another issue at the client side, namely building version 5.7.0 on cluster (2). There I certain had a make clean at each attempt. So the alternative anwer could be don’t forget to make clean before rebuilding? Please add that to your answer above if applicable: there’s not much recent around about this endian-handshake glitch, and it can be useful for future readers.

Indeed, either way that issue is solved. Thanks for sparring.

@cory.quammen Would you mind it to edit your accepted answer and add that making sure to do a make clean could also be part of the solution? There are not so many tips in the internet on sorting the endian-handshake incident. A little edit can help future readers fix their problem quickly, since the solution is appended to the question. Thanks in advance.

This concludes my elaborations on this post. Thanks again for the support.

Done.