UTF8 everywhere

I have been working on several merge requests in VTK, https://gitlab.kitware.com/vtk/vtk/merge_requests/6122 https://gitlab.kitware.com/vtk/vtk/merge_requests/6291 & https://gitlab.kitware.com/vtk/vtk/merge_requests/6301, which introduce the constraint that all file and path names (and eventually all string data) passed across the VTK API are utf-8 encoded.

It has come to my attention that some Paraview users make use of extended character sets with non-utf8 locales, so I’m wondering how many people would object to this new requirement?

1 Like

which introduce the constraint that all file and path names (and eventually all string data) passed to/from VTK are utf-8 encoded.

Once again, it is not clear to me where, in the code, is this requirement present. could you precise that ?

so I’m wondering how many people would object to this new requirement ?

Anyone without a UTF-8 locale will be impacted by the proposed change to ParaView.
Type locale -a in a terminal on linux to know.
On windows, the info is in the language settings.

Anyone using ANSI (128 char) file and path names will not be affected, regardless of their locale settings, since utf-8 encoding is identical to plain English text.

Why do you say that all file and path name have to be UTF-8 ?

The requirement comes about due to the introduction of vtksys for opening files/streams and as a result of this change in MR 6122
image

Much clearer now. Thanks.

There is indeed no way around it. However it means we drop support of non-UTF8 locale on windows, so we must be clear on that beforehand.

@utkarsh.ayachit @cory.quammen : thoughts ?

Afraid I don’t know enough about character encodings to contribute. Are the limitations on non-UTF8 locales likely to affect many Windows users?

The purpose of the changes is actually to allow text in any language to be treated consistently and reliably by VTK regardless of the user locale.

Indeed, all these changes will indeed improve support for UTF-8 encodings greatly.

However, all Chinese user may be impacted by this, see here :

I don’t see the problem. Every character from every language can be encoded in UTF-8.

1 Like

Great ! I missed this part ! GB232, which was historically used before GB 18030 did not support it.
There still must be user in China using GB232 but that may be fine to tell them to upgrade.