Programming
kotlin utf-8
Updated Thu, 01 Sep 2022 00:11:25 GMT

Kotlin prints non-English characters as question marks


I am trying to print Hebrew characters from a Kotlin program (running on the console).

All the Hebrew characters are being output as question marks.

I created the following simple test.kts script file for testing:

println(" ")
// Try to print a simple non-Hebrew character too
println("\u0394") // Greek Delta

The file is properly saved in UTF-8 format.

It prints:

???? ???????
?

I tried running it in Command Prompt, PowerShell (both in its native window and in Windows Terminal), and Git Bash, all of which give the same result. I also tried redirecting the output to a file to rule out display issues in the shells.

To make sure the problem isn't the console itself, I also made simple test.bat, test.ps1, and test.sh files with the following content:

echo " "

All three shells correctly displayed the Hebrew text here, indicating that the problem is in Kotlin's output, not in the shell display. (Though PowerShell requires the file to be saved "UTF-8 with BOM" to display properly, this can't be the issue with Kotlin since Kotlin won't even run a script that is saved with a BOM.)

As far as I can tell, Kotlin should support UTF-8 output by default with no configuration needed.

How can I get the proper output?


Updates:

If I write the output to a file using java.io.File("out.txt").writeText(" "), it works properly.

Also, if I open a new PrintStream using val out = java.io.PrintStream(System.out, true, "UTF-8") and then write to it using out.println(" "), that works properly too.

Only writing to the console with println is broken.


System info:

  • Windows 10 2004 (Build 19041.450)
  • Kotlin 1.4.0 (downloaded from GitHub Releases)
  • Tested with JAVA_HOME pointing to both JRE 1.8.0_261 (Oracle) and 11.0.2 (Oracle OpenJDK).



Solution

(Update at bottom)

Partial answer, but was able to get some Hebrew characters in the console in both Kotlin and Java. Was verry painful. Included some commented out stuff to show you some other things I may have tried if you run into any other hurdles.

Saved Tester.kt as UTF-8 with Notepad.

fun main(args : Array<String>) {
    System.setProperty("file.encoding", "UTF8")
    //val charset = Charsets.UTF_8
    //val byteArray = " ".toByteArray(charset)
    //System.out.printf("%c",byteArray.toString(charset))
    //System.out.println(Charset.defaultCharset())
    System.out.println("")
    
}
kotlinc.bat .\Tester.kt -include-runtime -d Tester.jar

Now, this leads to another mess, which I discovered by trying to copy and paste Hebrew characters to Powershell/Cmd. When copying, the ? marks showed right off the bat. Dug around a little bit, seems Powershell ISE is better suited for this (reference below). Without any plugins, copy and pasted successfully. Then had to run this:

PS> [Console]::OutputEncoding = [System.Text.Encoding]::UTF8

Because on my system, running the following showed:

PS> [Console]::OutputEncoding
IsSingleByte      : True
BodyName          : iso-8859-1
EncodingName      : Western European (Windows)
HeaderName        : Windows-1252
WebName           : Windows-1252
WindowsCodePage   : 1252
IsBrowserDisplay  : True
IsBrowserSave     : True
IsMailNewsDisplay : True
IsMailNewsSave    : True
EncoderFallback   : System.Text.InternalEncoderBestFitFallback
DecoderFallback   : System.Text.InternalDecoderBestFitFallback
IsReadOnly        : True
CodePage          : 1252

Then,

java -jar -D"file.encoding=UTF-8" tester.jar

and voila, a single Lamedh


Also, the Java route, which may or may not bring more insights:

Tester.java saved as UTF-8 with Notepad, imports redundant, yes, but shows some standout imports

import java.nio.charset.Charset;
import java.nio.charset.StandardCharsets;
import static java.nio.charset.StandardCharsets.*;
import java.nio.*;
public class Tester{
    public static void main(String[] args){
        String str1 = " ";
        byte[] ptext = str1.getBytes(UTF_8); 
        String value = new String(ptext, UTF_8); 
        ByteBuffer byteBuffer = StandardCharsets.UTF_8.encode("");
        System.out.println(Charset.defaultCharset());
        System.out.println(" ");
        System.out.println(value);
        System.out.print(byteBuffer.getChar());
        System.out.printf("Value: %s",value);
    }
}

javac would give:

javac .\Tester.java
.\Tester.java:8: error: unmappable character (0x9D) for encoding windows-1252
                System.out.println("? ");

So

javac -encoding UTF-8 .\Tester.java

and voila again, PS ISE only:

PS> java -D"file.encoding=UFT-8" Tester
UTF-8
 
 
Value:  

I think this shows there are several hurdles, but it can work with Kotlin, and with println after making sure the file is correct, running the file the right way, and the output is correct. Hebrew may be particularly difficult due to the right-to-left nature, other characters like Greek were easier I think.

No matter what, I feel your pain, good luck. From what I read, there may be other bottlenecks like sending Hebrew over a network. This opened my eyes to several things, will continue to learn about this myself.

(Update) Using the second link in the reference actually provided before, you can make two small changes and get Hebrew in Powershell (not just ISE)!!

PS> $OutputEncoding = [console]::InputEncoding = [console]::OutputEncoding = New-Object System.Text.UTF8Encoding

Then,

Font: Courier New

References:





Comments (1)

  • +0 – It looks like the Kotlin script .kts executor can't handle it, but compiling to JAR with -encoding UTF-8 and running with -D"file.encoding=UTF-8" worked (along with setting $OutputEncoding...). — Sep 02, 2020 at 19:39