The Delphi Compiler and UTF-8 Encoded Source Code Files With no BOM

Posted by on in Tutorial

Last week a (large) customer sent me an email indicating he was experiencing issues when compiling the same project on different machines. Turned out the difference was in the source code files format and the root cause was a unit saved as UTF-8 but without a BOM. The reason? One of the developers is using Visual Studio Code... and the solution is a chancing that or using compiler flag. But before I get to the solution, let me show you the problem with a very simple test case.

Delphi and Source Files Encoding

To test the scenario, we (it was one of the architects who came up with the simple scenario) created a simple VCL application with code like the following:

var 
  strEuro : String = 'Euro=€';

procedure TForm16.Button1Click(Sender: TObject);
begin
  Button1.Caption := 'Hello ' + strEuro;
end;

By default, the editor in the Delphi IDE uses ANSI encoding, and everything works fine. You can use the editor context menu, pick the File Format submenu, select UTF8 (I know, missing -), and everything keeps working as expected. Notice, though the length in bytes of the string changes, as you need multiple bytes to represent the Euro symbol in UTF-8.

Enter Visual Studio Code

While many modern editors use UTF-8 as their standard file format, a nice option we are considering to default to also in RAD Studio, Visual Studio Code (or VSC) is one of the few that prefers using UTF-8 with no BOM. The BOM (Byte Order Mark) is a sequence of bytes (3 bytes for for UTF-8) that marks the file to make it simple for an editor or a tool processing the file (like a compiler) to figure out the internal format. Given the similarities in many cases, you might have to read an entire file to see if it is UTF-8 or ANSI encoded.

When you open a Delphi unit in VSC, it keeps the formatting. If it is UTF-8 with BOM, it remains as such. But for a new file the default is UTF-8 with no BOM. And any file can be saved with that format, as you can see in the status bar options:

Now if you do this, compile the application again, you'll get a nice caption for your button, rather than the Euro symbol:

Use a BOM or a Compiler Flag

Once you realize the issue, the solution is not that complex to achieve.

On one side, you can make sure your editors save the UTF-8 files with a BOM. The compiler sees it, and all works fine.

On the other side, you can tell the compiler to consider the source code file as UTF-8, regardless of the BOM. You can do this using the compiler flag:

--codepage:65001

or setting the codepage in the compiler options to that value:

Good Work Around, Future Solution

I hope this will help you avoid a similar issue. We are researching making the codepage detection automatic, so that the compiler could automatically pick up the right option. But not something we plan to addressing shortly... as the R&D team is busy with the coming release.

 

 

 



About
Gold User, Rank: 7, Points: 457
Delphi and RAD Studio Product Manager at Embarcadero.

Comments

Check out more tips and tricks in this development video: