DPL & Unicode - a toss up.

Posted by on in Blogs

So far it's looking like a toss-up between folks wanting more information on the Delphi Parallel Library and those wanting more information about the shift to Unicode.  I think both are extremely important and it is no surprise given the feedback.  Since it is still not clear whether or not DPL will make it into the next release, I may opt to begin talking more about Unicode... then again, maybe not :-).

Right now, I'm wrestling with some compiler issues related to debugging when a generic type is instantiated... needless to say it's making the work on DPL a little tough.  This is par for the course when you're trying to retrofit the airplane while it is still in flight :-).  If it takes more than a few days before this is resolved, I'll probably jump back over to Unicode.  That area is working and the team is full speed ahead on it.

Just to clear some things up, I'm going to answer a few of the common questions folks have about the move to Unicode.

Is there a new Unicode string type or are you just using WideString?

Yes, there is a new data type.  UnicodeString.  It will be reference-counted just like AnsiString and unlike WideString which is a BSTR.  This new data type will be mapped to string which is currently an AnsiString underlying type depending on how it is declared.  Any string declared using the [] operator (string[64]) will still be a ShortString which is a length prefixed AnsiChar array.  The UnicodeString payload will match that of the OS, which is UTF16.  This means you can, at times, have surrogate pairs for characters.  For characters that fall outside the Basic Multilingual Plane (BMP).

Will I be able to still use the AnsiString type?

Yes.  No existing types are being taken away.

What about Char and PChar?

Char will be an alias for WideChar and PChar will be an alias for PWideChar.

Will I have to explicitly call the "W" versions of the Windows API?

For all the Windows API header translations that CodeGear provides, your code should not have to change to call the "W" version.  Since it has always been our intent to make this change at some point in the future, we have been specially processing the header translations (since Delphi 2 if you must know ;-) to ease this transition.  If you want more details on how we do this you can visit the JEDI website for guidelines on how to use these tools.  We'll be providing some updates for these tools in order to properly process a header to use the "W" versions.

Why didn't you just use UTF8?  It's more compact than UTF16.

This was considered.  However, this would have forced far more conversions throughout the VCL code as it talks to the Windows API, and it would have introduced a lot of very subtle breakages in much of user code.  While a lot of code out there already handles DBCS (Double-byte character sets), that same code does not correctly handle characters that consist of > 2 bytes.  In UTF8 a single character can be represented by as many as 6 bytes. [Correction: This is not the case in true UTF8.  5 and 6 byte sequences are illegal in UTF8 (thanks Aleksander)]  In UTF8 a single character can be represented by as many as 4 bytes.  Finally, UTF16 is the native format used internally by Windows itself.  By calling directly to the "W" APIs, the "A" translation layer that Windows has is bypassed and should, in theory, increase performance in some cases.

OMG!!  All my code is going to break!  I can't handle this!!

Now hold on there.  Before you get your knickers in a knot,  please take a moment to fully understand the impact of this change and how to best prepare for it today.  As we're in this process of working in Tiburon, we've been capturing a lot of the common pitfalls and idioms many of you are likely to encounter.  We'll also be working on ways to get this information out to our customers.  Blogs, Whitepapers, and other articles will be the vehicles by which we'll provide this information.  We do understand that there are some types of applications that will be affected more than others.  Many of you have written your own handy-dandy library of string processing functions and classes.  The top categories of things you'll need to watch out for are:

  • Assumptions about the size of Char.
  • Use of string as a storage medium for data other than character data.
  • SizeOf(Buffer) <> Length(Buffer) where Buffer: array[0..x] of Char;
  • File I/O (console I/O will still be down converted to Ansi data since you it can be redirected)
  • set of Char; should be changed to set of AnsiChar;
    • You should also consider starting to use the new character classification functions in Tiburon.
  • If your code must still operate on Ansi string data, then simply be more explicit about it.  Change the declaration to AnsiString, AnsiChar, and PAnsiChar.  This can be done today and will recompile unchanged in Tiburon.

What about the Windows 9x OS?

Not going to happen.  If you absolutely must continue to support those operating systems, RAD Studio 2007 is a great choice.  I realize this may not be a popular decision for some markets, but it is getting harder and harder to support an operating system that is barely even tacitly supported by MS themselves.  We've even looked into MSLU (Microsoft Layer for Unicode) and that is not a very viable option since in order to get it to work with Delphi we'd have to duplicate a lot of the code that is in the COFF based .LIB file that is provided only for VC++.  Yes there is the unicows.dll, but that is not where the "magic" happens.  So, Windows 2000 and newer will be the supported target platforms.

In the coming months, I'll try and show some common code constructs that will need to be modified along with a lot of common code that will just work either way.  It is has been pleasantly surprising how much code works as the latter, and how easy it has been to get the former to behave like the latter.

Tags: CodeGear


About
Gold User, Rank: 83, Points: 11

Comments

  • Guest
    gabr Wednesday, 9 January 2008

    This sounds really interresting but I have big problems with

    "This new data type will be mapped to string which is currently an AnsiString underlying type depending on how it is declared."

    and

    "If your code must still operate on Ansi string data, then simply be more explicit about it. Change the declaration to AnsiString, AnsiChar, and PAnsiChar. This can be done today and will recompile unchanged in Tiburon."

    This is totally unacceptable for us (the company I'm working in). We have to support many applications with millions of source lines of code, some of which can still be compiled with BP7 (with a help of many IFDEFs, of cource). There is no way this code can be cleaned up in time for Tiburon. And I'm totally sure it will break if string becomes a UTF-16 datatype.

    What we need is a compiler switch that will default to Ansi mode for existing applications and for Unicode mode for new applications. That way we can still support old code while we can start working from scratch on Unicode-supporting applications.

    I'm pretty sure that we will not upgrade to Tiburon if string will be aliased to UnicodeString.

  • Guest
    Michael Trowe Wednesday, 9 January 2008

    Hi,

    what I miss is a compiler-switch to change the mapping for the string-type.
    It would make sense to choose the old behaviour, where string is mapped to AnsiString. That allows smoother step-by-step migration of existing applications.

    Michael

  • Guest
    Bruce McGee Wednesday, 9 January 2008

    You mention support for Windows 2000 and later. How about NT4?

  • Guest
    Jan Derk Wednesday, 9 January 2008

    Thanks for the insight. For the first time since D7 I am actually excited about a new Delphi release.

    To convert my applications to Unicode I don't mind a little code breaking and it sure does not sound too bad.

    One good thing about getting us Unicode so late is that you do not have to support W98. Two years ago I would have screamed about it, but most of our customers are now on XP.

  • Guest
    Alan Clark Wednesday, 9 January 2008

    For those such as gabr above who say they cannot use unicode strings yet, would a text search and replace from
    : string;
    to
    : AnsiString;
    to change declarations perhaps work?

  • Guest
    Tom Miller Wednesday, 9 January 2008

    Glad to see you are finally not letting backwards compatibility hold the future hostage. I am very excited about the new features coming in Tiburon. It would be even more exciting if the Win64 compiler was included as a preview :-)

  • Guest
    Aleksander Oven Wednesday, 9 January 2008

    In UTF8 a single character can be represented by as many as 6 bytes.

    Not so according to http://www.utf-8.com, so I hope this is not really how Delphi's future UTF-8 algorthms are implemented.

    Maximum allowed byte span for a valid UTF-8 character is 4 bytes, with the following bit pattern:

    11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

    This pattern of 21 free bits covers codepoints in range $010000-$1FFFFF, and together with the 1-byte, 2-byte and 3-byte patterns gives a total of 2,097,152 possible encoded characters.

    Not all of these are valid, though - some are reserved UTF-16 surrogate pairs, as defined by the Unicode standard. Any decent UTF-8 parsing algorithm should account for those, too.

  • Guest
    Adrien Reboisson Wednesday, 9 January 2008

    Great news ! However, as a lot of people, I've to support "old" programs where changing all String types to AnsiString would be painful... and also clearly a big waste of time. I seriously hope there will be a flag or a checkbox somewhere that will map automatically the String data type to the AnsiChar one. Seriously. I really want to design new Unicode apps, but backward compatibility is also very important for legacy apps. If nothing is done about this issue, I'll stick to D2007 for a looong time ;-)

  • Guest
    Roddy Wednesday, 9 January 2008

    I know you're not a C++ man, but what's going to happen to the automated HPP gerneration. Currently, Delphi "string" comes out as AnsiString in the HPPs, which is not terribly helpful. I assume that it will need to come out as String, and then String will be typedef'd to AnsiString or UnicodeString accordingly...?

  • Guest
    Robertus Maximus Wednesday, 9 January 2008

    Woohooo! We've been waiting for proper unicode support in Delphi for ages, good work Codegear! :)

  • Guest
    mike Wednesday, 9 January 2008

    People, are you realistic, asking for such things as inconsistent switch to unicode due to some issues with 15 year old products??

    I agree, there are still applications out in the market which still need to be supported and which need to run on 98. BUT they are not the majority, and these developers will still be able to use D7 or D2007.

    I'm convinced that this switch to Unicode and the decision to change/improve the VCL comes much too late, and that Codegear lost much of its advantages due this delay. We shouldn't try to delay this switch even more if we still hope that Delphi stays a semi-major player instead of degrading it to a niche product for some software relicts.

  • Guest
    Nick Hodges Wednesday, 9 January 2008

    DanB --

    It is not true that DevExpress has dropped C++Builder support. That's totally false.

    Nick

  • Guest
    Tobias Giesen Wednesday, 9 January 2008

    This will be exciting. I am using WideString a lot currently and I will be looking forward to UnicodeString. I'm sure you are making the right decisions in terms of breaking only so much existing code as necessary.

  • Guest
    ahmoy Wednesday, 9 January 2008

    I think some companies cannot simply just use a replace function to rename String to AnsiString since sometimes there is IFDEF inside the code to support legacy system.

    Option to turn on/off the string mapping to unicode will be nice if codegear have time to implement it in the new delphi.

    of course this is no a problem if codegear push this problem to that company to write a parser to replace this.

  • Guest
    ahmoy Wednesday, 9 January 2008

    - C Johnson,

    "what is the correct datatype for an 8 bit ascii CHAR, if its not CHAR??"

    the 8 bit char is AnsiChar and the 8 bit pchar is PAnsiChar.

  • Guest
    Pavel S Wednesday, 9 January 2008

    We definitely need that compiler-switch to change the mapping for the string-type, leaving old applications with no need of Unicode alone.

  • Guest
    DanB Wednesday, 9 January 2008

    Nick

    I did not say they dropped all C++ builder support, but what I did say is true: Their latest VCL product, ExpressSkins, does not support C++Builder.

    Here is what DevExpress CTO Julian M Bucknall has to say about it on thier forums:

    "We decided, at a late stage admittedly, not to support C++Builder with ExpressSkins in the first release"

    While he does not rule out adding support later on, he does say:

    "It does mean though that it is *unlikely* that we'll be adding support
    for C++Builder in our new VCL products. Not unless there's some drastic changes to the product and in the market."

    It sounds like the decision is based on a) the perception that the C++ Builder market is small and b) the compatability problems that delphi and c++ have in the current product... and c) a lack of effort on codegears part to help:

    "Another thought and then I'll go. I am the CTO of CodeGear's
    (arguably) largest third-party control partner. Have I received an
    email, a phone call, a visit from the new C++Builder Product Manager
    at CodeGear? That would be no. From Nick Hodges, his Delphi
    equivalent, sure. But from Alisdair Meredith? Complete silence.
    Reflect on that."

  • Guest
    Dimitry Timokhov Wednesday, 9 January 2008

    I'm sure there must be a compiler-switch to change the mapping for the string-type.

  • Guest
    Qian Xu Wednesday, 9 January 2008

    1. I am sure that there is a compiler-switch to change the mapping for the string-type, isn
    't there?

    2. Is it possible to partial declare string as AnsiString for some component libraries and partial declare string as UnicodeString for the rest code of a project. Because those libraries without source might not be compatible with UnicodeString.

    3. The name UnicodeString is unprofessional. But it seems to have no alternative choice.?

  • Guest
    Qian Xu Wednesday, 9 January 2008

    A quick note to 22. comment:
    It looks really strange, when the string type is called Unicodestring and the Char type is called WideChar.

  • Please login first in order for you to submit comments

Check out more tips and tricks in this development video: