Delphi in A Unicode World Updated
Delphi in A Unicode World Updated
Delphi in A Unicode World Updated
August 2008
Corporate Headquarters
100 California Street, 12th Floor
San Francisco, California 94111
EMEA Headquarters
York House
18 York Road
Maidenhead, Berkshire
SL6 1SF, United Kingdom
Asia-Pacific Headquarters
L7. 313 La Trobe Street
Melbourne VIC 3000
Australia
Contents
CHAPTER I: WHAT IS UNICODE, WHY YOU NEED IT, AND HOW TO WORK WITH IT IN
DELPHI 2009 ..................................................................................................................................... - 3 Introduction ...................................................................................................................................... - 3 What is Unicode? ............................................................................................................................. - 3 Why Unicode? .................................................................................................................................. - 3 A Word about Terminology ............................................................................................................ - 4 The New UnicodeString Type ......................................................................................................... - 4 Conclusion ........................................................................................................................................ - 6 CHAPTER II: NEW RTL FEATURES AND CLASSES TO SUPPORT UNICODE ............................ - 7 Introduction ...................................................................................................................................... - 7 TCharacter Class .............................................................................................................................. - 7 TEncoding Class .............................................................................................................................. - 8 TStringBuilder .................................................................................................................................. - 9 Declaring New String Types ......................................................................................................... - 10 Additional RTL Support for Unicode ............................................................................................ - 10 StringElementSize.......................................................................................................................... - 10 StringCodePage ............................................................................................................................ - 10 Other RTL Features for Unicode ................................................................................................... - 11 SetCodePage ................................................................................................................................. - 12 Getting TBytes from Strings.......................................................................................................... - 13 Conclusion ...................................................................................................................................... - 13 CHAPTER III: UNICODIFYING YOUR CODE ............................................................................... - 14 Areas That Should Just Work .................................................................................................... - 14 General Use of String Types ......................................................................................................... - 14 The Runtime Library....................................................................................................................... - 14 The VCL .......................................................................................................................................... - 15 String Indexing ............................................................................................................................... - 15 Length/Copy/Delete/SizeOf with Strings .................................................................................... - 15 Pointer Arithmetic on PChar ......................................................................................................... - 16 ShortString ..................................................................................................................................... - 16 Areas That Should be Reviewed ................................................................................................... - 17 SaveToFile/LoadFromFile ............................................................................................................. - 17 Use of the Chr Function ................................................................................................................ - 18 Sets of Characters .......................................................................................................................... - 18 Using Strings as Data Buffers ........................................................................................................ - 19 Calls to SizeOf on Buffers.............................................................................................................. - 19 Use of FillChar ................................................................................................................................ - 19 Using Character Literals ................................................................................................................ - 20 Calls to Move ................................................................................................................................. - 21 Read/ReadBuffer methods of TStream ........................................................................................ - 21 Write/WriteBuffer........................................................................................................................... - 22 LeadBytes ....................................................................................................................................... - 22 TMemoryStream ............................................................................................................................ - 22 TStringStream ................................................................................................................................ - 23 MultiByte ToWideChar .................................................................................................................. - 23 SysUtils.AppendStr ........................................................................................................................ - 23 GetProcAddress ............................................................................................................................. - 23 Use of PChar() casts to enable pointer arithmetic on non-char based pointer types............... - 24 Variant open array parameters ..................................................................................................... - 24 CreateProcessW............................................................................................................................. - 25 -
Embarcadero Technologies
-1-
Passing in a string constant ........................................................................................................... - 25 Passing in a constant expression .................................................................................................. - 25 Passing in a string with a Reference Count of -1: ........................................................................ - 26 Code to search for ......................................................................................................................... - 26 APPENDICES ................................................................................................................................. - 27 Embarcadero and Partner Blog Entries about Unicode ............................................................. - 27 Embarcadero Developer Network Videos about Unicode ......................................................... - 27 Get off your ASCII and expand your business to global markets .............................................. - 27 Migrating your Projects to Delphi 2009 Its easy!..................................................................... - 28 Additional Sources of Delphi 2009 Information .......................................................................... - 28 Additional Sources of Unicode Information ................................................................................ - 28 -
Embarcadero Technologies
-2-
INTRODUCTION
The Internet has broken down geographical barriers that enable world-wide software
distribution. As a result, applications can no longer live in a purely ANSI-based environment.
The world has embraced Unicode as the standard means of transferring text and data. Since it
provides support for virtually any writing system in the world, Unicode text is now the norm
throughout the global technological ecosystem.
WHAT IS UNICODE?
Unicode is a character encoding scheme that allows virtually all alphabets to be encoded into a
single character set. Unicode allows computers to manage and represent text most of the
worlds writing systems. Unicode is managed by The Unicode Consortium and codified in a
standard. More simply put, Unicode is a system for enabling everyone to use each others
alphabets. Heck, there is even a Unicode version of Klingon.
This article isnt meant to give you a full rundown of exactly what Unicode is and how it works;
instead it is meant to get you going on using Unicode within Delphi 2009. If you want a good
overview of Unicode, Joel Spolsky has a great article entitled The Absolute Minimum Every
Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No
Excuses!) which is highly recommended reading. As Joel clearly points out ITS NOT THAT
HARD. This chapter will discuss why Unicode is important, and how Delphi will implement the
new UnicodeString type.
WHY UNICODE?
Among the many new features found in Delphi 2009 is the imbuing of Unicode throughout the
product. The default string in Delphi is now a Unicode-based string. Since Delphi is largely built
with Delphi, the IDE, the compiler, the RTL, and the VCL all are fully Unicode-enabled.
The move to Unicode in Delphi is a natural one. Windows itself is fully Unicode-aware, so it is
only natural that applications built for it, use a Unicode string as the default string. And for
Delphi developers, the benefits dont stop merely at being able to use the same string type as
Windows.
The addition of Unicode support provides Delphi developers with a great opportunity. Delphi
developers now can read, write, accept, produce, display, and deal with Unicode data and its
all built right into the product. With only few, or in some cases to zero code changes, your
applications can be ready for any kind of data you, your customers or your users can throw at it.
Applications that previously restricted to ANSI encoded data can be easily modified to handle
almost any character set in the world.
Embarcadero Technologies
-3-
Delphi developers will now be able to serve a global market with their applications -- even if
they dont do anything special to localize or internationalize their applications. Windows itself
supports many different localized versions, and Delphi applications need to be able to adapt
and work on machines running any of the large number of locales that Windows supports,
including the Japanese, Chinese, Greek, or Russian versions of Windows. Users of your software
may be entering non-ANSI text into your application or using non-ANSI based path names.
ANSI-based applications wont always work as desired in those scenarios. Windows applications
built with a fully Unicode-enabled Delphi will be able to handle and work in those situations.
Even if you dont translate your application into any other spoken languages, your application
still needs to be able to work properly -- no matter what the end users locale is.
For existing ANSI-based Delphi applications, then opportunity to localize applications and
expand the reach of those applications into Unicode-based markets is potentially huge. And if
you do want to localize your applications, Delphi makes that very easy, especially now at
design-time. The Integrated Translation Environment (ITE) enables you to translate, compile,
and deploy an application right in the IDE. If you require external translation services, the IDE
can export your project in a form that translators can use in conjunction with the deployable
External Translation Manager. These tools work together with the Delphi IDE for both Delphi
and C++Builder to make localizing your applications a smooth and easy to manage process.
The world is Unicode-based, and now Delphi developers can be a part of that in a native,
organic way. So if you want to be able to handle Unicode data, or if you want to sell your
applications to emerging and global markets, you can do it with Delphi 2009.
Embarcadero Technologies
-4-
either a Unicode-sized character, or an ANSI byte-sized character. (Note that both the
AnsiString and WideString types will remain in place.) The Char and PChar types will
map to WideChar and PWideChar, respectively. Note, as well, that no string types have
disappeared. All the types that developers are used to still exist and work as before.
However, for Delphi 2009, the default string type will be equivalent to UnicodeString. In
addition, the default Char type is WideChar, and the default PChar type is PWideChar.
That is, the following code is declared by the compiler:
type
string = UnicodeString;
Char = WideChar;
PChar = PWideChar;
UnicodeString is assignment compatible with all other string types; however, assignments
between AnsiStrings and UnicodeStrings will do type conversions as appropriate. Thus,
an assignment of a UnicodeString type to an AnsiString type could result in data-loss. hat
is, if a UnicodeString contains high-order byte data, a conversion of that string to
AnsiString will result in a loss of that high-order byte data. The important thing to note here
is that this new UnicodeString behaves pretty much like strings always have (with the notable
exception of their ability to hold Unicode data, of course). You can still add any string data to
them, you can index them, you can concatenate them with the + sign, etc.
For example, instances of a UnicodeString will still be able to index characters. Consider the
following code:
var
MyChar: Char;
MyString: string;
begin
MyString := This is a string;
MyChar := MyString[1];
end;
The variable MyChar will still hold the character found at the first index position, i.e. T. This
functionality of this code hasnt changed at all. Similarly, if we are handling Unicode data:
var
MyChar: Char;
MyString: string;
begin
MyString := ;
MyChar := MyString[1];
end;
The
variable MyChar will still hold the character found at the first index position, i.e. .
The RTL provides helper functions that enable users to do explicit conversions between
codepages and element size conversions. If the user is using the Move function on the
character array, they cannot make assumptions about the element size.
As you can imagine, this new string type has ramifications for existing code. With Unicode, it is
no longer true that one Char represents one Byte. In fact, it isnt even always true that one Char
is equal to two bytes! As a result, you may have to make some adjustments to your code.
However, weve worked very hard to make the transition a smooth one, and we are confident
that youll be able to be up and running quite quickly.
Embarcadero Technologies
-5-
Chapters II and III will discuss further the new UnicodeString type, talk about some of the new
features of the RTL that support Unicode enablement, and then discuss specific coding idioms
that youll want to look for in your code. This series should help make your transition to Unicode
a smooth and painless endeavor.
CONCLUSION
With the addition of Unicode as the default string, Delphi can accept, process, and display
virtually any alphabet or code page in the world. Applications you build with Delphi 2009 will
be able to accept, display, and handle Unicode text with ease, and they will work much better
in almost any Windows locale. Delphi developers can now easily localize and translate their
applications to enter markets that they have previously been more difficult to enter. Its a
Unicode world out there, and now your Delphi apps can live in it.
In Chapter II, well discuss the changes and updates to the Delphi Runtime Library that will
enable you to work easily with Unicode strings.
Embarcadero Technologies
-6-
INTRODUCTION
In Chapter I, we saw how Unicode support is a huge benefit for Delphi developers by enabling
communication with all characters set in the Unicode universe. We saw the basics of the
UnicodeString type and how it will be used in Delphi
In this chapter, well look at some of the new features of the Delphi Runtime Library that
support Unicode and general string handling.
TCHARACTER CLASS
The Tiburon RTL includes a new class called TCharacter, which is found in the Character
unit. It is a sealed class that consists entirely of static class functions. Developers should not
create instances of TCharacter, but rather merely call its static class methods directly. Those
class functions do a number of things, including:
Embarcadero Technologies
-7-
The Character unit also contains a number of standalone functions that wrap up the
functionality of each class function from TCharacter, so if you prefer a simple function call,
the above can be written as:
uses
Character;
begin
if IsLetter(MyChar) then
begin
...
end;
end;
Thus the TCharacter class can be used to do most any manipulation or checking of
characters that you might care to do.
In addition, TCharacter contains class methods to determine if a given character is a high or
low surrogate of a surrogate pair.
TENCODING CLASS
The Tiburon RTL also includes a new class called TEncoding. Its purpose is to define a specific
type of character encoding so that you can tell the VCL what type of encoding you want used in
specific situations.
For instance, you may have a TStringList instance that contains text that you want to write
out to a file. Previously, you would have written:
begin
...
MyStringList.SaveToFile(SomeFilename.txt);
...
end;
and the file would have been written out using the default ANSI encoding. That code will still
work fine it will write out the file using ANSI string encoding as it always has, but now that
Delphi supports Unicode string data, developers may want to write out string data using a
specific encoding. Thus, SaveToFile (as well as LoadFromFile) now take an optional second
parameter that defines the encoding to be used:
begin
...
MyStringList.SaveToFile(SomeFilename.txt, TEncoding.Unicode);
...
end;
Execute the above code and the file will be written out as a Unicode (UTF-16) encoded text file.
TEncoding will also convert a given set of bytes from one encoding to another, retrieve
information about the bytes and/or characters in a given string or array of characters, convert
any string into an array of byte (TBytes), and other functionality that you may need with
regard to the specific encoding of a given string or array of chars.
Embarcadero Technologies
-8-
The TEncoding class includes the following class properties that give you singleton access to a
TEncoding instance of the given encoding:
class property ASCII: TEncoding read GetASCII;
class property BigEndianUnicode: TEncoding read
GetBigEndianUnicode;
class property Default: TEncoding read GetDefault;
class property Unicode: TEncoding read GetUnicode;
class property UTF7: TEncoding read GetUTF7;
class property UTF8: TEncoding read GetUTF8;
The Default property refers to the ANSI active codepage. The Unicode property refers to
UTF-16.
TEncoding also includes the
class function TEncoding.GetEncoding(CodePage: Integer): TEncoding;
that will return an instance of TEncoding that has the affinity for the code page passed in the
parameter.
In addition, it includes following function:
function GetPreamble: TBytes;
which will return the correct BOM for the given encoding.
TEncoding is also interface compatible with the .Net class called Encoding.
TSTRINGBUILDER
The RTL now includes a class called TStringBuilder. Its purpose is revealed in its name it
is a class designed to build up strings. TStringBuilder contains any number of
overloaded functions for adding, replacing, and inserting content into a given string. The string
builder class makes it easy to create single strings out of a variety of different data types. All of
the Append, Insert, and Replace functions return an instance of TStringBuilder, so they
can easily be chained together to create a single string.
For example, you might choose to use a TStringBuilder in place of a complicated Format
statement. For instance, you might write the following code:
procedure TForm86.Button2Click(Sender: TObject);
var
MyStringBuilder: TStringBuilder;
Price: double;
begin
MyStringBuilder := TStringBuilder.Create('');
try
Price := 1.49;
Label1.Caption := MyStringBuilder.Append('The apples are
$').Append(Price).
Append(' a pound.').ToString;
finally
MyStringBuilder.Free;
end;
end;
Embarcadero Technologies
-9-
TStringBuilder is also interface compatible with the .Net class called StringBuilder.
And the new String type will be a string with an affinity for the Cyrillic code page.
STRINGELEMENTSIZE
StringElementSize returns the typical size for an element (code point) in a given string.
Consider the following code:
procedure TForm88.Button3Click(Sender: TObject);
var
A: AnsiString;
U: UnicodeString;
begin
A := 'This is an AnsiString';
Memo1.Lines.Add('The ElementSize for an AnsiString is: ' +
IntToStr(StringElementSize(A)));
U := 'This is a UnicodeString';
Memo1.Lines.Add('The ElementSize for an UnicodeString is: ' +
IntToStr(StringElementSize(U)));
end;
STRINGCODEPAGE
StringCodePage will return the Word value that corresponds to the codepage for a given
string.
Consider the following code:
Embarcadero Technologies
- 10 -
Codepage
Codepage
Codepage
Codepage
for
for
for
for
an
an
an
an
In addition the RTL also declares a type called RawByteString which is a string type with no
encoding affiliated with it:
RawByteString = type AnsiString($FFFF);
The purpose of the RawByteString type is to enable the passing of string data of any code
page without doing any codepage conversions. This is most useful for routines that do not care
about specific encoding, such as byte-oriented string searches.Normally, this would mean that
parameters of routines that process strings without regard for the strings code page should be
of type RawByteString. Declaring variables of type RawByteString should rarely, if ever,
be done as this can lead to undefined behavior and potential data loss.
In general, string types are assignment compatible with each other.
For instance:
MyUnicodeString := MyAnsiString;
Embarcadero Technologies
- 11 -
will perform as expected it will take the contents of the AnsiString and place them into a
UnicodeString. You should in general be able to assign one string type to another, and the
compiler will do the work needed to make the conversions, if possible.
Some conversions, however, can result in data loss, and one must watch out this when moving
from one string type that includes Unicode data to another that does not. For instance, you can
assign UnicodeString to an AnsiString, but if the UnicodeString contains characters
that have no mapping in the active ANSI code page at runtime, those characters will be lost in
the conversion. Consider the following code:
procedure TForm88.Button4Click(Sender: TObject);
var
U: UnicodeString;
A: AnsiString;
begin
U := 'This is a UnicodeString';
A := U;
Memo1.Lines.Add(A);
U := '
2009!!';
A := U;
Memo1.Lines.Add(A);
end;
The output of the above when the current OS code page is 1252is:
This is a UnicodeString
????? ?????????? ? ??? ??????? ? ?????????????? ?????? 2009!!
As you can see, because Cyrillic characters have no mapping in Windows-1252, information was
lost when assigning this UnicodeString to an AnsiString. The result was gibberish
because the UnicodeString contained characters not representable in the code page of the
AnsiString, those characters were lost and replaced by the question mark when assigning
the UnicodeString to the AnsiString.
SETCODEPAGE
SetCodePage, declared in the System.pas unit as
procedure SetCodePage(var S: AnsiString; CodePage: Word; Convert: Boolean);
is a new RTL function that sets a new code page for a given AnsiString. The optional
Convert parameter determines if the payload itself of the string should be converted to the
given code page. If the Convert parameter is False, then the code page for the string is
merely altered. If the Convert parameter is True, then the payload of the passed string will be
converted to the given code page.
SetCodePage should be used sparingly and with great care. Note that if the codepage
doesnt actually match the existing payload (i.e. Convert is set to False), then unpredictable
results can occur. Also if the existing data in the string is converted and the new codepage
doesnt have a representation for a given original character, data loss can occur.
Embarcadero Technologies
- 12 -
CONCLUSION
Delphi 2009s Runtime Library is now completely capable of supporting the new UnicodeString.
It includes new classes and routines for handling, processing, and converting Unicode strings,
for managing codepages, and for ensuring an easy migration from earlier versions.
In Chapter III, well cover the specific code constructs that youll need to look out for in ensuring
that your code is Unicode ready.
Embarcadero Technologies
- 13 -
should be reviewed to ensure that those assumptions are not persisted in code. Code that
writes to or reads from persistent storage needs to ensure that the correct number of bytes are
being read or written, as a single byte no longer represents a single character.
Generally, any needed code changes should be straightforward and can be done with a
minimal amount of effort.
- 14 -
THE VCL
The entire VCL is Unicode aware. All existing VCL components work right out of the box just as
they always have. The vast majority of your code using the VCL should continue to work as
normal. Weve done a lot of work to ensure that the VCL is both Unicode ready and backwards
compatible. Normal VCL code that doesnt do any specific string manipulation will work as
before.
STRING INDEXING
String Indexing works exactly as before, and code that indexes into strings doesnt need to be
changed:
var
S: string;
C: Char;
begin
S := This is a string;
C := S[1]; // C will hold T, but of course C is a WideChar
end;
Embarcadero Technologies
- 15 -
This code will work exactly the same as with previous versions of Delphi but of course the
types are different: PChar is now a PWideChar and MyString is now a UnicodeString.
SHORTSTRING
ShortString remains unchanged in both functionality and declaration, and will work just as
before.
ShortString declarations allocate a buffer for a specific number of AnsiChars. Consider the
following code:
var
S: string[26];
begin
S:= 'abcdefghijklmnopqrstuvwxyz';
WriteLn('Length = ', Length(S));
WriteLn('SizeOf = ', SizeOf(S));
WriteLn('TotalBytes = ', Length(S) * SizeOf(S[1]));
ReadLn;
end.
Note that the total bytes of the alphabet is 26 showing that the variable is holding
AnsiChars.
In addition, consider the following code:
type
TMyRecord = record
String1: string[20];
String2: string[15];
end;
Embarcadero Technologies
- 16 -
This record will be laid out in memory exactly as before it will be a record of two
AnsiStrings with AnsiChars in them. If youve got a File of Rec of a record with short
strings, then the above code will work as before, and any code reading and writing such a
record will work as before with no changes.
However, remember that Char is now a WideChar, so if you have some code that grabs those
records out of a file and then calls something like:
var
MyRec: TMyRecord;
SomeChar: Char;
begin
// Grab MyRec from a file...
SomeChar := MyRec.String1[3];
...
end;
then you need to remember that SomeChar will convert the AnsiChar in String1[3] to a
WideChar. If you want this code to work as before, change the declaration of SomeChar:
var
MyRec: TMyRecord;
SomeChar: AnsiChar; // Now declared as an AnsiChar for the shortstring
index
begin
// Grab MyRec from a file...
SomeChar := MyRec.String1[3];
...
end;
SAVETOFILE/LOADFROMFILE
SaveToFile and LoadFromFile calls could very well go under the Just Works section
above, as these calls will read and write just as they did before. However, you may want to
consider using the new overloaded versions of these calls if you are going to be dealing with
Unicode data when using them.
For instance, TStrings now includes the following set of overloaded methods:
procedure SaveToFile(const FileName: string); overload; virtual;
procedure SaveToFile(const FileName: string; Encoding: TEncoding);
overload; virtual;
The second method above is the new overload that includes an encoding parameter that
determines how the data will be written out to the file. (See above for an explanation of the
TEncoding type.) If you call the first method above, the string data will be saved as it always
has been as ANSI data. Therefore, your existing code will work exactly as it always has.
Embarcadero Technologies
- 17 -
However, if you put some Unicode string data into the text to be written out, you will need to
use the second overload, passing a specific TEncoding type. If you do not, the strings will be
written out as ANSI data, and data loss will likely result.
Therefore, the best idea here would be to review your SaveToFile and LoadFromFile calls,
and add a second parameter to them to indicate how youd like your data saved. If you dont
think youll ever be adding or using Unicode strings, though, you can leave things as they are.
If code using the Chr function is assigning the result to an AnsiChar, then this error can easily be
removed by replacing the Chr function with a cast to AnsiChar.
So, this code
MyChar := chr(i);
Can be changed to
MyChar := AnsiChar(i);
SETS OF CHARACTERS
Probably the most common code idiom that will draw the attention of the compiler is the use of
characters in sets. In the past, a character was one byte, so holding characters in a set was no
problem. But now, Char is declared as a WideChar, and thus cannot be held in a set any
longer. So, if you have some code that looks like this:
procedure TDemoForm.Button1Click(Sender: TObject);
var
C: Char;
begin
C := Edit1.Text[1];
if C in ['a'..'z', 'A'..'Z'] then
begin
Label1.Caption := 'It is there';
end;
end;
and you compile it, youll get a warning that looks something like this:
[DCC Warning] Unit1.pas(40): W1050 WideChar reduced to byte char in set
expressions. Consider using 'CharInSet' function in 'SysUtils' unit.
You can, if you like, leave the code that way the compiler will know what you are trying to
do and generate the correct code. However, if you want to get rid of the warning, you can use
the new CharInSet function:
if CharInSet(C, ['a'..'z', 'A'..'Z']) then
begin
Label1.Caption := 'It is there';
end;
Embarcadero Technologies
- 18 -
The CharInSet function will return a Boolean value, and compile without the compiler
warning.
In the above code, Length will return the number of characters in the given string (plus the null
termination character), but SizeOf will return the total number of Bytes used by the array, in
this case 34, i.e. two bytes per character. In previous versions, this code would have returned 17
for both.
USE OF FILLCHAR
Calls to FillChar need to be reviewed when used in conjunction with strings or a character.
Consider the following code:
Embarcadero Technologies
- 19 -
var
Count: Integer;
Buffer: array[0..255] of Char;
begin
// Existing code - incorrect when string = UnicodeString
Count := Length(Buffer);
FillChar(Buffer, Count, 0);
// Correct for Unicode either one will be correct
Count := SizeOf(Buffer);
// <<-- Specify buffer size in
bytes
Count := Length(Buffer) * SizeOf(Char); // <<-- Specify buffer size in
bytes
FillChar(Buffer, Count, 0);
end;
Length returns the size in characters but FillChar expects Count to be in bytes. In this case,
SizeOf should be used instead of Length (or Length needs to be multiplied by the size of
Char).
In addition, because the default size of a Char is 2, FillChar will fill a string with bytes, not
Char as previously.
Example:
var
Buf: array[0..32] of Char;
begin
FillChar(Buf, Length(Buf), #9);
end;
This doesnt fill the array with code point $09 but code point $0909. In order to get the
expected result the code needs to be changed to:
var
Buf: array[0..32] of Char;
begin
..
StrPCopy(Buf, StringOfChar(#9, Length(Buf)));
..
end;
will recognize the Euro symbol and thus evaluate to True in most ANSI codepages. However, it
will evaluate to False in Delphi 2009 because while #128 is the euro sign in most ANSI code
pages, it is a control character in Unicode. In Unicode, Euro symbol is #$20AC.
Developers should replace any characters #128-#255 with literals, when converting to Delphi
2009, since:
if Edit1.Text[1] = '' then
Embarcadero Technologies
- 20 -
will work the same as #128 in ANSI, but also work (i.e., recognize the Euro) in Delphi 2009
(where '' is #$20AC)
CALLS TO MOVE
Calls to Move need to be reviewed when strings or character arrays are used. Consider the
following code:
var
Count: Integer;
Buf1, Buf2: array[0..255] of Char;
begin
// Existing code - incorrect when string = UnicodeString
Count := Length(Buf1);
Move(Buf1, Buf2, Count);
// Correct for Unicode
Count := SizeOf(Buf1);
Length returns the size in characters but Move expects Count to be in bytes. In this case,
SizeOf should be used instead of Length (or Length needs to be multiplied by the size of
Char).
Note: The solution depends on the format of the data being read. See the new TEncoding
class described above to assist in properly encoding the text in the stream.
Embarcadero Technologies
- 21 -
WRITE/WRITEBUFFER
As with Read/ReadBuffer, calls to TStream.Write/WriteBuffer need to be reviewed
when strings or character arrays are used. Consider the following code:
var
S: string;
Stream: TStream;
Temp: AnsiString;
begin
// Existing code - incorrect when string = UnicodeString
Stream.Write(Pointer(S)^, Length(S));
// Correct for Unciode data
Stream.Write(Pointer(S)^, Length(S) * SizeOf(Char)); // <<-- Specify
buffer size in bytes
// Correct for Ansi data
Temp := S;
// <<-- Use temporary AnsiString
Stream.Write(Pointer(Temp)^, Length(Temp) * SizeOf(AnsiChar));//
<<-- Specify buffer size in bytes
end;
Note: The solution depends on the format of the data being written. See the new TEncoding
class described above to assist in properly encoding the text in the stream.
LEADBYTES
Replace calls like this:
if Str[I] in LeadBytes then
TMEMORYSTREAM
In cases where a TMemoryStream is being used to write out a text file, it will be useful to write
out a Byte Order Mark (BOM) as the first entry in the file. Here is an example of writing the
BOM to the file:
var
BOM: TBytes;
begin
...
BOM := TEncoding.UTF8.GetPreamble;
Write(BOM[0], Length(BOM));
All writing code will need to be changed to UTF8 encode the Unicode string:
var
Temp: Utf8String;
begin
...
Temp := Utf8Encode(Str); // <-- Str is the string being written out to
the file.
Write(Pointer(Temp)^, Length(Temp));
//Write(Pointer(Str)^, Length(Str)); <-- this is the original call to
write the string to the file.
Embarcadero Technologies
- 22 -
TSTRINGSTREAM
TStringStream now descends from a new type, TByteStream. TByteStream adds a
property named Bytes which allows for direct access to the bytes with a TStringStream.
TStringStream works as it always has, with the exception that the string it holds is a Unicodebased string.
MULTIBYTE TOWIDECHAR
Calls to MultiByteToWideChar can simply be removed and replaced with a simple
assignment. An example when using MultiByteToWideChar:
procedure TWideCharStrList.AddString(const S: string);
var
Size, D: Integer;
begin
Size := SizeOf(S);
D := (Size + 1) * SizeOf(WideChar);
FList[FUsed] := AllocMem(D);
MultiByteToWideChar(0, 0, PChar(S), Size, FList[FUsed], D);
Inc(FUsed);
end;
And after the change to Unicode, this call was changed to support compiling under both ANSI
and Unicode:
procedure TWideCharStrList.AddString(const S: string);
var
L, D: Integer;
begin
FList[FUsed] := StrNew(PWideChar(S));
Inc(FUsed);
end;
SYSUTILS.APPENDSTR
This method is deprecated, and as such, is hard-coded to use AnsiString and no
UnicodeString overload is available.
Replace calls like this:
AppendStr(String1, String2);
Or, better yet, use the new TStringBuilder class to concatenate strings.
GETPROCADDRESS
Calls to GetProcAddress should always use PAnsiChar (there is no W-suffixed function in
the SDK). For example:
Embarcadero Technologies
- 23 -
Note: Windows.pas will provide an overloaded method that will do this conversion.
In the above snippet, Node is not actually character data. It is being cast to a PChar merely for
the purpose of using pointer arithmetic to access data that is a certain number of bytes after
Node. This worked previously because SizeOf(Char) = Sizeof(Byte). This is no longer true,
and to ensure the code remains correct, it needs to be change to use PByte rather than
PChar. Without the change, Result will end up pointing to the incorrect data.
Embarcadero Technologies
- 24 -
CREATEPROCESSW
The Unicode version of CreateProcess (CreateProcessW) behaves slightly differently than
the ANSI version. To quote MSDN in reference to the lpCommandLine parameter:
"The Unicode version of this function, CreateProcessW, can modify the contents of this
string. Therefore, this parameter cannot be a pointer to read-only memory (such as a const
variable or a literal string). If this parameter is a constant string, the function may cause an
access violation."
Because of this, some existing code that calls CreateProcess may start giving Access
Violations when compiled in Delphi 2009.
Examples of problematic code:
Embarcadero Technologies
- 25 -
Search for any uses of of Char or of AnsiChar to ensure that the buffers are used
correctly for Unicode
Search for instances string[ to ensure that the characters reference are
placed into Chars (i.e. WideChar).
Check for the explicit use of AnsiString, AnsiChar, and PAnsiChar to see if it is
still necessary and correct.
Search for explicit use of ShortString to see if it is still necessary and correct
Search for Length( to ensure that it isnt assuming that Length is the same as SizeOf
Search for Copy(, Seek(, Pointer(, AllocMem(, and GetMem( to ensure that they are
correctly operating on strings or array of Chars.
They represent code constructs that could potentially need to be changed to support the new
UnicodeString type.
CONCLUSION
So that sums up the types of code idioms you need to review for correctness in the Unicode
world. In general, most of your code should work. Most of the warnings your code will receive
can be easily fixed up. Most of the code patterns youll need to review are generally
uncommon, so it is likely that much if not all of your existing code will work just fine.
Embarcadero Technologies
- 26 -
APPENDICES
EMBARCADERO AND PARTNER BLOG ENTRIES ABOUT UNICODE
The Unicode Shift
http://blogs.embarcadero.com/nickhodges/2008/03/24/39041
Unicode Character Categorization
http://blogs.embarcadero.com/abauer/2008/01/11/38848
Meanwhile, back at the (Unicode) ranch
http://blogs.embarcadero.com/abauer/2008/01/28/38853
DPL & Unicode a toss up
http://blogs.embarcadero.com/abauer/2008/01/09/38845
Tiburons LoadFromFile and SaveToFile for Unicode characters
http://blogs.embarcadero.com/davidi/2008/07/15/38898
Delphi 2009 - Unicode in Type Libraries
http://chrisbensen.blogspot.com/2008/09/delphi-2009-unicode-in-type-libraries.html
Unicode database support in Tiburon for Delphi and C++Builder 2009 http://blogs.embarcadero.com/davidi/2008/07/15/38895
Delphi 2009 and Unicode
http://www.jacobthurman.com/?p=30
Delphi 2009 Unicode videos from Marco Cantu
http://blog.digivendo.com/2008/09/delphi-d2009-unicode-videos-from-marco-cantu/
Why you need Delphi 2009
http://compaspascal.blogspot.com/2008/09/why-you-need-delphi-2009.html
Delphi 2009 TStringBuilder (Recap and Benchmark)
http://www.monien.net/blog/index.php/2008/10/delphi-2009-tstringbuilder/
Embarcadero Technologies
- 27 -
Download
http://windemo1.codegear.com/Tiburon/LaunchReplays/ASCIInew.zip
Embarcadero Technologies
- 28 -