Globalization for CDT 2.0

Globalization Feature Development Specification in CDT 2.0
Internationalization, Localization and Globalization Feature Design Specification.

Author	: Tanya Wolff
Revision Date	: 16/01/2004 - Version: 0.1.0
Change History	: 0.1.0 - Initial Draft

Table of Contents

1. Introduction
2. Design Tasks
2.1 Externalize Strings

    2.2 Externalize Non-Text Objects
    2.3 Unicode Support
    2.4 GUI Components
    2.5 Bidi Algorithm
    2.6 Sorting
    2.7.Searching
    2.8 Text Files
    2.9 Multilingual Development Tools
    2.10 Packaging
    2.11 Formatting
3. Risks

4. Glossary

5. References

Appendix A BiDi Test

Appendix B I18N Search

1. Introduction

Globalization (G11N) of a product occurs only when Internationalization (I18N) has been achieved by development, localization (L10N) completed by translation centers, and packaged as a single executable ready to be used in any Baseline Requirement language in any Basic Support country with any Basic Support locale setting. In essence, G11N = I18N + L10N + Multilingual Support.

Internationalization is the responsibility of developers, regardless of whether the product will be translated. Users in different countries, or in general, may have different preferences in how numbers and text are displayed, what code pages of text they type in, different IMEs (Input Method Editors), different printers they use, how they expect output to be sorted, equivalent strings in searches, including ignorable or equivalent characters in different code sets. These preferences have a default setting in the locale installed with the operating system, but users may also change this locale preference through the control panel’s regional settings. I18N goes far beyond just preparing strings for translation. Externalizing strings from the executable code is part of the process of creating a localization pack: requirement 5 of the 8 Globalization Architectural Imperatives (GAI) requirements, where GAI is one of the 5 Baseline Requirements outlined in the Globalization White Paper.

2. Design Tasks

This section describes the development tasks for internationalization of CDT. Every item is P1 since without these, globalization cannot be achieved.

2.1 Externalize Strings

In order for a product to be translated, all visible strings must be extracted from the executable code and collected into resource bundles. In java we use <bundle>.properties to collect the English strings. Later, the translated strings go into <bundle>_xx.properties where xx is the language code. A localization pack is a collection of all the <bundle>_xx.properties files for a particular language, as well as any images, colors and sounds. The requirement to make the localization pack pluggable implies it can be built and installed without rebuilding/installing the base product, and is not bound to the base product until runtime. Also, localization packs for multiple languages can coexist and be used by the base product in any region.

Action Items

Content that must be extracted can be in XML files and software. Perform the following actions to ensure extraction is complete.

1. If there are any visible strings in the plugin.xml file, create a plugin.properties file containing each string and a key to identify it. Replace the string in the plugin.xml with %<key>.

E.g.

plugin.properties:

myplugin.name = “Managed Build”

plugin.xml:

<wizard name="%myplugin.name"/>

2. Externalize strings in java source either using Eclipse’s “find strings to externalize” source tool at the plugin level by right clicking on the plugin in the navigator, and select source->find strings to externalize, or at the file level by right clicking on the file in the navigator and selecting source->externalize strings. Alternatively, this task can be done or manually as follows:

a. For each plugin, create property files organized according to GUI categories. Invent logical key names for each string with dot separated category prefixes. The convention is <classname>.<qualifier>. Do not use computed key names. The property files are loaded into memory as needed and remain as an instance of a ResourceBundle, so the initial loading takes time and space directly proportional to the size of the property file.

b. Create a dedicated accessor class to the Resource Bundle which includes getString and getFormattedString as static methods.

ContextIds.java:

public class ContextIds {

        private static final String BUNDLE_NAME = "org.eclipse.cdt.myplugin.contextIds"; //$NON-NLS-1$

        private static final ResourceBundle RESOURCE_BUNDLE =

                ResourceBundle.getBundle(BUNDLE_NAME);

private contextIds() {

        }

        public static String getString(String key) {

                try {

                        return RESOURCE_BUNDLE.getString(key);

                } catch (MissingResourceException e) {

                        return '!' + key + '!';

                }

        }

public static String getFormattedString(String key, Object[] args) {

                try {

                        return MessageFormat.format(RESOURCE_BUNDLE.getString(key), args);

                } catch (MissingResourceException e) {

                        return '!' + key + '!';

                }

        }

}

The getFormattedString method allows you to get a single pattern string in the properties file, instead concatenating several string parts which leads to errors after translation. This method is not automatically created by the externalize strings wizard, but should be added manually if there are many formatted strings.

contextIds.properties:

contextKey.searchResult = The search found {0} files containing “{1}” on disk {2}.

contextIds_de.properties:

contextKey.searchResult = Es gibt {0} Dateien auf Platte {2}, die „{1}“ enthalthen.

MyUI.java:

ContextIds.getFormattedString("contextKey.searchResult", new Object[] {23, “classA”, “C:\dev”}); //$NON-NLS-1$ //$NON-NLS-2$ //$NON-NLS-3$

Alternatively, following some current CDT plugins, create these methods in an existing class, where the constructor sets the RESOURCE_BUNDLE and a shared instance of the plugin. The getString method would then return the static instance’s RESOURCE_BUNDLE.

c. Tag any non-translatable strings with // $NON-NLS-n$ on the same line in the java code, where n is the nth non-translatable string on the line.

3. Search for keys in property files which are not referenced anywhere in the java source code and plugin.xml files and remove them. There is no easy way to do this. For new plugins following key guidelines as above, a file search can be used but this may not work for existing plugins. As a rule, keys should not be computed, but if old code contains computed keys, it may be impossible to determine whether a key is no longer in use.

4. Inspect strings in property files from a translator perspective. If the subject is implied, add a comment so that the verb and adjective forms can be derived during translation. If there is a limit on the width of a text field or table column, or unwrappable or unscrollable text area, identify the maximum number of characters in a comment in the property files.

2.2 Externalize Non-Text Objects

Action Items

1. Separate presentation-dependent parameters, such as default fonts dependent on the OS, or size or layout constraints.

For example,

message.properties:

default.tableWidth = 80

font.properties:

org.eclipse.jface.defaultfont.0=MS Sans Serif-regular-8

2. Images, Colors and AudioClips should also be externalized to be sent to translation centers to review. Although these are not usually translated, this allows the countries to verify that they are within government standards/cultural acceptance and/or to exchange them with their own preferences.

3. For Images, specific action items are as follows:

1. Put all icon and image file resources into an icons/ folder under the plugin directory.

2. Create a dedicated accessor class as a source file called <plugin>Images.java. Create a static ImageRegistry object, an ImageDescriptor for each icon or image, and put each ImageDescriptor into the imageRegistry.

private static ImageRegistry fgImageRegistry = new ImageRegistry();

imageURL = new URL( Plugin.getDescriptor().getInstallURL(), "icons/" ) + imageFile;

ImageDescriptor result = ImageDescriptor.createFromURL( imageURL );

fgImageRegistry.put( key, result );

3. Create a static accessor method for the images.

public static Image get(String key) {

return imageRegistry.get(key);

}

4. Alternatively, we could adopt an inheritance strategy as Aurora does with a ResourceManager handling all the accessor methods for strings, fonts, and images. An abstract ResourceManager class exists with static members already set and each plugin extends this class, adding its own Images to its implementation.

2.3 Unicode Support

Unicode is not only a coded character set which covers every character and glyph in the world. It is also comprised of standard encoding forms, each of which has a mapping from the coded character set to an integer representation. In addition, Unicode sets standards for locale-sensitive functions such as representing data, sorting data, parsing strings not limited to the Latin-1 character set.

Unicode's coded character set comprises the first 16 planes of the first group of IS0-10646-1 characters and cover the coded characters in the range 0000 to 10FFFF. The characters above FFFF are mapped to surrogate pairs in the encoding form, UTF-16, which first had characters in Java 1.4.1. UTF-16 is the preferable form of Unicode because all characters are represented as 16 bits (or 2 units for surrogate pairs, whose range of values don’t conflict with single unit encoding). UTF-8 will be slower for DBCS processing, and UTF-32 will take up a huge amount of memory.

The following are I18N APIs which are locale-sensitive in that they demonstrate encapsulation of the Unicode algorithms. The APIs listed are drawn from the Java 1.4.1 documentation. Some of these support Unicode 3.0 or 2.0 algorithms only.

Java.Text

Strings

CharacterIterator

BreakIterator

StringCharacterIterator

Format & Parsing

Format

FieldPosition

ParsePosition

NumberFormat

DecimalFormat

RuleBasedNUmberFormat

DecimalFormatSymbols

DateFormat

SimpleDateFormat

DateFormatSymbols

MessageFormat

ChoiceFormat

Collation

Collator

RuleBasedCollator

CollationKey

CollationElementIterator

Java.Util

Calendar

GregorianCalendar

Locale

ResourceBundle

SimpleTimeZone

Action Items

Use JRE since it supports Unicode. Character and String types are represented internally using Unicode, so anything entered via different keyboards, the system tools character map, or different IMEs are valid input to the application. In parsing strings, remember that each character is more than one byte, and each word is not necessarily delimited by whitespace and converting a string to upper or lower case may expand or collapse the length.
When displaying strings, numbers, dates, etc., remember they have different representations depending on the locale. Use the above Format classes which provide locale-sensitive algorithms, and abstracts the details of these algorithms, so that developers won’t have to implement locale-sensitive code themselves. More on formatting messages can be found in the Formatting section.
Case Conversion: Use a StringBuffer object to convert case of strings in case it contains an expandable/collapsable character. Avoid using the toUpperCase() and toLowerCase() methods on the Character object since there is not always a 1-1 mapping from a lower case character to an upper case character: Eg German ‘ß’ character becomes SS when String.toUppercase() is invoked. A Character.toUpperCase() may return false for some DBCS if there isn’t a 1-1 mapping to an uppercase character.

2.4 GUI Components

When text is translated, or system font size is changed, or vision-impaired users turn on High Contrast, the text may take up more space. The UI components should grow or shrink to accommodate the new text size, without truncation.

Action Items

Ensure all components are resizable.
Ensure no text fonts are hard coded. These will be externalized into a font.properties file as mentioned in the section, "Externalize Non-Text Objects".

2.5 Bidi Algorithm

Regardless of whether CDT is translated to bidirectional languages (Arabic, Urdu, Farsi, Hebrew, and Yiddish), internationalization includes presentation of data in any character set. Bidirectional languages display characters in right to left (RTL) sequence, but numbers and English (or French) insertions are displayed left to right (LTR). Bahrain, Egypt, Jordan, Kuwait, Lebanon, Oman, Qatar, Saudi Arabia, Syria, United Arab Emirates and Yemen are the Arabic Speaking countries using English as an additional language. Algeria, Morocco and Tunisia are Arabic Speaking countries using French as an additional language. Text is stored and processed in logical order to make processing feasible, such as copy&paste. Sorting and searching in text rely on the storage of text in logical order. For displaying, it must be reordered. The Unicode standard specifies an algorithm for this logical-to-visual reordering. Avoid Data Loss: Data should not be reordered from logical to visual order except for display and printing. Logical-to-visual reordering is a many-to-one function. Bidirectional data should be converted to Unicode and reordered to logical order only once to avoid roundtrip losses.

Action Items

All text input should undergo a check if it requires Bidi before storing and if so, create a Bidi object. There are three cases of input to be considered to determine if physical-to-logical reordering is necessary.
1. If each character is stored as it is typed: Here the text is stored correctly, in logical order. Only for redisplaying should a Bidi object be created.
2. If input is received through a Component’s getText() method, the text is stored in logical order (as it was typed). Processing can be done on it, but in order to display it, it must be reordered. Note: JDT does this nicely in M5. System.out.println displays differently in eclipse console than in windows command console. See Appendix A for a test to make sure the display is correct.
3. Input received through an InputStream: Must determine how it was sent.

If (Bidi.requiresBidi(text, start, limit) {

Bidi newText = new Bidi(…);

}

All Bidi objects should be reordered for visual display. This includes determining the number of nested runs of RTL text and LTR text, and possible overridings. Eg. Some numbers can be forced to print RTL, instead of the default LTR.

2.6 Sorting

Users can expect sorted output to be sorted according to a particular locale setting. How the presented data is sorted varies between languages. Issues concerning accents, conjoined letters and ignorable punctuation are important as the priorities for these preferences vary from country to country. For instance, French requires that letters sorted with accents at the end of the string be sorted ahead of accents in the beginning of the string. For example, the word "côte" sorts before "coté" because the acute accent on the final "e" is more significant than the circumflex on the "o". String.compareTo() and String.indexOf() are not locale-sensitive and should be avoided whenever lists to sort may contain non-Latin-1 characters.

Case should always be sorted using a Collator, because even hex-based sorting won’t work: here all capitals come before lowercase, and this is not desired even in English.

C/C++ standards indicate that identifiers as well as string literals can be made up of universal characters if they are represented as \uxxxx or \Uxxxxxxxx. Gnu compiler doesn't support universal character names in identifiers, but Visual Studio 6.0 does.

Action Items

Choose either Collator or CollationKey for the comparison object. For a single comparison or sorting a list of 10 strings or less, use a Collator to compare Strings a and b or sort a list:

Collator coll= Collator.getInstance();

if (coll.compare(a, b) == 0) ... ;

For longer lists, convert all strings to CollationKeys first. CollationKeys to be compared must be created by the same Collator object.

Collator coll = Collator.getInstance();

CollationKey aKey = coll.getCollationKey(a);

if (aKey.compare(coll.getCollationKey(b) == 0) … ;

Set the strength of the comparison. By default, the Collator's strength is TERTIARY, so it finds PRIMARY, SECONDARY and TERTIARY differences.

PRIMARY         finds primary differences only such as different letters. Ignores case and accents. Note that in French, e and é are considered the same in primary strength, but in Danish, a and å are different letters.

SECONDARY   finds primary and secondary differences. In English and French, accents are secondary differences. Here 'e' and 'é' are considered different in both English and French, but 'e' and 'E' are still the same under secondary strength.

TERTIARY        finds primary, secondary and tertiary differences. Case differences are considered tertiary differences. E and e are different in tertiary strength.

IDENTICAL       finds all differences, even if 2 characters look the same. Eg. An accented character such as "\u00C0" (A-grave) and combining accents such as "A\u0300" (A, combining-grave) are different. Note these are only different if the composition is set at the default, NO_DECOMPOSITION. If decomposition were set higher, then "\u00C0" (A-grave) would be decomposed before comparison, and then no differences would be found.

E.g.

Collator coll = Collator.getInstance();

coll.setStrength(Collator.PRIMARY);

Ignorable characters. In Java 1.4.1, certain ignorable characters are ignored only when the strength of the comparison is set to PRIMARY. For instance, "black-bird", "blackbird" and "black bird" are considered the same at the PRIMARY level, but different at the SECONDARY level. In the future, Java will incorporate the Unicode Collation Algorithm version 4 which contains a QUATERNARY level. This will find differences in ignorable characters at the Quaternary level, but they won't be found in the primary, secondary or tertiary levels.

To add a rule to the collator so that space and '-' are not ignored, but still treated as secondary differences, retrieve the collator's rules and replace the first compare operator with a '<' since every character in the ruleset up to the first '<' is ignored.

E.g.

RuleBasedCollator coll = (RuleBasedCollator)Collator.getInstance();

String oldRules = coll.getRules();

coll = new RuleBasedCollator("<"+oldRules.substring(1));

Set the decomposition level of the Collator.

NO_DECOMPOSITION                   This is the default and is the fastest, but won’t return the correct result for characters with accents.

CANONICAL_DECOMPOSITION     This is recommended, as composed characters are converted to 2 Unicode characters: the base and the accent. This would return the correct result when comparing two identical characters with accents from different codesets.

FULL_DECOMPOSITION                For Japanese comparisons, this will recognize that half width and full width variants of the same letter are equal.

E.g.

coll.setDecomposition(Collator.CANONICAL_DECOMPOSITION)

4. Sort the list. Sorting a list requires passing the Collator object to the sort function.

E.g.

Collections.sort(mylist, coll).

2 2.7 Searching

CDT Search feature must respond to searches of identifiers in different character sets, as well as ordering the results according the preferred locale. Issues in language-sensitive sorting are also applicable to text searching. This includes locale-sensitive priorities on accents, conjoined letters, and ignorable punctuation. For instance, English searches can ignore accents, whereas some French accents are more important than others. Danish searches treat 'å' (\u00e5) and 'aa' as equal. Conjoined letters can be collapsed as well as expanded. E.g. Spanish ‘ch’ is to be treated as one letter between ‘c’ and ‘d’. (In java 1.4.1 the Collator supports this particular collapse only in the Catalan locale). As well, searching for “blackbird” could return “black-bird”.

String.indexOf won’t handle these cases properly. When comparing using Collation, each character is converted to a key containing 4 components: alphanumeric, diacritic, case and special subkeys. After that, a weight is put on each component according to the locale. See Appendix B for a national language index function.

Action Items

1. Create a Collator object as you would for sorting (see the previous section). It must be the RuleBasedCollator type for the next step.

2. Create an iterator for searching in a string:

RuleBasedCollator rbc = (RuleBasedCollator) Collator.getInstance();

rbc.getCollationElementIterator();

3. Find the index of the substring by iterating through the string. The internationalized indexOf code is included in Appendix B.

2.8 Text Files

Reading and writing text files (and importing source files) require consideration that the local character encoding may not be in ASCII or Unicode. For instance on a Chinese operating system, the Notepad uses GB or BIG5 encoding. Java contains APIs which read and write a particular encoding to/from Unicode. These are not locale-sensitive, just platform sensitive, and this is a known limitation.

Native2ascii is a jdk tool that can convert from the local character set directly to ascii, if the Editor’s display mode is ascii.

Java is required to support the following encoding schemes:

US-ASCII Seven-bit ASCII, a.k.a. ISO646-US, a.k.a. the Basic Latin block of the Unicode character set

ISO-8859-1 ISO Latin Alphabet No. 1, a.k.a. ISO-LATIN-1

UTF-8 Eight-bit Unicode Transformation Format

UTF-16BE Sixteen-bit Unicode Transformation Format, big-endian byte order

UTF-16LE Sixteen-bit Unicode Transformation Format, little-endian byte order

UTF-16 Sixteen-bit Unicode Transformation Format, byte order specified by a mandatory initial byte-order mark (either order accepted on input, big-endian used on output)

Other encodings are subject to an UnsupportedEncodingException.

Action Items

1. Do not use InputStream and OutputStream as these cannot read or write Unicode characters.

2. When reading a file (not created by Eclipse), use InputStreamReader to convert from the default code page to Unicode. An optional encoding parameter for the InputStreamReader constructor is a String following the naming standards set by the IANA Character Registry. If no encoding is specified in the constructor, then the platform's default code page is assumed.

E.g.

public Reader readArabic(String file) throws IOException {
InputStream fileIn = new FileInputStream(file);
return new InputStreamReader(fileIn, "iso-8859-6");
}

3. When reading a Unicode file in UTF-16 encoding, there may exist a Byte Order Mark (BOM) at the beginning which must be removed before processing the file. It is either FEFF for BigEndian or FFFE for LittleEndian.

4. When writing to a file, use OutputStreamWriter which converts Unicode to the local character encoding. Unicode characters in the range above FFFF are mapped to a double 16-bit unit encoding called surrogates and resulting in the range \uD800 - \uDFFF. It is illegal to attempt to write a character stream containing malformed surrogate elements, a character with a missing surrogate, so ensure that a Unicode character requiring surrogates is not split.

If the output is intended for the C-compiler and the local character encoding is not ANSI compliant (7 or 8 bit), then OutputStreamWriter can also be used to convert to a specified encoding using the required charset’s canonical name found in the IANA Character Registry.

E.g.

public Writer writeForCompiler(String file) throws IOException {
OutputStream fileOut = new FileOutputStream(file);
return new OutputStreamWriter(fileOut, "iso-8859-1");
}

Keep in mind that this produces a folded character of '?' if the character is beyond \u00ff. Convert it first to a string of "\uxxxx".

2.9 Multilingual Development Tools

Global users of CDT may wish to write internationalized applications in C or C++. Any development tool should make this task easier by providing tools to develop, deploy, and manage multilingual applications.

Action Items

CDT should provide the following tools, or provide interfaces to open source tools so that the application developers can have easy access to such tools.

Tools to extract user visible text strings from the application and create localization packs
Tools to allow an application to access different localization packs as needed
Tools to analyze application source code to identify potential globalization problems

2.10 Packaging

This activity occurs after a product is internationalized and localized (translated). A national language fragment is created per plugin and named after its plugin. Fragments do not require a build of the base code and thus are loaded into the application at run-time by its plugin, just as Eclipse loads all its plugins at runtime.

Action Items

Create fragments destined for single zip file to be unzipped in the eclipse/fragments directory as described in the Eclipse Article, How to Internationalize your Eclipse Plug-In, Step 5: Create initial translated plug-in fragment. The details for only one plugin's fragment are summarized as follows:

1. Create a fragment with the new fragment wizard: File > New > Project..., select Plugin Development category, then Fragment Project type. Call the project, <pluginid>_nl1, which will become the fragment ID. The default settings will set nl1.jar as the first runtime library, and NL1 Fragment/ as the source folder. The nl1.jar will contain all the translations of the plugin's property files, when they are added to the project in step 4. The fragment.xml manifest will be automatically created.

2. Add the $nl/$ folder to the fragment manifest's runtime information under runtime libraries. Here is where resource files that don't have the language in its name are stored, such as welcome.xml, translated proprietary files, html files, xml files, doc.zip files. Manually add a directory for each language to nl/, whose name is specified by the same standard as the property files' locale names. E.g. de/, fr/, jp/. Manually put the translated non-property files in the appropriate directory.

3. For each translated properties file, if it has non ISO-8859-1 characters (non Latin-1), run the native2ascii on the file passing in the source's codepage as a parameter (otherwise, it will take the local machine's codepage). This will substitute double byte characters to \uxxxx so that java's ResourceBundle class can read these resources.

4. Rename the translated property files for the target plugin according to their locale (ie messages_fr.properties) and add them to the NL1 Fragments directory. Right click on the fragment.xml file in the PDE perspective and select "Create Fragment JARs".

5. Deposit the source jar, the nl/ directory, and the fragment.xml into the <pluginid>_nl1 directory under the eclipse/fragments directory, and your fragment is complete. Continue creating fragments for all other plugins.

2.11 Formatting

An internationalized application is sensitive to a country's preference in how their numbers, dates, and times are displayed. For example, the number 12345.67 is "12,345.67" in the US, "12 345,67" in France and "12.345,67" in Germany. Most format classes have getInstance methods which retrieve a singleton based on either the locale provided as an argument, or the default locale, if none provided. MessageFormat is also locale-sensitive and is instantiated with a pattern and locale, or uses the default locale if not specified. MessageFormat can also use a static method for formatting based on the default locale. ChoiceFormat must be instantiated with a pattern and is not locale-sensitive.

Action Items

Use the APIs listed in the Unicode Support Section for formatting in the following areas:

1. Number display: All numbers displayed either in the application or written to non-c/c++ source files require the use of NumberFormat class.

NumberFormat nf = NumberFormat.getInstance();

System.out.println(nf.format(10123));

Plurals in messages often combine ChoiceFormat with NumberFormat. Although ChoiceFormat is not a locale sensitive object, it can be used in conjunction with MessageFormat in retrieving resource bundles. A ChoiceFormat object is instantiated with either a pattern or two arrays: one containing the limits and one containing the formats. Here is an example where a pattern is used:

ChoiceFormat myform = new ChoiceFormat ("0#no files|1#one file|2#many files");

for (int i = -5; i<10; i++) {System.out.println("number of files: "+myform.format(i));}

2. Dates and Times: In error log timestamps, or configuration details pages, times and dates should be displayed using DateFormat.

DateFormat theDate = DateFormat.getDateInstance(DateFormat.LONG);

DateFormat theTime = DateFormat.getTimeInstance(DateFormat.SHORT);

Date d = new Date();

System.out.println("The time is "+theTime.format(d)+" on "+theDate.format(d));

3. Messages: If strings are to be displayed containing parts which are undetermined until runtime use MessageFormat to format and arrange the undetermined arguments according to locale by retrieving the resource bundle as described in the Externalize Strings section. MessageFormat can use the results of formatting NumberFormat and DateFormat objects by specifying a pattern either in the static format method or in the MessageFormat constructor. This example shows the common use of the static format method.

Date date = new Date();

Integer num = new Integer(12345678);

Object[] args = {date, classname, filename, num);

MessageFormat.format("At time {0,time} on {0, date}, the number of occurrences of Class {1} found in file {2} is {3,number,integer} ", args);

Messages should not assume that sentence parts remain in the same order as English when translated and thus cannot be put together by concatenating strings. Use getFormattedString explained in the Externalize Strings section.

3. Deviations

1. Although C and C++ standards indicate that identifiers can be from any codeset, the CDT compilers will accept the universal characters in identifiers as ASCII strings in the form “\uxxxx” or “\Uxxxxxxxx” only, and it may not be feasible to convert UTF identifiers to ASCII before each call to make.

2. CDT will not fully implement Unicode 3.1 algorithms as Java APIs currently support only Unicode 3.0. The JRE level used for CDT is minimally 1.4.1 and does not contain the ICU4J expansion for the most current Unicode algorithms. The following Java classes will support a lower level Unicode, affecting character input, sorting & searching, and bidirectional text requirements.

Character: Unicode 3.0.
Collator: supports searching based on Unicode 2.0 algorithms.
Bidi: Unicode 3.0.

3. CDT will not allow characters in the range \uD800 to \uDFFF to be entered in the source file editor, because the c/c++ coding standards reveal that universal character names exclude characters in these regions. Also, due to deviation #2, these characters cannot be recognized in dialogs and settings either. Unicode 3.1 support must exist for Character APIs since this is the first population of characters in the surrogate zone region of UTF-16, and Character supports Unicode 3.0 only.

4. Glossary

CodePoint: Hex that is used to represent a character

UNIX:ISO 8859-x, (SBCS) EUC (DBCS)

Windows: Cp1252 (USEnglish), Cp932 (Japanese-New JIS)

DOS, OS/2, FAT: IBMPC code pages eg 819 (Latin-1), 943 (Japanese)

Universal Character Set – contains encoding for all characters

Eg: Unicode—256x256x16planes ISO-10646-1 256x256x256planesx128groups

Eg: Unicode code points

Latin-1 0000-00FF, Hiragana 3040-309f, Katakana 30a0-30ff, Arabic 0600-06ff, CJK 4e00-9faf

Globalization (G11N): I18N & L12N & multicultural support (support any custumer anywhere anytime anypace with one single executable)

Globalization White Paper . Lists requirements for flagship products, key products and components. Included are the languages where Basic Support is required and languages for translation.

Internationalization (I18N): The process of designing a program from the ground up so that it can be changed to reflect the expectations of a new user community without having to modify its executable code. Development resonsibility.

Localization (L10N) : The process of translating text, converting images, etc so that the program conforms to a particular country’s expectations.

Includes translation, altering pictures…. What the Translation team does. Translation Center’s responsibility.

MRI Machine Readable Information. All the language- and culturally sensitive information exchanged between the product and its users. MRI includes messages, dialog boxes, online manuals, audio output, animations, windows, help text, tutorials, diagnostics, clip art, icons, and any presentation control that is necessary to convey information to users. MRI comes as PII or non-PII.

PCI The presentation control information (PCI) is the invisible set of controls that determine the presentation attributes of the information, such as, color, intensity, loudness, and window size.

5. References

Globalization Central http://eou5.austin.ibm.com/global/global_int.nsf/Publish/982

ICU4J home http://oss.software.ibm.com/icu4j/index.html

Unicode home http://www.unicode.org/

ICU 2.8 classes http://oss.software.ibm.com/icu/apiref/annotated.html

Appendix A BiDi Test

A simaaaple java test shows that JDT displays Bidi output correctly in the Eclipse console view:

Run the Testbidi application below from Eclipse.

Enter "صثقف123خح" in the Testbidi application (from the 101 USEnglish keyboard, type "wert123op").

Note that the visual left-to-right order for the "text" string in Eclipse is 2 Arabic characters followed by 3 digits, then 4 Arabic characters. If this is not the case, then your browser does not support BiDi text properly. The following is displayed in Eclipse, which shows a different visual order than the logical order.

text:صثقف123خح

text(0): ص

text(1): ث

text(2): ق

text(3): ف

text(4): 1

text(5): 2

text(6): 3

text(7): خ

text(8): ح

Whereas when run from windows command line, the following output (besides the fact that fonts are wrong) shows that the storage is correct (handled by JRE) but the display of the complete string is incorrectly in the same order as the storage.

text:????123??

text(0): ?

text(1): ?

text(2): ?

text(3): ?

text(4): 1

text(5): 2

text(6): 3

text(7): ?

text(8): ?

Testbidi.java:

public class Testbidi extends JFrame{

public static void main(String[] args) {

JFrame j = new JFrame("hello");

new Testbidi(j);

j.pack();

j.show();

}

JFrame frame;

JTextField t;

public Testbidi(JFrame j) {

frame = j;

t = new JTextField(30);

JPanel p = new JPanel();

p.add(t);

JButton b = new JButton ("display");

b.addActionListener(new ActionListener() {

public void actionPerformed(ActionEvent e){

String s = t.getText();

System.out.println("text:"+s);

for (int i = 0; i< s.length(); i++)

System.out.println("text("+i+"): "+s.charAt(i));

}});

p.add(b);

frame.getContentPane().add(p);

}

}

Appendix B I18N Search

This algorithm should be used when the correct index of a Unicode sensitive substring is required.

TestSearch.java:

public static int nlIndexOf(String searchFor,String searchIn,RuleBasedCollator coll) {

int start = -1;

boolean started = false;

CollationElementIterator p = coll.getCollationElementIterator(searchIn);

CollationElementIterator q = coll.getCollationElementIterator(searchFor);

int e1 = q.next();

int e2 = p.next();

boolean triedOnce = false;

while (e1 != CollationElementIterator.NULLORDER&& e2 != CollationElementIterator.NULLORDER) {

int primaryOrder1 = CollationElementIterator.primaryOrder(e1);

int primaryOrder2 = CollationElementIterator.primaryOrder(e2);

if (e1 == e2) {

if (!started) {

start = p.getOffset();

started = true;

}

e1 = q.next();

e2 = p.next();

} else if (!triedOnce) {

q.reset();

e1 = q.next();

started = false;

triedOnce = true;

} else {

e2 = p.next();

triedOnce = false;

}

}

if (e1 == CollationElementIterator.NULLORDER) {

return start-1;

} else return -1;

}

Last Modified January 9, 2004