International Considerations for Integration Services
Microsoft SQL Server Integration Services supports the parsing and manipulation of multilingual data, supports all Windows locales, and provides special comparison options for sorting and comparing string data.
The Integration Services transformations for text mining and the fuzzy matching may not work as well with non-English languages as they do with English. However, both the text mining and fuzzy matching transformations may provide useful results when used with many non-English languages, depending on the language.
Locale Insensitive Parsing
Integration Services includes locale-insensitive parsing routines that you can use for data that is in certain formats. These parsing routines, collectively called Fast Parse, support only the most frequently used date format representations, do not perform locale-specific parsing, do not recognize special characters in currency data, and cannot convert hexadecimal or scientific representation of integers. Fast parse can significantly improve the performance of Integration Services packages that do not have locale dependencies. For more information, see Parsing Data.
Locale Settings
Integration Services supports locales at the level of the package object, container, task, and data flow component. You can also set the locale of event handlers.
A package can use multiple different locales. For example, the package may use the English (United States) locale while one task in the package uses the German (Germany) locale and another task uses the Japanese (Japan) locale.
You can use any Windows locale in an Integration Services package. You set the locale when you construct the package, and unless the package uses configurations to update locale properties, the package is guaranteed to behave the same when deployed to computers that may use different regional and language options than the development environment.
However, if a package must use different locales when deployed to different servers, you can create configurations that provide the updated locales to use when the package is run. For more information, see Setting Package Properties and SSIS Package Configurations.
Comparison Options
The locale provides the basic rules for comparing string data in a data flow. For example, the locale specifies the sort position of each letter in the alphabet. However, these rules may not be sufficient for the comparisons that you want to perform, and Integration Services supports a set of advanced comparison options that go beyond the comparison rules of a locale. For example, if you choose to ignore nonspacing characters, "a" and "รก" are equivalent for comparison purposes. For more information, see Comparing String Data.
Text Mining
The transformations for text mining, Term Extraction and Term Lookup, use their own dictionary. This dictionary is only available in English, and the results from using the text mining transformations with languages other than English may be limited. Microsoft supports the use of these transformations only with English.
However, depending on the linguistic similarity between the non-English and English languages, you may find that the Text Extraction transformation can extract terms in non-English languages, and that the Term Lookup transformation can be used to look up terms and calculate term frequency. The greater the similarity between the languages, the more successful the term mining will be. For example, using the Term Extraction transformation for text mining of Swedish strings could be effective because the Swedish language uses word and sentence delimiters that are similar to those in the English language. On the other hand, using the Text Extraction transformation is not likely to be as successful with Japanese. For more information, see Term Extraction Transformation and Term Lookup Transformation.
Fuzzy Matching
The two transformations, Fuzzy Grouping and Fuzzy Lookup, use fuzzy matching to group similar records in a dataset or perform lookups in a reference table. Both transformations can perform matching most effectively when the text data contains multiple, long words separated by white space or delimiters. The transformations may not be as tolerant of errors in logographic languages, such as Chinese, where words often consist of only a few characters and may not be separated by white space. Generally, the transformations may be less likely to catch spelling errors, extra words, and missing words in logographic languages. For more information, see Fuzzy Grouping Transformation and Fuzzy Lookup Transformation.
|