Wednesday 17 September 2008

Data Quality - Data Cleanse

Data Cleanse transform: identifies and isolates specific parts of mixed data, and standardizes your data based on information stored in the parsing and capitalization dictionary files, business rules defined in the rule file, and expressions defined in the pattern file.
  • Parse data: transform can identify and isolate a wide variety of data.
  • Standardize data: The Data Cleanse transform can standardize data to make your records more consistent , include case, punctuation, and acronyms.
  • Assign gender and pre-names : The Data Cleanse transform can assign a precise gender code to each name - strong male, strong female, weak male, weak female, and ambiguous. For dual names, Data Cleanse also offers additional gender codes—female multi-name,
    male multi-name, mixed multi-name, and ambiguous multi-name.
  • Create personalized greetings
  • Create a separate output record for each person :If you expect that you have multiple persons, firms, e-mail addresses, and so on in a single record, you may want to split that data out into separate records.
Related components:
  • Data_cleanse: The base Data Cleanse transform that includes a minimal set of options.
  • data_cleanse_en, fr, etc
  • Blueprint :
    • Batch address and data cleanse blueprint
    • Transactional address and data cleanse blueprint
  • Scan and Split
  • Search and Replace
Prepare records for matching: Standardize your data upstream from the match transforms using Data Cleanse.

International data
  • Customize greetings and prenames per country :
  • Modify the phone file for other countries :Data Cleanse includes phone number patterns for many countries by default. However, if you find that you need parsing for a country that is not included, you can modify the international phone file (drlphint.dat 一个正则表达式file) to enable Data Cleanse to detect phone number patterns that follow a different format.
User-defined pattern: Data Cleanse can parse any kind of number or alphanumeric for which you can define a pattern.
  • The pattern label is created in the drludpm.dat (regular expressions)file when the pattern is defined.
  • For more, refer BusinessObjects Data Quality XI Release 2 Data Cleanse Modifier’s Guide
Parsing dictionaries :
  • The parsing dictionary identifies and parses name, title, and firm data.The parser looks up words in the parsing dictionary to retrieve information. The parser then uses the dictionary information, as well as the rule file, to identify and parse name, title, and firm data.
Improve parsing results: There are also some tips for parsing results.

Define paths for dictionaries : By default, the Data Cleanse dictionary files are installed to the reference_data folder under your Data Quality installation location.

Scan and Split : Scan and Split is a specialized Formatter transform that allows you to split your
field data into two or even three parts to better isolate names or other data from
within complicated fields.

Search and Replace:For some less complex data manipulation tasks where speed is a priority, you can use the Search fand Replace transform. When you use Search and Replace, you
can search for:
  • a substring
  • a word
  • the entire contents of a field
and replace it with another value. Here are some details for the functions
  • Convert coded data
  • Search and replace
  • Internal vs. external search and replace values
  • Leading and trailing spaces
  • Search order: Search table entries are sorted only by the number of characters in the search value. If there are multiple entries of the same length, they are not sorted further. This means that the transform will search for and replace the longest values first, followed by any shorter values.
  • Quick Replace.

No comments: