Wednesday 17 September 2008

Data Quality - Data Cleanse

Data Cleanse transform: identifies and isolates specific parts of mixed data, and standardizes your data based on information stored in the parsing and capitalization dictionary files, business rules defined in the rule file, and expressions defined in the pattern file.
  • Parse data: transform can identify and isolate a wide variety of data.
  • Standardize data: The Data Cleanse transform can standardize data to make your records more consistent , include case, punctuation, and acronyms.
  • Assign gender and pre-names : The Data Cleanse transform can assign a precise gender code to each name - strong male, strong female, weak male, weak female, and ambiguous. For dual names, Data Cleanse also offers additional gender codes—female multi-name,
    male multi-name, mixed multi-name, and ambiguous multi-name.
  • Create personalized greetings
  • Create a separate output record for each person :If you expect that you have multiple persons, firms, e-mail addresses, and so on in a single record, you may want to split that data out into separate records.
Related components:
  • Data_cleanse: The base Data Cleanse transform that includes a minimal set of options.
  • data_cleanse_en, fr, etc
  • Blueprint :
    • Batch address and data cleanse blueprint
    • Transactional address and data cleanse blueprint
  • Scan and Split
  • Search and Replace
Prepare records for matching: Standardize your data upstream from the match transforms using Data Cleanse.

International data
  • Customize greetings and prenames per country :
  • Modify the phone file for other countries :Data Cleanse includes phone number patterns for many countries by default. However, if you find that you need parsing for a country that is not included, you can modify the international phone file (drlphint.dat 一个正则表达式file) to enable Data Cleanse to detect phone number patterns that follow a different format.
User-defined pattern: Data Cleanse can parse any kind of number or alphanumeric for which you can define a pattern.
  • The pattern label is created in the drludpm.dat (regular expressions)file when the pattern is defined.
  • For more, refer BusinessObjects Data Quality XI Release 2 Data Cleanse Modifier’s Guide
Parsing dictionaries :
  • The parsing dictionary identifies and parses name, title, and firm data.The parser looks up words in the parsing dictionary to retrieve information. The parser then uses the dictionary information, as well as the rule file, to identify and parse name, title, and firm data.
Improve parsing results: There are also some tips for parsing results.

Define paths for dictionaries : By default, the Data Cleanse dictionary files are installed to the reference_data folder under your Data Quality installation location.

Scan and Split : Scan and Split is a specialized Formatter transform that allows you to split your
field data into two or even three parts to better isolate names or other data from
within complicated fields.

Search and Replace:For some less complex data manipulation tasks where speed is a priority, you can use the Search fand Replace transform. When you use Search and Replace, you
can search for:
  • a substring
  • a word
  • the entire contents of a field
and replace it with another value. Here are some details for the functions
  • Convert coded data
  • Search and replace
  • Internal vs. external search and replace values
  • Leading and trailing spaces
  • Search order: Search table entries are sorted only by the number of characters in the search value. If there are multiple entries of the same length, they are not sorted further. This means that the transform will search for and replace the longest values first, followed by any shorter values.
  • Quick Replace.

Data Quality - Address cleansing

Address cleansing: gives you back a corrected, complete, and standardized form of your original address data.
  • Verify that the locality, region, and postal codes agree with one another. If you have just a locality and region, the transform usually can add the postal code and vice versa (depending on the country).
  • Standardize the way the address line looks. For example, it can add or strip punctuation or abbreviate or spell-out the primary type (depending on whatyou want).
  • Identify any undeliverable addresses, such as vacant lots, condemned buildings, and so on (USA records only).
  • Assign diagnostic codes to help you find out why addresses were not assigned or how they had to be corrected. For a listing of these codes for the Global Address Cleanse transform and the USA Regulatory Address Cleanse
Input: The address cleanse transforms accept discrete, multiline, and hybrid address line formats.
  • Multiline
  • Discrete
  • Hybrid
Output:
  • Parsed address components, which correspond to the input fields, such as locality, region, and postal code.
  • Best address components, which are processed data standardized according to the options set in the transform.
  • Information about whether any data was changed, added, used, or not used in a corrected component.
Transforms:
  • Global Address Cleanse and plugins: Must with plugins Australia, Canada, Japan, Multi Country, or USA
  • USA Regulatory Address Cleanse: DPV, eLOT, EWS, GeoCensus, LACSLink, RDI, suggestion
    lists (not for certification), and Z4Change. With this transform you can create a USPS Form 3553.
  • Global Suggestion Lists : Offers suggestions for possible address matches for your global address data.
  • Country ID Identifies the country of destination for the record and outputs an ISO code.
Set up the reference files:
  • Directories
  • Substitution files
Define the standardization options: Standardization changes the way the data is presented after an assignment has been made.