Friday, 19 September 2008

Mentha




传说薄荷的原名出自希腊神话。冥王哈迪斯(Hades)爱上了美丽的精灵曼茜(Menthe),冥王的妻子佩瑟芬妮(Persephone)十分嫉妒。为了使冥王忘记曼茜,佩瑟芬妮将她变成了一株不起眼的小草,长在路边任人踩踏。可是内心坚强善良的曼茜变成小草后,她身上却拥有了一股令人舒服的清凉迷人的芬芳,越是被摧折踩踏就越浓烈。虽然变成了小草,她却被越来越多的人喜爱。人们把这种草叫薄荷
(Mentha)。


Thursday, 18 September 2008

特别的一天,感谢特别的你




相识是一种缘分,在这一年多以来,我感受到从来未有的快乐和最真诚的友情!谢谢你们带给我的一切快乐!!也祝福我的朋友们平安和幸福。

IMG_1695


















Wednesday, 17 September 2008

Data Quality - Data Cleanse

Data Cleanse transform: identifies and isolates specific parts of mixed data, and standardizes your data based on information stored in the parsing and capitalization dictionary files, business rules defined in the rule file, and expressions defined in the pattern file.
  • Parse data: transform can identify and isolate a wide variety of data.
  • Standardize data: The Data Cleanse transform can standardize data to make your records more consistent , include case, punctuation, and acronyms.
  • Assign gender and pre-names : The Data Cleanse transform can assign a precise gender code to each name - strong male, strong female, weak male, weak female, and ambiguous. For dual names, Data Cleanse also offers additional gender codes—female multi-name,
    male multi-name, mixed multi-name, and ambiguous multi-name.
  • Create personalized greetings
  • Create a separate output record for each person :If you expect that you have multiple persons, firms, e-mail addresses, and so on in a single record, you may want to split that data out into separate records.
Related components:
  • Data_cleanse: The base Data Cleanse transform that includes a minimal set of options.
  • data_cleanse_en, fr, etc
  • Blueprint :
    • Batch address and data cleanse blueprint
    • Transactional address and data cleanse blueprint
  • Scan and Split
  • Search and Replace
Prepare records for matching: Standardize your data upstream from the match transforms using Data Cleanse.

International data
  • Customize greetings and prenames per country :
  • Modify the phone file for other countries :Data Cleanse includes phone number patterns for many countries by default. However, if you find that you need parsing for a country that is not included, you can modify the international phone file (drlphint.dat 一个正则表达式file) to enable Data Cleanse to detect phone number patterns that follow a different format.
User-defined pattern: Data Cleanse can parse any kind of number or alphanumeric for which you can define a pattern.
  • The pattern label is created in the drludpm.dat (regular expressions)file when the pattern is defined.
  • For more, refer BusinessObjects Data Quality XI Release 2 Data Cleanse Modifier’s Guide
Parsing dictionaries :
  • The parsing dictionary identifies and parses name, title, and firm data.The parser looks up words in the parsing dictionary to retrieve information. The parser then uses the dictionary information, as well as the rule file, to identify and parse name, title, and firm data.
Improve parsing results: There are also some tips for parsing results.

Define paths for dictionaries : By default, the Data Cleanse dictionary files are installed to the reference_data folder under your Data Quality installation location.

Scan and Split : Scan and Split is a specialized Formatter transform that allows you to split your
field data into two or even three parts to better isolate names or other data from
within complicated fields.

Search and Replace:For some less complex data manipulation tasks where speed is a priority, you can use the Search fand Replace transform. When you use Search and Replace, you
can search for:
  • a substring
  • a word
  • the entire contents of a field
and replace it with another value. Here are some details for the functions
  • Convert coded data
  • Search and replace
  • Internal vs. external search and replace values
  • Leading and trailing spaces
  • Search order: Search table entries are sorted only by the number of characters in the search value. If there are multiple entries of the same length, they are not sorted further. This means that the transform will search for and replace the longest values first, followed by any shorter values.
  • Quick Replace.

Data Quality - Address cleansing

Address cleansing: gives you back a corrected, complete, and standardized form of your original address data.
  • Verify that the locality, region, and postal codes agree with one another. If you have just a locality and region, the transform usually can add the postal code and vice versa (depending on the country).
  • Standardize the way the address line looks. For example, it can add or strip punctuation or abbreviate or spell-out the primary type (depending on whatyou want).
  • Identify any undeliverable addresses, such as vacant lots, condemned buildings, and so on (USA records only).
  • Assign diagnostic codes to help you find out why addresses were not assigned or how they had to be corrected. For a listing of these codes for the Global Address Cleanse transform and the USA Regulatory Address Cleanse
Input: The address cleanse transforms accept discrete, multiline, and hybrid address line formats.
  • Multiline
  • Discrete
  • Hybrid
Output:
  • Parsed address components, which correspond to the input fields, such as locality, region, and postal code.
  • Best address components, which are processed data standardized according to the options set in the transform.
  • Information about whether any data was changed, added, used, or not used in a corrected component.
Transforms:
  • Global Address Cleanse and plugins: Must with plugins Australia, Canada, Japan, Multi Country, or USA
  • USA Regulatory Address Cleanse: DPV, eLOT, EWS, GeoCensus, LACSLink, RDI, suggestion
    lists (not for certification), and Z4Change. With this transform you can create a USPS Form 3553.
  • Global Suggestion Lists : Offers suggestions for possible address matches for your global address data.
  • Country ID Identifies the country of destination for the record and outputs an ISO code.
Set up the reference files:
  • Directories
  • Substitution files
Define the standardization options: Standardization changes the way the data is presented after an assignment has been made.

Tuesday, 16 September 2008

Data Quality - Core Concept

Project:Data Flow project.3 types of Projects(主要区别其实就是Reader & Writer的不同)
  • Batch: A project run using batch processing. Batch processing executes a series of non interactive projects all at one time. Batch processing is particularly useful for operations that require the computer or a peripheral device for an extended period of time. Once batch processing begins, it continues until it is done or until an error occurs.The new project is based on the newproject_batch.xml file, and opens with a Data Manager plugin.
  • Transactional : A project run using transactional processing. Transactional processing usually processes one record at a time. Transactional processing is accomplished with the Data Quality web service. The new project is based on the newproject_transaction.xml file, and opens with transactional Reader and Writer transforms and a Data Manager plugin.
  • Integrated Batch: A project run using batch processing with an Integrated Batch Reader and an Integrated Batch Writer transform. This type of project can be used to pass data to and from an integrated application, including BusinessObjects XI Data Integrator Release 2. The new project is based on the newproject_integratedbatch.xml file, and opens with integrated batch Reader and Writer transforms and a Data Manager plugin. (可以为DI整合)
Transform: A transform consists of a group of options that perform a specific function (address cleansing, address validation, data cleansing, matching, and so on).A transform accepts data input from either a data source or another transform via a pipe, and will also output data to another transform or to a data target.
Compound Transform: A compound transform is a combination of transforms that show up as a single entity on the canvas.
Plugin: A plugin is a special kind of transform. A plugin must always be associated with
a transform of a specific type, or to an entire project.

Shared options: Transforms are made up of various files and subcomponents. With shared options, you can define a set of options for a given transform type, and then reuse those options among all transforms of that type.

Substitution variables: allow you to define a variable and a value for that variable.

Dataflow objects: are all of the things you just read about, includes projects, transforms, compound transforms, shared options, and so on.

Basic hierarchy of Data Object


Level of Data Objects

You will find more details from reference object panel in DQ.



Note: The override & reuse concept between different levels needs more attention and further reading..

Data record: A data record is a row of data.The data record is constructed at runtime.
Data collection:A data collection is a group of data records. Early in the dataflow process, a fixed number of data records are grouped into data collections. Later, in preparation for
the matching process, a variable number of data records are grouped into data collections of candidate matches. Finally, in the matching process, data records are split into data collections of matching records and uniques.
Data collections in the Reader transform : read at the same time and pass to next transform.
Data collections in the Match transform :The Match transform then receives the new data collections of potential matches one at a time and compares them, and then splits the data collections again into groups of matching and unique data records.
Data collections after the Match transform: All of the transforms downstream from the Match transform operate on the new match or unique data collections one at a time.


Low_Watermark and High_Watermark : Performance usage and to sync speed between each other.

Transactional project rules and tips:
  • One collection in, one collection out :Every record in the collection passes (or fails) the conditions set up in the Filter transform, and the collection is sent to only one output pipe.
  • Transactional Writer:You can only have one transactional Writer in a
    transactional dataflow.
  • Aggregator transform: You cannot use an Aggregator transform in a transactional project. This transform’s task is to create collections based on criteria you select. In a batch project, there is a finite number of records and the Aggregator knows when the records stop coming. In a transactional environment, the project is left “open,” and therefore the transform would always be waiting for more records.
  • Sorter transform: You can use a Sorter transform if you make sure that the Sort_Mode option is set to Collection_Sort.
  • Flat files. If you are writing to a flat file, this file remains open until the transactional project is closed. Therefore, you may not be able to use this file if the project is still open.
  • Batch Writers. You may find it useful to include a batch Writer in your transactional project to write to your database. You can route data from the transactional Writer to the batch Writer to save time in the future by allowing you to write to your database immediately after a transaction has occurred.
Field Name rule(output/input):
field_type.transform_type.class.parent_component.field_name
Filed overlapping is important!

Data Profile 1

(From Wikipedia)

Data profiling is the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data. The purpose of these statistics may be to:

  1. Find out whether existing data can easily be used for other purposes
  2. Give metrics on data quality including whether the data conforms to company standards
  3. Assess the risk involved in integrating data for new applications, including the challenges of joins.
  4. Track data quality.
  5. Assess whether metadata accurately describes the actual values in the source database.
  6. Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can incur time delays and project cost overruns.
  7. Have an enterprise view of all data, for uses such as Master Data Management where key data is needed, or Data governance for improving data quality
Some companies also look at data profiling as a way to involve business users in what traditionally has been an IT function. Line of business users can often provide context about the data, giving meaning to columns of data that are poorly defined by metadata and documentation.

Typical types of metadata sought are:

  • Domain: whether the data in the column conforms to the defined values or range of values it is expected to take
    • for example: ages of children in kindergarten are expected to be between 4 and 5. An age of 7 would be considered out of domain
    • A code for flammable materials is expected to be A, B or C. A code of 3 would be considered out of domain.
  • Type: Alphabetic or numeric
  • Pattern: a North American phone number should be (999)999-9999
  • Frequency counts: most of our customers should be in California; so the largest number of occurrences of state code should be CA
  • Statistics:
    • minimum value
    • maximum value
    • mean value (average)
    • median value
    • modal value(mode)
    • standard deviation(误差)
  • Interdependency:
    • Within a table: the zip code field always depends on the country code
    • Between tables: the customer number on an order should always appear in the customer table
Broadly speaking, most vendors who provide data profiling tools, also provide data quality tools. They often divide the functionality into three categories. The names for these categories often differ depending on the vendor, but the overall process is in three steps, which must be executed in order:

  • Column Profiling (Including the statistics and domain examples provided above)
  • Dependency Profiling, which identifies intra-table dependencies. Dependency profiling is related to the normalization of a data source, and addresses whether or not there are non-key attributes that determine or are dependent on other non-key attributes. The existence of transitive dependencies here may be evidence of second-normal form.
  • Redundancy Profiling, which identifies overlapping values between tables. This is typically used to identify candidate foreign keys within tables, to validate attributes that should be foreign keys (but that may not have constraints to enforce integrity), and to identify other areas of data redundancy. Example: redundancy analysis could provide the analyst with the fact that the ZIP field in table A contained the same values as the ZIP_CODE field in table B, 80% of the time.
Column profiling provides critical metadata which is required to perform dependency profiling, and as such, must be executed before dependency profiling. Similarly, dependency profiling must be performed before redundancy profiling. While the output of previous steps may not be interesting to an analyst depending on his or her purpose, the analyst will most likely be obliged to move through these steps anyway. Other information delivery mechanisms may exist, depending on the vendor. Some vendors also provide data quality dashboards so that upper management, data governance teams and c-level executives can track enterprise data quality. Still other provide mechanism for the analysis to be delivered via XML. Often, these same tools can be used for on-going monitoring of data quality.

Data Profile need to be understood the data before creating an ETL process
  • Check for missing values (NULL)
  • Get possible list of values
  • Visualize the data distribution
  • Find patterns
  • Get data ranges (min, max, average) – identify data domain outliers
  • Uniqueness of data (distinct values)
  • Referential integrity – understand relationships
What Data Insight could do?
  • Efficient, effective, data investigation
  • Interface designed for business users
  • Automated Summary Analysis and ref. integrity testing
  • Flexible and comprehensive column validation
  • Business rule auditing
  • Scheduling, trend analysis and continuous monitoring
  • Alert triggering and notification
  • Flexible reporting: PDF, XML, MS Word, Excel, etc.
  • Communicate business rules to data cleansing user

Wednesday, 10 September 2008

taobao招聘要求

基础扎实,做事认真,善于总结,有想法,能承受压力,有一技之长就行。
总结很精辟。:)
有兴趣的看链接~~
http://rdc.taobao.com/blog/dba/html/203_taobao_jobs_2009.html