暗月猫的旅行: 16 Sept 2008

Tuesday, 16 September 2008

Data Quality - Core Concept

Project:Data Flow project.3 types of Projects(主要区别其实就是Reader & Writer的不同)

Batch: A project run using batch processing. Batch processing executes a series of non interactive projects all at one time. Batch processing is particularly useful for operations that require the computer or a peripheral device for an extended period of time. Once batch processing begins, it continues until it is done or until an error occurs.The new project is based on the newproject_batch.xml file, and opens with a Data Manager plugin.
Transactional : A project run using transactional processing. Transactional processing usually processes one record at a time. Transactional processing is accomplished with the Data Quality web service. The new project is based on the newproject_transaction.xml file, and opens with transactional Reader and Writer transforms and a Data Manager plugin.
Integrated Batch: A project run using batch processing with an Integrated Batch Reader and an Integrated Batch Writer transform. This type of project can be used to pass data to and from an integrated application, including BusinessObjects XI Data Integrator Release 2. The new project is based on the newproject_integratedbatch.xml file, and opens with integrated batch Reader and Writer transforms and a Data Manager plugin. （可以为DI整合）

Transform: A transform consists of a group of options that perform a specific function (address cleansing, address validation, data cleansing, matching, and so on).A transform accepts data input from either a data source or another transform via a pipe, and will also output data to another transform or to a data target.
Compound Transform: A compound transform is a combination of transforms that show up as a single entity on the canvas.
Plugin: A plugin is a special kind of transform. A plugin must always be associated with
a transform of a specific type, or to an entire project.

Shared options: Transforms are made up of various files and subcomponents. With shared options, you can define a set of options for a given transform type, and then reuse those options among all transforms of that type.

Substitution variables: allow you to define a variable and a value for that variable.

Dataflow objects: are all of the things you just read about, includes projects, transforms, compound transforms, shared options, and so on.

Basic hierarchy of Data Object

Level of Data Objects

You will find more details from reference object panel in DQ.

Note: The override & reuse concept between different levels needs more attention and further reading..

Data record: A data record is a row of data.The data record is constructed at runtime.
Data collection:A data collection is a group of data records. Early in the dataflow process, a fixed number of data records are grouped into data collections. Later, in preparation for
the matching process, a variable number of data records are grouped into data collections of candidate matches. Finally, in the matching process, data records are split into data collections of matching records and uniques.
Data collections in the Reader transform : read at the same time and pass to next transform.
Data collections in the Match transform :The Match transform then receives the new data collections of potential matches one at a time and compares them, and then splits the data collections again into groups of matching and unique data records.
Data collections after the Match transform: All of the transforms downstream from the Match transform operate on the new match or unique data collections one at a time.

Low_Watermark and High_Watermark : Performance usage and to sync speed between each other.

Transactional project rules and tips:

One collection in, one collection out :Every record in the collection passes (or fails) the conditions set up in the Filter transform, and the collection is sent to only one output pipe.
Transactional Writer:You can only have one transactional Writer in a
transactional dataflow.
Aggregator transform: You cannot use an Aggregator transform in a transactional project. This transform’s task is to create collections based on criteria you select. In a batch project, there is a finite number of records and the Aggregator knows when the records stop coming. In a transactional environment, the project is left “open,” and therefore the transform would always be waiting for more records.
Sorter transform: You can use a Sorter transform if you make sure that the Sort_Mode option is set to Collection_Sort.
Flat files. If you are writing to a flat file, this file remains open until the transactional project is closed. Therefore, you may not be able to use this file if the project is still open.
Batch Writers. You may find it useful to include a batch Writer in your transactional project to write to your database. You can route data from the transactional Writer to the batch Writer to save time in the future by allowing you to write to your database immediately after a transaction has occurred.

Field Name rule(output/input):
field_type.transform_type.class.parent_component.field_name
Filed overlapping is important!

Data Profile 1

(From Wikipedia)

Data profiling is the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data. The purpose of these statistics may be to:

Find out whether existing data can easily be used for other purposes
Give metrics on data quality including whether the data conforms to company standards
Assess the risk involved in integrating data for new applications, including the challenges of joins.
Track data quality.
Assess whether metadata accurately describes the actual values in the source database.
Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can incur time delays and project cost overruns.
Have an enterprise view of all data, for uses such as Master Data Management where key data is needed, or Data governance for improving data quality

Some companies also look at data profiling as a way to involve business users in what traditionally has been an IT function. Line of business users can often provide context about the data, giving meaning to columns of data that are poorly defined by metadata and documentation.

Typical types of metadata sought are:

Domain: whether the data in the column conforms to the defined values or range of values it is expected to take

for example: ages of children in kindergarten are expected to be between 4 and 5. An age of 7 would be considered out of domain
A code for flammable materials is expected to be A, B or C. A code of 3 would be considered out of domain.

Type: Alphabetic or numeric
Pattern: a North American phone number should be (999)999-9999
Frequency counts: most of our customers should be in California; so the largest number of occurrences of state code should be CA
Statistics:

minimum value
maximum value
mean value (average)
median value
modal value(mode)
standard deviation(误差)

Interdependency:

Within a table: the zip code field always depends on the country code
Between tables: the customer number on an order should always appear in the customer table

Broadly speaking, most vendors who provide data profiling tools, also provide data quality tools. They often divide the functionality into three categories. The names for these categories often differ depending on the vendor, but the overall process is in three steps, which must be executed in order:

Column Profiling (Including the statistics and domain examples provided above)
Dependency Profiling, which identifies intra-table dependencies. Dependency profiling is related to the normalization of a data source, and addresses whether or not there are non-key attributes that determine or are dependent on other non-key attributes. The existence of transitive dependencies here may be evidence of second-normal form.
Redundancy Profiling, which identifies overlapping values between tables. This is typically used to identify candidate foreign keys within tables, to validate attributes that should be foreign keys (but that may not have constraints to enforce integrity), and to identify other areas of data redundancy. Example: redundancy analysis could provide the analyst with the fact that the ZIP field in table A contained the same values as the ZIP_CODE field in table B, 80% of the time.

Column profiling provides critical metadata which is required to perform dependency profiling, and as such, must be executed before dependency profiling. Similarly, dependency profiling must be performed before redundancy profiling. While the output of previous steps may not be interesting to an analyst depending on his or her purpose, the analyst will most likely be obliged to move through these steps anyway. Other information delivery mechanisms may exist, depending on the vendor. Some vendors also provide data quality dashboards so that upper management, data governance teams and c-level executives can track enterprise data quality. Still other provide mechanism for the analysis to be delivered via XML. Often, these same tools can be used for on-going monitoring of data quality.

Data Profile need to be understood the data before creating an ETL process

Check for missing values (NULL)
Get possible list of values
Visualize the data distribution
Find patterns
Get data ranges (min, max, average) – identify data domain outliers
Uniqueness of data (distinct values)
Referential integrity – understand relationships

What Data Insight could do?

Efficient, effective, data investigation
Interface designed for business users
Automated Summary Analysis and ref. integrity testing
Flexible and comprehensive column validation
Business rule auditing
Scheduling, trend analysis and continuous monitoring
Alert triggering and notification
Flexible reporting: PDF, XML, MS Word, Excel, etc.
Communicate business rules to data cleansing user