Data profiling is the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data. The purpose of these statistics may be to:
- Find out whether existing data can easily be used for other purposes
- Give metrics on data quality including whether the data conforms to company standards
- Assess the risk involved in integrating data for new applications, including the challenges of joins.
- Track data quality.
- Assess whether metadata accurately describes the actual values in the source database.
- Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can incur time delays and project cost overruns.
- Have an enterprise view of all data, for uses such as Master Data Management where key data is needed, or Data governance for improving data quality
Typical types of metadata sought are:
- Domain: whether the data in the column conforms to the defined values or range of values it is expected to take
- for example: ages of children in kindergarten are expected to be between 4 and 5. An age of 7 would be considered out of domain
- A code for flammable materials is expected to be A, B or C. A code of 3 would be considered out of domain.
- Type: Alphabetic or numeric
- Pattern: a North American phone number should be (999)999-9999
- Frequency counts: most of our customers should be in California; so the largest number of occurrences of state code should be CA
- Statistics:
- minimum value
- maximum value
- mean value (average)
- median value
- modal value(mode)
- standard deviation(误差)
- Interdependency:
- Within a table: the zip code field always depends on the country code
- Between tables: the customer number on an order should always appear in the customer table
- Column Profiling (Including the statistics and domain examples provided above)
- Dependency Profiling, which identifies intra-table dependencies. Dependency profiling is related to the normalization of a data source, and addresses whether or not there are non-key attributes that determine or are dependent on other non-key attributes. The existence of transitive dependencies here may be evidence of second-normal form.
- Redundancy Profiling, which identifies overlapping values between tables. This is typically used to identify candidate foreign keys within tables, to validate attributes that should be foreign keys (but that may not have constraints to enforce integrity), and to identify other areas of data redundancy. Example: redundancy analysis could provide the analyst with the fact that the ZIP field in table A contained the same values as the ZIP_CODE field in table B, 80% of the time.
Data Profile need to be understood the data before creating an ETL process
- Check for missing values (NULL)
- Get possible list of values
- Visualize the data distribution
- Find patterns
- Get data ranges (min, max, average) – identify data domain outliers
- Uniqueness of data (distinct values)
- Referential integrity – understand relationships
- Efficient, effective, data investigation
- Interface designed for business users
- Automated Summary Analysis and ref. integrity testing
- Flexible and comprehensive column validation
- Business rule auditing
- Scheduling, trend analysis and continuous monitoring
- Alert triggering and notification
- Flexible reporting: PDF, XML, MS Word, Excel, etc.
- Communicate business rules to data cleansing user
No comments:
Post a Comment