Step 1:
Determine Business Objectives
·
Customer
meetings.
·
Internal
meetings.
Step 2: Collect and Analyze Information
·
Data
gathering.
·
Business
process analysis.
Step 3: Identify Core Business Processes
(Data Marts)
·
Key
areas which are most important and their processes and relationship.
·
Performance
model.
Step 4: Construct a Conceptual Data Model
·
After
identifying the business processes, you can create a conceptual model of the
data.
·
Subject
areas, Facts and Dimensions.
·
Relate
the dimensions and relationships.
Step 5: Locate Data Sources and Plan Data
Transformations
·
Identify
where the critical information is and how to move it into the data warehouse
structure (Identify the sources).
·
Identify
the Data mapping.
·
Check
for Data Transformation need and develop a plan for the same to track the
changes.
Step 6: Set tracking duration (Storage
space)
·
Make
necessary alerts to maintain the DWH from failure.
Step 7: Implement the Plan
·
After
you've developed the plan, it provides a viable basis for estimating work and
scheduling the project.
·
Define
the phases and system and other requirements.
Building a Data
Warehouse
In general,
building any data warehouse consists of the following steps:
1.
Extracting the transactional data from the data
sources into a staging area
2. Transforming the
transactional data
3. Loading the
transformed data into a dimensional database
4. Building
pre-calculated summary values to speed up report generation
5. Building (or purchasing)
a front-end reporting tool
Data Transformation
·
Making identifiers
unique (sex categories Male/Female/Unknown, M/F/null, Man/Woman/Not Available
are translated to standard Male/Female/Unknown)
·
Convert null
values into standardized Not Available/Not Provided value
·
Convert phone
numbers, ZIP codes to a standardized form
·
Validate address
fields, convert them into proper naming, e.g. Street/St/St./Str./Str
Data Profiling
Data profiling is the process of examining the data
available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data. The purpose of these
statistics may be to:
1. Find out whether existing data can easily be used for other
purposes
2. Improve the ability to search the data by tagging it with keywords, descriptions, or assigning it to a category
3. Give metrics on data quality including whether the data conforms to particular standards or
patterns
4. Assess the risk involved in integrating data for new applications, including the challenges of joins
5. Discover metadata of the source database, including value patterns
and distributions, key candidates, foreign-key
candidates, and functional
dependencies
6. Assess whether known metadata accurately describes the actual
values in the source database
7. Understanding data challenges early in any data intensive
project, so that late project surprises are avoided. Finding data problems late
in the project can lead to delays and cost overruns.
8. Have an enterprise view of all data, for uses such as master data
management where key
data is needed, or data governance for improving data quality.
The benefits
of data profiling is to improve data quality, shorten the implementation cycle
of major projects, and improve understanding of data for the users.
Discovering
business knowledge embedded in data itself is one of the significant benefits
derived from data profiling.
Data profiling
is one of the most effective technologies for improving data accuracy in
corporate databases.
What are Slowly Changing Dimensions?
Slowly Changing Dimensions (SCD) - dimensions that change slowly over time, rather than changing on regular schedule, time-base. In Data Warehouse there is a need to track changes in dimension attributes in order to report historical data. In other words, implementing one of the SCD types should enable users assigning proper dimensions attribute value for given date. Example of such dimensions could be: customer, geography, and employee.
There are many approaches how to deal with SCD. The most popular are:
Slowly Changing Dimensions (SCD) - dimensions that change slowly over time, rather than changing on regular schedule, time-base. In Data Warehouse there is a need to track changes in dimension attributes in order to report historical data. In other words, implementing one of the SCD types should enable users assigning proper dimensions attribute value for given date. Example of such dimensions could be: customer, geography, and employee.
There are many approaches how to deal with SCD. The most popular are:
· Type 0 - The passive method
·
Type 1 -
Overwriting the old value
·
Type 2 -
Creating a new additional record
·
Type 3 -
Adding a new column
·
Type 4 -
Using historical table
·
Type 6 -
Combine approaches of types 1,2,3 (1+2+3=6)
Type 0 - The passive method. In this method, no special action is performed upon dimensional changes. Some dimension data can remain the same as it was first time inserted, others may be overwritten.
Type 1 - Overwriting the old value. In this method, no history of dimension changes is kept in the database. The old dimension value is simply overwritten be the new one. This type is easy to maintain and is often use for data which changes are caused by processing corrections (e.g. removal special characters, correcting spelling errors).
Before the change:
Customer_ID
|
Customer_Name
|
Customer_Type
|
1
|
Cust_1
|
Corporate
|
After the change:
Customer_ID
|
Customer_Name
|
Customer_Type
|
1
|
Cust_1
|
Retail
|
Type 2 - Creating a new additional record. In this methodology all history of dimension changes is kept in the database. You capture attribute change by adding a new row with a new surrogate key to the dimension table. Both the prior and new rows contain as attributes the natural key (or other durable identifier). Also 'effective date' and 'current indicator' columns are used in this method. There could be only one record with current indicator set to 'Y'. For 'effective date' columns, i.e. start_date and end_date, the end_date for current record usually is set to value 9999-12-31. Introducing changes to the dimensional model in type 2 could be very expensive database operation so it is not recommended to use it in dimensions where a new attribute could be added in the future.
Before the change:
Customer_ID
|
Customer_Name
|
Customer_Type
|
Start_Date
|
End_Date
|
Current_Flag
|
1
|
Cust_1
|
Corporate
|
22-07-2010
|
31-12-9999
|
Y
|
After the change:
Customer_ID
|
Customer_Name
|
Customer_Type
|
Start_Date
|
End_Date
|
Current_Flag
|
1
|
Cust_1
|
Corporate
|
22-07-2010
|
17-05-2012
|
N
|
2
|
Cust_1
|
Retail
|
18-05-2012
|
31-12-9999
|
Y
|
Type 3 - Adding
a new column. In this type usually only the current and previous value of
dimension is kept in the database. The new value is loaded into 'current/new'
column and the old one into 'old/previous' column. Generally speaking the
history is limited to the number of column created for storing historical data.
This is the least commonly needed technique .
Before the change:
Before the change:
Customer_ID
|
Customer_Name
|
Current_Type
|
Previous_Type
|
1
|
Cust_1
|
Corporate
|
Corporate
|
After the change:
Customer_ID
|
Customer_Name
|
Current_Type
|
Previous_Type
|
1
|
Cust_1
|
Retail
|
Corporate
|
Type 4 - Using historical table. In this method a separate historical table is used to track all dimensions attribute historical changes for each of the dimension. The 'main' dimension table keeps only the current data e.g. customer and customer history tables.
Current table:
Customer_ID
|
Customer_Name
|
Customer_Type
|
1
|
Cust_1
|
Corporate
|
Historical table:
Customer_ID
|
Customer_Name
|
Customer_Type
|
Start_Date
|
End_Date
|
1
|
Cust_1
|
Retail
|
01-01-2010
|
21-07-2010
|
1
|
Cust_1
|
Oher
|
22-07-2010
|
17-05-2012
|
1
|
Cust_1
|
Corporate
|
18-05-2012
|
31-12-9999
|
Type 6 - Combine approaches of types 1,2,3 (1+2+3=6). In this type we have in dimension table such additional columns as:
· Current_type - for keeping current value of the attribute.
All history records for given item of attribute have the same current value.
·
Historical_type -
for keeping historical value of the attribute. All history records for given
item of attribute could have different values.
·
Start_date - for
keeping start date of 'effective date' of attribute's history.
·
End_date - for
keeping end date of 'effective date' of attribute's history.
·
Current_flag - for
keeping information about the most recent record.
In this method to capture attribute change we add a new record as in type 2.
In this method to capture attribute change we add a new record as in type 2.
The Current_type
information is overwritten with the new one as in type 1. We store the history
in a historical_column as in type 3.
Customer_ID
|
Customer_Name
|
Current_Type
|
Historical_Type
|
Start_Date
|
End_Date
|
Current_Flag
|
1
|
Cust_1
|
Corporate
|
Retail
|
01-01-2010
|
21-07-2010
|
N
|
2
|
Cust_1
|
Corporate
|
Other
|
22-07-2010
|
17-05-2012
|
N
|
3
|
Cust_1
|
Corporate
|
Corporate
|
18-05-2012
|
31-12-9999
|
Y
|
It's RUDE to
Read and Run!
Get involved
and leave your Comments in the Box Below. The more people get involved, the
more we all benefit.
So, leave your thoughts before you leave
the page.
It is a very informative and useful post thanks it is good material to read this post increases my knowledge. You can visit Logistics and Transportation in Malta
ReplyDeleteData migration involving Slowly Changing Dimensions (SCDs) requires careful planning and execution to ensure historical data integrity and consistency in your target system. Here's a breakdown of the key steps involved:
ReplyDelete1. Planning and Analysis:
Big Data Projects For Final Year Students
Image Processing Projects For Final Year
Deep Learning Projects for Final Year
Identify Source and Target Systems: Define the source system holding the data you're migrating and the target system where the data will reside.
Analyze Data Model Differences: Understand the data model structures in both systems, particularly focusing on the dimension tables involved. Identify any discrepancies in how dimensions are handled between source and target.
Choose SCD Approach: Select the appropriate SCD (Slowly Changing Dimension) approach based on your business needs. Common approaches include:
Type 1 (Overwrite): Existing data in the target dimension is overwritten with new data during migration. This approach is suitable for dimensions where historical data isn't critical.
Type 2 (Add New Row): A new row is created in the target dimension table for each new data record. This approach preserves historical data but can lead to table size growth.
Type 3 (Add Historical Marker): A new column is added to the target dimension table to flag the "current" record. This approach balances space efficiency and historical data preservation.