Translate

Data Migration Steps and SCD Changes


Step 1: Determine Business Objectives
·       Customer meetings.
·       Internal meetings.
·       Technical meetings.

Step 2: Collect and Analyze Information

·       Data gathering.
·       Business process analysis.

Step 3: Identify Core Business Processes (Data Marts)

·       Key areas which are most important and their processes and relationship.
·       Performance model.

Step 4: Construct a Conceptual Data Model

·       After identifying the business processes, you can create a conceptual model of the data. 
·       Subject areas, Facts and Dimensions.
·       Relate the dimensions and relationships.

Step 5: Locate Data Sources and Plan Data Transformations

·       Identify where the critical information is and how to move it into the data warehouse structure (Identify the sources).
·       Identify the Data mapping.
·       Check for Data Transformation need and develop a plan for the same to track the changes.

Step 6: Set tracking duration (Storage space)

·       Make necessary alerts to maintain the DWH from failure.


Step 7: Implement the Plan

·       After you've developed the plan, it provides a viable basis for estimating work and scheduling the project.
·       Define the phases and system and other requirements.

Building a Data Warehouse

In general, building any data warehouse consists of the following steps:
1.        Extracting the transactional data from the data sources into a staging area
2.       Transforming the transactional data
3.       Loading the transformed data into a dimensional database
4.      Building pre-calculated summary values to speed up report generation
5.       Building (or purchasing) a front-end reporting tool

Data Transformation
·       Making identifiers unique (sex categories Male/Female/Unknown, M/F/null, Man/Woman/Not Available are translated to standard Male/Female/Unknown)
·       Convert null values into standardized Not Available/Not Provided value
·       Convert phone numbers, ZIP codes to a standardized form
·       Validate address fields, convert them into proper naming, e.g. Street/St/St./Str./Str

Data Profiling
Data profiling is the process of examining the data available in an existing data source (e.g. a database or a file) and collecting statistics and information about that data. The purpose of these statistics may be to:
1.      Find out whether existing data can easily be used for other purposes
2.     Improve the ability to search the data by tagging it with keywords, descriptions, or assigning it to a category
3.     Give metrics on data quality including whether the data conforms to particular standards or patterns
4.     Assess the risk involved in integrating data for new applications, including the challenges of joins
5.     Discover metadata of the source database, including value patterns and distributionskey candidatesforeign-key candidates, and functional dependencies
6.    Assess whether known metadata accurately describes the actual values in the source database
7.     Understanding data challenges early in any data intensive project, so that late project surprises are avoided. Finding data problems late in the project can lead to delays and cost overruns.
8.    Have an enterprise view of all data, for uses such as master data management where key data is needed, or data governance for improving data quality.

The benefits of data profiling is to improve data quality, shorten the implementation cycle of major projects, and improve understanding of data for the users. 
Discovering business knowledge embedded in data itself is one of the significant benefits derived from data profiling.
Data profiling is one of the most effective technologies for improving data accuracy in corporate databases.

What are Slowly Changing Dimensions?

Slowly Changing Dimensions (SCD) - dimensions that change slowly over time, rather than changing on regular schedule, time-base. In Data Warehouse there is a need to track changes in dimension attributes in order to report historical data. In other words, implementing one of the SCD types should enable users assigning proper dimensions attribute value for given date. Example of such dimensions could be: customer, geography, and employee.

There are many approaches how to deal with SCD. The most popular are:

·       Type 0 - The passive method
·       Type 1 - Overwriting the old value
·       Type 2 - Creating a new additional record
·       Type 3 - Adding a new column
·       Type 4 - Using historical table
·       Type 6 - Combine approaches of types 1,2,3 (1+2+3=6)


Type 0 - The passive method. In this method, no special action is performed upon dimensional changes. Some 
dimension data can remain the same as it was first time inserted, others may be overwritten.

Type 1 - Overwriting the old value. In this method, no history of dimension changes is kept in the database. The old dimension value is simply overwritten be the new one. This type is easy to maintain and is often use for data which changes are caused by processing corrections (e.g. removal special characters, correcting spelling errors).

Before the change: 
Customer_ID
Customer_Name
Customer_Type
1
Cust_1
Corporate

After the change: 
Customer_ID
Customer_Name
Customer_Type
1
Cust_1
Retail


Type 2 - Creating a new additional record. In this methodology all history of dimension changes is kept in the database. You capture attribute change by adding a new row with a new surrogate key to the dimension table. Both the prior and new rows contain as attributes the natural key (or other durable identifier). Also 'effective date' and 'current indicator' columns are used in this method. There could be only one record with current indicator set to 'Y'. For 'effective date' columns, i.e. start_date and end_date, the end_date for current record usually is set to value 9999-12-31. Introducing changes to the dimensional model in type 2 could be very expensive database operation so it is not recommended to use it in dimensions where a new attribute could be added in the future.

Before the change: 
Customer_ID
Customer_Name
Customer_Type
Start_Date
End_Date
Current_Flag
1
Cust_1
Corporate
22-07-2010
31-12-9999
Y

After the change: 
Customer_ID
Customer_Name
Customer_Type
Start_Date
End_Date
Current_Flag
1
Cust_1
Corporate
22-07-2010
17-05-2012
N
2
Cust_1
Retail
18-05-2012
31-12-9999
Y



Type 3 - Adding a new column. In this type usually only the current and previous value of dimension is kept in the database. The new value is loaded into 'current/new' column and the old one into 'old/previous' column. Generally speaking the history is limited to the number of column created for storing historical data. This is the least commonly needed technique .

Before the change: 
Customer_ID
Customer_Name
Current_Type
Previous_Type
1
Cust_1
Corporate
Corporate

After the change: 
Customer_ID
Customer_Name
Current_Type
Previous_Type
1
Cust_1
Retail
Corporate


Type 4 - Using historical table. In this method a separate historical table is used to track all dimensions attribute historical changes for each of the dimension. The 'main' dimension table keeps only the current data e.g. customer and customer history tables.

Current table
Customer_ID
Customer_Name
Customer_Type
1
Cust_1
Corporate

Historical table: 
Customer_ID
Customer_Name
Customer_Type
Start_Date
End_Date
1
Cust_1
Retail
01-01-2010
21-07-2010
1
Cust_1
Oher
22-07-2010
17-05-2012
1
Cust_1
Corporate
18-05-2012
31-12-9999


Type 6 - Combine approaches of types 1,2,3 (1+2+3=6). In this type we have in dimension table such additional columns as:
·       Current_type - for keeping current value of the attribute. All history records for given item of attribute have the same current value.
·       Historical_type - for keeping historical value of the attribute. All history records for given item of attribute could have different values.
·       Start_date - for keeping start date of 'effective date' of attribute's history.
·       End_date - for keeping end date of 'effective date' of attribute's history.
·       Current_flag - for keeping information about the most recent record.
In this method to capture attribute change we add a 
new record as in type 2.

The Current_type information is overwritten with the new one as in type 1. We store the history in a historical_column as in type 3. 

Customer_ID
Customer_Name
Current_Type
Historical_Type
Start_Date
End_Date
Current_Flag
1
Cust_1
Corporate
Retail
01-01-2010
21-07-2010
N
2
Cust_1
Corporate
Other
22-07-2010
17-05-2012
N
3
Cust_1
Corporate
Corporate
18-05-2012
31-12-9999
Y




It's RUDE to Read and Run!
Get involved and leave your Comments in the Box Below. The more people get involved, the more we all benefit.
So, leave your thoughts before you leave the page. 

2 comments:

  1. It is a very informative and useful post thanks it is good material to read this post increases my knowledge. You can visit Logistics and Transportation in Malta

    ReplyDelete
  2. Data migration involving Slowly Changing Dimensions (SCDs) requires careful planning and execution to ensure historical data integrity and consistency in your target system. Here's a breakdown of the key steps involved:

    1. Planning and Analysis:

    Big Data Projects For Final Year Students

    Image Processing Projects For Final Year

    Deep Learning Projects for Final Year


    Identify Source and Target Systems: Define the source system holding the data you're migrating and the target system where the data will reside.
    Analyze Data Model Differences: Understand the data model structures in both systems, particularly focusing on the dimension tables involved. Identify any discrepancies in how dimensions are handled between source and target.
    Choose SCD Approach: Select the appropriate SCD (Slowly Changing Dimension) approach based on your business needs. Common approaches include:
    Type 1 (Overwrite): Existing data in the target dimension is overwritten with new data during migration. This approach is suitable for dimensions where historical data isn't critical.
    Type 2 (Add New Row): A new row is created in the target dimension table for each new data record. This approach preserves historical data but can lead to table size growth.
    Type 3 (Add Historical Marker): A new column is added to the target dimension table to flag the "current" record. This approach balances space efficiency and historical data preservation.

    ReplyDelete