Dimensional modeling in Data Warehouse: An Ultimate Magical Guide (Part 1)

Dimensional Modeling

Dimensional Modeling (DM) is a data structure technique optimized for data storage in a Data warehouse. The purpose of dimensional modeling is to optimize the database for faster retrieval of data. A dimensional model in data warehouse is designed to read, summarize, analyze numeric information like values, balances, counts, weights, etc. In contrast, relation models are optimized for addition, updating and deletion of data in a real-time Online Transaction System. These dimensional and relational models have their unique way of data storage that has specific advantages. For instance, in the relational mode, normalization and ER models reduce redundancy in data. On the contrary, dimensional model arranges data in such a way that it is easier to retrieve information and generate reports.

Elements of Dimensional Data Model

Fact

Facts are the measurements/metrics or facts from business process. For a Sales business process, a measurement would be quarterly sales number.

Dimension

Dimension provides the context surrounding a business process event. In simple terms, they give who, what, where of a fact. In the Sales business process, for the fact quarterly sales number, dimensions would be –

  • Who – Customer Names
  • Where – Location
  • What – Product Name

In other words, a dimension is a window to view information in the facts.

Attributes

The Attributes are the various characteristics of the dimension in dimensional data modeling.

In the Location dimension, the attributes can be

  • State
  • Country
  • Zip code etc.

Attributes are used to search, filter, or classify facts. Dimension Tables contain Attributes

Fact Table

A fact table is a primary table in dimension modelling.
A Fact Table contains –

  1. Measurements/facts
  2. Foreign key to dimension table

fact table contains the numeric measures produced by an operational measurement event in the real world. At the lowest grain, a fact table row corresponds to a measurement event and vice versa. Thus the fundamental design of a fact table is entirely based on a physical activity and is not influenced by the eventual reports that may be produced. In addition to numeric measures, a fact table always contains foreign keys for each of its associated dimensions, as well as optional degenerate dimension keys and date/time stamps. Fact tables are the primary target of computations and dynamic aggregations arising from queries.

Characteristics of the Fact table

The fact table includes numerical values of what we measure. For example, a fact value of 20 might means that 20 widgets have been sold.
Each fact table includes the keys to associated dimension tables. These are known as foreign keys in the fact table.
Fact tables typically include a small number of columns.
When it is compared to dimension tables, fact tables have a large number of rows.

Dimension Table

  • A dimension table contains dimensions of a fact.
  • They are joined to fact table via a foreign key.
  • Dimension tables are de-normalized tables.
  • The Dimension Attributes are the various columns in a dimension table
  • Dimensions offers descriptive characteristics of the facts with the help of their attributes
  • No set limit set for given for number of dimensions
  • The dimension can also contain one or more hierarchical relationships

Characteristics of the Dimension table

Dimension tables contain the details about the facts. That, as an example, enables the business analysts to understand the data and their reports better.
The dimension tables include descriptive data about the numerical values in the fact table. That is, they contain the attributes of the facts. For example, the dimension tables for a marketing analysis function might include attributes such as time, marketing region, and product type.
Since the record in a dimension table is denormalized, it usually has a large number of columns. The dimension tables include significantly fewer rows of information than the fact table.
The attributes in a dimension table are used as row and column headings in a document or query results display.

Additive, Semi-Additive, and Non-Additive Facts

The numeric measures in a fact table fall into three categories. The most flexible and useful facts are fully additive; additive measures can be summed across any of the dimensions associated with the fact table. Semi-additive measures can be summed across some dimensions, but not all; balance amounts are common semi-additive facts because they are additive across all dimensions except time. Finally, some measures are completely non-additive, such as ratios. A good approach for non-additive facts is, where possible, to store the fully additive components  of the non-additive measure and sum these components  into the final answer set before calculating the final non-additive fact. This final calculation is often done in the BI layer or OLAP cube.

Steps of Dimensional Modelling

The accuracy in creating Dimensional modeling determines the success of data warehouse implementation. Here are the steps to create Dimension Model –

  1. Business Process Identification
  2. Granularity (level of detail) Identification
  3. Dimension Identification
  4. Facts
  5. Build Star

The model should describe the Why, How much, When/Where/Who and What of your business process.

#1. Identify the Business Process

Identifying the actual business process a data warehouse should cover. This could be Marketing, Sales, HR, etc. as per the data analysis needs of the organization. The selection of the Business process also depends on the quality of data available for that process. It is the most important step of the Data Modelling process, and a failure here would have cascading and irreparable defects.

#2. Identify the Granularity

The Granularity describes the level of detail for the business problem/solution. It is the process of identifying the lowest level of information for any table in data warehouse. If a table contains sales data for every day, then it should be daily granularity. If a table contains total sales data for each month, then it has monthly granularity.

During this stage, we need to answer questions like –

  1. Do we need to store all the available products or just a few types of products? This decision is based on the business processes selected for Data warehouse.
  2. Do we store the product sales information on a monthly, weekly, daily or hourly basis? This decision depends on the nature of reports requested by executives.
  3. How do the above two choices affect the database size?

Example of Granularity:

The CEO at an MNC wants to find the sales for specific products in different locations on a daily basis.
So, the grain is “product sale information by location by the day.”

#3. Identify the Dimensions

Dimensions are nouns like date, store, inventory, etc. These dimensions are where all the data should be stored. For example, the date dimension may contain data like a year, month and weekday.

Example of Dimensions:

The CEO at an MNC wants to find the sales for specific products in different locations on a daily basis.
Dimensions: Product, Location and Time
Attributes: For Product: Product key (Foreign Key), Name, Type, Specifications
Hierarchies: For Location: Country, State, City, Street Address, Name

#4. Identify the Fact

This step is co-associated with the business users of the system because this is where they get access to data stored in the data warehouse. Most of the fact table rows are numerical values like price or cost per unit, etc.

Example of Facts:

The CEO at an MNC wants to find the sales for specific products in different locations on a daily basis.
The fact here is Sum of Sales by product by location by time.

#5. Build Schema

In this step, we implement the Dimension Model. A schema is nothing but the database structure (arrangement of tables). There are two popular schemas –

  1. Star Schema

The star schema architecture is easy to design. It is called a star schema because diagram resembles a star, with points radiating from a center. The center of the star consists of the fact table, and the points of the star is dimension tables.

The fact tables in a star schema which is third normal form whereas dimensional tables are de-normalized.

  1. Snowflake Schema

The snowflake schema is an extension of the star schema. In a snowflake schema, each dimension are normalized and connected to more dimension tables.

Rules for Dimensional Modelling

Following are the rules and principles of Dimensional Modeling:

  • Load atomic data into dimensional structures.
  • Build dimensional models around business processes.
  • Need to ensure that every fact table has an associated date dimension table.
  • Ensure that all facts in a single fact table are at the same grain or level of detail.
  • It’s essential to store report labels and filter domain values in dimension tables
  • Need to ensure that dimension tables use a surrogate key
  • Continuously balance requirements and realities to deliver business solution to support their decision-making

Benefits of Dimensional Modeling

  • Standardization of dimensions allows easy reporting across areas of the business.
  • Dimension tables store the history of the dimensional information.
  • It allows to introduce entirely new dimension without major disruptions to the fact table.
  • Dimensional also to store data in such a fashion that it is easier to retrieve the information from the data once the data is stored in the database.
  • Compared to the normalized model dimensional table are easier to understand.
  • Information is grouped into clear and simple business categories.
  • The dimensional model is very understandable by the business. This model is based on business terms, so that the business knows what each fact, dimension, or attribute means.
  • Dimensional models are deformalized and optimized for fast data querying. Many relational database platforms recognize this model and optimize query execution plans to aid in performance.
  • Dimensional modelling in data warehouse creates a schema which is optimized for high performance. It means fewer joins and helps with minimized data redundancy.
  • The dimensional model also helps to boost query performance. It is more denormalized therefore it is optimized for querying.
  • Dimensional models can comfortably accommodate change. Dimension tables can have more columns added to them without affecting existing business intelligence applications using these tables.
Dimensional modeling is simple: 

Dimensional modeling methods make it possible for warehouse designers to create database schemas that business customers can easily hold and comprehend. There is no need for vast training on how to read diagrams, and there is no complicated relationship between different data elements.

Dimensional modeling promotes data quality: 

The star schema enable warehouse administrators to enforce referential integrity checks on the data warehouse. Since the fact information key is a concatenation of the essentials of its associated dimensions, a factual record is actively loaded if the corresponding dimensions records are duly described and also exist in the database.

By enforcing foreign key constraints as a form of referential integrity check, data warehouse DBAs add a line of defense against corrupted warehouses data.

Performance optimization is possible through aggregates: 

As the size of the data warehouse increases, performance optimization develops into a pressing concern. Customers who have to wait for hours to get a response to a query will quickly become discouraged with the warehouses. Aggregates are one of the easiest methods by which query performance can be optimized.

Faster Retrieval of Data:

Dimensional modeling merges the tables in the model itself, which enables users to retrieve data faster by running join queries compared to the other approaches. The denormalized schema of a dimensional model is optimized to run ad hoc queries. As a result, it greatly complements the business intelligence (BI) goals of an organization.

Better Understanding of Business Processes:

The information in a dimensional model is stored in fact and dimension tables. This categorization of data into facts and dimensions, and the entity-relationship structure of a dimensional model, present complex business processes in an easy-to-understand manner to analysts.

Flexible to Change:

Dimensional modeling framework makes the data warehouse design extensible. The design can be easily modified to incorporate any new business requirements or make any adjustments. New entities can be added in the model or layout of the existing ones can be changed to reflect modified business processes.

Degenerate Dimension Tables

degenerate dimensions are not physically implemented data structures. They commonly occur when the fact table’s grain is a single transaction (or transaction line). Degenerate dimension attributes exist in the fact table as a part of the primary key but have no corresponding dimension. Transaction control header numbers assigned by the operational business process are typically degenerate dimensions, such as order, ticket, credit card transaction, or check numbers. These degenerate dimensions are natural keys of the “parents” of the line items.


Even though there is no corresponding dimension table of attributes, degenerate dimensions can be quite useful for grouping together related fact tables rows. For example, retail point-of-sale transaction numbers tie all the individual items purchased together into a single market basket. In health care, degenerate dimensions can group the claims items related to a single hospital stay or episode of care.

Junk Dimension Tables

Junk dimensions are used to reduce the number of dimensions in the dimensional model and reduce the number of columns in the fact table.  A junk dimension combines two or more related low cardinality flags into a single dimension. An example of this may be car color (red, black, blue, etc.) and body style (sedan, van, SUV, etc.) As you can see these are limited in number and, if created as single dimensions, the dimensions would be limited to a single attribute. In order to eliminate these small dimensions, we create a single “junk” dimension which cross joins all possible attributes into a single dimension which will be used in the fact table.

Limitations of Dimensional Data Modelling

Although Dimensional Data Modelling is very crucial to any organisation, it has a few limitations that companies need to take care of when incorporating the concept into their applications. Some of those limitations are given below:

  • Designing and creating Schemas require domain knowledge about the data.
  • To maintain the integrity of Facts and Dimensions, loading the Data Warehouses with a record from various operational systems is complicated.
  • It is severe to modify the Data Warehouse operations if the organisation adopts the Dimensional technique and changes the method in which they do business.

Conclusion

Despite few limitations, the DDM technique has proved to be one of the simplest and efficient techniques to handle data in Data Warehouses till date.