What is star and snowflake schemas in data analysis?

star schema in data analysis

In this article, we will take a close look at the star schema and also touch upon the related snowflake schema when it comes to modeling data for Power BI. And while doing so, we will explore the topics of fact and dimension tables. And to understand this, we will need to make use of an example.

Well, for that, let's consider this table containing sales transactions for various restaurants which are part of a restaurant chain. The first column is a Restaurant ID, and you'll see here that restaurants 1 and 2, each have recorded a single transaction here, but there are two transactions recorded for restaurant 3. The table also records the customer who made the purchase, the date of the transaction, and also the total bill amount. In a star or snowflake schema, this can be considered a fact table.

Understanding fact tables and aggregations:

So what exactly is a fact table? Well, you can think of this as one which contains data with regards to a frequently occurring event. These could be various sales recorded at a retailer, production numbers at a factory, or medical procedures at a hospital. An important feature of a fact table is that it is usually filled with a lot of numbers and dates.

More precisely, information which can be aggregated in Power BI reports. So, if a table contains details about the production of a certain item, this could contain the batch number, the time of production, and maybe even the weight of the produced good. One thing to keep in mind is that a fact table may not contain a unique key. Or if you're familiar with relational databases, it's one which may not contain a primary key. Well, with this in mind, let's head back over to our fact table containing restaurant sales data. So, the Sale Amount column is something which can be aggregated, and this aggregation can be performed along various dimensions.

This can be done on the basis of the Sale Date, a Customer ID, or even the Restaurant ID, or even on the basis of some category present in a related table, such as a city or state. And speaking of related tables, this brings us to the topic of dimension tables, so these give us additional bits of information about the recorded facts in a fact table.

Understanding dimension tables:

A dimension table is related to a fact table, and this could be either a direct relationship or an indirect one via another dimension table.

As an example, think of a Customer table which is related to the Restaurant sales fact table, we just took a look at, which contains customer specific information such as their names, date of birth, and so on. We could also have a Restaurant table which is linked to it containing more details about the restaurants where the sales have been recorded. When it comes to dimension tables. Each row usually represents a specific entity which is unique in the table. That is, such tables can have a primary key. So in a customer table, there will just be one row for a specific customer. Given that dimension tables are related to fact tables, it is possible for us to perform aggregations on the data in the fact tables based on some of the fields in the dimension tables.

For instance, we could aggregate the sales for each customer to figure out the most lucrative customers for a restaurant chain. So heading back now to our fact table, we can see that this includes a Restaurant ID and a Customer ID, and these can be used in order to establish a relationship with a separate restaurant table and a customer table. And, of course, these will represent additional dimensions to this sales data. In fact, let's pull up each of these tables to see what these might look like.

On the right, we have the Sales table and then on the left we have a Restaurant table which contains details such as the restaurant name, city, address, and so on, along with a Customer table just below that which includes a name and a date of birth for each customer.

Cardinality in Data Relationships:

To set up the relationships between these two tables, well, consider that the Sales and Restaurant tables do have a column in common. ID in restaurant maps to Restaurant ID in Sales. Now in the Restaurant table, the ID is unique, that is, there is just one row corresponding to a specific ID. However, this is clearly not the case in the Sales table. Even in our example, there are two occurrences of the Restaurant ID of 3 which means that the relationship between Restaurant and Sales can be thought of as 1:many. Given that for each restaurant, there are multiple sales transactions in the Sales table. This also means that we can aggregate sales transactions for each restaurant. For example, to calculate the revenues recorded at each restaurant. And in fact, given that the Restaurant table contains a City column, we could also get the aggregate sales at each city. Moving along then, to the relationship between Customer and Sales and this happens via Customer ID in Sales, and ID in Customer.

Now in the Customer table, we can consider the ID column as representing a primary key, so just one row for a specific Customer ID. However, a customer may go on and record multiple transactions at the same restaurant or at multiple restaurants belonging to the same chain. This means that this relationship should also be modeled as 1:many. So when we talk about a relationship as 1:1, 1:many, and so on, we are in fact referring to the cardinality of the relationship. Now looking back at the Restaurant table, you will see that this contains a column called Category Code. This category could point to various types of restaurants, for example, delivery only, pick up only, drive-through, or sit down. And for each of these categories, there may be other bits of information which doesn't make sense to store in the Restaurant table. So we may have an entirely different table containing restaurant category details. Let's just expand a schema to get a sense for what the tables might look like, so we have Sales at the heart of everything and this is the fact table, and this has direct relationships with Restaurant and Customer. These are the dimension tables around Sales.

However, these are not the only ones. Restaurant and Customer could be related to other tables of their own. For example, the Restaurant table is linked to Restaurant Category, while a Customer table could be associated with a Loyalty Program table. So while there is just a single fact table in the middle, there are various dimensions to these facts. Starting from the customers and restaurants to restaurant categories, and loyalty programs. This type of design is very common when it comes to Power BI and data analysis and is termed the star schema. This is when there is a single fact table in the middle and is related to a number of different dimensions around it. Strictly speaking, however, the star schema is set to apply when a fact table is directly related to a number of dimension tables. However, when the dimension tables themselves have further dimensions, as in this example. This is referred to as a snowflake schema. With that said, let's take a look at some of the features of a star schema.

First of all, the fact tables which lie at their heart can be very, very large, potentially going into the millions of rows given that they usually contain records of frequently occurring events such as sales transactions at a retailer or the streaming of a video on a video streaming service. On the other hand, the dimension tables which surround them are usually much smaller, since they may contain unique entities.

Conclusion:

The dimension tables are typically meant to help with aggregating the data in the fact tables. In our example, we already discussed that it is possible for us to aggregate the sales for each restaurant or even each city, as long as these fields are present in the dimension tables. Furthermore, a star schema usually makes it much easier to manage large datasets. By separating the various dimensions from the facts themselves, we can avoid duplication of data, for example, by separating the customer information from the fact tables. We don't need to store the customer name for each transaction they have recorded. This usually means less storage and also improve performance overall.

W3google