How to optimize data models for better data analysis?

optimizing data models

We will now discuss something which is crucial, but often a neglected topic when it comes to designing data models for Power BI. Specifically designing models which are optimal in terms of simplicity as well as performance.

To start off, let's take a look at a common practice when it comes to designing models, not just for Power BI, but for any tool where data needs to be stored. Now usually a data analyst will need to gather a lot of data to be presented for analysis. When doing so, it is quite easy to get tempted and load in all data you have access to, regardless of whether it is really needed. You may well think what if I need it at some point? Well, the problem with this approach is that, over a period of time this will make everything much harder to maintain.

For example, you may need to change the display format for all numeric columns, so that they are represented as whole numbers which means that you will also need to apply this transformation to columns which are not referenced by any visualization, simply because you may not be sure whether it is in fact, used by a visual.

So, unnecessary data can lead to a lot of clutter, but significantly they can also degrade performance. To understand this, consider that a lot of the data may be stored in import mode, in which case they will all be loaded into memory. If unnecessary data is loaded into memory, it won't be long before visualizations start performing rather slowly, which is why it helps for us to follow a rather general rule of thumb. Keep everything as simple as possible.

In fact, try to follow a somewhat minimalist approach. It'll just save a lot of headaches further down the line. Of course, it's all very easy to say that we need to keep things simple, but what precisely does this translate to when it comes to Power BI models? Well, this could be as simple as removing any unnecessary rows or columns from the imported data. So, when it comes to data analysis, having details such as customer's email addresses and phone numbers won't be particularly helpful, in which case they can simply be eliminated altogether.

Minimizing Relationships for Better Efficiency:

Another factor here is to make sure that data is not duplicated. This is similar to normalization adopted in relational databases, and the justifications for normalization also applied to Power BI. Duplication of data can lead to inconsistencies overtime where one copy of the data is updated and another one is not, which means that any visualizations built out of this data could just portray incorrect information. Furthermore, it helps to combine tables whenever this is possible. If your task is to analyze sales data, it is possible that your sources come from 12 different files, one for each month. When loading these files, by default 12 different tables will get created. In which case, you are best off combining them into a single unit. Another area where models can be simplified is relationships.

It is best to minimize the total number of relationships between tables overall and also to lower the cardinality of each of them. So, we should avoid using many:many relationships when a 1:many relationship will suffice. And on the topic of relationships, we should definitely avoid circular relationships between tables.

Another way to simplify data is to use measures in order to perform some summarization. For example, calculating the sum of sales or the average of prices and so on. Such aggregations can also be done at the visualization level. However, this does come at the cost of some performance, and it is better to push this back to the level of measures. Alright then, so now that we have an idea of some of the concrete steps which can be taken to improve performance in Power BI. Let's take a look at some of the potential bottlenecks in performance.

Common Performance Bottlenecks:

One very common source of this is the use of very complex visualizations. If you have way too many categories and you also have some slicers associated with your visuals, remember that each interaction with a visual or even a slicer does involve some computation work to be performed. Also, too many visualizations mean that a lot of queries need to be executed under the hood on the data. Even DAX queries could slow down performance if they have not been defined correctly.

Just like with any code, it is possible to make DAX formulas needlessly complex with too many nested function calls, not using filters when they could help, and so on. Furthermore, relationships which are not properly defined can also slow things down. So, if a relationship can be set as 1:many, we should do so, given many:many relationships perform rather poorly. And there is also Power BI's automatic date/time feature. This is one which auto detects any date columns within your data and then applies date hierarchies to those columns. This can be a very cool feature and potentially very useful, but don't use it if you don't need it. Especially since this is turned on by default.

So given these potential bottlenecks, what can we do to simplify things and improve performance overall? Well, when it comes to DAX formulas, if you happen to have a rather long and complex one, using variables can help.

Much like with any programming language, you may use a variable to store a value which is the result of an operation. That variable can be used to reference the value multiple times in the formula, while the alternative is to simply perform the operation over and over again. Another important consideration is to do as little work on the data within Power BI as possible. So, if another team sends you a file with your source data, well, consider asking them to perform whatever aggregations, filtering or trimming of data which you require before sending the file along. In which case, you can focus much more on generating visualizations from the data rather than perform preprocessing. And then I'll reiterate a point which I had touched upon earlier, where we need to keep relationships simple.

Avoid connecting a table to multiple other tables unless necessary, and we should definitely avoid circular relationships. Since these could seriously muddle a lot of aggregations. To help us identify bottlenecks and potentially fix them, we can make use of Power BI's performance analyzer utility. This can point us to the time taken for various operations and guide us towards any optimizations which need to be carried out.

Optimize direct query mode:

Moving along, let's take a look at some of the optimizations which can be performed in the context of Direct Query. You will recall that Direct Query is a storage mode where the data is not cached by the Power BI service. But instead, is queried from the underlying source when required. Since the data is retrieved as needed, this ensures that we always get the latest data. Of course, this means that the underlying data source needs to be queried each time we ask for the data.

Such a query could be triggered by a simply interacting with a visualization or loading a new report. When we are sending queries to the underlying source, very often a relational database, well, the performance of that source becomes a huge factor. If the database is slow, then querying that data will take some time, which means that our visualizations will take much longer to load. If that database is located on a different machine in the network, well, network latency also plays a role in performance. So, if your database is located in a different corner of the world, well, perhaps Direct Query is not the best storage mode. So, if your visuals are performing poorly and you do have data which is retrieved using Direct Query, you should look into this as a source of your bottleneck.

However, in case you're unsure, you could always make use of the Performance Analyzer utility. Now, if the freshness of the data is absolutely crucial, then you will need to use Direct Query. In which case, it helps to know how exactly such models can be optimized. Well, first of all, make sure that the source of the data itself is running in an optimal manner. If this happens to be a relational database, make sure that you use indexes for the queried fields, so that you don't need to perform a full table scan.

Furthermore, make sure that you only query for the data which you really need so that the quantity of data sent over the network is minimized, which in turn will reduce any network latency. Beyond that, we need to consider the fact that interactions with visualizations can be rather compute intensive, even when querying data which has been cached. However, for visuals built on Direct Query data, well, we should try to minimize interactions as much as possible.

Now, when it comes to visualizations, it is possible for us to apply filters and slicers in order to get summaries for just a subset of the data. In case you haven't used slicers before, each time you interact with it, it triggers a query on the underlying data. To avoid queries based on accidental interactions, you could include an apply button where you will interact with the slicer and then trigger a query only when that Apply button is hit.

Conclusion:

So, by removing unnecessary amount of data which are not very relevant to the application, simplifying relationships between different data models, optimizing DAX queries, and improving direct query performance we can have data models that can faster and smoother which plays very important role while handling or working with big amount of data.

W3google