Data Reflections course - 8: kellylynch

kellylynch

Data Reflections course - 8

Mar 22, 2022 12:18

10 reflection best practices
Here are some considerations and suggestions when designing your raw and aggregate reflections. By no means are they exhaustive, and each deployment will have nuances in terms of data and resources that strongly influence the appropriate best practices.
In general, Dremio recommends you experiment and measure, and ask questions as you develop your use cases.
1. Organize your queries by pattern
It can be very helpful to understand your query history and analyze groups of queries with common patterns. By grouping queries into common patterns, you might be able to more easily determine the data reflections that will have the biggest impact on the largest set of queries. These groupings can be based on:
- datasets
- types of aggregations
- fields used for filters
- other factors

Keep in mind that a single query can use multiple reflections and a single reflection can serve many queries. Administrators should optimize for the overall workloads based on requirements for cost, storage, response time, and data freshness.
2: Design reflections for query patterns
Dremio supports two types of reflections -- raw and aggregate:
Raw reflections are appropriate for row-level information

Aggregate reflections are appropriate for queries that summarize data based on GROUP BY expressions or aggregations (e.g., SUM, AVG, COUNT, MIN, MAX),
When determining how to accelerate workloads, administrators should consider information provided in the query profile to understand which stages of the query plans are most expensive to perform. For example, it may be the case that joins are the dominant factor in a query plan, and that while aggregations consume some of the resources, they are relatively minor compared to the joins. In such a case, a raw reflection may be more appropriate than an aggregate reflection as it can be used to accelerate a broader range of queries.

3: Understand the cardinality of key fields
For any of the commonly used fields, it is useful to develop a sense of the data cardinality. Fields with low cardinality are good candidates for partitioning if they are frequently used in filters. Fields with low cardinality are also good candidates for use as dimensions in aggregate reflections.
Fields with high cardinality can significantly increase the size of aggregate reflections as the number of unique values has a combinatorial effect on size. Similarly, fields with high cardinality should be avoided for use as a partitioning key. In general, Dremio recommends no more than 10,000 unique values be used for partitioning.

4: Think in terms of several discrete reflections
Data Reflections allow data engineers to be iterative in their approach to performance optimization. Because reflections do not require any change in the behavior of the data user, you can add and refine reflections on an ongoing basis with little to no impact to ongoing workloads.
Generally speaking, Dremio recommends creating a reflection strategy that employs on the order of tens of reflections, each configured for different query patterns, rather than a single uber reflection that works for all queries.

5: Use supporting anchor datasets for general optimizations like joins and calculated fields
As you develop a better understanding of your query patterns and reflections, you can refactor your design to create a smaller number of generalized reflections that make operating Dremio more efficient overall. When undertaking this kind of optimization work, it is useful to organize reflections using supporting anchor datasets which are not part of the user-accessible semantic data model.
For example, you could observe that users employ a common join pattern across three tables. Instead of creating a new VDS for them that joins the tables, you could create a supporting anchor dataset that includes the joins, and build a reflection on this VDS. Dremio will substitute these reflections as appropriate without forcing your users to use a new VDS with the joins pre-defined. The same logic applies to calculated fields, common filtering patterns, and other optimizations.

6: Understand your data freshness requirements
Dremio peforms updates to reflections at a rate determined by a refresh interval configured on the data source, but it can be overridden by settings on the table level. Develop a good understanding of the requirements for your data freshness, with the understanding that more frequent refreshes will be more resource intensive, in two respects:
- The frequency of data access at the source
- The number of compute cycles on your Dremio cluster performing reflection builds
7: Use partitions to optimize queries that apply a selective filter on low cardinality fields
When administrators select a field for partitioning in a reflection, Dremio will physically group records together into a common directory on the file system. For example, when partitioning by COUNTRY where the values of this column are US, UK, DE, CA, etc, Dremio will store data together into directories called US, UK, DE, CA, etc. This optimization allows Dremio to scan a subset of the directories based on the query in an optimization called partition pruning. If a user queries on records that have COUNTRY IN then Dremio can apply partition pruning to scan only the US and UK directories, significantly reducing the total data that is scanned for the query.
When defining a partitioning column for a reflection, consider the following:
- Is the field used in many queries?
- Are there relatively few unique values in the field (low cardinality)

To partition the data, Dremio must first sort all the records, which will consume resources. Accordingly, data should only be partitioned on fields that can be used to optimize queries. In addition, the number of unique values for a field should be relatively small (eg, COUNTRY) to provide a relatively small number of partitions. If all values in a field are unique, in contrast, the work to partition will be a high cost for a relatively small benefit. In general, Dremio recommends the total number of partitions for a dataset be less than 10,000 unique values.
8: Use sorted fields to optimize equality and inequality queries
The sort option is useful for optimizing filters and range queries, especially on columns with high cardinality. Dremio can take advantage of sorted data to more efficiently apply filters and range predicates to queries. If sorting is enabled, during query execution Dremio skips over large blocks of records based on filters on sorted columns.
Typically it is not beneficial to sort on more than one column in a single Data Reflection as this does not improve read performance significantly but it will increase the costs of maintenance tasks.
For workloads that need to support multiple sort options, consider multiple reflections where each is sorted on a single column.
At this time Dremio does not use the sort option to optimize sort operations. While the data is sorted locally within a node (and partition, if applicable), most Data Reflections span multiple nodes and partitions, and there is no global sort applied across all files. Ultimately the data must still be sorted at query time.
9: Use EXPLAIN PLAN to test your data reflections

You can prepend EXPLAIN PLAN FOR to any SQL query in Dremio to view the query plan without running the query. In the job history, the query plan will show you everything you would view on the query plan for an executed query, with a few exceptions. By using EXPLAIN PLAN you can get a sense for how your queries will be executed without having Dremio do all the heavy lifting of running the actual query
10: Think of data reflections independent of your virtual data model

The purpose of spaces, directories, and virtual datasets in Dremio is to allow you to design a semantic model for your data that makes sense for your end users. Dremio separates the logical sense of your data from the physical storage and optimization of the data. The benefit of this approach is that you can be more agile: you can iterate on your reflections without forcing the behavior of your users to change. Through reflections and supporting anchor datasets, you can develop acceleration strategies that improve the experience of your end users on an ongoing basis, as their needs and your resources evolve.

dremio (learning)