Skip to content

Last updated: April 09, 2024

How to configure data quality checks and rules

Read this guide to learn how to configure data quality checks in DQOps in YAML files, and how to set up the validation rules.

Where are the data quality checks configured?

Data quality checks are configured on monitored tables and columns in the <schema_name>.<table_name>.dqotable.yaml YAML files. These files are placed in the $DQO_USER_HOME/sources/<connection_name>/ folders in the DQOps user home. The role and layout of the DQOps user home folder are described on the DQOps user home page.

Read the configuration of data sources to know how to add data sources. The concept of the table metadata in DQOps describes how to manage the .dqotable.yaml table metadata files.

Configuring data quality checks from the DQOps user interface

Follow the running data quality checks manual to see how to use the user interface to configure the data quality checks.

YAML file structure

The following example of a table metadata file shows the location of data quality check nodes where the parameters for the table-level data quality checks and data quality rules are defined.

# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
  timestamp_columns:
    event_timestamp_column: date
    partition_by_column: date
  incremental_time_window:
    daily_partitioning_recent_days: 7
    monthly_partitioning_recent_months: 1
  profiling_checks: # (1)!
    ...
  monitoring_checks: # (2)!
    daily: # (3)!
      ...
    monthly: # (4)!
      ...
  partitioned_checks: # (5)!
    daily: # (6)!
      ...
    monthly: # (7)!
      ...
  1. The node where profiling checks are configured at a table level.
  2. The node where monitoring checks are configured at a table level.
  3. The node where daily monitoring checks are configured at a table level.
  4. The node where monthly monitoring checks are configured at a table level.
  5. The node where partition checks are configured at a table level.
  6. The node where daily partition checks are configured at a table level.
  7. The node where monthly partition checks are configured at a table level.

Table-level configuration elements

The nodes where the table-level data quality checks are configured in the .dqotable.yaml file are described in the table below.

Please expand this section to see the description of all .dqotable.yaml file nodes
Line Element path               Description Default value
11 spec.profiling_checks The node where profiling checks are configured on a table level.
13 spec.monitoring_checks The node where monitoring checks are configured at a table level.
14 spec.monitoring_checks.daily The node daily monitoring checks are configured at a table level.
16 spec.monitoring_checks.monthly The node monthly monitoring checks are configured at a table level.
18 spec.partitioned_checks The node where partition checks are configured at a table level.
19 spec.partitioned_checks.daily The node where daily partition checks are configured at a table level.
21 spec.partitioned_checks.monthly The node where monthly partition checks are configured at a table level.

Configuring table-level checks

The data quality checks can be configured both at a table level and at a column level, depending on the type of the check.

The configuration of data quality checks will be shown in the example of a profiling check.

Table-level profiling checks

The table profiling checks are meant to capture advanced data quality statistics and store the most current value for each month. Their role is to track the overall quality of data, without affecting the data quality KPIs.

The example below shows a configuration of the profile_row_count with only a warning severity rule that verifies if the table's row count is at least one row. A warning severity issue will be raised when a result of a query similar to SELECT COUNT(*) FROM <monitored_table> will be 0, which means that the table is empty.

# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:

  profiling_checks: # (1)!
    volume: # (2)!
      profile_row_count: # (3)!
        warning: # (4)!
          min_count: 1 # (5)!
    schema: # (6)!
      ...
  ...
  1. The table profiling checks specification where the profiling checks are configured.
  2. A volume category node. Other categories are sibling nodes of the volume.
  3. The configuration of the profile_row_count data quality check.
  4. The configuration of a data quality rule at a warning severity level. This rule will raise a warning severity level data quality issue if the sensor readout does not meet the rule parameter.
  5. The rule parameter for the min_count rule. It is the smallest accepted row count (the sensor readout captured by the data quality check's sensor) that will make the rule pass.
  6. Another category node for the table schema checks.

The elements of the profiling checks configuration are listed in the table below.

Line Element path (within the spec node)               Description
6 profiling_checks The table-level profiling checks container profiling checks specification where the profiling checks are configured.
7 profiling_checks.volume A volume category node. Similar data quality checks are grouped in categories. Other categories are sibling nodes of this node.
8 profiling_checks.volume. profile_row_count The configuration of the profile_row_count data quality check. When a node with the name of the data quality check is added to the category node,check becomes activated.
9 profiling_checks.volume. profile_row_count.warning The configuration of a data quality rule at a warning severity level. This rule will raise a warning severity level data quality issue if the sensor readout does not meet the rule parameter.
10 profiling_checks.volume. profile_row_count.warning.min_count The rule parameter for the min_count rule. It is the smallest accepted row count (the sensor readout captured by the data quality check's sensor) that will make the rule pass.
11 profiling_checks.schema Yet another check category node.

Table-level monitoring checks

The monitoring checks are the primary type of data quality check used in DQOps for continuous monitoring of the data quality of the data source. The data quality issues raised by these checks are decreasing the data quality KPI.

The monitoring checks are configured in the spec.monitoring_checks node and are divided into daily and monthly monitoring checks. The daily monitoring checks keep only the most current check result for the day when the check was executed. Running the same check again on the same day overrides the previously stored value, but does not delete the results from past days. The same principle is followed by the monthly monitoring check, which store only one value for each calendar month.

The following example shows a table YAML file with the row count check configured both at daily and monthly periods. Please notice that the names of the checks are different because all data quality check names are unique in DQOps.

The daily monitoring check variant is daily_row_count and the monthly monitoring check variant is monthly_row_count.

# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:

  monitoring_checks: # (1)!
    daily: # (2)!
      volume:
        daily_row_count: # (3)!
          warning:
            min_count: 1
    monthly: # (4)!
      volume:
        monthly_row_count: # (5)!
          warning:
            min_count: 1
  ...
  1. The table monitoring checks specification where the monitoring checks are configured.
  2. The daily monitoring checks container.
  3. The configuration of the daily_row_count data quality check.
  4. The monthly monitoring checks container.
  5. The configuration of the monthly_row_count data quality check.

Table-level partition checks

Table-level partition checks allow analyzing even petabyte scale tables that are constantly growing because new data are appended to new partitions. Running data quality checks as queries on such big tables is time-consuming, and can generate high charges when the data is hosted in the cloud.

The partition checks require that the spec.timestamp_columns.partition_by_column field is set. The value of this field is a column name that is used as a partitioning key. DQOps is not limited to analyzing tables that are physically date partition. It is possible to select any date/datetime/timestamp column that divides the data by date ranges. The partition checks will use a GROUP BY clause, grouping the data by the selected partition_by_column column, enabling detection of data quality issues in time ranges.

This column should be a column of type date, datetime or timestamp. The actual column data types depend on the monitored database type. However, when the column is not a date type because it is a text column, DQOps supports configuring calculated columns in the table YAML metadata. The calculated columns are virtual columns that are defined as an SQL expression using the target data source-specific SQL grammar.

The configuration of partition checks is similar to the configuration of monitoring checks, they are also divided into daily and monthly partition checks.

# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
  timestamp_columns:
    partition_by_column: date # (1)!
  incremental_time_window:
    daily_partitioning_recent_days: 7 # (2)!
    daily_partitioning_include_today: false
    monthly_partitioning_recent_months: 1 # (3)!
    monthly_partitioning_include_current_month: false
  partitioned_checks: # (4)!
    daily: # (5)!
      volume:
        daily_partition_row_count: # (6)!
          warning:
            min_count: 1
    monthly: # (7)!
      volume:
        monthly_partition_row_count: # (8)!
          warning:
            min_count: 1
  ...
  1. The column name in the table that is used as the date partitioning key. It will be used by the partition checks to calculate the data quality results for daily and monthly time periods. The partitioning column in this example is the column named date from the tested table.
  2. The number of recent days queried when running the daily partition checks.
  3. The number of recent months queried when running the monthly partition checks.
  4. The table partition checks specification where the partition(ed) checks are configured.
  5. The daily partition checks container.
  6. The configuration of the daily_partition_row_count data quality check.
  7. The monthly partition checks container.
  8. The configuration of the monthly_partition_row_count data quality check.

The incremental_time_window section configures how the incremental data quality check execution works on partition data. DQOps generates SQL queries from the Jinja2 sensor templates by adding a WHERE clause that applies a query filter on the table, scanning only the last 7 days for daily checks or the data since the 1st day of the previous month for monthly checks. The additional filter predicate is similar to the following SQL fragment.

WHERE analyzed_table."date" >= DATE_ADD(CURRENT_DATE(), INTERVAL -{{daily_partitioning_recent_days}} DAY)

Because the data may change during the day, the default configuration of incremental partition checks in DQOps excludes today for daily checks and the current month from monthly checks. The default values for the daily_partitioning_include_today and monthly_partitioning_include_current_month are false.

Read the concept of incremental data quality monitoring to learn why running partition data quality checks is superior for append-only tables, financial data, and very big tables.

Configure issue severity levels

The data quality checks in DQOps allow setting different rule thresholds for different data quality issue severity levels. There are three severity levels: warning, error, and fatal. The warning level notifies the user about a potential issue, the error level indicates that the issue should be resolved to maintain data quality KPI score, and the fatal level represents a serious issue that can cause the data pipeline to stop.

The following example shows how to set three different threshold levels for a row count check.

# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
  monitoring_checks:
    daily:
      volume:
        daily_row_count:
          warning: # (1)!
            min_count: 1000
          error: # (2)!
            min_count: 100
          fatal: # (3)!
            min_count: 1           
  1. The warning severity level rule.
  2. The error severity level rule.
  3. The fatal severity level rule.

It is worth mentioning that the thresholds for the warning, error, and fatal rules should be inclusive. In this example, the warning rule raises a warning severity data quality issue when the row count drops below 1000 rows. When the table has 999 rows, a warning will be raised, but the thresholds for the error and fatal severity issues are lower, so these rules will pass successfully.

Setting up a threshold for the warning severity rule below the threshold for the error severity rule by mistake will make the warning rule ineffective, raising only error severity issues.

Checks without rules

Data quality checks without any rules are a special case in DQOps. These checks will only capture the sensor result from the sensor and store it in both the sensor_readouts, and the check_results parquet tables.

The issue severity level for these checks will be always valid, storing the value 0 in the severity column in the check_results parquet table. Checks without any rules enabled will also not be included in the data quality KPI because their include_in_kpi flag will be false. The flag include_in_sla will be also false.

Due to a limitation in the YAML format, a node without a value makes the YAML file invalid. However, every YAML supports JSON inclusions.

The following example shows how to activate a check without setting any rules by setting its value to a JSON {} empty object.

1
2
3
4
5
6
7
8
# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
  monitoring_checks:
    daily:
      volume:
        daily_row_count: {}

Rules without parameters

Some data quality rules do not have any parameters (thresholds). Configuring these rules uses the same YAML/JSON trick to set the value of the rule to a JSON {} object, enabling the rule at the given severity level.

The following example shows the daily_column_count_changed check that compares the column count captured today with yesterday's (or any previous known row count), detecting if the number of columns on the table has changed recently. A warning severity issue is raised on the day the change in the number of columns is detected..

1
2
3
4
5
6
7
8
9
# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
  monitoring_checks:
    daily:
      schema:
        daily_column_count_changed:
          warning: {}

Checks with additional sensor parameters

Some data quality sensors are parametrized. The parameters are used in the SQL template and will be rendered in the SQL query that is generated from the sensor.

The following example shows how to use the daily_text_found_in_set_percent check that uses an expected_values parameter which is an array of strings that are the valid values expected in the column. This check counts the percentage of rows that have one of the expected values stored in the country column.

# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
  columns:
    currency:
      type_snapshot:
        column_type: STRING
        nullable: true
      monitoring_checks:
        daily:
          accepted_values:
            daily_text_found_in_set_percent:
              parameters:
                expected_values:
                  - "USD" # (1)!
                  - "EUR"
                  - "GBP"
              error:
                min_percent: 100
  1. The value that is expected in the column.

The parameters node is present in every data quality check, but it is not saved to the .dqotable.yaml file when no parameters are specified.

Referencing data dictionaries

Instead of entering the same set of expected values as a parameter to data quality checks that use lists of accepted values, it is possible to move the list of values to a data dictionary. Data dictionaries are CSV files stored in the DQOps user home/dictionaries folder.

The list of values in the expected_values parameter of the daily_text_found_in_set_percent data quality check can be replaced by a token that references the dictionary file. The token format is ${dictionary://<dictionary_file_name>}. It is advised to wrap the token value in double quotes as shown in the example below.

# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
  columns:
    currency:
      type_snapshot:
        column_type: STRING
        nullable: true
      monitoring_checks:
        daily:
          accepted_values:
            daily_text_found_in_set_percent:
              parameters:
                expected_values:
                  - "${dictionary://currencies.csv}" # (1)!
              error:
                min_percent: 100
  1. Data dictionary expansion formula "${dictionary://<dictionary_file_name>}" replaced by a list of dictionary entries when DQOps runs a data quality check.

A data quality check can reference multiple dictionaries in separate lines. DQOps aggregates all values from all dictionaries, adding also standalone values included in the list.

The definition of custom data dictionaries is described in the data dictionaries section of the DQOps user home folder concept.

Configuring column-level checks

The list of columns is stored in the spec.columns node in the .dqotable.yaml file. The configuration of the column metadata is described in the configuration of the data sources.

Column profiling checks

Profiling checks are configured on columns the same way as on tables. Only a different set of data quality checks is available because column-level checks must be executed on a column. The column name is included in the generated SQL query rendered from a sensor's template.

The following example shows column profiling checks configured on a column.

# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
  columns:
    cumulative_confirmed:
      type_snapshot:
        column_type: INT64
        nullable: true
      profiling_checks: # (1)!
        nulls:
          profile_nulls_count: {} # (2)!
          profile_nulls_percent:  # (3)!
            warning:
              max_percent: 0
            error:
              max_percent: 1
        schema:
          profile_column_exists: # (4)!
            warning:
              expected_value: 1 # (5)!
    date:
      type_snapshot:
        column_type: DATE
        nullable: true
  1. The container of the column-level profiling checks.
  2. A profile_nulls_count check without any rules, the check will only capture the number of rows with a null value in the cumulative_confirmed column.
  3. A profile_nulls_percent check that measures a percentage of rows with null values, instead of returning the number of rows. The check is configured to raise a warning severity issue when any null rows were detected (the percentage of null values is above 0%). An error severity issue is raised when the percentage of rows with null values exceeds 1%.
  4. A profile_column_exists check that verifies that the column is present in the table by reading the metadata of the table.
  5. The expected_value rule parameter's value is 1, which means that DQOps requires that the column was found.

Column monitoring checks

Column-level monitoring checks are also configured in a very similar way.

The following example shows using the daily monitoring variants of the profiling checks shown in the previous example.

# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
  columns:
    cumulative_confirmed:
      type_snapshot:
        column_type: INT64
        nullable: true
      monitoring_checks: # (1)!
        daily: # (2)!
          nulls:
            daily_nulls_count: {}
            daily_nulls_percent:
              warning:
                max_percent: 0
              error:
                max_percent: 1
          schema:
            daily_column_exists:
              warning:
                expected_value: 1
    date:
      type_snapshot:
        column_type: DATE
        nullable: true
  1. The container of the column-level monitoring checks.
  2. The container of the daily monitoring checks.

When the scheduling is enabled, these checks will be executed daily, detecting if any rows with null values were identified (measuring the completeness of the cumulative_confirmed column). Also, DQOps will retrieve the table schema from the data source and verify if the column is still found in the table's metadata.

Column partition checks

Configuring column-level partition checks is also not different from configuring them on a table level.

The following example shows using the completeness checks on a partition level. Please also notice that there is no daily_partition_column_exists check, because schema checks can be measured only on a whole table level by the monitoring type of data quality check. Table schema drift checks cannot operate on partitions.

# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
  timestamp_columns:
    partition_by_column: date # (1)!  
  columns:
    cumulative_confirmed:
      type_snapshot:
        column_type: INT64
        nullable: true
      partitioned_checks: # (2)!
        daily:
          nulls:
            daily_partition_nulls_count: {}
            daily_partition_nulls_percent:
              warning:
                max_percent: 0
              error:
                max_percent: 1
  1. The selection of the column that will be used for date partitioning in the GROUP BY SQL clause.
  2. The container of the column-level partition checks.

Additional configuration

All data quality check nodes in .dqotable.yaml file also support adding additional configuration at a data quality check level.

The parameters for a single configured data quality checks are shown on the data quality check editor screen in the user interface after clicking the wheel icon.

Data quality check settings panel in DQOps

Disable a data quality check

A data quality check can be disabled. A disabled data quality check is skipped when running data quality checks, but the configuration is preserved in the
.dqotable.yaml file.

# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
  columns:
    cumulative_confirmed:
      monitoring_checks:
        daily:
          nulls:
            daily_nulls_percent:
              parameters:
               disabled: true
              warning:
                max_percent: 0

Change the data quality dimension

The data quality dimension name that is stored in the data quality parquet result files can be changed, if the default quality dimension name used internally by DQOps does not meet the data quality reporting requirements.

# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
  columns:
    cumulative_confirmed:
      monitoring_checks:
        daily:
          nulls:
            daily_nulls_percent:
              parameters:
                quality_dimension: Validity
              warning:
                max_percent: 0

Change the display name

The data quality check name that is used on data quality dashboards can be also modified. Instead of showing the technical name of the data quality check (daily_nulls_percent in this example), a more user-friendly name "cumulative_confirmed has too many null values" will be shown.

# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
  columns:
    cumulative_confirmed:
      monitoring_checks:
        daily:
          nulls:
            daily_nulls_percent:
              parameters:
                display_name: "cumulative_confirmed has too many null values"
              warning:
                max_percent: 0

Exclude from data quality KPI formula

Some data quality checks can be excluded from calculating the data quality KPI score.

# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
  columns:
    cumulative_confirmed:
      monitoring_checks:
        daily:
          nulls:
            daily_nulls_percent:
              parameters:
                exclude_from_kpi: true
              warning:
                max_percent: 0

Use a different data grouping

A data quality check can use a custom data grouping of data quality results, selecting a different data grouping name.

apiVersion: dqo/v1
kind: table
spec:
  default_grouping_name: group_by_country_and_state
  groupings:
    group_by_country_and_state:
      level_1:
        source: column_value
        column: country
      level_2:
        source: column_value
        column: state
    group_only_by_country:
      level_1:
        source: column_value
        column: country
  columns:
    cumulative_confirmed:
      monitoring_checks:
        daily:
          nulls:
            daily_nulls_percent:
              parameters:
                grouping: group_only_by_country
              warning:
                max_percent: 0

Custom run schedule

A single data quality check can use its own CRON schedule, instead of using the default schedule defined on the connection or table levels.

apiVersion: dqo/v1
kind: table
spec:
  columns:
    cumulative_confirmed:
      monitoring_checks:
        daily:
          nulls:
            daily_nulls_percent:
              parameters:
                schedule_override: 
                  cron_expression: 0 2 1 * *
              warning:
                max_percent: 0

Exclude a check from the scheduling

A check can be excluded from scheduled execution when a CRON schedule triggers running all checks on a connection, or a table.

apiVersion: dqo/v1
kind: table
spec:
  columns:
    cumulative_confirmed:
      monitoring_checks:
        daily:
          nulls:
            daily_nulls_percent:
              parameters:
                schedule_override: 
                  disabled: true
              warning:
                max_percent: 0

Applying table filters at a data quality check level

DQOps also supports setting a table filter predicate at a data quality check level. The filter will affect only a single check, while all other checks defined on the table will analyze the whole table or use the table-level filter. The filter predicate is specified in the parameters.filter node inside the check's configuration as shown in the following example.

For further guidance on configuring checks, read the configuring data quality checks article.

# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
  columns:
    date:
      type_snapshot:
        column_type: DATE
        nullable: true
    cumulative_confirmed:
      monitoring_checks:
        daily:
          nulls:
            daily_nulls_percent:
              parameters:
                filter: "{alias}.date >= '2023-11-06'"
              warning:
                max_percent: 0

Adding comments

You can also add custom comments to a configured data quality check. This feature becomes important if multiple users are maintaining data quality checks, and some changes to the check's thresholds are applied.

# yaml-language-server: $schema=https://cloud.dqops.com/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
  monitoring_checks:
    daily:
      volume:
        daily_row_count:
          comments:
            - date: 2024-01-18T14:57:09.931
              comment_by: user
              comment: "Minimum row count changed to 100 rows because we are expecting to have one row per country"
            warning:
              min_count: 100

Activate multiple checks at once

DQOps supports also alternative methods of activating data quality checks, designed to configure thousands of checks at once, by turning on data quality checks using filters for target tables, columns, and the column's data type.

Activate multiple checks with the DQOps shell

The dqo check activate command supports activating multiple checks from the DQOps shell. Ready-to-use command examples for activating data quality checks are provided in the documentation of data quality checks.

The following example activates the daily_row_count check on all fact tables, raising a warning severity issue when the table has less than 10000 rows. The -Wrule_parameter_name= parameters support passing parameters to the data quality rules.

dqo> check activate -c=connection_name -t=public.fact_* -ch=daily_row_count --enable-warning -Wmin_count=10000

The next example activates the daily_nulls_percent check to measure that the columns that match the *_id name pattern contain only non-null values, counting the percentage of null values, and raising a data quality issue when the percentage is above 0%.

dqo> check activate -c=connection_name -t=public.fact_* -ch=daily_nulls_percent -col=*_id --enable-warning -Wmax_percent=0

Configuring data quality checks to raise error severity issues requires slightly different parameters, as shown in the example below.

dqo> check activate -c=connection_name -t=public.fact_* -ch=daily_nulls_percent -col=*_id --enable-error -Emax_percent=0

Configure multiple checks using the UI

DQOps provides the option to configure multiple checks from the user interface. You can search for target tables and columns to activate data quality checks or review the configuration of rules.

Search for checks on the multiple check editor

To learn more about configuring multiple data quality checks, refer to the manual.

What's next