Last updated: April 09, 2024
Data quality monitoring checks for data observability
This guide shows how the data quality monitoring checks in DQOps are observing the data sources, and tracking the data quality with data quality KPIs.
What are data monitoring checks?
The monitoring checks in DQOps are responsible for continuous monitoring the data quality of data sources. The data quality results generated by monitoring checks capture an end-of-day or an end-of-month data quality status of monitored data.
Capturing an end-of-day (or end-of-month) status of the last execution of a data quality check is important for:
- storing an audit log of executed data quality checks, especially when auditing is required for regulatory reasons
- measuring data quality KPIs to prove the trustfulness of data sources
- tracking the data quality improvement day-by-day, and presenting the progress of data cleansing projects to stakeholders and business sponsors of the data quality initiative
Before activating a data quality monitoring check, you should test a profiling version of the data quality check. Every monitoring and partition data quality check has a profiling version, named as profiling_*.
Time scale
Monitoring checks are divided into two groups, having almost the same data quality checks.
- daily monitoring checks track the end-of-day data quality status
- monthly monitoring checks track the end-of-month data quality status, but are not supporting anomaly detection checks because one data quality result per month is not enough to use prediction
Summary
The following table summarizes the key concepts of monitoring data quality checks in DQOps, divided by daily monitoring and monthly monitoring checks.
Check type | Time scale | Purpose | Time period truncation | Check name prefix |
---|---|---|---|---|
monitoring | daily | The preferred type of checks to detect data quality issues. Daily monitoring checks store the end-of-day data quality status for measuring the data quality KPIs. |
One data quality monitoring result captured per day, when a daily monitoring check is run again during the same day, the previous result is replaced. |
daily_* |
monitoring | monthly | Capture the last known end-of-month data quality status. Monthly monitoring checks store the end-of-month data quality status for measuring the data quality KPIs. |
One data quality monitoring result captured per month, when a monthly monitoring check is run again during the same month, the previous result is replaced. |
monthly_* |
Monitoring checks in DQOps user interface
Daily monitoring checks
The following screen shows the data quality results of the daily_row_count
data quality check that measures the number of rows in a table using a SELECT COUNT(*) FROM <monitored_table>
SQL query.
The data quality check error severity rule has a parameter min_count: 1
, which would raise an error severity issue if the table is empty.
The other threshold of the warning severity rule verifies if the table has at least 500.000 rows, raising a warning severity issue when the table
is smaller.
The data quality check details panel on the check editor shows that all recent
data quality check runs failed with a warning severity issue, because the table had less than 500.000 rows for the last 11 days
when the data quality check was run. The highest detected row count was 488.478 rows.
The Executed At column shows the time when the data quality check was run, and the Checkpoint date column shows the value of the time_period value from the check_results Parquet table used by DQOps to store the data quality results.
Because daily monitoring checks store the end-of-day status (and only one result per check and day of running),
the values of the Checkpoint date (time_period
parquet column) are truncated to the beginning of the day when the check was run.
The same data quality results are also shown on the chart view.
All captured data quality metrics are presented as an Actual value
time series on the chart, called the data quality sensor readouts in DQOps.
Because all recent row counts were below the minimum required 500.000 rows,
all the results are shown within the yellow zone for warning severity data quality issues.
Monthly monitoring checks
The monthly monitoring checks store the end-of-month data quality status, replacing previously captured results. The following screen shows the result of running the monthly_row_count at 2020-01-20 16:56:17 (January 20th, 2024).
DQOps stored one result for January 2024, truncating the value of the Checkpoint date (time_period
parquet column) to the beginning of the month.
Monitoring checks pros and cons
When to use monitoring checks
Use the data monitoring checks to:
-
Track and checkpoint the end-of-day (or end-of-month) data quality status for every data source.
-
Measure the improvement of the data quality score using data quality KPIs.
-
Schedule data quality check using a DQOps CRON scheduler.
-
Run data quality checks from data pipelines, starting the run_checks using the DQOps REST API Client, or starting data quality checks from shell.
-
Track the progress of data cleansing projects, measuring the data quality KPI as a percentage of passed daily quality checks. The data quality KPI can be calculated for each data quality dimension, and for all data sources, tables, categories of data quality checks.
You can run monitoring checks multiple times during the day
It is safe to run monitoring checks every time when new data is loaded into a monitored table, even multiple times during the day. DQOps will replace the last known data quality result during the day or month in respectively daily monitoring checks and monthly monitoring checks.
Limitations of monitoring checks
The results of monitoring data quality checks are used to evaluate the data quality KPI and compliance with data contracts.
-
Do not use monitoring checks for the first time before experimenting with a profiling variant of that check. The configuration of accepted profiling checks can be easily converted to monitoring checks. If a misconfigured monitoring check is run and fails, raising a data quality issues, the issues will decrease the data quality KPI score. You will have to use the delete data quality results screens to remove these data quality results.
-
Monthly monitoring checks do not support anomaly detection data quality checks, because when only one data quality result for each data quality check is stored per month, there is not enough historical data to use prediction.
Monitoring check configuration in DQOps YAML files
The configuration of active data quality monitoring checks is stored in the .dqotable.yaml files. Please review the samples in the configuring table metadata article to learn more.
-
configuring table-level monitoring checks shows how to configure monitoring checks at a table level
-
configuring column-level monitoring checks shows how to configure monitoring checks at a column level
What's next
- Learn how to assess the initial data quality status using profiling checks.
- Learn how to analyze data quality of partitioned data using partition checks.
- Read the configuring table-level monitoring checks and configuring column-level monitoring checks to learn the details of configuring monitoring checks in YAML files.
- Learn how to use monitoring checks in the end-to-end data quality improvement process using DQOps.
- Learn how DQOps calculates the data quality KPI score