Skip to content

Last updated: April 09, 2024

Data profiling checks

Read this guide to understand the purpose of data profiling checks in DQOps, and how they are used to evaluate the initial data quality status of tables.

What are data profiling checks?

The data profiling checks in DQOps are responsible for assessing the initial data quality score of data sources. They should be activated on new data sources to verify that the dataset meets the minimum data quality requirements.

Profiling checks are also useful for exploring and experimenting with various types of checks to determine the most suitable ones for regular data quality monitoring. Before activating a data quality monitoring check, the user should test a profiling version of the data quality check. Every monitoring and partition data quality check has a profiling version, named as profiling_*.

Summary

The following table summarizes the key concepts of profiling data quality checks in DQOps.

Check type Purpose Time period truncation Check name prefix
profiling Evaluate the initial data quality score of new tables.
Experiment with DQOps data quality checks before activating monitoring checks for measuring the data quality KPIs.
One data quality profiling result captured per month,
when profiling is repeated in the same month, the previous result is replaced.
profile_*

Data profiling status checkpoints

DQOps stores only the one data quality profiling result for each month. If user runs the same profiling again during the same month, the previous result is replaced. This behavior is designed for experimentation and tuning the parameters for the data quality rules.

Profiling checks in DQOps user interface

The following screen shows the profiling results for a profile_row_count data quality check that detects empty tables. By setting the min_count rule parameter to 1, DQOps will raise an error severity data quality issue for empty tables. The data quality issue is stored in the check_results Parquet table.

data profiling screen in DQOps

The screenshot above shows the most recent data profiling results captured at 2024-01-20 16:18:49. The dates in the Executed At column show the dates when the profiling checks were executed according to a CRON schedule, which is 0 12 * * * (every day at 12 PM). The Profile data (local time) is the time period (month) for which the profiling result is valid. The date is truncated to the 1st day of the month when the profiling data quality checks were run.

Initial data quality KPI score

The data quality dashboard showing the data profiling results are located in the Profiling folder. The Profiling KPIs scorecard - summary dashboard shows the initial data quality KPI score, which is calculated as a percentage of passed data profiling checks for the current month.

data profiling initial data quality score dashboard

Profiling checks pros and cons

When to use profiling checks

Use the data profiling checks for:

Limitations of profiling checks

Data profiling checks store only one result per month, which limit their usage only to assessing the initial data quality.

  • Do not use the DQOps REST API Client from data pipelines for running profiling checks, unless there is a requirement to maintain the most recent data profiling status at all time. Run monitoring data quality checks instead to capture the result for each day.

  • Do not activate any anomaly detection and relative value lookup checks as profiling checks, because anomaly detection depends on feeding a full history of data quality results at least at a daily scale. Profiling checks in DQOps store only one result per month, so the history of data quality results is limited. The following screenshot shows how the DQOps check editor presents anomaly detection checks that cannot be run, because there is not enough historical data to be used for prediction.

data anomaly detection by data profiling checks

Profiling check configuration in DQOps YAML files

The configuration of active data quality profiling checks is stored in the .dqotable.yaml files. Please review the samples in the configuring table metadata article to learn more.

What's next