Last updated: April 05, 2024

DQOps Data Quality Operations Center concepts

Follow this guide to learn each concept of DQOps Data Quality Operations Center to start measuring data quality for data sources.

List of DQOps concepts

This article is a dictionary of DQOps terms. Click on the links to learn about every concept.

What is a data quality check

A data quality check detects data quality issues. The check in DQOps is defined as a pair of a sensor that captures metrics from the data source and a rule that verifies the sensor's readout. For example, the nulls_percent check uses both the null_percent sensor and the max_percent rule to detect if the maximum percent of null values in a tested column is satisfied.

If the percent of null values in a column raises above the threshold (maximum allowed percent), a data quality issue is raised.

DQOps has three types of data quality checks:

Profiling checks for measuring the initial data quality score during the profiling stage
Monitoring checks for observing and measuring the data quality daily or monthly
Partition checks for analyzing partitioned data incrementally, or analyzing incrementally any table that has a date column.

Configuring data sources

DQOps stores the data source configuration and the metadata of imported tables in YAML files. The files are stored in the current folder, called the DQOps user home.

The connection parameters to data sources and metadata of imported tables can be edited in two ways, by using the DQOps user interface, or by changing the YAML configuration files directly. Other methods of managing data sources include also using the command-line interface, or using the DQOps Python client. Review the list of data quality sources to see which databases are supported by DQOps.

Configuring table metadata

The metadata and the configuration of all data quality checks is stored in .dqotable.yaml YAML files, following a naming convention sources/<data_source_name>/<schema_name>.<table_name>.dqotable.yaml. The files can be stored and versioned in your Git repository.

You can also easily add similar tables, or move the data quality check configuration from development tables to production tables by copying and renaming the .dqotable.yaml files.

Configuring data quality checks

Data quality checks are configured by setting the incident alerting thresholds by setting the data quality rule parameters.

DQOps uses YAML files to keep the configuration of data sources and the activated data quality checks on monitored tables. The DQOps YAML file format is fully documented and the YAML schema files are published.

The DQOps YAML schema files allow unprecedented coding experience with Visual Studio Code when managing the configuring the data quality checks directly in the editor. Code completion, syntax validation and help hints are shown by Visual Studio Code and many other editors when editing DQOps YAML files.

Running data quality checks

Data quality checks configured for each table and column are executed by targeting the data source, table, column, check name, check type, check category or even labels assigned to tables or columns.

Data observability

When a new data source is imported into DQOps, a default set of data quality checks is activated. The main purpose of data observability is to detect anomalies in the data, database downtimes, schema changes, uniqueness issues, and an inconsistent growth of the table volume.

DQOps user home

DQOps user home is the most important folder, it is the place where DQOps stores all configuration and data quality results. When DQOps is started by running python -m dqops, the current working folder is used as the default DQOps user home.

On the first run, DQOps will set up a folder tree to store the list of monitored data sources and the configuration of data quality checks for all imported tables. The configuration is stored in YAML files for simplicity of editing in Visual Studio Code.

Data quality sensors

The data quality sensors are SQL queries defined as Jinja2 templates. A sensor is called by a data quality check to capture a data quality measure such as the row count from the monitored source. The sensor's measure is called a sensor readout in DQOps.

Data quality rules

Data quality rules in DQOps are Python functions that receive the sensor readout that was captured by sensor (a result of an SQL query). The rule verifies if the sensor readout is valid or a data quality issue should be raised. For example, the max_percent rule will verify if the result of the null_percent sensor is acceptable.

Data quality KPIs

The data quality is measured by data quality KPIs (Key Performance Indicators). The definition of a data quality KPI in DQOps is a percentage of passed data quality checks out of all executed data quality checks.

The data quality KPIs are calculated on multiple levels:

per table
per data source
per data quality dimension
or a combination of any other dimensions

Incremental data quality monitoring

Learn how partition checks are used to analyze data quality incrementally, even for very big tables, reaching terabyte or petabyte scale.

Partitioned checks introduced by DQOps allow to detect data quality issues very early, as soon as invalid data was loaded in the most recent batch.

Data quality dashboards

DQOps stores the data quality check results locally, but also the data is synchronized to a Data Quality Data Warehouse hosted in the cloud by DQOps for each user.

The data quality dashboards are accessing the Data Quality Data Warehouse and enable reviewing the Data Quality KPIs or drilling down to the data quality issues. DQOps uses a custom Google Looker Studio Community Connector to access the user's Data Quality Data Warehouse.

Data quality dimensions

The data quality dimensions are the fundamental way to group data quality checks into groups of checks that detect similar issue. The most important data quality dimensions supported by DQOps are:

Availability watches the tables in the data source, raising a data quality issue when the table is missing or returns errors
Accuracy checks compare the data to the "source of truth", which means comparing tables between stages and data sources
Consistency monitors the data over a period of time, looking for anomalies such as the usual percent of null values per day was within the regular range, but an unusual increase of the percent of null values in a column was observed for one day
Completeness detects missing data, for example columns with too many null values
Reasonableness identifies values that are not making sense, falling out of expected range
Timeliness tracks freshness of data, measuring the maximum allowed age of data
Uniqueness finds issues related to duplicate values
Validity detects common field format issues, such as an email field does not meet the email format

Auditing time periods

DQOps captures the time period for which the data quality result is valid. This can be the data quality status at the end of the day in daily monitoring checks. Learn how DQOps captures the local timezone of monitored data sources, even if monitoring databases are located in different countries, regions and continents.

Data grouping

DQOps unique feature is the ability to use a GROUP BY clause in the data quality sensors, allowing to run data quality checks for multiple ranges of rows in the same table.

For example, a table fact_global_sales that aggregates sales fact rows from multiple countries can be tested for each country. A column that identifies a country must be present in the table and data grouping must be configured.

Data grouping allows detecting data quality issues for groups of rows loaded by different data streams, different data pipelines, or received from different vendors or departments.

DQOps Data Quality Operations Center concepts

List of DQOps concepts

What is a data quality check

Configuring data sources

Configuring table metadata

Configuring data quality checks

Running data quality checks

Data observability

DQOps user home

Data quality sensors

Data quality rules

Data quality KPIs

Incremental data quality monitoring

Data quality dashboards

Data quality dimensions

Auditing time periods

Data grouping

User interface overview

Command-line interface

Data storage

Deployment architecture

Check execution flow

Other topics

Installing DQOps

List of data sources

DQOps use cases

Working with DQOps

Integrations

Command-line interface

REST API Python Client

Data quality checks reference

Reference