Skip to content

Last updated: April 05, 2024

DQOps Data Quality Operations Center concepts

Follow this guide to learn each concept of DQOps Data Quality Operations Center to start measuring data quality for data sources.

List of DQOps concepts

This article is a dictionary of DQOps terms. Click on the links to learn about every concept.

What is a data quality check

A data quality check detects data quality issues. The check in DQOps is defined as a pair of a sensor that captures metrics from the data source and a rule that verifies the sensor's readout. For example, the nulls_percent check uses both the null_percent sensor and the max_percent rule to detect if the maximum percent of null values in a tested column is satisfied.

If the percent of null values in a column raises above the threshold (maximum allowed percent), a data quality issue is raised.

DQOps has three types of data quality checks:

Configuring data sources

DQOps stores the data source configuration and the metadata of imported tables in YAML files. The files are stored in the current folder, called the DQOps user home.

The connection parameters to data sources and metadata of imported tables can be edited in two ways, by using the DQOps user interface, or by changing the YAML configuration files directly. Other methods of managing data sources include also using the command-line interface, or using the DQOps Python client. Review the list of data quality sources to see which databases are supported by DQOps.

Configuring table metadata

The metadata and the configuration of all data quality checks is stored in .dqotable.yaml YAML files, following a naming convention sources/<data_source_name>/<schema_name>.<table_name>.dqotable.yaml. The files can be stored and versioned in your Git repository.

You can also easily add similar tables, or move the data quality check configuration from development tables to production tables by copying and renaming the .dqotable.yaml files.

Configuring data quality checks

Data quality checks are configured by setting the incident alerting thresholds by setting the data quality rule parameters.

DQOps uses YAML files to keep the configuration of data sources and the activated data quality checks on monitored tables. The DQOps YAML file format is fully documented and the YAML schema files are published.

The DQOps YAML schema files allow unprecedented coding experience with Visual Studio Code when managing the configuring the data quality checks directly in the editor. Code completion, syntax validation and help hints are shown by Visual Studio Code and many other editors when editing DQOps YAML files.

Running data quality checks

Data quality checks configured for each table and column are executed by targeting the data source, table, column, check name, check type, check category or even labels assigned to tables or columns.

Data observability

When a new data source is imported into DQOps, a default set of data quality checks is activated. The main purpose of data observability is to detect anomalies in the data, database downtimes, schema changes, uniqueness issues, and an inconsistent growth of the table volume.

DQOps user home

DQOps user home is the most important folder, it is the place where DQOps stores all configuration and data quality results. When DQOps is started by running python -m dqops, the current working folder is used as the default DQOps user home.

On the first run, DQOps will set up a folder tree to store the list of monitored data sources and the configuration of data quality checks for all imported tables. The configuration is stored in YAML files for simplicity of editing in Visual Studio Code.

Data quality sensors

The data quality sensors are SQL queries defined as Jinja2 templates. A sensor is called by a data quality check to capture a data quality measure such as the row count from the monitored source. The sensor's measure is called a sensor readout in DQOps.

Data quality rules

Data quality rules in DQOps are Python functions that receive the sensor readout that was captured by sensor (a result of an SQL query). The rule verifies if the sensor readout is valid or a data quality issue should be raised. For example, the max_percent rule will verify if the result of the null_percent sensor is acceptable.

Data quality KPIs

The data quality is measured by data quality KPIs (Key Performance Indicators). The definition of a data quality KPI in DQOps is a percentage of passed data quality checks out of all executed data quality checks.

The data quality KPIs are calculated on multiple levels:

  • per table

  • per data source

  • per data quality dimension

  • or a combination of any other dimensions

Incremental data quality monitoring

Learn how partition checks are used to analyze data quality incrementally, even for very big tables, reaching terabyte or petabyte scale.

Partitioned checks introduced by DQOps allow to detect data quality issues very early, as soon as invalid data was loaded in the most recent batch.

Data quality dashboards

DQOps stores the data quality check results locally, but also the data is synchronized to a Data Quality Data Warehouse hosted in the cloud by DQOps for each user.

The data quality dashboards are accessing the Data Quality Data Warehouse and enable reviewing the Data Quality KPIs or drilling down to the data quality issues. DQOps uses a custom Google Looker Studio Community Connector to access the user's Data Quality Data Warehouse.

Data quality dimensions

The data quality dimensions are the fundamental way to group data quality checks into groups of checks that detect similar issue. The most important data quality dimensions supported by DQOps are:

  • Availability watches the tables in the data source, raising a data quality issue when the table is missing or returns errors

  • Accuracy checks compare the data to the "source of truth", which means comparing tables between stages and data sources

  • Consistency monitors the data over a period of time, looking for anomalies such as the usual percent of null values per day was within the regular range, but an unusual increase of the percent of null values in a column was observed for one day

  • Completeness detects missing data, for example columns with too many null values

  • Reasonableness identifies values that are not making sense, falling out of expected range

  • Timeliness tracks freshness of data, measuring the maximum allowed age of data

  • Uniqueness finds issues related to duplicate values

  • Validity detects common field format issues, such as an email field does not meet the email format

Auditing time periods

DQOps captures the time period for which the data quality result is valid. This can be the data quality status at the end of the day in daily monitoring checks. Learn how DQOps captures the local timezone of monitored data sources, even if monitoring databases are located in different countries, regions and continents.

Data grouping

DQOps unique feature is the ability to use a GROUP BY clause in the data quality sensors, allowing to run data quality checks for multiple ranges of rows in the same table.

For example, a table fact_global_sales that aggregates sales fact rows from multiple countries can be tested for each country. A column that identifies a country must be present in the table and data grouping must be configured.

Data grouping allows detecting data quality issues for groups of rows loaded by different data streams, different data pipelines, or received from different vendors or departments.

User interface overview

The user interface in DQOps is using a tabbed application that resembles many popular database management tools. Configuring data quality checks on multiple tables at the same time is supported in separate tabs.

Command-line interface

Command-line access to DQOps is supported by a shell interface. The DQOps shell supports command and table name completion.

Data storage

DQOps stores both the configuration of data sources, the configuration of data quality checks activated on tables and the data quality check execution results locally in a DQOps user home folder.

The data quality results are stored in a $DQO_USER_HOME/.data folder that is a Hive-compliant local data lake. Please read the data storage concept guide to understand the data lake structure.

Deployment architecture

DQOps can be hosted locally, in the cloud or as a hybrid deployment, running a semi-offline DQOps instances on-premise or in the customer's cloud environment.

Check execution flow

Detailed data quality execution flows that show how DQOps executes data quality sensors, data quality rules, data quality checks, and how the data is stored. Learn how execution errors are stored.

Other topics

Check out the other areas of the DQOps documentation.

Installing DQOps

Learn now to install DQOps using pip, docker, or installing locally from a release package.

List of data sources

The list of supported data sources and how to register them in DQOps.

DQOps use cases

Review a list of data quality use cases, and how to detect most common data quality issues with DQOps. Each use case is a step-by-step guide, starting with the description of a problem, followed by the steps to configure relevant data quality checks, and finally showing the data quality dashboards.

Working with DQOps

The remaining step-by-step manuals not included in the DQOps basic concepts guide.

Integrations

Find out how DQOps integrates with other systems. How to run data quality checks in Apache Airflow, or how to send data quality incident notifications to Slack.

Command-line interface

DQOps supports running commands directly from the operating system shell, or using the DQOps integrated shell. The command-line interface
is a reference of all DQOps commands.

REST API Python Client

Using DQOps is not limited only to the user interface, or the command-line shell. All operations such as running data quality checks are also supported from a DQOps Python client. The REST API Python Client shows ready-to-use Python code samples.

Data quality checks reference

The reference of all data quality checks that are included in DQOps. The reference of each data quality check has a YAML configuration fragment, and examples of SQL queries for each data source.

Reference

The full reference of all data quality sensors, data quality rules, DQOps YAML files and DQOps Parquet schema.