Last updated: April 09, 2024
Definition of a data quality sensor
Read this guide to understand how data quality sensors in DQOps capture metrics about monitored data sources for detecting data quality issues with rules.
What are sensors in DQOPs?
In DQOps, the data quality sensor and data quality rule form the data quality check.
The data quality sensor reads the value from the data source at a given point in time. Examples of these reads include the number of rows, the percentage of null values in a column, or the current delay between the timestamp of the latest row and the current system time.
Sensor templating
To implement sensors DQOps uses the Jinja2 templating engine which is rendered into a SQL query.
The following examples show a data quality sensor template for various database types that calculates the number of rows in a table.
SQL Template (Jinja2)
{% import '/dialects/bigquery.sql.jinja2' as lib with context -%}
SELECT
COUNT(*) AS actual_value
{{- lib.render_data_stream_projections('analyzed_table') }}
{{- lib.render_time_dimension_projection('analyzed_table') }}
FROM {{ lib.render_target_table() }} AS analyzed_table
{{- lib.render_where_clause() -}}
{{- lib.render_group_by() -}}
{{- lib.render_order_by() -}}
{% import '/dialects/postgresql.sql.jinja2' as lib with context -%}
SELECT
COUNT(*) AS actual_value
{{- lib.render_data_stream_projections('analyzed_table') }}
{{- lib.render_time_dimension_projection('analyzed_table') }}
FROM {{ lib.render_target_table() }} AS analyzed_table
{{- lib.render_where_clause() -}}
{{- lib.render_group_by() -}}
{{- lib.render_order_by() -}}
{% import '/dialects/redshift.sql.jinja2' as lib with context -%}
SELECT
COUNT(*) AS actual_value
{{- lib.render_data_grouping_projections('analyzed_table') }}
{{- lib.render_time_dimension_projection('analyzed_table') }}
FROM {{ lib.render_target_table() }} AS analyzed_table
{{- lib.render_where_clause() -}}
{{- lib.render_group_by() -}}
{{- lib.render_order_by() -}}
{% import '/dialects/snowflake.sql.jinja2' as lib with context -%}
SELECT
COUNT(*) AS actual_value
{{- lib.render_data_grouping_projections('analyzed_table') }}
{{- lib.render_time_dimension_projection('analyzed_table') }}
FROM {{ lib.render_target_table() }} AS analyzed_table
{{- lib.render_where_clause() -}}
{{- lib.render_group_by() -}}
{{- lib.render_order_by() -}}
The file starts with an import of reusable dialects specific to the database.
Macro name | Description |
---|---|
render_target_table |
Adds target table name. |
render_data_grouping_projections |
Adds data grouping columns to the list of columns returned to support the GROUP BY clause. Read the data grouping configuration manual to learn how to configure data grouping. Data grouping configuration allows tracking data quality results for different groups of rows stored in the same table. |
render_time_dimension_projection |
Optional time dimension projection that allows measuring individual data quality results for each time period (hour, day, week, etc.) separately. |
render_where_clause |
The WHERE clause is used to filter records. |
render_group_by |
The GROUP BY statement groups rows by dates, partitions or additional columns. The columns selected for data grouping are also returned. |
render_order_by |
The ORDER BY statement sorts the results from the oldest to the newest daily partitions. |
Sensor types and categories
Sensors are divided into two types: table and column. Each type has several categories and subcategories. A full list of sensors within each category is available at the link.
Sensor category | Description |
---|---|
Table | |
accuracy table sensors | Verifies the correct number of rows between the tested and reference table. |
availability table sensors | Detects problems with data source stability. |
custom sql table sensors | Verifies custom SQL queries at the table level. |
schema table sensors | Detects the schema changes at the table level. |
timeliness table sensors | Verifies freshness, staleness, and ingestion delay. |
volume table sensors | Verifies data volume. |
Column | |
accepted values column sensors | Verifies that values in a column are only from an accepted list of values. |
accuracy column sensors | Verifies the correctness of data between the tested and reference column. |
bool column sensors | Detects issues in columns with boolean-type data. |
conversion column sensors | Detects text value that cannot be parsed and converted to boolean, numeric or date types. |
custom sql column sensors | Verifies custom SQL queries at the column level. |
datatype column sensors | Detects issues in columns with datetime-type data. |
datetime column sensors | Detects issues in columns with datatype data. |
integrity column sensors | Detects integrity issues between different columns. |
nulls column sensors | Detects nulls and not nulls on columns. |
numeric column sensors | Detects issues in columns with numeric data such as negative values and values in sets or in range. Detects whether basic statistic calculations such as max, min, mean, median, sum, and stddev are in range. |
patterns column sensors | Verifies text columns by matching the values to custom or predefined patterns (regular expressions) |
pii column sensors | Detects the presence of sensitive data such as phone number, zip code, e-mail, and IP4 and IP6 addresses. |
range column sensors | Detects min or max values. |
sampling column sensors | Retrieves column value samples. |
schema column sensors | Detects the schema changes at the column level. |
text column sensors | Detects issues in columns with string-type data. |
uniqueness column sensors | Detect uniqueness and duplications. |
whitespace column sensors | Detects columns storing blank and whitespace values, such as 'None', 'n/a', ''. |
Sensor readout data storage
DQOps stores a copy of the sensor data locally on the DQOps instance. The data files are stored as Apache Parquet files in an Apache Hive-compatible folder tree, partitioned by the data source, monitored table name, and the month. The sensor's query results are called sensor readouts in DQOps. The results are stored in a sensor_readouts parquet table as described in the data quality results storage concept.
Configure sensors in UI
You can easily configure sensors in DQOps using the Configuration section of the user interface.
Sensor definition screen
Below is an example of screens with the definition for the text_length_in_range_percent sensor and the Jinja template for the PostgreSQL database which can be modified.
The first tab named the Sensor definition is responsible for editing the specification files for a custom data quality sensor stored in the $DQO_USER_HOME/sensors/**/*.dqosensor.yaml files.
Database specific SQL query template
The Jinja2 query templates are edited on the tabs named as the data sources. The configuration for each data source is stored in two files:
-
sensors/**/<data_source_type>.dqoprovidersensor.yaml specification file with additional configuration for that data source type.
-
Jinja2 file stored in the sensors/**/<data_source_type>.sql.jinja2 file.
What's next
- Learn how DQOps executes data quality sensors and rules.
- Learn how the data quality results are stored.
- Review the full reference of all supported data quality sensors with examples of templated SQL queries for each data source.