Skip to content

Regex match percent

Description

The column validity check ragex_match_percent counts the records matching a provided regex, then divides this number by the row count. This figure multiplicated by 100.0 tells us about the percent of records that follow a certain format.

You can use this check on any column that you can run regular expressions on. Typically, it would be a STRING or a VARCHAR data type.

This check has two optional parameters: named_regex which utilizes predefined regexes, and custom_regex, where you provide a regular expression yourself. Those parameters are described here.

BigQuery example with custom regex

First let's walk through the example for this check step by step.

We refer to a query result as actual_value.

Sample data from BigQuery

The table is bigquery-public-data.austin_311.311_service_requests, here are the first 5 rows:

+-------------+-----------------------------------------------------+--------+--------+----------------------+----------------------+----------------------+------------+------------------------------------------+---------------+----------------------------+--------+--------------+--------+--------------------------+--------------------------+------------+--------------+-------------------------------+-----------------------+----------+----------+
| unique_key  |                complaint_description                | source | status |  status_change_date  |     created_date     |   last_update_date   | close_date |             incident_address             | street_number |        street_name         |  city  | incident_zip | county | state_plane_x_coordinate | state_plane_y_coordinate |  latitude  |  longitude   |           location            | council_district_code | map_page | map_tile |
+-------------+-----------------------------------------------------+--------+--------+----------------------+----------------------+----------------------+------------+------------------------------------------+---------------+----------------------------+--------+--------------+--------+--------------------------+--------------------------+------------+--------------+-------------------------------+-----------------------+----------+----------+
| 22-00158223 | Street Light Issue- Multiple poles/multiple streets | Phone  | Open   | 2022-05-09T06:49:17Z | 2022-05-09T06:44:55Z | 2022-05-09T06:49:17Z |            |                                          |               |                            |        |              |        |                          |                          |            |              |                               |                       |          |          |
| 22-00157769 | Street Light Issue- Multiple poles/multiple streets | E-Mail | Open   | 2022-05-09T06:04:55Z | 2022-05-08T12:03:18Z | 2022-05-09T06:04:55Z |            |                                          |               |                            |        |              |        |                          |                          |            |              |                               |                       |          |          |
| 22-00157406 | Sidewalk/Curb Ramp/Route - NEW                      | Web    | New    | 2022-05-07T20:22:07Z | 2022-05-07T20:22:07Z | 2022-05-07T20:22:07Z |            | "6600 FRONTAGE RD, AUSTIN, TX"           |               | 6600 FRONTAGE RD           | AUSTIN |              |        |               0.00000000 |                      0.0 | 3.44238668 | -105.9831947 | "(3.442386682, -105.9831947)" |                       |          |          |
| 22-00157012 | Parking Machine Issue                               | Phone  | New    | 2022-05-07T11:54:08Z | 2022-05-07T11:54:08Z | 2022-05-07T11:54:08Z |            |                                          |               |                            |        |              |        |                          |                          |            |              |                               |                       |          |          |
| 22-00156881 | Parking Machine Issue                               | Phone  | New    | 2022-05-07T10:04:56Z | 2022-05-07T10:04:56Z | 2022-05-07T10:04:56Z |            |                                          |               |                            |        |              |        |                          |                          |            |              |                               |                       |          |          |
+-------------+-----------------------------------------------------+--------+--------+----------------------+----------------------+----------------------+------------+------------------------------------------+---------------+----------------------------+--------+--------------+--------+--------------------------+--------------------------+------------+--------------+-------------------------------+-----------------------+----------+----------+

and the column we are checking for validity is unique_key We will check if the records hold the format by using a custom_regex.

Check configuration

After adding connection and importing tables (in the example you can go ahead to running the check. Connection, table and check are ready) you can access the table configuration by running.

table edit -c=conn_bq_7 -t=austin_311.311_service_requests

The YAML configuration looks like this (all the code necessary to define this check is highlighted below):

# yaml-language-server: $schema=https://cloud.dqo.ai/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
  target:
    schema_name: austin_311
    table_name: 311_service_requests
  time_series:
    mode: current_time
    time_gradient: day
  filter: city = "Austin"
  columns:
    unique_key:
      type_snapshot:
        column_type: STRING
        nullable: true
      checks:
        validity:
          regex_match_percent:
            parameters:
              custom_regex: "^22-[0-9]{8}$"
            rules:
              min_count:
                low:
                  min_value: 98.0
                medium:
                  min_value: 95.0
                high:
                  min_value: 90.0
    ...
    created_date:
      type_snapshot:
        column_type: TIMESTAMP
        nullable: true
    ...
    city:
      type_snapshot:
        column_type: STRING
        nullable: true
    ...

Let's review what this configuration means.

Sensor

Sensor for this check is column/validity/regex_match_percent, which will be executed on a unique_key column (lines 19-20). We passed a custom_regex parameter to run ^22-[0-9]{8}$ regex, so we expect the records to start with 22- and 8 digits following (lines 20-21). This parameter is passed to the rendered query:

SELECT CASE
        WHEN COUNT(analyzed_table.`unique_key`) = 0 THEN NULL
        ELSE 100.0 * SUM(
            CASE
                WHEN REGEXP_CONTAINS(
                    analyzed_table.`unique_key`,
                    r'^22-[0-9]{8}$'
                ) IS TRUE THEN 1
                ELSE 0
            END
        ) / COUNT(analyzed_table.`unique_key`)
    END AS actual_value, CAST(CURRENT_TIMESTAMP() AS date) AS time_period
FROM `bigquery-public-data`.`austin_311`.`311_service_requests` AS analyzed_table
WHERE city = "Austin"
GROUP BY time_period
ORDER BY time_period
Finished executing a sensor for a check regex_match_percent on the table austin_311.311_service_requests using a sensor definition column/validity/regex_match_percent, sensor result count: 2

Results returned by the sensor:
+------------+-----------+
|actual_value|time_period|
+------------+-----------+
|100.0       |2022-05-10 | <--- The timestamp of a data quality snapshot 
+------------+-----------+      (current timestamp)

The table above is the exact same as the one you would see on the provider's platform (in this case BigQuery).

The query returned two columns: actual_value which is a percent of records with valid format, and time_period, configured with time_series. With mode=current_time the goal of time_period is to record a date of check execution.

Rule

To evaluate check results, we have to define a rule, which is done in lines 22-29:

rules:
  min_count:
    low:
      min_value: 98.0
    medium:
      min_value: 95.0
    high:
      min_value: 90.0

The min_count rule configuration says that if the actual_value >= 98.0 then the result is valid, if 98.0 > actual_value >= 95.0 then severity is 1 (low), if 95.0 > actual_value >= 90.0 then severity is 2 (medium) if 90.0 > actual_value then severity is 3 (high).

Finished executing rules (thresholds) for a check regex_match_percent on the table austin_311.311_service_requests, verified rules count: 2

Rule evaluation results:
+------------+--------------+----------------+-------------+------------+-------------------+---------------+--------+-------------------+-----------+--------------------+-------------------+-----------+-------------------+-------------------+-----------------+-----------------------------------+------------------------+-----------+--------+-------------------+---------+----------------+------------------+---------------+
|actual_value|expected_value|time_period     |time_gradient|dimension_id|connection_hash    |connection_name|provider|table_hash         |schema_name|table_name          |column_hash        |column_name|check_hash         |check_name         |quality_dimension|sensor_name                        |executed_at             |duration_ms|severity|rule_hash          |rule_name|high_lower_bound|medium_lower_bound|low_lower_bound|
+------------+--------------+----------------+-------------+------------+-------------------+---------------+--------+-------------------+-----------+--------------------+-------------------+-----------+-------------------+-------------------+-----------------+-----------------------------------+------------------------+-----------+--------+-------------------+---------+----------------+------------------+---------------+
|100.0       |90.0          |2022-05-10T00:00|day          |0           |5183897182689125639|conn_bq_7      |bigquery|1483428340984869430|austin_311 |311_service_requests|6580082490137205204|unique_key |3062396720704446212|regex_match_percent|validity         |column/validity/regex_match_percent|2022-05-12T08:25:57.528Z|2711       |0       |2231697474420509704|min_count|90.0            |95.0              |98.0           |
+------------+--------------+----------------+-------------+------------+-------------------+---------------+--------+-------------------+-----------+--------------------+-------------------+-----------+-------------------+-------------------+-----------------+-----------------------------------+------------------------+-----------+--------+-------------------+---------+----------------+------------------+---------------+
The other table we receive is rule evaluation results. Here in severity column we are told about assigned severity levels. Because our result is 100.0 which is greater that low severity threshold (in this case 98.0), the severity is 0 - a valid result.

Check summary

Check evaluation summary briefly informs us about check execution:

Check evaluation summary per table:
+----------+-------------------------------+------+--------------+-------------+------------+---------------+-------------+
|Connection|Table                          |Checks|Sensor results|Valid results|Alerts (low)|Alerts (medium)|Alerts (high)|
+----------+-------------------------------+------+--------------+-------------+------------+---------------+-------------+
|conn_bq_7 |austin_311.311_service_requests|1     |1             |1            |0           |0              |0            |
+----------+-------------------------------+------+--------------+-------------+------------+---------------+-------------+

There is only one check defined, and we received one sensor result, to which severity level 0 (valid result) was assigned.

Example with built-in regexes

Sample email and phone numbers

Suppose you have the following table,

+----+---------------------+---------------+
| id | phone_numbers       | emails        |
+----+---------------------+---------------+
|  1 | 123456789           | user1@dqo.com |
|  2 | 987654321           | user2@dqo.com |
|  3 | -12AA22134          | user3@dqo.com |
|  4 | + tel993456         | @@usr@dqo.com |
|  5 | 111222333           | 12-.usr,com   |
+----+---------------------+---------------+

Check configuration

We will check correct formatting on two column: email_mixed and phone_numbers_mixed, with predefined regexes phoneNumber and email respectively.

# yaml-language-server: $schema=https://cloud.dqo.ai/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
  target:
    schema_name: dqo_ai_test_data
    table_name: test_data_regex_sensor_179306422851143075
  time_series:
    mode: current_time
    time_gradient: day
  columns:
    id:
      type_snapshot:
        column_type: INT64
        nullable: true
    phone_numbers:
      type_snapshot:
        column_type: STRING
        nullable: true
      checks:   #(1)
        validity:
          regex_match_percent:
            parameters:
              named_regex: phoneNumber
            rules:
              min_count:
                low:
                  min_value: 90.0
                medium:
                  min_value: 70.0
                high:
                  min_value: 50.0
    emails:
      type_snapshot:
        column_type: STRING
        nullable: true
      checks:   #(2)
        validity:
          regex_match_percent:
            parameters:
              named_regex: email
            rules:
              min_count:
                low:
                  min_value: 90.0
                medium:
                  min_value: 70.0
                high:
                  min_value: 50.0
  1. Here is a check configuration for phone number validation using named_regex = phoneNumber.
  2. Here is a check configuration for phone number validation using named_regex = email.

Sensor

Here are the rendered queries with predefined regular expressions.

For the following query the parameter to pass phone number format regex was defined in line 24

SELECT CASE
        WHEN COUNT(analyzed_table.`phone_numbers`) = 0 THEN NULL
        ELSE 100.0 * SUM(
            CASE
                WHEN REGEXP_CONTAINS(
                    analyzed_table.`phone_numbers`,
                    r'^[pP]?[tT]?[nN]?[oO]?[0-9]{7,11}$'
                ) IS TRUE THEN 1
                ELSE 0
            END
        ) / COUNT(analyzed_table.`phone_numbers`)
    END AS actual_value, CAST(CURRENT_TIMESTAMP() AS date) AS time_period
FROM `dqo-ai-testing`.`dqo_ai_test_data`.`test_data_regex_sensor_179306422851143075` AS analyzed_table
GROUP BY time_period
ORDER BY time_period
and for the email format regex in line 41.

SELECT CASE
        WHEN COUNT(analyzed_table.`emails`) = 0 THEN NULL
        ELSE 100.0 * SUM(
            CASE
                WHEN REGEXP_CONTAINS(
                    analyzed_table.`emails`,
                    r'^[A-Za-z]+[A-Za-z0-9.]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}$'
                ) IS TRUE THEN 1
                ELSE 0
            END
        ) / COUNT(analyzed_table.`emails`)
    END AS actual_value, CAST(CURRENT_TIMESTAMP() AS date) AS time_period
FROM `dqo-ai-testing`.`dqo_ai_test_data`.`test_data_regex_sensor_179306422851143075` AS analyzed_table
GROUP BY time_period
ORDER BY time_period
Finished executing a sensor for a check regex_match_percent on the table dqo_ai_test_data.test_data_regex_sensor_179306422851143075 using a sensor definition column/validity/regex_match_percent, sensor result count: 1

Results returned by the sensor: #(1)
+------------+-----------+
|actual_value|time_period|
+------------+-----------+
|60.0        |2022-05-10 |
+------------+-----------+

Results returned by the sensor: #(2)
+------------+-----------+
|actual_value|time_period|
+------------+-----------+
|60.0        |2022-05-10 |
+------------+-----------+
  1. Result for phone number
  2. Result for email.

Rule

To evaluate check results, we have to define rules for both checks. They are defined in lines 25-32 and 42-49 in the YAML configuration.

rules:
  min_count:
    low:
      min_value: 90.0
    medium:
      min_value: 70.0
    high:
      min_value: 50.0

The min_count rule configuration says that if the actual_value >= 90.0 then the result is valid, if 90.0 > actual_value >= 70.0 then severity is 1 (low), if 70.0 > actual_value >= 50.0 then severity is 2 (medium) if 50.0 > actual_value then severity is 3 (high).

Finished executing rules (thresholds) for a check regex_match_percent on the table dqo_ai_test_data.test_data_regex_sensor_179306422851143075, verified rules count: 1

Rule evaluation results: #(1)
+------------+--------------+----------------+-------------+------------+-------------------+---------------+--------+-------------------+----------------+-----------------------------------------+-------------------+-------------------+-------------------+-------------------+-----------------+-----------------------------------+------------------------+-----------+--------+-------------------+---------+----------------+------------------+---------------+
|actual_value|expected_value|time_period     |time_gradient|dimension_id|connection_hash    |connection_name|provider|table_hash         |schema_name     |table_name                               |column_hash        |column_name        |check_hash         |check_name         |quality_dimension|sensor_name                        |executed_at             |duration_ms|severity|rule_hash          |rule_name|high_lower_bound|medium_lower_bound|low_lower_bound|
+------------+--------------+----------------+-------------+------------+-------------------+---------------+--------+-------------------+----------------+-----------------------------------------+-------------------+-------------------+-------------------+-------------------+-----------------+-----------------------------------+------------------------+-----------+--------+-------------------+---------+----------------+------------------+---------------+
|60.0        |50.0          |2022-05-10T00:00|day          |0           |5914958588201149627|dqo-ai-testing |bigquery|4257768357366869193|dqo_ai_test_data|test_data_regex_sensor_179306422851143075|1236866524037266806|phone_numbers_mixed|5188358281981702970|regex_match_percent|validity         |column/validity/regex_match_percent|2022-05-10T13:54:38.032Z|3540       |2       |1391659706564390862|min_count|50.0            |70.0              |90.0           |
+------------+--------------+----------------+-------------+------------+-------------------+---------------+--------+-------------------+----------------+-----------------------------------------+-------------------+-------------------+-------------------+-------------------+-----------------+-----------------------------------+------------------------+-----------+--------+-------------------+---------+----------------+------------------+---------------+

Rule evaluation results: #(2)
+------------+--------------+----------------+-------------+------------+-------------------+---------------+--------+-------------------+----------------+-----------------------------------------+-------------------+-----------+-------------------+-------------------+-----------------+-----------------------------------+------------------------+-----------+--------+-------------------+---------+----------------+------------------+---------------+
|actual_value|expected_value|time_period     |time_gradient|dimension_id|connection_hash    |connection_name|provider|table_hash         |schema_name     |table_name                               |column_hash        |column_name|check_hash         |check_name         |quality_dimension|sensor_name                        |executed_at             |duration_ms|severity|rule_hash          |rule_name|high_lower_bound|medium_lower_bound|low_lower_bound|
+------------+--------------+----------------+-------------+------------+-------------------+---------------+--------+-------------------+----------------+-----------------------------------------+-------------------+-----------+-------------------+-------------------+-----------------+-----------------------------------+------------------------+-----------+--------+-------------------+---------+----------------+------------------+---------------+
|60.0        |50.0          |2022-05-10T00:00|day          |0           |5914958588201149627|dqo-ai-testing |bigquery|4257768357366869193|dqo_ai_test_data|test_data_regex_sensor_179306422851143075|8879968814725603742|email_mixed|1043048580860418850|regex_match_percent|validity         |column/validity/regex_match_percent|2022-05-10T13:54:42.046Z|1517       |2       |7881665242135922542|min_count|50.0            |70.0              |90.0           |
+------------+--------------+----------------+-------------+------------+-------------------+---------------+--------+-------------------+----------------+-----------------------------------------+-------------------+-----------+-------------------+-------------------+-----------------+-----------------------------------+------------------------+-----------+--------+-------------------+---------+----------------+------------------+---------------+
  1. Rule evaluation for phone number
  2. Rule evaluation for email.

Check summary

This connection contains more than one table, summary is displayed for all of them. Check evaluation summary for our table is highlighted below.

Check evaluation summary per table:
+--------------+-------------------------------------------------------------------------------------------+------+-------+-------+------+--------+------+
|Connection    |Table                                                                                      |Checks|Sensor |Valid  |Alerts|Alerts  |Alerts|
|              |                                                                                           |      |results|results|(low) |(medium)|(high)|
+--------------+-------------------------------------------------------------------------------------------+------+-------+-------+------+--------+------+
|dqo-ai-testing|dqo_ai_test_data.continuous_days_date_and_string_formats_7998702180845642887               |0     |0      |0      |0     |0       |0     |
+--------------+-------------------------------------------------------------------------------------------+------+-------+-------+------+--------+------+
 ...
+--------------+-------------------------------------------------------------------------------------------+------+-------+-------+------+--------+------+
|dqo-ai-testing|dqo_ai_test_data.test_average_goog_8863335308813025204                                     |0     |0      |0      |0     |0       |0     |
+--------------+-------------------------------------------------------------------------------------------+------+-------+-------+------+--------+------+
|dqo-ai-testing|dqo_ai_test_data.test_data_regex_sensor_179306422851143075                                 |2     |2      |0      |0     |2       |0     |
+--------------+-------------------------------------------------------------------------------------------+------+-------+-------+------+--------+------+
|dqo-ai-testing|dqo_ai_test_data.test_data_timeliness_sensors_1287831065858823464                          |0     |0      |0      |0     |0       |0     |
+--------------+-------------------------------------------------------------------------------------------+------+-------+-------+------+--------+------+
 ...
+--------------+-------------------------------------------------------------------------------------------+------+-------+-------+------+--------+------+
|dqo-ai-testing|dqo_ai_test_data.test_data_time_series_1901211650245440619                                 |0     |0      |0      |0     |0       |0     |
+--------------+-------------------------------------------------------------------------------------------+------+-------+-------+------+--------+------+

When to use

This check is most useful when you need to validate your data for correct formatting, for example email addresses are allowed to contain only permitted symbols and expected to have only one @ symbol. That is when you could use a named_regex = email, or provide your own regex with custom_regex

Let's have a look at ready to run example, which is described below.


Used sensor

Regex match percent

Errors detected

Invalid records that do not follow a specified format. It could be wrong email formatting with , or anything that could be matched with regex.

Parameters

This checks has two optional parameters that configure sensor:

  • named_regex: str (optional, default = missing)
    predefined regex, regexes are listed in here
  • custom_regex: str (optional, default = missing)
    used for defining custom regexes, provided as strings, e.g. custom_regex="^[0-9 -]{11}$"

named_regex vs. custom_regex

When defining a check keep in mind that named_regex overrides custom_regex. Meaning that if you specify both named_regex and custom_regex, a query will run with named_regex only. So for example

...
  checks:
    validity:
      regex_match_percent:
        parameters:
          named_regex: email
          custom_regex: "^22-[0-9]{8}$"
        rules:
          min_count:
...
will result with
SELECT CASE
        WHEN COUNT(analyzed_table.`col`) = 0 THEN NULL
        ELSE 100.0 * SUM(
            CASE
                WHEN REGEXP_CONTAINS(
                    analyzed_table.`unique_key`,
                    r'^[A-Za-z]+[A-Za-z0-9.]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}$'
                ) IS TRUE THEN 1
                ELSE 0
            END
        ) / COUNT(analyzed_table.`col`)
    END AS actual_value, CAST(CURRENT_TIMESTAMP() AS date) AS time_period
FROM `project`.`dataset`.`table` AS analyzed_table
GROUP BY time_period
ORDER BY time_period
where rendered regex is email, found in built-in regexes.

List of built-in regexes

Name Regex
email '^[A-Za-z]+[A-Za-z0-9.]+@[A-Za-z0-9.-]+.[A-Za-z]{2,4}$'
phoneNumber '^[pP]?[tT]?[nN]?[oO]?[0-9]{7,11}$'

Info

Regex parameter is required in a certain sense. No regex provided will result with matching an empty string, so all the data will be valid, giving a score of 100%:

SELECT 100.0* SUM(
        CASE
            WHEN REGEXP_CONTAINS(analyzed_table.`unique_key`, r'') IS TRUE THEN 1
            ELSE 0
        END
    ) / COUNT(*) AS actual_value,
FROM `bigquery-public-data`.`austin_311`.`311_service_requests` AS analyzed_table;


+--------------+
| actual_value |
+--------------+
|        100.0 |
+--------------+
SELECT 100.0* SUM(
        CASE
            WHEN REGEXP_LIKE(analyzed_table.`unique_key`, '') THEN 1
            ELSE 0
        END
    ) / COUNT(*) AS actual_value,
FROM `bigquery-public-data`.`austin_311`.`311_service_requests` AS analyzed_table;


+--------------+
| actual_value |
+--------------+
|        100.0 |
+--------------+

Errors do not show up if there is no regex passed, but this is a thing you should to avoid.


Accepted rules

min_count

Errors detected: a minimal percent of records matching certain format.

Parameters:

If actual_value >= low then severity level is 0 - valid.

  • low:
    rule threshold for a low severity (1) alert
    • min_value: float
      minimum accepted value for the actual_value returned by the sensor (inclusive)
  • medium:
    rule threshold for a medium severity (2) alert
    • min_value: float
      minimum accepted value for the actual_value returned by the sensor (inclusive)
  • high:
    rule threshold for a high severity (3) alert
    • min_value: float
      minimum accepted value for the actual_value returned by the sensor (inclusive)

count_equals

Errors detected: a specified percent of records within error band match certain format.

Parameters:

  • low:
    rule threshold for a low severity (1) alert
    • expectedValue: float
      expected value for the actual_value returned by the sensor. The sensor value should equal expected_value +/- the error_margin.
    • errorMargin: float
      error margin for comparison.
  • medium:
    rule threshold for a medium severity (2) alert
    • expectedValue: float
      expected value for the actual_value returned by the sensor. The sensor value should equal expected_value +/- the error_margin.
    • errorMargin: float
      error margin for comparison.
  • high:
    rule threshold for a high severity (3) alert
    • expectedValue: float
      expected value for the actual_value returned by the sensor. The sensor value should equal expected_value +/- the error_margin.
    • errorMargin: float
      error margin for comparison.