Skip to content

Values in set percent

Description

The values_in_set_percent calculates a percentage of records that match provided values. This check accepts two parameters: a list of values to match (values_list) and a desired type of data (values_type).

The check is performed by using SQL function IN(). Depending on the provider, the given values are passed differently to query engine in accordance with provider's rules.

When user specifies a different data type than the one in the column, sensor tries to cast column values to specified type. If a data type is not provided, and data types do not match, the default type to cast is STRING.

Info

In all cases except for DATE, check casts the column for selected data type. In case of DATE, values provided in the list are cast. This is done for a technical reasons, e.g. when table is time partitioned.


When to use

We recommend using this check when you would like to validate the data for certain values. For example when you expect country column to contain records from few countries, e.g. Poland, United States, Germany (see the example).

Warning

Running this check defining the wrong data type for the column might result with an error. This is because casting some types to another is impossible, for e.g. DATE to NUMERIC. That is why we recommend using this query on STRING types, where such errors do not occur.


Used sensor

Values in set percent


Accepted rules

Min count

Count equals


Parameters

This checks has two parameters that configure sensor:

  • values_type: str (Optional)
    predefined values type provided from user, types are listed in enum (default: STRING) BuiltInListFormats
  • values_list: list
    list of values provided by the user

The default type is STRING, which is one of the types of values_type. In case of values_list user has to provide a list of values compatible with values_type(see the examples).


List of built in list formats

Name Type
NUMERIC 'NUMERIC'
STRING 'STRING'
DATE 'DATE'
Name Type
NUMERIC 'NUMERIC'
STRING 'STRING'
DATE 'DATE'

How to use

Default configuration

The default configuration of column validity check values_in_set_percent with min_count rule. If the parameter values_type is not chosen, the default data type for values provided from user is STRING.

Table configuration in YAML file for column check values_in_set_percent on a column countries in table table in test_data dataset in dqo-ai project on BigQuery and min_count rule ,looks like this

# yaml-language-server: $schema=https://cloud.dqo.ai/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
  target:
    schema_name: dqo_ai_test_data
    table_name: table
  time_series:
    mode: current_time
    time_gradient: day
  columns:
    countries:
      type_snapshot:
        column_type: STRING
        nullable: true
      checks:
        validity:
          values_in_set_percent:
            parameters:
              values_list: ["PL","US","DE"]
            rules:
              min_count:
                low:
                  min_value: 90.0
                medium:
                  min_value: 80.0
                high:
                  min_value: 70.0
For default check configuration and passed values_list: ["PL","US","DE"] the rendered query is
SELECT
    CASE
        WHEN COUNT(*) = 0 THEN NULL
        ELSE
            100.0 * SUM(
                CASE
                    WHEN (analyzed_table.`countries` in ('PL', 'US', 'DE')) IS TRUE THEN 1
                    ELSE 0
                END
            )/COUNT(*)
    END AS actual_value
FROM `dqo-ai-testing`.`dqo_ai_test_data`.`table` AS analyzed_table

String to numeric

Table configuration in YAML file for column check values_in_set_percent on a column mix_string_int in table table in test_data dataset in dqo-ai project on BigQuery and min_count rule, looks like this

# yaml-language-server: $schema=https://cloud.dqo.ai/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
  target:
    schema_name: dqo_ai_test_data
    table_name: table
  time_series:
    mode: current_time
    time_gradient: day
  columns:
    mix_string_int:
      type_snapshot:
        column_type: STRING
        nullable: true
      checks:
        validity:
          values_in_set_percent:
            parameters:
              values_type: STRING
              values_list: [1,2,3]
            rules:
              min_count:
                low:
                  min_value: 90.0
                medium:
                  min_value: 80.0
                high:
                  min_value: 70.0
In this case user provided values_type as STRING but values_list has numeric values so sensor convert NUMERIC values to STRING. The rendered query is
SELECT
    CASE
        WHEN COUNT(*) = 0 THEN NULL
        ELSE
            100.0 * SUM(
                CASE
                    WHEN (analyzed_table.`mix_string_int` in ('1', '2', '3')) IS TRUE THEN 1
                    ELSE 0
                END
            )/COUNT(*)
    END AS actual_value
FROM `dqo-ai-testing`.`dqo_ai_test_data`.`table` AS analyzed_table

Numeric to numeric

Table configuration in YAML file for column check values_in_set_percent on a column id in table table in test_data dataset in dqo-ai project on BigQuery and min_count rule, looks like this

# yaml-language-server: $schema=https://cloud.dqo.ai/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
  target:
    schema_name: dqo_ai_test_data
    table_name: table
  time_series:
    mode: current_time
    time_gradient: day
  columns:
    id:
      type_snapshot:
        column_type: INT64
        nullable: true
      checks:
        validity:
          values_in_set_percent:
            parameters:
              values_type: NUMERIC
              values_list: [1,2,3]
            rules:
              min_count:
                low:
                  min_value: 90.0
                medium:
                  min_value: 80.0
                high:
                  min_value: 70.0
In this case values_type is the same to values type into values_list, so sensor tries cast column to provided type. The rendered query is
SELECT
    CASE
        WHEN COUNT(*) = 0 THEN NULL
        ELSE
            100.0 * SUM(
                CASE
                    WHEN (CAST(analyzed_table.`id` AS NUMERIC) in (1, 2, 3)) IS TRUE THEN 1
                    ELSE 0
                END
            )/COUNT(*)
    END AS actual_value
FROM `dqo-ai-testing`.`dqo_ai_test_data`.`table` AS analyzed_table

Date to date

Table configuration in YAML file for column check values_in_set_percent on a column date in table table in test_data dataset in dqo-ai project on BigQuery and min_count rule, looks like this

# yaml-language-server: $schema=https://cloud.dqo.ai/dqo-yaml-schema/TableYaml-schema.json
apiVersion: dqo/v1
kind: table
spec:
  target:
    schema_name: dqo_ai_test_data
    table_name: table
  time_series:
    mode: current_time
    time_gradient: day
  columns:
    date:
      type_snapshot:
        column_type: DATE
        nullable: true
      checks:
        validity:
          values_in_set_percent:
            parameters:
              values_type: DATE
              values_list: ["2020-02-01","2020-02-02"]
            rules:
              min_count:
                low:
                  min_value: 90.0
                medium:
                  min_value: 80.0
                high:
                  min_value: 70.0
In this case user provided values_type as a DATE, so sensor cast provided values to DATE, even then type is equal to DATE. The rendered query is
SELECT
    CASE
        WHEN COUNT(*) = 0 THEN NULL
        ELSE
            100.0 * SUM(
                CASE
                    WHEN (analyzed_table.`date` in (CAST('2020-02-01' AS DATE), CAST('2020-02-02' AS DATE))) IS TRUE THEN 1
                    ELSE 0
                END
            )/COUNT(*)
    END AS actual_value
FROM `dqo-ai-testing`.`dqo_ai_test_data`.`table` AS analyzed_table