Values in set percent
Description
The values_in_set_percent
calculates a percentage of records that match provided values.
This check accepts two parameters: a list of values to match (values_list
) and a desired type of data (values_type
).
The check is performed by using SQL function IN()
. Depending on the provider, the given values are passed
differently to query engine in accordance with provider's rules.
When user specifies a different data type than the one in the column, sensor tries to cast column values to specified
type. If a data type is not provided, and data types do not match, the default type to cast is STRING
.
Info
In all cases except for DATE
, check casts the column for selected data type. In case of DATE
, values provided in
the list are cast. This is done for a technical reasons, e.g. when table is time partitioned.
When to use
We recommend using this check when you would like to validate the data for certain values. For example when you expect
country
column to contain records from few countries, e.g. Poland, United States, Germany (see the
example).
Warning
Running this check defining the wrong data type for the column might result with an error.
This is because casting some types to another is impossible, for e.g. DATE
to NUMERIC
. That is why we
recommend using this query on STRING
types, where such errors do not occur.
Used sensor
Accepted rules
Parameters
This checks has two parameters that configure sensor:
values_type
: str (Optional)
predefined values type provided from user, types are listed in enum (default:STRING
) BuiltInListFormatsvalues_list
: list
list of values provided by the user
The default type is STRING
, which is one of the types of values_type
.
In case of values_list
user has to provide a list of values compatible with values_type
(see the examples).
List of built in list formats
Name | Type |
---|---|
NUMERIC |
'NUMERIC' |
STRING |
'STRING' |
DATE |
'DATE' |
Name | Type |
---|---|
NUMERIC |
'NUMERIC' |
STRING |
'STRING' |
DATE |
'DATE' |
How to use
Default configuration
The default configuration of column validity check values_in_set_percent
with min_count
rule.
If the parameter values_type
is not chosen, the
default data type for values provided from user is STRING
.
Table configuration in YAML file for column check values_in_set_percent
on a column countries
in table
table in test_data
dataset in dqo-ai
project on BigQuery and min_count
rule ,looks like this
values_list: ["PL","US","DE"]
the rendered query is
SELECT
CASE
WHEN COUNT(*) = 0 THEN NULL
ELSE
100.0 * SUM(
CASE
WHEN (analyzed_table.`countries` in ('PL', 'US', 'DE')) IS TRUE THEN 1
ELSE 0
END
)/COUNT(*)
END AS actual_value
FROM `dqo-ai-testing`.`dqo_ai_test_data`.`table` AS analyzed_table
String to numeric
Table configuration in YAML file for column check values_in_set_percent
on a column mix_string_int
in table
table in
test_data
dataset in dqo-ai
project on BigQuery and min_count
rule, looks like this
values_type
as STRING
but values_list
has numeric values so sensor convert NUMERIC
values to STRING
.
The rendered query is
SELECT
CASE
WHEN COUNT(*) = 0 THEN NULL
ELSE
100.0 * SUM(
CASE
WHEN (analyzed_table.`mix_string_int` in ('1', '2', '3')) IS TRUE THEN 1
ELSE 0
END
)/COUNT(*)
END AS actual_value
FROM `dqo-ai-testing`.`dqo_ai_test_data`.`table` AS analyzed_table
Numeric to numeric
Table configuration in YAML file for column check values_in_set_percent
on a column id
in table
table in
test_data
dataset in dqo-ai
project on BigQuery and min_count
rule, looks like this
values_type
is the same to values type into values_list
, so sensor tries cast column to provided type.
The rendered query is
SELECT
CASE
WHEN COUNT(*) = 0 THEN NULL
ELSE
100.0 * SUM(
CASE
WHEN (CAST(analyzed_table.`id` AS NUMERIC) in (1, 2, 3)) IS TRUE THEN 1
ELSE 0
END
)/COUNT(*)
END AS actual_value
FROM `dqo-ai-testing`.`dqo_ai_test_data`.`table` AS analyzed_table
Date to date
Table configuration in YAML file for column check values_in_set_percent
on a column date
in table
table in
test_data
dataset in dqo-ai
project on BigQuery and min_count
rule, looks like this
values_type
as a DATE
, so sensor cast provided values to DATE
, even then type is equal to DATE
.
The rendered query is