Glue / Client / start_data_quality_ruleset_evaluation_run

start_data_quality_ruleset_evaluation_run¶

Glue.Client.start_data_quality_ruleset_evaluation_run(**kwargs)¶

Once you have a ruleset definition (either recommended or your own), you call this operation to evaluate the ruleset against a data source (Glue table). The evaluation computes results which you can retrieve with the GetDataQualityResult API.

Request Syntax

response = client.start_data_quality_ruleset_evaluation_run(
    DataSource={
        'GlueTable': {
            'DatabaseName': 'string',
            'TableName': 'string',
            'CatalogId': 'string',
            'ConnectionName': 'string',
            'AdditionalOptions': {
                'string': 'string'
            }
        },
        'DataQualityGlueTable': {
            'DatabaseName': 'string',
            'TableName': 'string',
            'CatalogId': 'string',
            'ConnectionName': 'string',
            'AdditionalOptions': {
                'string': 'string'
            },
            'PreProcessingQuery': 'string'
        }
    },
    Role='string',
    NumberOfWorkers=123,
    Timeout=123,
    ClientToken='string',
    AdditionalRunOptions={
        'CloudWatchMetricsEnabled': True|False,
        'ResultsS3Prefix': 'string',
        'CompositeRuleEvaluationMethod': 'COLUMN'|'ROW'
    },
    RulesetNames=[
        'string',
    ],
    AdditionalDataSources={
        'string': {
            'GlueTable': {
                'DatabaseName': 'string',
                'TableName': 'string',
                'CatalogId': 'string',
                'ConnectionName': 'string',
                'AdditionalOptions': {
                    'string': 'string'
                }
            },
            'DataQualityGlueTable': {
                'DatabaseName': 'string',
                'TableName': 'string',
                'CatalogId': 'string',
                'ConnectionName': 'string',
                'AdditionalOptions': {
                    'string': 'string'
                },
                'PreProcessingQuery': 'string'
            }
        }
    }
)

Parameters:

DataSource (dict) –
[REQUIRED]

The data source (Glue table) associated with this run.
- GlueTable (dict) –
  
  An Glue table.
  - DatabaseName (string) – [REQUIRED]
    
    A database name in the Glue Data Catalog.
  - TableName (string) – [REQUIRED]
    
    A table name in the Glue Data Catalog.
  - CatalogId (string) –
    
    A unique identifier for the Glue Data Catalog.
  - ConnectionName (string) –
    
    The name of the connection to the Glue Data Catalog.
  - AdditionalOptions (dict) –
    
    Additional options for the table. Currently there are two keys supported:
    - pushDownPredicate: to filter on partitions without having to list and read all the files in your dataset.
    - catalogPartitionPredicate: to use server-side partition pruning using partition indexes in the Glue Data Catalog.
    - (string) –
      - (string) –
- DataQualityGlueTable (dict) –
  
  An Glue table for Data Quality Operations.
  - DatabaseName (string) – [REQUIRED]
    
    A database name in the Glue Data Catalog.
  - TableName (string) – [REQUIRED]
    
    A table name in the Glue Data Catalog.
  - CatalogId (string) –
    
    A unique identifier for the Glue Data Catalog.
  - ConnectionName (string) –
    
    The name of the connection to the Glue Data Catalog.
  - AdditionalOptions (dict) –
    
    Additional options for the table. Currently there are two keys supported:
    - pushDownPredicate: to filter on partitions without having to list and read all the files in your dataset.
    - catalogPartitionPredicate: to use server-side partition pruning using partition indexes in the Glue Data Catalog.
    - (string) –
      - (string) –
  - PreProcessingQuery (string) –
    
    SQL Query of SparkSQL format that can be used to pre-process the data for the table in Glue Data Catalog, before running the Data Quality Operation.
Role (string) –
[REQUIRED]

An IAM role supplied to encrypt the results of the run.
NumberOfWorkers (integer) – The number of G.1X workers to be used in the run. The default is 5.
Timeout (integer) – The timeout for a run in minutes. This is the maximum time that a run can consume resources before it is terminated and enters TIMEOUT status. The default is 2,880 minutes (48 hours).
ClientToken (string) – Used for idempotency and is recommended to be set to a random ID (such as a UUID) to avoid creating or starting multiple instances of the same resource.
AdditionalRunOptions (dict) –
Additional run options you can specify for an evaluation run.
- CloudWatchMetricsEnabled (boolean) –
  
  Whether or not to enable CloudWatch metrics.
- ResultsS3Prefix (string) –
  
  Prefix for Amazon S3 to store results.
- CompositeRuleEvaluationMethod (string) –
  
  Set the evaluation method for composite rules in the ruleset to ROW/COLUMN
RulesetNames (list) –
[REQUIRED]

A list of ruleset names.
- (string) –
AdditionalDataSources (dict) –
A map of reference strings to additional data sources you can specify for an evaluation run.
- (string) –
  - (dict) –
    
    A data source (an Glue table) for which you want data quality results.
    - GlueTable (dict) –
      
      An Glue table.
      - DatabaseName (string) – [REQUIRED]
        
        A database name in the Glue Data Catalog.
      - TableName (string) – [REQUIRED]
        
        A table name in the Glue Data Catalog.
      - CatalogId (string) –
        
        A unique identifier for the Glue Data Catalog.
      - ConnectionName (string) –
        
        The name of the connection to the Glue Data Catalog.
      - AdditionalOptions (dict) –
        
        Additional options for the table. Currently there are two keys supported:
        
        pushDownPredicate: to filter on partitions without having to list and read all the files in your dataset.
        
        catalogPartitionPredicate: to use server-side partition pruning using partition indexes in the Glue Data Catalog.
        
        (string) –
        
        (string) –
    - DataQualityGlueTable (dict) –
      
      An Glue table for Data Quality Operations.
      - DatabaseName (string) – [REQUIRED]
        
        A database name in the Glue Data Catalog.
      - TableName (string) – [REQUIRED]
        
        A table name in the Glue Data Catalog.
      - CatalogId (string) –
        
        A unique identifier for the Glue Data Catalog.
      - ConnectionName (string) –
        
        The name of the connection to the Glue Data Catalog.
      - AdditionalOptions (dict) –
        
        Additional options for the table. Currently there are two keys supported:
        
        pushDownPredicate: to filter on partitions without having to list and read all the files in your dataset.
        
        catalogPartitionPredicate: to use server-side partition pruning using partition indexes in the Glue Data Catalog.
        
        (string) –
        
        (string) –
      - PreProcessingQuery (string) –
        
        SQL Query of SparkSQL format that can be used to pre-process the data for the table in Glue Data Catalog, before running the Data Quality Operation.

Return type:

dict

Returns:

Response Syntax

{
    'RunId': 'string'
}

Response Structure

(dict) –
- RunId (string) –
  
  The unique run identifier associated with this run.

start_data_quality_ruleset_evaluation_run¶

Request Syntax

Response Syntax

Response Structure

Exceptions