GlueDataBrew / Client / create_dataset

create_dataset#

GlueDataBrew.Client.create_dataset(**kwargs)#

Creates a new DataBrew dataset.

See also: AWS API Documentation

Request Syntax

response = client.create_dataset(
    Name='string',
    Format='CSV'|'JSON'|'PARQUET'|'EXCEL'|'ORC',
    FormatOptions={
        'Json': {
            'MultiLine': True|False
        },
        'Excel': {
            'SheetNames': [
                'string',
            ],
            'SheetIndexes': [
                123,
            ],
            'HeaderRow': True|False
        },
        'Csv': {
            'Delimiter': 'string',
            'HeaderRow': True|False
        }
    },
    Input={
        'S3InputDefinition': {
            'Bucket': 'string',
            'Key': 'string',
            'BucketOwner': 'string'
        },
        'DataCatalogInputDefinition': {
            'CatalogId': 'string',
            'DatabaseName': 'string',
            'TableName': 'string',
            'TempDirectory': {
                'Bucket': 'string',
                'Key': 'string',
                'BucketOwner': 'string'
            }
        },
        'DatabaseInputDefinition': {
            'GlueConnectionName': 'string',
            'DatabaseTableName': 'string',
            'TempDirectory': {
                'Bucket': 'string',
                'Key': 'string',
                'BucketOwner': 'string'
            },
            'QueryString': 'string'
        },
        'Metadata': {
            'SourceArn': 'string'
        }
    },
    PathOptions={
        'LastModifiedDateCondition': {
            'Expression': 'string',
            'ValuesMap': {
                'string': 'string'
            }
        },
        'FilesLimit': {
            'MaxFiles': 123,
            'OrderedBy': 'LAST_MODIFIED_DATE',
            'Order': 'DESCENDING'|'ASCENDING'
        },
        'Parameters': {
            'string': {
                'Name': 'string',
                'Type': 'Datetime'|'Number'|'String',
                'DatetimeOptions': {
                    'Format': 'string',
                    'TimezoneOffset': 'string',
                    'LocaleCode': 'string'
                },
                'CreateColumn': True|False,
                'Filter': {
                    'Expression': 'string',
                    'ValuesMap': {
                        'string': 'string'
                    }
                }
            }
        }
    },
    Tags={
        'string': 'string'
    }
)
Parameters:
  • Name (string) –

    [REQUIRED]

    The name of the dataset to be created. Valid characters are alphanumeric (A-Z, a-z, 0-9), hyphen (-), period (.), and space.

  • Format (string) – The file format of a dataset that is created from an Amazon S3 file or folder.

  • FormatOptions (dict) –

    Represents a set of options that define the structure of either comma-separated value (CSV), Excel, or JSON input.

    • Json (dict) –

      Options that define how JSON input is to be interpreted by DataBrew.

      • MultiLine (boolean) –

        A value that specifies whether JSON input contains embedded new line characters.

    • Excel (dict) –

      Options that define how Excel input is to be interpreted by DataBrew.

      • SheetNames (list) –

        One or more named sheets in the Excel file that will be included in the dataset.

        • (string) –

      • SheetIndexes (list) –

        One or more sheet numbers in the Excel file that will be included in the dataset.

        • (integer) –

      • HeaderRow (boolean) –

        A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.

    • Csv (dict) –

      Options that define how CSV input is to be interpreted by DataBrew.

      • Delimiter (string) –

        A single character that specifies the delimiter being used in the CSV file.

      • HeaderRow (boolean) –

        A variable that specifies whether the first row in the file is parsed as the header. If this value is false, column names are auto-generated.

  • Input (dict) –

    [REQUIRED]

    Represents information on how DataBrew can find data, in either the Glue Data Catalog or Amazon S3.

    • S3InputDefinition (dict) –

      The Amazon S3 location where the data is stored.

      • Bucket (string) – [REQUIRED]

        The Amazon S3 bucket name.

      • Key (string) –

        The unique name of the object in the bucket.

      • BucketOwner (string) –

        The Amazon Web Services account ID of the bucket owner.

    • DataCatalogInputDefinition (dict) –

      The Glue Data Catalog parameters for the data.

      • CatalogId (string) –

        The unique identifier of the Amazon Web Services account that holds the Data Catalog that stores the data.

      • DatabaseName (string) – [REQUIRED]

        The name of a database in the Data Catalog.

      • TableName (string) – [REQUIRED]

        The name of a database table in the Data Catalog. This table corresponds to a DataBrew dataset.

      • TempDirectory (dict) –

        Represents an Amazon location where DataBrew can store intermediate results.

        • Bucket (string) – [REQUIRED]

          The Amazon S3 bucket name.

        • Key (string) –

          The unique name of the object in the bucket.

        • BucketOwner (string) –

          The Amazon Web Services account ID of the bucket owner.

    • DatabaseInputDefinition (dict) –

      Connection information for dataset input files stored in a database.

      • GlueConnectionName (string) – [REQUIRED]

        The Glue Connection that stores the connection information for the target database.

      • DatabaseTableName (string) –

        The table within the target database.

      • TempDirectory (dict) –

        Represents an Amazon S3 location (bucket name, bucket owner, and object key) where DataBrew can read input data, or write output from a job.

        • Bucket (string) – [REQUIRED]

          The Amazon S3 bucket name.

        • Key (string) –

          The unique name of the object in the bucket.

        • BucketOwner (string) –

          The Amazon Web Services account ID of the bucket owner.

      • QueryString (string) –

        Custom SQL to run against the provided Glue connection. This SQL will be used as the input for DataBrew projects and jobs.

    • Metadata (dict) –

      Contains additional resource information needed for specific datasets.

      • SourceArn (string) –

        The Amazon Resource Name (ARN) associated with the dataset. Currently, DataBrew only supports ARNs from Amazon AppFlow.

  • PathOptions (dict) –

    A set of options that defines how DataBrew interprets an Amazon S3 path of the dataset.

    • LastModifiedDateCondition (dict) –

      If provided, this structure defines a date range for matching Amazon S3 objects based on their LastModifiedDate attribute in Amazon S3.

      • Expression (string) – [REQUIRED]

        The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, “(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)”. Substitution variables should start with ‘:’ symbol.

      • ValuesMap (dict) – [REQUIRED]

        The map of substitution variable names to their values used in this filter expression.

        • (string) –

          • (string) –

    • FilesLimit (dict) –

      If provided, this structure imposes a limit on a number of files that should be selected.

      • MaxFiles (integer) – [REQUIRED]

        The number of Amazon S3 files to select.

      • OrderedBy (string) –

        A criteria to use for Amazon S3 files sorting before their selection. By default uses LAST_MODIFIED_DATE as a sorting criteria. Currently it’s the only allowed value.

      • Order (string) –

        A criteria to use for Amazon S3 files sorting before their selection. By default uses DESCENDING order, i.e. most recent files are selected first. Another possible value is ASCENDING.

    • Parameters (dict) –

      A structure that maps names of parameters used in the Amazon S3 path of a dataset to their definitions.

      • (string) –

        • (dict) –

          Represents a dataset parameter that defines type and conditions for a parameter in the Amazon S3 path of the dataset.

          • Name (string) – [REQUIRED]

            The name of the parameter that is used in the dataset’s Amazon S3 path.

          • Type (string) – [REQUIRED]

            The type of the dataset parameter, can be one of a ‘String’, ‘Number’ or ‘Datetime’.

          • DatetimeOptions (dict) –

            Additional parameter options such as a format and a timezone. Required for datetime parameters.

            • Format (string) – [REQUIRED]

              Required option, that defines the datetime format used for a date parameter in the Amazon S3 path. Should use only supported datetime specifiers and separation characters, all literal a-z or A-Z characters should be escaped with single quotes. E.g. “MM.dd.yyyy-‘at’-HH:mm”.

            • TimezoneOffset (string) –

              Optional value for a timezone offset of the datetime parameter value in the Amazon S3 path. Shouldn’t be used if Format for this parameter includes timezone fields. If no offset specified, UTC is assumed.

            • LocaleCode (string) –

              Optional value for a non-US locale code, needed for correct interpretation of some date formats.

          • CreateColumn (boolean) –

            Optional boolean value that defines whether the captured value of this parameter should be used to create a new column in a dataset.

          • Filter (dict) –

            The optional filter expression structure to apply additional matching criteria to the parameter.

            • Expression (string) – [REQUIRED]

              The expression which includes condition names followed by substitution variables, possibly grouped and combined with other conditions. For example, “(starts_with :prefix1 or starts_with :prefix2) and (ends_with :suffix1 or ends_with :suffix2)”. Substitution variables should start with ‘:’ symbol.

            • ValuesMap (dict) – [REQUIRED]

              The map of substitution variable names to their values used in this filter expression.

              • (string) –

                • (string) –

  • Tags (dict) –

    Metadata tags to apply to this dataset.

    • (string) –

      • (string) –

Return type:

dict

Returns:

Response Syntax

{
    'Name': 'string'
}

Response Structure

  • (dict) –

    • Name (string) –

      The name of the dataset that you created.

Exceptions