Glue Triggers

Bài toán:

Nghịch thử Glue Trigger: https://swapnil-bhoite.medium.com/aws-glue-triggers-2a2098f148c0

Mình thử làm crawler được trigger chạy khi job trả về kết quả là success.

Vì medium bị chặn (không biết do nhà mạng hay là do medium chặn IP, mà lúc vào đc lúc không nên mình copy bài trên vào đây cho tiện đọc:

AWS Glue Triggers

Glue triggers are used to start one or more crawlers or extract, transform, and load (ETL) jobs. Using triggers, you can design a chain of dependent jobs and crawlers.

Currently, the AWS Glue console supports only jobs, not crawlers, when working with triggers, you can use the AWS CLI or AWS Glue API to configure triggers for both jobs and crawlers [1]. Looking into the create trigger CLI documentation [2] for creating a trigger for crawler, this does not have any examples. The intention of this article is to provide an example for creating trigger for crawler using AWS CLI, as triggers for ETL job can be created from the console.

Below is the example CLI command to create trigger for crawler

aws glue create-trigger --name testTrigger --type CONDITIONAL --predicate 'Logical=AND,Conditions=[{LogicalOperator=EQUALS,JobName=testJob,State=SUCCEEDED}]' --actions CrawlerName=testCrawler --start-on-creation

The above will create a trigger named ‘testTrigger’ which will start the crawler ‘testCrawler’ once the ‘testJob’ is succeeded. Note that the testJob needs to be started using a trigger only, if this is started manually then the testCrawler will not get fired by trigger. In Glue, dependent jobs or crawlers are only started if the job or crawler that completes was started by a trigger. All jobs or crawlers in a dependency chain must be descendants of scheduled or on-demand This behavior documented here [3].

The above trigger can also be created using python boto3 SDK.

import boto3
client = boto3.client('glue')response = client.create_trigger(
    Name='testTrigger',
    Type='CONDITIONAL',
    Predicate={
        'Logical': 'AND',
        'Conditions': [
            {
                'LogicalOperator': 'EQUALS',
                'JobName': 'testJob',
                'State': 'SUCCEEDED'
            },
        ]
    },
    Actions=[
        {
            'CrawlerName': 'testCrawler'
       },
    ],
    StartOnCreation=True,

)

Either of the above approach should assist to create a trigger for Glue crawler.

References

—

[1] https://docs.aws.amazon.com/glue/latest/dg/console-triggers.html

[2] https://docs.aws.amazon.com/cli/latest/reference/glue/create-trigger.html

[3] https://docs.aws.amazon.com/glue/latest/dg/about-triggers.html

Vấn đề:

Có thể bạn sẽ gặp vấn đề Crawler không chạy dù job trả về đã success. Nếu vậy, hãy đọc tiếp nhé!

Nếu Job của bạn đã khởi động bằng tay (Job mà đã Job Succeeded nhưng thông tin [Triggered by] đang là blank như ảnh dưới đây)

thì Crawler không khởi chạy sau khi Job đã Job Succeeded.

Để trigger cho Crawler hoạt động thì Job cần được start bằng một Trigger khác [1]

Cách fix:

Step 1: Tạo Glue Job, Crawler
Step 2: Tạo Trigger cho Crawler. Sau khi run job bằng manual, xác nhận job đã succeeded nhưng Crawler không chạy giống như lỗi mà bạn đang gặp.
Step 3: Tạo Trigger cho Job. Sau khi dùng trigger để start Job, Crawler đã chạy thành công

Tài liệu tham khảo:
　[1] docs.aws.amazon.com/glue/latest/dg/about-tr..
　—-
　Jobs or crawlers that run as a result of other jobs or crawlers completing are referred to as dependent. Dependent jobs or crawlers are only started if the job or crawler that completes was started by a trigger. All jobs or crawlers in a dependency chain must be descendants of a single scheduled or on-demand trigger.
　—-