I want to create a zip file of S3 objects under a prefix and send output this file to another bucket using Python Glue job

This can be done using the “write_dynamic_frame.from_options” class,[1] and specifying in the “connection_options” your “compressionType”. [2]

An example of what this would look like: (This example was made in AWS Glue Studio under the Jobs section using “Visual with a source and target”)

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Script generated for node S3 bucket
S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
    format_options={"multiline": False},
    connection_type="s3",
    format="json",
    connection_options={
        "paths": [
            "s3://1st_bucket_name/prefix/prefix_2"
        ]
    },
    transformation_ctx="S3bucket_node1",
)

# Script generated for node S3 Bucket 2
S3Bucket2_node1664580455907 = glueContext.write_dynamic_frame.from_options(
    frame=S3bucket_node1,
    connection_type="s3",
    format="json",
    connection_options={
        "path": "s3://2nd_bucket_name/prefix/prefix_2/",
        "compression": "gzip",
        "partitionKeys": [],
    },
    transformation_ctx="S3Bucket2_node1664580455907",
)

job.commit()

If you wish to have the compressed files written as a single file into the S3 bucket as opposed to multiple compress files, you can add .coalesce(1) into the middle of the script before the write operation like so: [3]

S3bucket_node1 = S3bucket_node1.coalesce(1)

I hope this helps, don’t hesitate to reach out if you need any help or clarification.

AWS DOCUMENTATION:
[1]https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader.html#aws-glue-api-crawler-pyspark-extensions-dynamic-frame-reader-from_options
[2] https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-s3
[3] https://sparkbyexamples.com/spark/spark-repartition-vs-coalesce/

We value your feedback. Please share your experience by rating this and other correspondences in the AWS Support Center. You can rate a correspondence by selecting the stars in the top right corner of the correspondence.

Best regards,
John Paul P.
Amazon Web Services

\======================

Thêm:

Về việc zip file: Nếu chỉ dừng lại ở zip file, thì tốt nhất là zip ở 1 con EC2 instance. Khách hàng có thể tự động bằng code chạy trên EC2 instance. TH nếu nhu cầu của khách hàng chỉ đơn giản là zip file thôi, không trigger bởi một dịch vụ nào khác thì KH muốn tự động thực hiện thì làm 1 cái script, rồi có thể chạy bằng cron trên Linux được. Chạy ở trên Lambda và Glue thì đều có những nhược điểm như sau:

Default Lambda function chỉ có 512MB memory cho thư mục /tmp (thư mục ta có thể ghi file vào), mặc dù có thể sử dụng EFS hoặc nâng ephermeral storage cho Lambda function (ephermeral storage hiện tại có thể lên đến 10GB). Tuy nhiên chi phí đắt. Mặc khác, Lambda function chỉ có thời gian chạy tối đa là 15 phút -> Việc download về, rồi lại xử lý nén, và upload nếu 1 số lượng dữ liệu lớn sẽ bị timeout.
Khi sử dụng Glue job, thì có thể overcome được giới hạn thời gian limit của Lambda, nhưng cũng gây chi phí compute lớn vì t đang hiểu là dự án xử lý data thì khó có file nào nhẹ nhàng. Hơn nữa, lại phải loop lại cho đến hết file (giả sử case file nặng cả trăm MB + có hàng trăm file thì sao?)
Mặt khác, nếu nhu cầu chỉ đơn giản như này thì việc ngồi code để chạy đc trên Lambda function và Glue sẽ mất thời gian, trong khi nếu là EC2 instance thì có khi chỉ làm 1 file bash script, download về bằng lệnh aws s3 cp, rồi nén vào, và lại aws s3 cp mà upload lên

Nếu KH nhất định muốn sử dụng Lambda function hoặc Glue job, thì có thể xem xét sử dụng stream của Python hoặc Nodejs

Với Python, code sẽ kiểu dạng như sau (hãy sửa lại code để có handler, code này đang chỉ là script chạy dưới local thôi):

import boto3
import io
import zipfile
s3 = boto3.resource('s3')

def createZipFileStream(bucketName, bucketFilePath, jobKey, fileExt):
    print("aaa")
    response = {} 
    bucket = s3.Bucket(bucketName)
    filesCollection = bucket.objects.filter(Prefix=bucketFilePath).all() 
    archive = io.BytesIO()
    print(bucketName)

    with zipfile.ZipFile(archive, 'w', zipfile.ZIP_DEFLATED) as zip_archive:
        for file in filesCollection:
            if file.key.endswith('.' + fileExt):   
                with zip_archive.open(file.key, 'w') as file1:
                    file1.write(file.get()['Body'].read())

    print("prepare to archive")
    archive.seek(0)
    s3.Object(bucketName, bucketFilePath + '/' + jobKey + '.zip').upload_fileobj(archive)
    archive.close()

    return response

createZipFileStream("csv-parge", "demozip", "demo", "csv")

Có điều, với cách xử lý như thế này, thì sẽ có 1 đoạn phải đọc nội dung file để ghi vào:

with zip_archive.open(file.key, 'w') as file1:

Với những file nặng, nhiều dữ liệu thì code này sẽ chạy rất lâu => lại quay về câu chuyện quá time chạy cho phép là 15 phút của Lambda và chạy lâu thì tốn compute resource nếu dùng Glue bên trên
Cái này nhiều file cũng sẽ phải loop nhiều
Việc đọc file rồi ghi thì lại cần thiết vì: S3 không có khái niệm file hay folder gì cả, chỉ có khái niệm object, và object key thôi. Nếu không đọc file, chỉ ghi mỗi key thì sẽ bị file tên đúng nhưng 0kb (không có nội dung)

Với logic xử lý này, thì không ổn.

Nếu sử dụng nodejs: có 2 hint có thể base vào:

Link sau thì là zip nhiều file thôi, k phải là zip all thư mục. Nhưng nếu chúng ta sử dụng code, để get về tên và URI của tất cả các file dưới 1 S3 prefix (folder theo cách gọi thông thường) thì vẫn được. Với nodejs thì t có lưu ý + hướng dẫn như sau:

File Lambda function c đặt là index.js nhé, và buộc phải có phần handler thì mới chạy https://docs.aws.amazon.com/lambda/latest/dg/nodejs-handler.html
Khi sử dụng nodejs thì sẽ phát sinh vấn đề rất nhiều thư viện trên AWS Lambda không có, buộc phải install các thư viện require bằng câu lệnh npm install <package>, zip cả file index.js cùng với các node modules thì mới được nhé. Tuy nhiên, nhược điểm của cách này là sẽ tạo ra những file zip rất là to, và sẽ không thể sửa code trực tiếp trên AWS Lambda console được đâu -> mỗi lần thay đổi code, phải deploy 1 file zip mới. Lưu ý là nên thực hiện việc npm install <package> trên Linux nhé, tôi từng sử dụng Windows CMD và deploy lên không chạy đc. Có thể sử dụng 1 con EC2 instance cho việc cài module này, rồi từ EC2 instance dùng câu lệnh aws s3 cp upload lên S3 bucket, rồi chỉ định S3 bucket URL ở phần upload from trong Lambda function, sẽ đỡ tạo máy ảo phức tạp. Ngoài ra, ở repo 1 có xml-stream, nếu npm install trên windows mà windows không cài đặt visual studio thì không cài đc đâu (cái này kể cả dùng sub system của windows cũng không được đâu nhé)
Nếu muốn không bị tình trạng file deploy to quá k sửa code trực tiếp đc, xem thêm cách sử dụng AWS Lambda layer tại đây: https://docs.aws.amazon.com/lambda/latest/dg/configuration-layers.html
Có một lưu ý nữa, là nếu chỉ tạo ra độc file index.js thôi thì khi c npm install <package> sẽ không thấy có các file/folder node modules xuất hiện, khi đó thì tại thư mục chứa file index.js, tạo them 1 file package.json có nội dung như sau:

[ec2-user@ip-172-31-92-240 code2]$ cat package.json
{
  "name": "without-layers",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "keywords": [],
  "author": "",
  "license": "ISC",
  "dependencies": {
    "archiver": "^5.3.1",
    "date-fns": "^2.24.0",
    "request": "^2.88.2"
  }
}
[ec2-user@ip-172-31-92-240 code2]$

Rồi npm install <package> lại là được.