Handle duplicate value khi ETL dữ liệu từ S3 (Excel file) vào RDS

Bài toán

Có rất nhiều file excel chứa thông tin người đăng ký tham gia các hoạt động. Các hoạt động này có số điểm khác nhau, giờ muốn tổng hợp xem 1 người tham gia bao nhiêu hoạt động và tính số điểm của người đó.

Vấn đề là:

Mỗi file Excel lại có các cột khác nhau, chỉ chung user_id, và tên
Các file này chứa nhiều ký tự tiếng Việt có dấu
Dữ liệu update thường xuyên, 1 lần sẽ có khoảng 20 file, mỗi file có khoảng 2000 rows. Nếu sử dụng các hàm Excel để tính toán số điểm trên từng file rồi tổng hợp lại sẽ rất mất thời gian

Giải bài toán

Chuyển dữ liệu từ Excel thành các bảng trong RDS MySQL để phục vụ tính toán dễ dàng hơn.

Định nghĩa các bảng trong RDS Mysql database:

Để tìm ra được số điểm, chúng ta cần có 3 bảng: user, activity và activity_tracking (để track hoạt động mà user tham gia)

DDL command các bảng như sau:

CREATE TABLE `activity` (
  `activity_id` VARCHAR(255) NOT NULL,
  `activity_name` VARCHAR(255) NOT NULL,
  `score` INT NOT NULL,
  `created_at` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`activity_id`)
);

CREATE TABLE `participants` (
  `user_id` VARCHAR(255) NOT NULL,
  `email` VARCHAR(255) NOT NULL,
  `created_at` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`user_id`)
);


CREATE TABLE `activity_tracking` (
  `id` INT NOT NULL AUTO_INCREMENT,
  `user_id` VARCHAR(255) NOT NULL,
  `activity_id` VARCHAR(255) NOT NULL,
  `activity_date` VARCHAR(255) NOT NULL,
  `created_at` TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  FOREIGN KEY (`activity_id`) REFERENCES activity(`activity_id`),
  FOREIGN KEY (`user_id`) REFERENCES participants(`user_id`)
);

3. Tạo S3:

Tạo 1 S3 bucket, sao cho có structure như sau:

Dữ liệu lưu trữ sẽ được partion theo tháng – năm tiện lợi hơn

Đây là bucket chứa dữ liệu làm đầu vào cho Glue Crawler

Vì mã encode của Excel hơi đặc biệt, nên nếu chuyển CSV bằng Excel thì thường bị lỗi font chữ. Ngoài ra, có rất nhiều trường trong dữ liệu raws không cần thiết hoặc phải loại bỏ các trường invalid, nên em sử dụng Glue DataBrew để làm việc này. Khi đã thiết lập xong hết các bước làm sạch dữ liệu thô, thì publish recipe để sử dụng về sau với DataBrew jobs. Tại đây, em sẽ tạo job để transform dữ liệu, export vào S3 csv data đã tạo bên trên để bắt đầu các việc tiếp theo.

Khi đã có CSV, tiếp tục sử dụng Glue Crawler để crawl dữ liệu từ các file csv (không thể crawl trực tiếp từ Excel vì dữ liệu còn khá lung tung, và Glue Crawler cũng không support Excel files. Glue Crawler lúc này sẽ tạo ra các bảng trong Glue Database. Có thể thực hiện thay đổi schema cho phù hợp nếu cần.

Tiếp đó, sử dụng Glue ETL job để chuyển dữ lieuẹ từ Glue Database tables vào RDS.

Lúc này, vấn đề sẽ xảy ra:

1/ Mặc định thì Glue ETL job sẽ đọc hết bảng Glue tables, và thực hiện ghi vào RDS => Sẽ tạo ra các bản dữ liệu trùng lặp

2/ Trong Database, user_id là PK nên không thể bị trùng, nhưng dữ liệu lấy từ Glue tables ghi vào có thể sẽ trùng, bởi vì tháng 8 người có user_id là A tham gia 1 hành động, sang tháng 9 chính người đó sẽ lại có thể tham gia hành động nữa. Như vậy nếu cứ thế ghi vào bảng, thì RDS Mysql sẽ chửi, không cho phép. Glue lại chỉ support upsert khi dùng Redshift, RDS thì không support.

\=> Giải pháp cho các vấn đề:

1/ Để tránh dữ liệu trùng lặp: tháng 9 đọc bảng Glue lại lấy cả dữ liệu tháng 8 + tháng 9 ghi vào RDS, thì có thể enable Bookmark cho ETL job. Khi enable bookmark, Glue job sẽ không quét lại những dữ liệu đã xử lý nữa (Tại Glue Crawler, em cũng để behavior là chỉ crawl những folder mới tạo, không crawl lại các folder cũ nữa)

2/ Để giải quyết vấn đề RDS MySQL chửi vì PK đã tồn tại, có thể làm như sau:

Thay vì ETL job ghi luôn vào bảng participants, ta tạo ra 1 bảng tmp có cấu trúc tương tự participants nhưng không đặt PK là user_id và để ETL job ghi vào bảng này
Thực hiện join bảng tạm với bảng chính, nếu ở bảng tạm có bản ghi nào user_id trùng với bảng chính thì xóa bản ghi đó ở bảng tạm
Sử dụng insert into select để ghi dữ liệu còn lại từ bảng tạm vào bảng chính

Vì có quá nhiều công việc nhỏ lẻ, nên em sẽ sử dụng Step Functions để lên kịch bản, từ đó ta chỉ cần run Step Functions là đã có thể thực hiện tất cả các công việc trên.

Flow của Step Functions sẽ có dạng như sau:

Với các hoạt động như tạo bảng tạm, join chạy các câu lệnh join và drop bảng tmp, để tiết kiệm chi phí, em sẽ sử dụng Lambda function thay vì Glue job.

Code sample tham khảo

Tạo tmp table:

import pymysql
import pymysql.cursors

def lambda_handler(event, context):
    # Connect to the database using db details or fetch these from Glue connections
    connection = pymysql.connect(host='<rds-endpoint>',
                                 user='<rds-username>',
                                 password='<rds-password>',
                                 database='<rds-database>',
                                 cursorclass=pymysql.cursors.DictCursor)
    with connection:
        with connection.cursor() as cursor:
            # Create a new record
            create_tmp_table_sql = "CREATE TABLE `participants_tmp` (user_id VARCHAR(255) NOT NULL, email VARCHAR(255) NOT NULL, created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP);"
            try: 
                cursor.execute(create_tmp_table_sql)
            except Exception as e:
                print(e)


        # connection is not autocommit by default. So you must commit to save
        # your changes.
        connection.commit()

Cũng sử dụng code tương tự như trên, nhưng chạy các câu lệnh sau để xóa duplicate value và drop bảng tạm đi:

DELETE P FROM participants_tmp P INNER JOIN participants I ON P.user_id = I.user_id;

INSERT INTO participants ( user_id, email, created_at ) SELECT  * FROM participants_tmp;

drop table if exists participants_tmp;

Tài liệu tham khảo:

https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/build-an-etl-service-pipeline-to-load-data-incrementally-from-amazon-s3-to-amazon-redshift-using-aws-glue.html

Lưu ý: vì dữ liệu có tiếng Việt, nên sẽ cần thay đổi 1 chút parameter của RDS để tránh bị lỗi font:

character_set_client          utf8
character_set_connection      utf8
character_set_database        utf8
character_set_filesystem      utf8
character_set_results         utf8
character_set_server          utf8
collation_connection          utf8_general_ci
collation_server              utf8_general_ci

Lưu ý cho trường hợp sử dụng PostgreSQL thay vì MySQL:

Tại thời điểm viết bài, nếu dùng PostgreSQL phiên bản 14.5, không thể connect từ Glue tới bằng JDBC driver mặc định của Glue được, sẽ cần sử dụng custom JDBC Driver, việc dùng Custom driver cho Glue Connection của Terraform được docs khá sơ sài, nên khuyến khích cập nhật custom driver cho connection manually và sử dụng lifecycle ignore_change connection properties để tránh việc chạy lại Terraform bị ảnh hưởng.

Các câu lệnh SQL trong PostgreSQL cũng sẽ hơi khác:

DELETE FROM participants_tmp P USING participants I WHERE P.user_id = I.user_id;

INSERT INTO participants SELECT DISTINCT ON (user_id) * FROM participants_tmp

Thư viện dùng để kết nối: psycopg2. Lưu ý, thư viện này có build ra 1 file cpython, và sẽ ảnh hưởng bởi version Python của máy build ra!

Chi tiết cách dùng: https://www.datacamp.com/tutorial/tutorial-postgresql-python

pip install psycopg2pip install psycopg2-binary

conn = psycopg2.connect(database = "datacamp_courses", 
                        user = "datacamp", 
                        host= 'localhost',
                        password = "postgresql_tutorial",
                        port = 5432)

# Open a cursor to perform database operations
cur = conn.cursor()
# Execute a command: create datacamp_courses table
cur.execute("""CREATE TABLE datacamp_courses(
            course_id SERIAL PRIMARY KEY,
            course_name VARCHAR (50) UNIQUE NOT NULL,
            course_instructor VARCHAR (100) NOT NULL,
            topic VARCHAR (20) NOT NULL);
            """)
# Make the changes to the database persistent
conn.commit()
# Close cursor and communication with the database
cur.close()
conn.close()

Về custom driver:

Nếu dùng driver mặc định sẽ gặp lỗi dạng như sau:

"ERROR StatusLogger Unrecognized conversion specifier [thread] starting at position 25 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [level] ERROR StatusLogger Unrecognized conversion specifier [level] starting at position 35 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [logger] ERROR StatusLogger Unrecognized conversion specifier [logger] starting at position 47 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [msg] ERROR StatusLogger Unrecognized conversion specifier [msg] starting at position 54 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [n] ERROR StatusLogger Unrecognized conversion specifier [n] starting at position 56 in conversion pattern. ERROR StatusLogger Unrecognized conversion specifier [thread] starting at position 25 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [level] ERROR StatusLogger Unrecognized conversion specifier [level] starting at position 35 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [logger] ERROR StatusLogger Unrecognized conversion specifier [logger] starting at position 47 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [msg] ERROR StatusLogger Unrecognized conversion specifier [msg] starting at position 54 in conversion pattern. ERROR StatusLogger Unrecognized format specifier [n] ERROR StatusLogger Unrecognized conversion specifier [n] starting at position 56 in conversion pattern."

Cách giải quyết:

Download latest JDBC driver and upload to S3

Download latest JDBC driver at https://jdbc.postgresql.org/download/ As of today, I downloaded latest JDBC driver 42.6.0 Java 8 version which filename is postgresql-42.6.0.jar.

Then I uploaded at s3://my-bucket/aws-glue/postgresql-jdbc-driver/postgresql-42.6.0.jar

Step 2. Update the JDBC connection

Update your JDBC connection in AWS Glue -> Connectors with

JDBC Driver Class name: org.postgresql.Driver
JDBC Driver S3 Path: s3://my-bucket/aws-glue/postgresql-jdbc-driver/postgresql-42.6.0.jar

Note the “Test Connection” will still fail with error log

Caused by: com.amazonaws.services.glue.exceptions.InvalidInputException: Testing connections with custom drivers is not currently supported. Caused by: com.amazonaws.services.glue.exceptions.InvalidInputException: Testing connections with custom drivers is not currently supported.

Mặc dù lỗi nhưng job và crawler sẽ thành công

Ref: https://repost.aws/questions/QUpkrhcfkYQtS2adbjpQ7quQ/cannot-connect-from-glue-to-rds-postgres

Notes thêm

Xử lý các bản ghi bị duplicate:

===
Xử lý nếu có duplicate: 

Tìm các bản ghi duplicate: 

SELECT 
    user_id,
    activity_id,
    activity_date,
    COUNT(*) AS "Count"
FROM activity_tracking
GROUP BY 
    user_id,
    activity_id,
    activity_date
ORDER BY activity_id;

Theo date, chỉ hiển thị các bản ghi duplicate: 
SELECT 
    user_id,
    activity_id,
    activity_date,
    COUNT(*) AS "Count"
FROM activity_tracking
GROUP BY 
    user_id,
    activity_id,
    activity_date
HAVING COUNT(*) > 1
ORDER BY activity_date;

Xóa các bản ghi bị duplicate: 

DELETE FROM activity_tracking
WHERE id NOT IN
(
SELECT MIN(id)
FROM activity_tracking
GROUP BY user_id, activity_id, activity_date
);

====

Xem contribution của người tham gia:

Xem tất cả: 
SELECT distinct activity_tracking.user_id, activity_tracking.activity_date, activity.activity_id, activity.score * COUNT(activity_tracking.activity_id) as point FROM activity_tracking LEFT JOIN activity on activity.activity_id = activity_tracking.activity_id GROUP BY user_id, activity.activity_id, activity_tracking.activity_date order by user_id;

======
Theo tháng: 
VD tháng 10: 
SELECT distinct activity_tracking.user_id, activity_tracking.activity_date, activity.activity_id, activity.score * COUNT(activity_tracking.activity_id) as point                  
FROM activity_tracking LEFT JOIN activity on activity.activity_id = activity_tracking.activity_id where activity_tracking.activity_date like '%Oct%'
GROUP BY user_id, activity.activity_id, activity_tracking.activity_date
order by user_id;

Xuất kết quả truy vấn ra file CSV trong postgresql:

====
Xuất ra file CSV: 

\copy (SELECT distinct activity_tracking.user_id, activity_tracking.activity_date, activity.activity_id, activity.score * COUNT(activity_tracking.activity_id) as point FROM activity_tracking LEFT JOIN activity on activity.activity_id = activity_tracking.activity_id GROUP BY user_id, activity.activity_id, activity_tracking.activity_date order by user_id) to '/home/ec2-user/environment/leaderboard.csv' with CSV DELIMITER ',' HEADER

===

Lưu ý khi xuất file CSV:: Câu truy vấn phải viết tất cả trên 1 dòng, không được xuống dòng nếu không sẽ lỗi.

Lưu ý:

Cú pháp xóa bản ghi trùng lặp trong SQL:

DELETE FROM Table
WHERE ID NOT IN
(
SELECT MIN(ID)
FROM Table
GROUP BY Field1, Field2, Field3, ...
)

Ref: https://database.guide/4-ways-to-select-duplicate-rows-in-postgresql/

https://stackoverflow.com/questions/6025367/t-sql-deleting-all-duplicate-rows-but-keeping-one