send data from s3 to postres rds with glue

Question

I'm trying to create an automated pipeline with aws. I'm able to get my csv file into my s3 bucket and that automatically triggers a lambda function to send the csv to my glue job. The glue job then turns the csv into a dataframe with pyspark. you cannot use psycopg2, pandas or sqlalchemy, or else glue will give an error saying the module doesn't exist. I have a postgres rds setup in aws rds. This is what i have so far

import sys
from pyspark.context import SparkContext
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from setuptools import setup
from sqlalchemy import create_engine
spark = SparkSession.builder.getOrCreate()
args = getResolvedOptions(sys.argv, ["VAL1", "VAL2"])
file_name = args['VAL1']
bucket_name = args["VAL2"]
file_path = "s3a://{}/{}".format(bucket_name, file_name)
df = spark.read.csv(file_path, sep=',', inferSchema=True, header=True)
df.drop("index")
url = "my rds endpoint link"

i have tried almost a dozen solutions before asking on stackoverflow. So any help would be amazing

Did you try a df.write... jdbc approach? and yeah, some py modules (eg sqlalchemy) have to be added/installed — Bob Haffner, Commented Apr 15, 2022 at 15:51
See this answer for installing additional py modules stackoverflow.com/a/71404169/3437504 — Bob Haffner, Commented Apr 15, 2022 at 15:54
when i go to add job parameters the "--additional-python-modules" isn't there — Jessica Calkins, Commented Apr 15, 2022 at 16:36
You need to add it in the Key "Box" and then add sqlalchemy to the Value Box — Bob Haffner, Commented Apr 15, 2022 at 16:40
i set the key to --additional-python-modules and the value to sqlalchemy==3.0 and im still getting the error of no module found. did i type something wrong? i also put in the value slot what the link you sent said which was datacompy==0.7.3 and i got the same error — Jessica Calkins, Commented Apr 15, 2022 at 17:02

Bob Haffner · Accepted Answer · 2022-04-15 19:01:53Z

I used this df.write approach before. Starting where you left off with your pyspark dataframe

jdbc_url = 'jdbc:postgresql://<instance_name>.xxxxxxxxx.us-west-2.rds.amazonaws.com:5432/<db_name>' 

(df.write.format('jdbc').option('url', 'jdbc_url') 
                        .option('user', 'myUsername') 
                        .option('password', 'myPassword') 
                        .option('dbtable', 'myTable') 
                        .option('driver', 'org.postgresql.Driver') 
                         .mode('append').save())

Collectives™ on Stack Overflow

send data from s3 to postres rds with glue

1 Answer 1

Your Answer

Not the answer you're looking for? Browse other questions tagged
python
amazon-web-services
amazon-rds
aws-glue
or ask your own question.

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged pythonamazon-web-servicesamazon-rdsaws-glue or ask your own question.

Linked

Related

Not the answer you're looking for? Browse other questions tagged
python
amazon-web-services
amazon-rds
aws-glue
or ask your own question.