1

I'm trying to create an automated pipeline with aws. I'm able to get my csv file into my s3 bucket and that automatically triggers a lambda function to send the csv to my glue job. The glue job then turns the csv into a dataframe with pyspark. you cannot use psycopg2, pandas or sqlalchemy, or else glue will give an error saying the module doesn't exist. I have a postgres rds setup in aws rds. This is what i have so far

import sys
from pyspark.context import SparkContext
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from setuptools import setup
from sqlalchemy import create_engine
spark = SparkSession.builder.getOrCreate()
args = getResolvedOptions(sys.argv, ["VAL1", "VAL2"])
file_name = args['VAL1']
bucket_name = args["VAL2"]
file_path = "s3a://{}/{}".format(bucket_name, file_name)
df = spark.read.csv(file_path, sep=',', inferSchema=True, header=True)
df.drop("index")
url = "my rds endpoint link"

i have tried almost a dozen solutions before asking on stackoverflow. So any help would be amazing

11
  • Did you try a df.write... jdbc approach? and yeah, some py modules (eg sqlalchemy) have to be added/installed Commented Apr 15, 2022 at 15:51
  • See this answer for installing additional py modules stackoverflow.com/a/71404169/3437504 Commented Apr 15, 2022 at 15:54
  • when i go to add job parameters the "--additional-python-modules" isn't there Commented Apr 15, 2022 at 16:36
  • You need to add it in the Key "Box" and then add sqlalchemy to the Value Box Commented Apr 15, 2022 at 16:40
  • i set the key to --additional-python-modules and the value to sqlalchemy==3.0 and im still getting the error of no module found. did i type something wrong? i also put in the value slot what the link you sent said which was datacompy==0.7.3 and i got the same error Commented Apr 15, 2022 at 17:02

1 Answer 1

0

I used this df.write approach before. Starting where you left off with your pyspark dataframe

jdbc_url = 'jdbc:postgresql://<instance_name>.xxxxxxxxx.us-west-2.rds.amazonaws.com:5432/<db_name>' 

(df.write.format('jdbc').option('url', 'jdbc_url') 
                        .option('user', 'myUsername') 
                        .option('password', 'myPassword') 
                        .option('dbtable', 'myTable') 
                        .option('driver', 'org.postgresql.Driver') 
                         .mode('append').save()) 

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.