I'm trying to create an automated pipeline with aws. I'm able to get my csv file into my s3 bucket and that automatically triggers a lambda function to send the csv to my glue job. The glue job then turns the csv into a dataframe with pyspark. you cannot use psycopg2, pandas or sqlalchemy, or else glue will give an error saying the module doesn't exist. I have a postgres rds setup in aws rds. This is what i have so far
import sys
from pyspark.context import SparkContext
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import SparkSession
from setuptools import setup
from sqlalchemy import create_engine
spark = SparkSession.builder.getOrCreate()
args = getResolvedOptions(sys.argv, ["VAL1", "VAL2"])
file_name = args['VAL1']
bucket_name = args["VAL2"]
file_path = "s3a://{}/{}".format(bucket_name, file_name)
df = spark.read.csv(file_path, sep=',', inferSchema=True, header=True)
df.drop("index")
url = "my rds endpoint link"
i have tried almost a dozen solutions before asking on stackoverflow. So any help would be amazing
df.write...
jdbc approach? and yeah, some py modules (eg sqlalchemy) have to be added/installed