Newest 'amazon-redshift+pyspark' Questions

-1 votes

2 answers

30 views

Not able to create a redshift table using glue

Getting error: exception: java.sql.SQLException: Exception thrown in awaitResult: the table is getting created flight_details but there is only one column dummy in it and the schema defined in create ...

Prateek Goel

19

asked Nov 22 at 5:10

1 vote

1 answer

32 views

Getting the error when creating the redshift table from GLUE

The error output in the log is: Failed to create table: An error occurred while calling o105.getSink. : java.lang.RuntimeException: Temporary directory for redshift not specified. Please verify --...

Prateek Goel

19

asked Nov 21 at 8:24

4 votes

0 answers

208 views

Getting : "An error occurred while calling o110.pyWriteDynamicFrame. Exception thrown in awaitResult:" in AWS Glue Job

I am getting "An error occurred while calling o110.pyWriteDynamicFrame. Exception is thrown in awaitResult:" in AWS Glue Job. The size of my source data in s3 is around 60 GB. I am reading ...

Nikhil Khandelwal

41

asked Jul 12 at 6:06

0 votes

0 answers

57 views

AWS Redshift parallel query issue in Glue script

I have created a Glue script which is supposed to read data from Redshift. The code works perfectly without hash partitions although as soon as I try to run parallel queries it throws an error like ...

Shardul Birje

389

asked Jun 17 at 18:55

2 votes

3 answers

330 views

How to resolve the following AWS Glue error while writing to Redshift using Spark: "ORA-01722: invalid number"?

I am trying to read from an Oracle database and write to a Redshift table using PySpark. # Reading data from Oracle oracle_df = spark.read \ .format("jdbc") \ .option("url",...

Shabari nath k

880

asked Jun 12 at 13:13

0 votes

1 answer

192 views

Error while reading data from databricks jdbc connection to redshift

We use a databricks cluster, that is shutdown after 30 mins of inactivity(13.3 LTS (includes Apache Spark 3.4.1, Scala 2.12)). My objective is to read a redshift table and write it to snowflake, I am ...

Sree51

99

asked May 4 at 6:41

0 votes

1 answer

96 views

Calculate running sum in Spark SQL

I am working on a logic where I need to calculate totalscan, last5dayscan, month2dayscan from dailyscan count. As of today I sum the dailyscan count daily but now data volume is making it tough for ...

mehtat_90

618

asked Apr 24 at 3:52

0 votes

1 answer

167 views

redshift spectrum type conversion from String to Varchar

When I scan the data from S3 using a Glue crawler I get this schema: {id: integer, value: String} This is because Spark writes data back in String type and not varchar type. Although there is a ...

Neelanjoy B

26

asked Mar 25 at 5:25

0 votes

0 answers

217 views

Incremental parquet files

I've got data from an on-prem that is being loaded into an s3 bucket. The data will be transformed using EMR/PySpark and with a surrogate key added during this process. The parquet files will then get ...

user172839

1,067

asked Feb 11 at 6:06

0 votes

1 answer

577 views

Parse XML column in redshift table

I have table in redshift which have XML column. I am looking for options to parse this XML and get some values as columns in other table in redshift. sample xml - ('<person><name>John</...

Bab

181

asked Jan 8 at 5:28

1 vote

0 answers

330 views

PySpark: java.sql.SQLException: HOUR_OF_DAY: 0 -> 1

I'm trying to read a table from a MySQL database and then uploading the data into a Redshift from Amazon AWS. The problem is that when I try to write the rows into the RedShift, I'm getting the error: ...

fernando fincatti

309

asked Dec 21, 2023 at 22:22

0 votes

0 answers

132 views

nested json column parsing in redshift

I have following redshift table column and want to parse each key value pair as separate columns/rows in the table. {"names":[{"name":"bob","months":8,"...

Bab

181

asked Dec 19, 2023 at 20:30

1 vote

0 answers

647 views

Spark job fails in AWS Glue | "An error occurred while calling o86.getSink. The connection attempt failed."

I attempted to migrate data from a csv file from an S3 storage to a table in my Redshift cluster. I took reference from an autogenerated code which came after I built blocks using Visual mode in AWS ...

Mohit Aswani

326

asked Oct 10, 2023 at 12:25

0 votes

1 answer

209 views

Redshift interpreting boolean data type as bit and hence not able to move hudi table from S3 to Redshift if there is any boolean data type column

I'm creating a data pipeline in AWS for moving data from S3 to Redshift via EMR. Data is stored in the HUDI format in S3 in parquet files. I've created the Pyspark script for full load transfer and ...

Rahul Kohli

1

asked Aug 22, 2023 at 6:49

1 vote

1 answer

2k views

Pass parameters to query with spark.read.format(jdbc) format

I am trying to execute following sample query throug spark.read.format("jdbc") in redshift query="select * from tableA join tableB on a.id = b.id where a.date > ? and b.date > ?&...

Bab

181

asked Jul 23, 2023 at 7:57

0 votes

0 answers

219 views

Redshift table insert with identity column getting table not found error

I have created table in redshift and inserted data as follows: create table schema_name.employee ( surrogate_key bigint IDENTITY(1,1), first_name varchar(200), last_name varchar(200), phone_number ...

Bab

181

asked Jul 15, 2023 at 6:58

0 votes

0 answers

142 views

Pyspark JDBC write json data into Redshift withour backslash and double quotes

I am migrating the postgres data to redshift but I am facing one issue while writing the records into the Redshift because of json data type. Actually it is writing into the redshift but it is adding ...

vish anand

151

asked Jun 8, 2023 at 18:47

1 vote

1 answer

710 views

Handling of ties in row_number in Pyspark vs SQL

I have a table containing following columns: year, subject, marks, city Let's say it contains following values: year subject marks student name 2023 Maths 91 Jon 2023 Maths 71 Dany 2023 ...

shubham tambere

92

asked Jun 8, 2023 at 6:45

0 votes

0 answers

27 views

How to deal with attempt of caching data from a db (Redshift) to Spark?

I have a scenario where there is an attempt to df.persist(MEMORY_AND_DISK_DESER) some tables from db, in my case Redshift, before to run different queries. Tables can have from a few hundreds million ...

Randomize

9,093

asked May 26, 2023 at 10:57

1 vote

0 answers

455 views

Creating a redshift table via a glue pyspark job

I am following this blog post on using Redshift intergration with apache spark in glue. I am trying to do it without reading in the data into a dataframe - I just want to send a simple "create ...

L Xandor

1,821

asked Apr 11, 2023 at 12:24

1 vote

1 answer

486 views

Unable to write to redshift via PySpark

I am trying to write to redshift via PySpark. My Spark version is 3.2.0 Using Scala version 2.12.15. I am trying to write as per guided here. I have also tried writing via aws_iam_role as explained in ...

digital_monk

78

asked Mar 30, 2023 at 19:01

0 votes

2 answers

452 views

Read subset of redshift table into glue session

In my normal workflows I read an entire table into glue using the following: orders = glueContext.create_dynamic_frame_from_options("redshift", connection_options = { "url": &...

ben890

1,123

asked Mar 10, 2023 at 14:49

0 votes

0 answers

242 views

How to write decimal type to redshift using awsglue?

I am trying to write a variety of columns to redshift from a dynamic frame using the DynamicFrameWriter.from_jdbc_conf method, but all DECIMAL fields end up as a column of NULLs. The ETL pulls in from ...

cjmbard

1

asked Jan 26, 2023 at 20:48

1 vote

1 answer

922 views

java.lang.NoClassDefFoundError: scala/Product$class using read function from PySpark

I'm new to PySpark, and I'm just trying to read a table from my redshift bank. The code looks like the following: import findspark findspark.add_packages("io.github.spark-redshift-community:spark-...

fernando fincatti

309

asked Nov 25, 2022 at 22:21

1 vote

1 answer

514 views

Best way to process Redshift data on Spark (EMR) via Airflow MWAA?

We have an Airflow MWAA cluster and huge volume of Data in our Redshift data warehouse. We currently process the data directly in Redshift (w/ SQL) but given the amount of data, this puts a lot of ...

val

359

asked Nov 17, 2022 at 11:42

0 votes

1 answer

1k views

Debugging "String length exceeds DDL Length" error AWS Glue

I'm writing a dynamic frame to Redshift as a table and I'm getting the following error: An error occurred while calling o3225.pyWriteDynamicFrame. Error (code 1204) while loading data into Redshift: &...

ben890

1,123

asked Nov 1, 2022 at 21:03

0 votes

1 answer

1k views

AWS Glue - PySpark DF to Redshift - How to handle columns with null values

I am using AWS Glue to load MongoDB data to AWS Redshift. Below is the process - Read from a Mongo collection. Create a Spark DF - this contains some columns with all or some null values. Write to a ...

Sanket Kelkar

169

asked Aug 30, 2022 at 9:18

3 votes

0 answers

855 views

Error : Requested role is not associated to cluster ,when trying to read redshift table from pyspark in emr

Trying to read redshift table from pyspark in emr, getting “requested role is not associated to cluster error” Role is already attached to redshift df.count() is working, df.show() throws above ...

remeezraja abdulrahman

31

asked Aug 10, 2022 at 19:54

0 votes

3 answers

429 views

How to subtract min max date's values and calculate with different group by in SQL

I'm trying to populate the right table from the left one. I was trying to get min and max (month) by group by brand and then get its value according to the month. select brand ,month ,value ...

sopL

61

asked Aug 8, 2022 at 21:23

0 votes

0 answers

263 views

How to freeze the historical data?

I have this assignment where i am having employee data. For example the excel screenshot. I need to freeze the data every month. Which means that if the data changes in august then it should not ...

Ajax

179

asked Jul 27, 2022 at 22:37

0 votes

2 answers

774 views

Spark overwrite delete redshift table permissions

I'm trying to update the content of a redshift cluster table using pyspark doing the following: content= spark.read \ .format("com.databricks.spark.redshift") \ .option("...

Loren

333

asked Jun 17, 2022 at 13:43

0 votes

1 answer

476 views

Rename a redshift SQL table within PySpark Databricks

I want to rename a redshift table within a Python Databricks notebook. Currently I have a query that pulls in data and creates a table: redshiftUrl = 'jdbc:redshift://myredshifturl' redshiftOptions = ...

Jacky

750

asked May 25, 2022 at 22:33

0 votes

1 answer

295 views

why does spark need S3 to connect Redshift warehouse? Meanwhile python pandas can read Redshift table directly

Sorry in advance for this dumb question. I am just begining with AWS and Pyspark. I was reviewing pyspark library and I see pyspark need a tempdir in S3 to be able to read data from redshift. My ...

Luis M.

1

asked May 14, 2022 at 6:00

0 votes

1 answer

254 views

Error "declared column type INT for column id incompatible with ORC file column type string query" when copy orc to Redshift

Error "declared column type INT for column id incompatible with ORC file column type string query" when copy orc to Redshift using the command: from 's3://' iam_role 'role' format as orc;

lucaspompeun

308

asked Apr 6, 2022 at 21:11

1 vote

0 answers

239 views

How can I access a table in a AWS KMS encrypted redshift cluster from a glue job using pyspark script?

My requirement: I want to write a pyspark script to read data from a table in a AWS KMS encrypted redshift cluster(required SSL is true). How can I retrieve connection details like password and use it ...

Learner07

23

asked Feb 6, 2022 at 20:33

0 votes

1 answer

515 views

how to write a Trigger to insert data from aurora to redshift

I am having some data in aurora mysql db, I would like to do two things: HISTORICAL DATA: To read the data from aurora(say TABLE A) do some processing and update some columns of a table in redshift(...

Keen_Learner

87

asked Feb 1, 2022 at 17:49

-1 votes

1 answer

3k views

pyspark code failing with error An error occurred while calling z:com.amazonaws.services.glue.DynamicFrame.apply. list#5451 []

I am writing aws glue job (pyspark code) using SQL Transformation. I am getting error with scala.MatchError: list#5252 [] (of class org.apache.spark.sql.catalyst.expressions.ListQuery. There is one ...

Kaustubh K

11

asked Jan 6, 2022 at 16:54

1 vote

3 answers

10k views

PySpark: writing in 'append' mode and overwrite if certain criteria match

I am append the following Spark dataframe to an existing Redshift database. And I want to use 'month' and 'state' as criterias to check, and replace data in the Redshift table if month = '2021-12' and ...

Crubal Chenxi Li

179

asked Dec 9, 2021 at 3:30

0 votes

0 answers

197 views

Moving data from s3 to redshift

I have a bucket in S3 and many folders inside that bucket. In each folder I have 5 to 6 files. I want to move these files to Redshift. I am using AWS crawler and Glue for moving the files. But, When I ...

SRA

3

asked Dec 6, 2021 at 12:00

-2 votes

1 answer

3k views

Select first word of string in Spark.SQL

I am trying to select the first word in the string of column Office_Name in table Office_Address thru Spark SQL. I am using below query - select split_part(Office_NAME,' ',1) Office_Alias from ...

Codegator

637

asked Dec 2, 2021 at 21:06

0 votes

0 answers

412 views

Data change Capture in Redshift using AWS Glue script

I have used a "For in" loop script in AWS Glue to move 70 tables from S3 to Redshift. But, When I run the script again and again, data is being duplicated. I have seen one document as a ...

SRA

3

asked Nov 18, 2021 at 13:53

0 votes

1 answer

1k views

AWS Glue Data moving from S3 to Redshift

I have around 70 tables in one S3 bucket and I would like to move them to the redshift using glue. I could move only few tables. Rest of them are having data type issue. Redshift is not accepting some ...

SRA

3

asked Nov 1, 2021 at 10:35

0 votes

1 answer

1k views

Incremental data load from Redshift to S3 using Pyspark and Glue Jobs

I have created a pipeline where the data ingestion takes place between Redshift and S3. I was able to do the complete load using the below method: def readFromRedShift(spark: SparkSession, schema, ...

whatsinthename

2,147

asked Sep 28, 2021 at 20:26

1 vote

1 answer

163 views

Pyspark error DDL length exceeded when regexp_replace with added space

This is really strange, this code works when i replace the \n and \r characters with no space. But when i use a space, either " ", or "\s", or "\s", or "[\s]", ...

Ron

207

asked Sep 20, 2021 at 15:57

2 votes

1 answer

1k views

How do I insert data in selective columns using PySpark?

I have a table on Redshift, to which I want to insert some data using pyspark dataframe. The redshift table has schema: CREATE TABLE admin.audit_of_all_tables ( wh_table_name varchar, ...

Sidhant Gupta

149

asked Sep 19, 2021 at 12:32

0 votes

0 answers

121 views

AWS Redshift query optimisation

Trying to insert data into redshift using pyspark dataframe write function Is there a better way to write the postActions query? The USING keyword in the postActions part is a bit confusing and i ...

asha k

11

asked Aug 26, 2021 at 18:20

0 votes

0 answers

593 views

Glue Data write to Redshift too slow

I am running a pyspark glue job with 10 DPU, the data in s3 is around 45 GB files split into 6 .csv files. first question: Its taking a lot of time to write data to Redshift from glue even tho I am ...

Aditya Verma

221

asked Aug 24, 2021 at 19:36

1 vote

0 answers

349 views

remove backslash from a .csv file to load data to redshift from s3

I am getting an issue when I am loading my file , I have backslash in my csv file how and what delimited can I use while using my copy command so that I don't get error loading data from s3 to ...

Aditya Verma

221

asked Aug 21, 2021 at 8:55

1 vote

2 answers

1k views

How do I increase the performance of Dataframe.write?

I am trying to write a PySpark dataframe to AWS Redshift. I am using postActions parameters for deletion. But this snippet is taking a lot of time to complete. Is there a way to improve the DATAFRAME....

asha k

11

asked Aug 10, 2021 at 4:18

2 votes

1 answer

968 views

redshift libs for pyspark

I am facing the following error while running my pyspark program. : java.lang.ClassNotFoundException: com.amazon.redshift.jdbc42.Driver at java.base/java.net.URLClassLoader.findClass(URLClassLoader....

Abhay Dandekar

1,300

asked Jul 31, 2021 at 3:59

Collectives™ on Stack Overflow

All Questions

Related Tags