Skip to main content

All Questions

Filter by
Sorted by
Tagged with
-1 votes
2 answers
30 views

Not able to create a redshift table using glue

Getting error: exception: java.sql.SQLException: Exception thrown in awaitResult: the table is getting created flight_details but there is only one column dummy in it and the schema defined in create ...
Prateek Goel's user avatar
1 vote
1 answer
32 views

Getting the error when creating the redshift table from GLUE

The error output in the log is: Failed to create table: An error occurred while calling o105.getSink. : java.lang.RuntimeException: Temporary directory for redshift not specified. Please verify --...
Prateek Goel's user avatar
4 votes
0 answers
208 views

Getting : "An error occurred while calling o110.pyWriteDynamicFrame. Exception thrown in awaitResult:" in AWS Glue Job

I am getting "An error occurred while calling o110.pyWriteDynamicFrame. Exception is thrown in awaitResult:" in AWS Glue Job. The size of my source data in s3 is around 60 GB. I am reading ...
Nikhil Khandelwal's user avatar
0 votes
0 answers
57 views

AWS Redshift parallel query issue in Glue script

I have created a Glue script which is supposed to read data from Redshift. The code works perfectly without hash partitions although as soon as I try to run parallel queries it throws an error like ...
Shardul Birje's user avatar
2 votes
3 answers
330 views

How to resolve the following AWS Glue error while writing to Redshift using Spark: "ORA-01722: invalid number"?

I am trying to read from an Oracle database and write to a Redshift table using PySpark. # Reading data from Oracle oracle_df = spark.read \ .format("jdbc") \ .option("url",...
Shabari nath k's user avatar
0 votes
1 answer
192 views

Error while reading data from databricks jdbc connection to redshift

We use a databricks cluster, that is shutdown after 30 mins of inactivity(13.3 LTS (includes Apache Spark 3.4.1, Scala 2.12)). My objective is to read a redshift table and write it to snowflake, I am ...
Sree51's user avatar
  • 99
0 votes
1 answer
96 views

Calculate running sum in Spark SQL

I am working on a logic where I need to calculate totalscan, last5dayscan, month2dayscan from dailyscan count. As of today I sum the dailyscan count daily but now data volume is making it tough for ...
mehtat_90's user avatar
  • 618
0 votes
1 answer
167 views

redshift spectrum type conversion from String to Varchar

When I scan the data from S3 using a Glue crawler I get this schema: {id: integer, value: String} This is because Spark writes data back in String type and not varchar type. Although there is a ...
Neelanjoy B's user avatar
0 votes
0 answers
217 views

Incremental parquet files

I've got data from an on-prem that is being loaded into an s3 bucket. The data will be transformed using EMR/PySpark and with a surrogate key added during this process. The parquet files will then get ...
user172839's user avatar
  • 1,067
0 votes
1 answer
577 views

Parse XML column in redshift table

I have table in redshift which have XML column. I am looking for options to parse this XML and get some values as columns in other table in redshift. sample xml - ('<person><name>John</...
Bab's user avatar
  • 181
1 vote
0 answers
330 views

PySpark: java.sql.SQLException: HOUR_OF_DAY: 0 -> 1

I'm trying to read a table from a MySQL database and then uploading the data into a Redshift from Amazon AWS. The problem is that when I try to write the rows into the RedShift, I'm getting the error: ...
fernando fincatti's user avatar
0 votes
0 answers
132 views

nested json column parsing in redshift

I have following redshift table column and want to parse each key value pair as separate columns/rows in the table. {"names":[{"name":"bob","months":8,"...
Bab's user avatar
  • 181
1 vote
0 answers
647 views

Spark job fails in AWS Glue | "An error occurred while calling o86.getSink. The connection attempt failed."

I attempted to migrate data from a csv file from an S3 storage to a table in my Redshift cluster. I took reference from an autogenerated code which came after I built blocks using Visual mode in AWS ...
Mohit Aswani's user avatar
0 votes
1 answer
209 views

Redshift interpreting boolean data type as bit and hence not able to move hudi table from S3 to Redshift if there is any boolean data type column

I'm creating a data pipeline in AWS for moving data from S3 to Redshift via EMR. Data is stored in the HUDI format in S3 in parquet files. I've created the Pyspark script for full load transfer and ...
Rahul Kohli's user avatar
1 vote
1 answer
2k views

Pass parameters to query with spark.read.format(jdbc) format

I am trying to execute following sample query throug spark.read.format("jdbc") in redshift query="select * from tableA join tableB on a.id = b.id where a.date > ? and b.date > ?&...
Bab's user avatar
  • 181
0 votes
0 answers
219 views

Redshift table insert with identity column getting table not found error

I have created table in redshift and inserted data as follows: create table schema_name.employee ( surrogate_key bigint IDENTITY(1,1), first_name varchar(200), last_name varchar(200), phone_number ...
Bab's user avatar
  • 181
0 votes
0 answers
142 views

Pyspark JDBC write json data into Redshift withour backslash and double quotes

I am migrating the postgres data to redshift but I am facing one issue while writing the records into the Redshift because of json data type. Actually it is writing into the redshift but it is adding ...
vish anand's user avatar
1 vote
1 answer
710 views

Handling of ties in row_number in Pyspark vs SQL

I have a table containing following columns: year, subject, marks, city Let's say it contains following values: year subject marks student name 2023 Maths 91 Jon 2023 Maths 71 Dany 2023 ...
shubham tambere's user avatar
0 votes
0 answers
27 views

How to deal with attempt of caching data from a db (Redshift) to Spark?

I have a scenario where there is an attempt to df.persist(MEMORY_AND_DISK_DESER) some tables from db, in my case Redshift, before to run different queries. Tables can have from a few hundreds million ...
Randomize's user avatar
  • 9,093
1 vote
0 answers
455 views

Creating a redshift table via a glue pyspark job

I am following this blog post on using Redshift intergration with apache spark in glue. I am trying to do it without reading in the data into a dataframe - I just want to send a simple "create ...
L Xandor's user avatar
  • 1,821
1 vote
1 answer
486 views

Unable to write to redshift via PySpark

I am trying to write to redshift via PySpark. My Spark version is 3.2.0 Using Scala version 2.12.15. I am trying to write as per guided here. I have also tried writing via aws_iam_role as explained in ...
digital_monk's user avatar
0 votes
2 answers
452 views

Read subset of redshift table into glue session

In my normal workflows I read an entire table into glue using the following: orders = glueContext.create_dynamic_frame_from_options("redshift", connection_options = { "url": &...
ben890's user avatar
  • 1,123
0 votes
0 answers
242 views

How to write decimal type to redshift using awsglue?

I am trying to write a variety of columns to redshift from a dynamic frame using the DynamicFrameWriter.from_jdbc_conf method, but all DECIMAL fields end up as a column of NULLs. The ETL pulls in from ...
cjmbard's user avatar
1 vote
1 answer
922 views

java.lang.NoClassDefFoundError: scala/Product$class using read function from PySpark

I'm new to PySpark, and I'm just trying to read a table from my redshift bank. The code looks like the following: import findspark findspark.add_packages("io.github.spark-redshift-community:spark-...
fernando fincatti's user avatar
1 vote
1 answer
514 views

Best way to process Redshift data on Spark (EMR) via Airflow MWAA?

We have an Airflow MWAA cluster and huge volume of Data in our Redshift data warehouse. We currently process the data directly in Redshift (w/ SQL) but given the amount of data, this puts a lot of ...
val's user avatar
  • 359
0 votes
1 answer
1k views

Debugging "String length exceeds DDL Length" error AWS Glue

I'm writing a dynamic frame to Redshift as a table and I'm getting the following error: An error occurred while calling o3225.pyWriteDynamicFrame. Error (code 1204) while loading data into Redshift: &...
ben890's user avatar
  • 1,123
0 votes
1 answer
1k views

AWS Glue - PySpark DF to Redshift - How to handle columns with null values

I am using AWS Glue to load MongoDB data to AWS Redshift. Below is the process - Read from a Mongo collection. Create a Spark DF - this contains some columns with all or some null values. Write to a ...
Sanket Kelkar's user avatar
3 votes
0 answers
855 views

Error : Requested role is not associated to cluster ,when trying to read redshift table from pyspark in emr

Trying to read redshift table from pyspark in emr, getting “requested role is not associated to cluster error” Role is already attached to redshift df.count() is working, df.show() throws above ...
remeezraja abdulrahman's user avatar
0 votes
3 answers
429 views

How to subtract min max date's values and calculate with different group by in SQL

I'm trying to populate the right table from the left one. I was trying to get min and max (month) by group by brand and then get its value according to the month. select brand ,month ,value ...
sopL's user avatar
  • 61
0 votes
0 answers
263 views

How to freeze the historical data?

I have this assignment where i am having employee data. For example the excel screenshot. I need to freeze the data every month. Which means that if the data changes in august then it should not ...
Ajax's user avatar
  • 179
0 votes
2 answers
774 views

Spark overwrite delete redshift table permissions

I'm trying to update the content of a redshift cluster table using pyspark doing the following: content= spark.read \ .format("com.databricks.spark.redshift") \ .option("...
Loren's user avatar
  • 333
0 votes
1 answer
476 views

Rename a redshift SQL table within PySpark Databricks

I want to rename a redshift table within a Python Databricks notebook. Currently I have a query that pulls in data and creates a table: redshiftUrl = 'jdbc:redshift://myredshifturl' redshiftOptions = ...
Jacky's user avatar
  • 750
0 votes
1 answer
295 views

why does spark need S3 to connect Redshift warehouse? Meanwhile python pandas can read Redshift table directly

Sorry in advance for this dumb question. I am just begining with AWS and Pyspark. I was reviewing pyspark library and I see pyspark need a tempdir in S3 to be able to read data from redshift. My ...
Luis M.'s user avatar
0 votes
1 answer
254 views

Error "declared column type INT for column id incompatible with ORC file column type string query" when copy orc to Redshift

Error "declared column type INT for column id incompatible with ORC file column type string query" when copy orc to Redshift using the command: from 's3://' iam_role 'role' format as orc;
lucaspompeun's user avatar
1 vote
0 answers
239 views

How can I access a table in a AWS KMS encrypted redshift cluster from a glue job using pyspark script?

My requirement: I want to write a pyspark script to read data from a table in a AWS KMS encrypted redshift cluster(required SSL is true). How can I retrieve connection details like password and use it ...
Learner07's user avatar
0 votes
1 answer
515 views

how to write a Trigger to insert data from aurora to redshift

I am having some data in aurora mysql db, I would like to do two things: HISTORICAL DATA: To read the data from aurora(say TABLE A) do some processing and update some columns of a table in redshift(...
Keen_Learner's user avatar
-1 votes
1 answer
3k views

pyspark code failing with error An error occurred while calling z:com.amazonaws.services.glue.DynamicFrame.apply. list#5451 []

I am writing aws glue job (pyspark code) using SQL Transformation. I am getting error with scala.MatchError: list#5252 [] (of class org.apache.spark.sql.catalyst.expressions.ListQuery. There is one ...
Kaustubh K's user avatar
1 vote
3 answers
10k views

PySpark: writing in 'append' mode and overwrite if certain criteria match

I am append the following Spark dataframe to an existing Redshift database. And I want to use 'month' and 'state' as criterias to check, and replace data in the Redshift table if month = '2021-12' and ...
Crubal Chenxi Li's user avatar
0 votes
0 answers
197 views

Moving data from s3 to redshift

I have a bucket in S3 and many folders inside that bucket. In each folder I have 5 to 6 files. I want to move these files to Redshift. I am using AWS crawler and Glue for moving the files. But, When I ...
SRA's user avatar
  • 3
-2 votes
1 answer
3k views

Select first word of string in Spark.SQL

I am trying to select the first word in the string of column Office_Name in table Office_Address thru Spark SQL. I am using below query - select split_part(Office_NAME,' ',1) Office_Alias from ...
Codegator's user avatar
  • 637
0 votes
0 answers
412 views

Data change Capture in Redshift using AWS Glue script

I have used a "For in" loop script in AWS Glue to move 70 tables from S3 to Redshift. But, When I run the script again and again, data is being duplicated. I have seen one document as a ...
SRA's user avatar
  • 3
0 votes
1 answer
1k views

AWS Glue Data moving from S3 to Redshift

I have around 70 tables in one S3 bucket and I would like to move them to the redshift using glue. I could move only few tables. Rest of them are having data type issue. Redshift is not accepting some ...
SRA's user avatar
  • 3
0 votes
1 answer
1k views

Incremental data load from Redshift to S3 using Pyspark and Glue Jobs

I have created a pipeline where the data ingestion takes place between Redshift and S3. I was able to do the complete load using the below method: def readFromRedShift(spark: SparkSession, schema, ...
whatsinthename's user avatar
1 vote
1 answer
163 views

Pyspark error DDL length exceeded when regexp_replace with added space

This is really strange, this code works when i replace the \n and \r characters with no space. But when i use a space, either " ", or "\s", or "\s", or "[\s]", ...
Ron's user avatar
  • 207
2 votes
1 answer
1k views

How do I insert data in selective columns using PySpark?

I have a table on Redshift, to which I want to insert some data using pyspark dataframe. The redshift table has schema: CREATE TABLE admin.audit_of_all_tables ( wh_table_name varchar, ...
Sidhant Gupta's user avatar
0 votes
0 answers
121 views

AWS Redshift query optimisation

Trying to insert data into redshift using pyspark dataframe write function Is there a better way to write the postActions query? The USING keyword in the postActions part is a bit confusing and i ...
asha k's user avatar
  • 11
0 votes
0 answers
593 views

Glue Data write to Redshift too slow

I am running a pyspark glue job with 10 DPU, the data in s3 is around 45 GB files split into 6 .csv files. first question: Its taking a lot of time to write data to Redshift from glue even tho I am ...
Aditya Verma's user avatar
1 vote
0 answers
349 views

remove backslash from a .csv file to load data to redshift from s3

I am getting an issue when I am loading my file , I have backslash in my csv file how and what delimited can I use while using my copy command so that I don't get error loading data from s3 to ...
Aditya Verma's user avatar
1 vote
2 answers
1k views

How do I increase the performance of Dataframe.write?

I am trying to write a PySpark dataframe to AWS Redshift. I am using postActions parameters for deletion. But this snippet is taking a lot of time to complete. Is there a way to improve the DATAFRAME....
asha k's user avatar
  • 11
2 votes
1 answer
968 views

redshift libs for pyspark

I am facing the following error while running my pyspark program. : java.lang.ClassNotFoundException: com.amazon.redshift.jdbc42.Driver at java.base/java.net.URLClassLoader.findClass(URLClassLoader....
Abhay Dandekar's user avatar