All Questions
Tagged with amazon-redshift pyspark
122 questions
-1
votes
2
answers
30
views
Not able to create a redshift table using glue
Getting error: exception: java.sql.SQLException: Exception thrown in awaitResult:
the table is getting created flight_details but there is only one column dummy in it and the schema defined in create ...
1
vote
1
answer
32
views
Getting the error when creating the redshift table from GLUE
The error output in the log is:
Failed to create table: An error occurred while calling o105.getSink.
: java.lang.RuntimeException: Temporary directory for redshift not specified. Please verify --...
4
votes
0
answers
208
views
Getting : "An error occurred while calling o110.pyWriteDynamicFrame. Exception thrown in awaitResult:" in AWS Glue Job
I am getting "An error occurred while calling o110.pyWriteDynamicFrame. Exception is thrown in awaitResult:" in AWS Glue Job.
The size of my source data in s3 is around 60 GB.
I am reading ...
0
votes
0
answers
57
views
AWS Redshift parallel query issue in Glue script
I have created a Glue script which is supposed to read data from Redshift. The code works perfectly without hash partitions although as soon as I try to run parallel queries it throws an error like ...
2
votes
3
answers
330
views
How to resolve the following AWS Glue error while writing to Redshift using Spark: "ORA-01722: invalid number"?
I am trying to read from an Oracle database and write to a Redshift table using PySpark.
# Reading data from Oracle
oracle_df = spark.read \
.format("jdbc") \
.option("url",...
0
votes
1
answer
192
views
Error while reading data from databricks jdbc connection to redshift
We use a databricks cluster, that is shutdown after 30 mins of inactivity(13.3 LTS (includes Apache Spark 3.4.1, Scala 2.12)).
My objective is to read a redshift table and write it to snowflake, I am ...
0
votes
1
answer
96
views
Calculate running sum in Spark SQL
I am working on a logic where I need to calculate totalscan, last5dayscan, month2dayscan from dailyscan count. As of today I sum the dailyscan count daily but now data volume is making it tough for ...
0
votes
1
answer
167
views
redshift spectrum type conversion from String to Varchar
When I scan the data from S3 using a Glue crawler I get this schema:
{id: integer, value: String}
This is because Spark writes data back in String type and not varchar type. Although there is a ...
0
votes
0
answers
217
views
Incremental parquet files
I've got data from an on-prem that is being loaded into an s3 bucket. The data will be transformed using EMR/PySpark and with a surrogate key added during this process. The parquet files will then get ...
0
votes
1
answer
577
views
Parse XML column in redshift table
I have table in redshift which have XML column. I am looking for options to parse this XML and get some values as columns in other table in redshift.
sample xml - ('<person><name>John</...
1
vote
0
answers
330
views
PySpark: java.sql.SQLException: HOUR_OF_DAY: 0 -> 1
I'm trying to read a table from a MySQL database and then uploading the data into a Redshift from Amazon AWS.
The problem is that when I try to write the rows into the RedShift, I'm getting the error:
...
0
votes
0
answers
132
views
nested json column parsing in redshift
I have following redshift table column and want to parse each key value pair as separate columns/rows in the table.
{"names":[{"name":"bob","months":8,"...
1
vote
0
answers
647
views
Spark job fails in AWS Glue | "An error occurred while calling o86.getSink. The connection attempt failed."
I attempted to migrate data from a csv file from an S3 storage to a table in my Redshift cluster. I took reference from an autogenerated code which came after I built blocks using Visual mode in AWS ...
0
votes
1
answer
209
views
Redshift interpreting boolean data type as bit and hence not able to move hudi table from S3 to Redshift if there is any boolean data type column
I'm creating a data pipeline in AWS for moving data from S3 to Redshift via EMR. Data is stored in the HUDI format in S3 in parquet files. I've created the Pyspark script for full load transfer and ...
1
vote
1
answer
2k
views
Pass parameters to query with spark.read.format(jdbc) format
I am trying to execute following sample query throug spark.read.format("jdbc") in redshift
query="select * from tableA join tableB on a.id = b.id where a.date > ? and b.date > ?&...
0
votes
0
answers
219
views
Redshift table insert with identity column getting table not found error
I have created table in redshift and inserted data as follows:
create table schema_name.employee (
surrogate_key bigint IDENTITY(1,1),
first_name varchar(200),
last_name varchar(200),
phone_number ...
0
votes
0
answers
142
views
Pyspark JDBC write json data into Redshift withour backslash and double quotes
I am migrating the postgres data to redshift but I am facing one issue while writing the records into the Redshift because of json data type. Actually it is writing into the redshift but it is adding ...
1
vote
1
answer
710
views
Handling of ties in row_number in Pyspark vs SQL
I have a table containing following columns:
year, subject, marks, city
Let's say it contains following values:
year subject marks student name
2023 Maths 91 Jon
2023 Maths 71 Dany
2023 ...
0
votes
0
answers
27
views
How to deal with attempt of caching data from a db (Redshift) to Spark?
I have a scenario where there is an attempt to df.persist(MEMORY_AND_DISK_DESER) some tables from db, in my case Redshift, before to run different queries. Tables can have from a few hundreds million ...
1
vote
0
answers
455
views
Creating a redshift table via a glue pyspark job
I am following this blog post on using Redshift intergration with apache spark in glue. I am trying to do it without reading in the data into a dataframe - I just want to send a simple "create ...
1
vote
1
answer
486
views
Unable to write to redshift via PySpark
I am trying to write to redshift via PySpark. My Spark version is 3.2.0 Using Scala version 2.12.15.
I am trying to write as per guided here. I have also tried writing via aws_iam_role as explained in ...
0
votes
2
answers
452
views
Read subset of redshift table into glue session
In my normal workflows I read an entire table into glue using the following:
orders = glueContext.create_dynamic_frame_from_options("redshift", connection_options = {
"url": &...
0
votes
0
answers
242
views
How to write decimal type to redshift using awsglue?
I am trying to write a variety of columns to redshift from a dynamic frame using the DynamicFrameWriter.from_jdbc_conf method, but all DECIMAL fields end up as a column of NULLs.
The ETL pulls in from ...
1
vote
1
answer
922
views
java.lang.NoClassDefFoundError: scala/Product$class using read function from PySpark
I'm new to PySpark, and I'm just trying to read a table from my redshift bank.
The code looks like the following:
import findspark
findspark.add_packages("io.github.spark-redshift-community:spark-...
1
vote
1
answer
514
views
Best way to process Redshift data on Spark (EMR) via Airflow MWAA?
We have an Airflow MWAA cluster and huge volume of Data in our Redshift data warehouse. We currently process the data directly in Redshift (w/ SQL) but given the amount of data, this puts a lot of ...
0
votes
1
answer
1k
views
Debugging "String length exceeds DDL Length" error AWS Glue
I'm writing a dynamic frame to Redshift as a table and I'm getting the following error:
An error occurred while calling o3225.pyWriteDynamicFrame. Error (code 1204) while loading data into Redshift: &...
0
votes
1
answer
1k
views
AWS Glue - PySpark DF to Redshift - How to handle columns with null values
I am using AWS Glue to load MongoDB data to AWS Redshift. Below is the process -
Read from a Mongo collection.
Create a Spark DF - this contains some columns with all or some null values.
Write to a ...
3
votes
0
answers
855
views
Error : Requested role is not associated to cluster ,when trying to read redshift table from pyspark in emr
Trying to read redshift table from pyspark in emr, getting “requested role is not associated to cluster error”
Role is already attached to redshift
df.count() is working, df.show() throws above ...
0
votes
3
answers
429
views
How to subtract min max date's values and calculate with different group by in SQL
I'm trying to populate the right table from the left one.
I was trying to get min and max (month) by group by brand and then get its value according to the month.
select
brand
,month
,value
...
0
votes
0
answers
263
views
How to freeze the historical data?
I have this assignment where i am having employee data. For example the excel screenshot.
I need to freeze the data every month. Which means that if the data changes in august then it should not ...
0
votes
2
answers
774
views
Spark overwrite delete redshift table permissions
I'm trying to update the content of a redshift cluster table using pyspark doing the following:
content= spark.read \
.format("com.databricks.spark.redshift") \
.option("...
0
votes
1
answer
476
views
Rename a redshift SQL table within PySpark Databricks
I want to rename a redshift table within a Python Databricks notebook.
Currently I have a query that pulls in data and creates a table:
redshiftUrl = 'jdbc:redshift://myredshifturl'
redshiftOptions = ...
0
votes
1
answer
295
views
why does spark need S3 to connect Redshift warehouse? Meanwhile python pandas can read Redshift table directly
Sorry in advance for this dumb question. I am just begining with AWS and Pyspark. I was reviewing pyspark library and I see pyspark need a tempdir in S3 to be able to read data from redshift. My ...
0
votes
1
answer
254
views
Error "declared column type INT for column id incompatible with ORC file column type string query" when copy orc to Redshift
Error "declared column type INT for column id incompatible with ORC file column type string query" when copy orc to Redshift using the command:
from 's3://'
iam_role 'role'
format as orc;
1
vote
0
answers
239
views
How can I access a table in a AWS KMS encrypted redshift cluster from a glue job using pyspark script?
My requirement:
I want to write a pyspark script to read data from a table in a AWS KMS encrypted redshift cluster(required SSL is true).
How can I retrieve connection details like password and use it ...
0
votes
1
answer
515
views
how to write a Trigger to insert data from aurora to redshift
I am having some data in aurora mysql db, I would like to do two things:
HISTORICAL DATA:
To read the data from aurora(say TABLE A) do some processing and update some columns of a table in redshift(...
-1
votes
1
answer
3k
views
pyspark code failing with error An error occurred while calling z:com.amazonaws.services.glue.DynamicFrame.apply. list#5451 []
I am writing aws glue job (pyspark code) using SQL Transformation. I am getting error with scala.MatchError: list#5252 [] (of class org.apache.spark.sql.catalyst.expressions.ListQuery. There is one ...
1
vote
3
answers
10k
views
PySpark: writing in 'append' mode and overwrite if certain criteria match
I am append the following Spark dataframe to an existing Redshift database. And I want to use 'month' and 'state' as criterias to check, and replace data in the Redshift table if month = '2021-12' and ...
0
votes
0
answers
197
views
Moving data from s3 to redshift
I have a bucket in S3 and many folders inside that bucket. In each folder I have 5 to 6 files. I want to move these files to Redshift. I am using AWS crawler and Glue for moving the files. But, When I ...
-2
votes
1
answer
3k
views
Select first word of string in Spark.SQL
I am trying to select the first word in the string of column Office_Name in table Office_Address thru Spark SQL. I am using below query -
select split_part(Office_NAME,' ',1) Office_Alias from ...
0
votes
0
answers
412
views
Data change Capture in Redshift using AWS Glue script
I have used a "For in" loop script in AWS Glue to move 70 tables from S3 to Redshift. But, When I run the script again and again, data is being duplicated. I have seen one document as a ...
0
votes
1
answer
1k
views
AWS Glue Data moving from S3 to Redshift
I have around 70 tables in one S3 bucket and I would like to move them to the redshift using glue. I could move only few tables. Rest of them are having data type issue. Redshift is not accepting some ...
0
votes
1
answer
1k
views
Incremental data load from Redshift to S3 using Pyspark and Glue Jobs
I have created a pipeline where the data ingestion takes place between Redshift and S3. I was able to do the complete load using the below method:
def readFromRedShift(spark: SparkSession, schema, ...
1
vote
1
answer
163
views
Pyspark error DDL length exceeded when regexp_replace with added space
This is really strange, this code works when i replace the \n and \r characters with no space. But when i use a space, either " ", or "\s", or "\s", or "[\s]", ...
2
votes
1
answer
1k
views
How do I insert data in selective columns using PySpark?
I have a table on Redshift, to which I want to insert some data using pyspark dataframe. The redshift table has schema:
CREATE TABLE admin.audit_of_all_tables
(
wh_table_name varchar,
...
0
votes
0
answers
121
views
AWS Redshift query optimisation
Trying to insert data into redshift using pyspark dataframe write function
Is there a better way to write the postActions query?
The USING keyword in the postActions part is a bit confusing and i ...
0
votes
0
answers
593
views
Glue Data write to Redshift too slow
I am running a pyspark glue job with 10 DPU, the data in s3 is around 45 GB files split into 6 .csv files.
first question:
Its taking a lot of time to write data to Redshift from glue even tho I am ...
1
vote
0
answers
349
views
remove backslash from a .csv file to load data to redshift from s3
I am getting an issue when I am loading my file , I have backslash in my csv file
how and what delimited can I use while using my copy command so that I don't get
error loading data from s3 to ...
1
vote
2
answers
1k
views
How do I increase the performance of Dataframe.write?
I am trying to write a PySpark dataframe to AWS Redshift.
I am using postActions parameters for deletion.
But this snippet is taking a lot of time to complete.
Is there a way to improve the DATAFRAME....
2
votes
1
answer
968
views
redshift libs for pyspark
I am facing the following error while running my pyspark program.
: java.lang.ClassNotFoundException: com.amazon.redshift.jdbc42.Driver
at java.base/java.net.URLClassLoader.findClass(URLClassLoader....