Skip to main content

All Questions

Tagged with
Filter by
Sorted by
Tagged with
0 votes
1 answer
60 views

Unable to access Public APIs using request library in EMR serverless

Getting below error when I am trying to fetch a API using request library. Traceback (most recent call last): File "/tmp/spark-39775710-130a-4403-9182-c557003f351b/lib.zip/urllib3/connection.py&...
Ashwini Kumar's user avatar
0 votes
1 answer
246 views

How can I use yaml file in pyspark code executing on EMR cluster?

In EMR cluster , I am running the pyspark code,which uses the yaml file, I am getting the path not found error. I am using the following spark submit : spark-submit --deploy-mode client --executor-...
Vishal Prajapati's user avatar
1 vote
0 answers
97 views

In EMR Serverless is there a way to set the LD_LIBRARY_PATH so that pyodbc lib can be used in the python virtual env

I am trying to run my existing python script(with native python non-spark) directly in Serverless using EMR Studio. Our goal is to execute native Python scripts directly on EMR serverless. To achieve ...
itzmenamita's user avatar
1 vote
1 answer
480 views

Access data on EMR directory from EMR Studio: Workspaces (Notebooks)

I have some data saved on s3, which I want to import while running a python script on EMR. To do it through a python code on EMR console: I just create the directories/file on my EMR like this /home/...
Gaurav Singhal's user avatar
0 votes
0 answers
113 views

AWS EMR Jupyter Notebook not saving '_xsrf' argument missing from POST

In AWS EMR the Jupyter Notebook won't even attach to a cluster without giving the error: '_xsrf' argument missing from POST A lot of the solutions I have tried such as refreshing the page when ...
rch_frnds's user avatar
0 votes
0 answers
56 views

Running PySpark script on EMR 6.13.0

I am running a pyspark script on EMR version 6.13.0 and have installed neccessary packages using bootstrap as below: #!/bin/bash sudo yum update -y sudo python3 -m pip install pandas PyDictionary ...
Supriya R's user avatar
1 vote
0 answers
266 views

Import Custom Python Modules on EMR Serverless through Spark Configuration

I created a spark_ready.py module that hosts multiple classes that I want to use as a template. I've seen in multiple configurations online that using the "spark.submit.pyFiles" will allow ...
Justine Paul Padayao's user avatar
0 votes
0 answers
56 views

Cant reach hbase (on S3) from pyspark

I have created a cluster on emr and I want to use hbase with pyspark. I am new to using distributed systems so I might make amatuer mistakes but connecting to hbase from pyspark feels very hard. My ...
Asfandyar Abbasi's user avatar
0 votes
1 answer
74 views

Fetching list of tags of an EMR Cluster using AWS Lambda Python

Is there any function to get list of tags of an emr cluster like S3 bucket? Like in a S3 bucket, we have get_bucket_tagging I tried using get_list but it does not work. Let me know if there is any ...
Gourav Roy's user avatar
0 votes
1 answer
65 views

Ensuring File Size Limit is Adhered to When Batch Processing Downloads in PySpark on EMR

I'm working on a PySpark application running on Amazon EMR, where my task involves downloading files based on a url from a DataFrame. The goal is to continuously download these files on an EMR ...
Mughees Asif's user avatar
0 votes
0 answers
68 views

Is there a way to merge spark-submit command line args with SparkConf() settings in code?

I am trying to load a jar file for my spark job. I had been doing something like... jar_file = "my.jar" spark_config = ( SparkConf() .setAppName("test_app") .set("...
zethw's user avatar
  • 353
0 votes
2 answers
620 views

Pyspark error in EMR writting parquet files to S3

I have a process that reads data from S3, processes it and then saves it again to s3 in other location in parquet format. Sometimes I get this error when it is writting: y4j.protocol.Py4JJavaError:...
Shadowtrooper's user avatar
0 votes
1 answer
198 views

EMR step execution using Airflow failed

I am using Airflow on local environment to add a step in EMR after creation of cluster using below code block. def add_step(cluster_id,jar_file,step_args): print("The cluster id : {}"....
Aniruddha's user avatar
0 votes
1 answer
193 views

Can't train sagemaker model using recordio protobuf data saved in S3

I want to preprocess data in Amazon EMR using PySpark and train a machine learning model in SageMaker using pipe mode. The issue I am having now is saving data in S3 and feeding it to the model. ...
Maokai's user avatar
  • 204
1 vote
1 answer
267 views

Efficient partial string search on large pyspark dataframes

I'm currently working on a PySpark project where I need to perform a join between two large dataframes. One dataframe contains around 10 million entries with short strings as keywords(2-5 words), ...
pnv's user avatar
  • 1,499
0 votes
1 answer
209 views

Redshift interpreting boolean data type as bit and hence not able to move hudi table from S3 to Redshift if there is any boolean data type column

I'm creating a data pipeline in AWS for moving data from S3 to Redshift via EMR. Data is stored in the HUDI format in S3 in parquet files. I've created the Pyspark script for full load transfer and ...
Rahul Kohli's user avatar
0 votes
1 answer
624 views

PySpark monotonically_increasing_id results differ locally and on AWS EMR

I created a small function that would assign a composite id to each row to essentially group rows into smaller subsets, given a subset size. Locally on my computer the logic works flawlessly. Once I ...
dataviews's user avatar
  • 3,040
0 votes
1 answer
2k views

ImportError: Install s3fs to access S3 on amazon EMR 6.3.0

I have the following error on my notebook after setting up and EMR 6.3.0: An error was encountered: Install s3fs to access S3 Traceback (most recent call last): File "/usr/local/lib64/python3.7/...
Airone's user avatar
  • 1
1 vote
2 answers
665 views

AWS emr unable to install python library in bootstrap shell script

Using emr-5.33.1 and python3.7.16. Goal is to add petastorm==0.12.1 into EMR. These are the steps to install it in EMR (worked until now) Add all required dependencies of petastorm and itself into s3 ...
haneulkim's user avatar
  • 4,908
0 votes
1 answer
404 views

Performance and Data Integrity Issues with Hudi for Long-Term Data Retention

Our project requires that we perform full loads daily, retaining these versions for future queries. Upon implementing Hudi to maintain 6 years of data with the following setup: "hoodie.cleaner....
Luiz's user avatar
  • 1,315
1 vote
1 answer
818 views

EMR - Pyspark, No module named 'boto3'

I am running an EMR with the following creation statement: $ aws emr create-cluster \ --name "my_cluster" \ --log-uri "s3n://somebucket/" \ --release-label "emr-6.8.0" ...
Flo's user avatar
  • 429
-1 votes
1 answer
193 views

PySpark `monotonically_increasing_id()` returns 0 for each row

This code def foobar(df): return ( df.withColumn("id", monotonically_increasing_id()) .withColumn("foo", lit("bar")) .withColumn("bar&...
hdw3's user avatar
  • 931
-1 votes
2 answers
257 views

Great expectations installation to AWS EMR

I tried to use great expectations for data quality purpose I am running my jobs in AWS EMR cluster and I am trying to launch great expectations job on AWS EMR as well I have bootstrap script for ...
Liu Piu's user avatar
  • 45
0 votes
1 answer
657 views

SparkException: Exception thrown in awaitResult for EMR

I tried running my Spark application from EMR, which right now is just the pi calculation in the tutorial doc: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-application.html I uploaded ...
Kei's user avatar
  • 621
1 vote
1 answer
751 views

How to create an EMR cluster and submit a spark-submit step using Dagster?

I want to create a Dagster app that creates an EMR cluster and adds a spark-submit step, but due to a lack of documentation or examples I can't figure out how to do that (copilot also struggles with ...
Danylo Liakhovetskyi's user avatar
1 vote
1 answer
378 views

Dagster PySpark not running on EMR

I am trying to build an pipeline in Dagster which does the following: Launch an EMR cluster using the EmrJobRunner class, by using its run_job_flow function. Add one or more steps to that cluster to ...
D. Gal's user avatar
  • 329
0 votes
1 answer
640 views

Install packages on EMR via bootstrap actions not working in Jupyter notebook

I have an EMR cluster using EMR-6.3.1. I am using the Python3 Kernel. I have a very simple bootstrap script in S3: #!/bin/bash sudo python3 -m pip install Cython==0.29.4 boto==2.49.0 boto3==1.18.50 ...
bravinator932421's user avatar
-2 votes
1 answer
163 views

How to append json in Python List

I want to create a JSON Python List shown as below using without repeated codes(Python functions). Expected Output: Steps=[ { 'Name': 'Download Config File', '...
Amateur coder's user avatar
0 votes
1 answer
545 views

Running Jupyter PySpark notebook in EMR, module not found, although it is installed

I am trying to run a Jupyter notebook in EMR. The notebook is using PySpark kernel. All packages needed for the notebook are installed via boostrap actions but when the notebook is run, it fails with ...
ezamur's user avatar
  • 2,172
3 votes
0 answers
343 views

Model inference using pyspark pandas_udf runs twice

Using python 3.7, pyspark 2.4.7 on emr 5.33.1. From trained keras model on pandas I'm trying to do distributed inference using pyspark. I have inference function wrapped with pandas_udf. @pandas_udf(...
haneulkim's user avatar
  • 4,908
0 votes
1 answer
390 views

Spark stuck in UNDEFINED status when submit a job in AWS EMR

I am running a spark submit using the following script by using an AWS EMR (version emr-5.26.0) on m4.xlarge instances : #!/bin/sh # Define variables to get script parameters MAIN_SPARK_URI=$1 # ...
Mistapopo's user avatar
  • 433
0 votes
1 answer
652 views

How to install additional packages for pyspark in AWS EMR cluster?

I have a pyspark application that uses boto3 library under the hood. I am trying to launch application with built wheel package that contains dependency of applications. External dependency like boto3 ...
Liu Piu's user avatar
  • 45
1 vote
0 answers
31 views

How to include user config in the SPARK STEPS in airflow?

I want to take the user input from airflow in json format like this: {"type":"adhoc", "start_date":"2022-01-01", "end_date":"2022-05-01"} ...
Tushar Vatsa's user avatar
0 votes
1 answer
479 views

Unable to read data from mongoDB using Pyspark or Python in AWS EMR

I am trying to read data from 3 node MongoDB cluster(replica set) using PySpark and native python in AWS EMR. I am facing issues while executing the codes with in AWS EMR cluster as explained below ...
RANJITH JN JN's user avatar
2 votes
1 answer
676 views

Pyspark streaming from Kafka to Hudi

I'm new using hudi and I have a problem. I'm working with an EMR in AWS with pyspark, Kafka and what I want to do is to read a topic from the Kafka cluster with pyspark streaming and then move it to ...
Valle1208's user avatar
0 votes
2 answers
774 views

Submitting a Pyspark job with multiple files in AWS EMR

I have a pyspark job that is distributed in multiple code files in this structure: flexible_clendar - Cache - redis_main.py - Helpers - helpers.py - Spark - spark_main.py ...
Daniel Avigdor's user avatar
1 vote
1 answer
1k views

Importing custom modules to AWS EMR

I have a s3 repository containing a 'main.py' file, with custom modules that I built (inside 'Cache' and 'Helpers'): My 'main.py' file looks like that: from pyspark.sql import SparkSession from ...
Daniel Avigdor's user avatar
0 votes
1 answer
699 views

Map Reduce program to calculate the average and count

I am trying to calculate the taxi and its trip using map reduce python program. In Map program I have written the following code where it will assign each row a key. import sys for line in sys.stdin: ...
deepak's user avatar
  • 3
1 vote
1 answer
3k views

Virtualenv in aws emr-serverless

I'm trying to run some jobs on aws cli using a virtual environment where I installed some libraries. I followed this guide; the same is here. But when I run the job I have this error: Job execution ...
solopiu's user avatar
  • 756
2 votes
0 answers
586 views

EMR pyspark import functions from other notebook (and retain in memory)

I am attempting to call another notebook from a pyspark notebook on AWS EMR. I can successfully call a notebook that runs functions such as f and the function exists in memory to be called within cell ...
Francis Smart's user avatar
0 votes
1 answer
2k views

ModuleNotFoundError: No module named 'boto3'

I have an AWS EMR cluster and all the steps are failing with the error: ''' ModuleNotFoundError: No module named 'boto3' ''' Python and pip versions: ''' python --version Python 3.7.10 pip --version ...
Sarat Kota's user avatar
0 votes
0 answers
1k views

Livy logs location(in S3) for an EMR cluster(debuging Neither SparkSession nor HiveContext/SqlContext is available)

I'm using AWS SageMaker connected to an EMR cluster via Livy, with a "normal" session(default session config) the connection is created, and spark context works fine. but when adding spark....
Luis Leal's user avatar
  • 3,492
0 votes
1 answer
1k views

Sagemaker notebook to EMR pyspark using yarn-client instead of livy

I know there are good tutorials on connecting Sagemaker notebooks to EMR cluster for running pyspark jobs via the SparkMagic pre-installed kernel, however we want to connect to the cluster using yarn-...
Luis Leal's user avatar
  • 3,492
0 votes
2 answers
253 views

Perform preprocessing operations from pandas on Spark dataframe

I have a rather large CSV so I am using AWS EMR to read the data into a Spark dataframe to perform some operations. I have a pandas function that does some simple preprocessing: def clean_census_data(...
justanewb's user avatar
  • 133
1 vote
0 answers
758 views

How to run Apache Beam Python code on AWS EMR?

I am just using a python file from creating Apache Beam Pipeline and trying to run it through the AWS EMR. hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-...
Mohit Kumar's user avatar
4 votes
1 answer
5k views

Aiflow 2 Xcom in Task Groups

I have two tasks inside a TaskGroup that need to pull xcom values to supply the job_flow_id and step_id. Here's the code: with TaskGroup('execute_my_steps') as execute_my_steps: config = {some ...
fjjones88's user avatar
  • 349
0 votes
0 answers
2k views

Installing Python libraries in AWS EMR Jupyter Notebook

I am trying to use my oft-used python libraries, such as numpy, pandas, and matplotlib, as well as other libraries like scikit-learn and PySpark on a Jupyter Notebook that is connected to an EMR ...
smpark's user avatar
  • 1
0 votes
1 answer
468 views

Is there an optimal way in pyspark to write the same dataframe to multiple locations?

I have a dataframe in pyspark and I want to write the same dataframe to two locations in AWS s3. Currently I have the following code running on AWS EMR. # result is the name of the dataframe ...
Ashill Chiranjan's user avatar
0 votes
1 answer
142 views

Using spark to merge 12 large dataframes together

Say I have one dataframe that looks like this d1 = {'Location+Type':['Census Tract 3, Jefferson County, Alabama: Summary level: 140, state:01> county:073> tract:000300', 'Census Tract 4, ...
justanewb's user avatar
  • 133
1 vote
0 answers
84 views

Mismatched numpy versions in a venv on a running cluster with EMR notebooks

Trying to upgrade the numpy version on a running cluster with EMR notebooks. My conf file is: %%configure -f { "conf":{ "spark.pyspark.python": "python3", ...
laila's user avatar
  • 1,049

1
2 3 4 5
9