Newest 'amazon-emr+python' Questions

0 votes

1 answer

60 views

Unable to access Public APIs using request library in EMR serverless

Getting below error when I am trying to fetch a API using request library. Traceback (most recent call last): File "/tmp/spark-39775710-130a-4403-9182-c557003f351b/lib.zip/urllib3/connection.py&...

Ashwini Kumar

1

asked Jul 27 at 17:40

0 votes

1 answer

246 views

How can I use yaml file in pyspark code executing on EMR cluster?

In EMR cluster , I am running the pyspark code,which uses the yaml file, I am getting the path not found error. I am using the following spark submit : spark-submit --deploy-mode client --executor-...

Vishal Prajapati

1

asked May 20 at 12:27

1 vote

0 answers

97 views

In EMR Serverless is there a way to set the LD_LIBRARY_PATH so that pyodbc lib can be used in the python virtual env

I am trying to run my existing python script(with native python non-spark) directly in Serverless using EMR Studio. Our goal is to execute native Python scripts directly on EMR serverless. To achieve ...

itzmenamita

11

asked May 14 at 18:30

1 vote

1 answer

480 views

Access data on EMR directory from EMR Studio: Workspaces (Notebooks)

I have some data saved on s3, which I want to import while running a python script on EMR. To do it through a python code on EMR console: I just create the directories/file on my EMR like this /home/...

Gaurav Singhal

1,074

asked May 7 at 1:34

0 votes

0 answers

113 views

AWS EMR Jupyter Notebook not saving '_xsrf' argument missing from POST

In AWS EMR the Jupyter Notebook won't even attach to a cluster without giving the error: '_xsrf' argument missing from POST A lot of the solutions I have tried such as refreshing the page when ...

rch_frnds

1

asked May 6 at 21:51

0 votes

0 answers

56 views

Running PySpark script on EMR 6.13.0

I am running a pyspark script on EMR version 6.13.0 and have installed neccessary packages using bootstrap as below: #!/bin/bash sudo yum update -y sudo python3 -m pip install pandas PyDictionary ...

Supriya R

1

asked Apr 19 at 18:36

1 vote

0 answers

266 views

Import Custom Python Modules on EMR Serverless through Spark Configuration

I created a spark_ready.py module that hosts multiple classes that I want to use as a template. I've seen in multiple configurations online that using the "spark.submit.pyFiles" will allow ...

Justine Paul Padayao

11

asked Mar 12 at 17:33

0 votes

0 answers

56 views

Cant reach hbase (on S3) from pyspark

I have created a cluster on emr and I want to use hbase with pyspark. I am new to using distributed systems so I might make amatuer mistakes but connecting to hbase from pyspark feels very hard. My ...

Asfandyar Abbasi

83

asked Mar 12 at 9:03

0 votes

1 answer

74 views

Fetching list of tags of an EMR Cluster using AWS Lambda Python

Is there any function to get list of tags of an emr cluster like S3 bucket? Like in a S3 bucket, we have get_bucket_tagging I tried using get_list but it does not work. Let me know if there is any ...

Gourav Roy

1

asked Mar 8 at 14:35

0 votes

1 answer

65 views

Ensuring File Size Limit is Adhered to When Batch Processing Downloads in PySpark on EMR

I'm working on a PySpark application running on Amazon EMR, where my task involves downloading files based on a url from a DataFrame. The goal is to continuously download these files on an EMR ...

Mughees Asif

87

asked Feb 25 at 0:02

0 votes

0 answers

68 views

Is there a way to merge spark-submit command line args with SparkConf() settings in code?

I am trying to load a jar file for my spark job. I had been doing something like... jar_file = "my.jar" spark_config = ( SparkConf() .setAppName("test_app") .set("...

zethw

353

asked Jan 31 at 18:56

0 votes

2 answers

620 views

Pyspark error in EMR writting parquet files to S3

I have a process that reads data from S3, processes it and then saves it again to s3 in other location in parquet format. Sometimes I get this error when it is writting: y4j.protocol.Py4JJavaError:...

Shadowtrooper

1,454

asked Jan 11 at 8:40

0 votes

1 answer

198 views

EMR step execution using Airflow failed

I am using Airflow on local environment to add a step in EMR after creation of cluster using below code block. def add_step(cluster_id,jar_file,step_args): print("The cluster id : {}"....

Aniruddha

3

asked Dec 7, 2023 at 4:53

0 votes

1 answer

193 views

Can't train sagemaker model using recordio protobuf data saved in S3

I want to preprocess data in Amazon EMR using PySpark and train a machine learning model in SageMaker using pipe mode. The issue I am having now is saving data in S3 and feeding it to the model. ...

Maokai

204

asked Nov 20, 2023 at 18:10

1 vote

1 answer

267 views

Efficient partial string search on large pyspark dataframes

I'm currently working on a PySpark project where I need to perform a join between two large dataframes. One dataframe contains around 10 million entries with short strings as keywords(2-5 words), ...

pnv

1,499

asked Aug 24, 2023 at 16:51

0 votes

1 answer

209 views

Redshift interpreting boolean data type as bit and hence not able to move hudi table from S3 to Redshift if there is any boolean data type column

I'm creating a data pipeline in AWS for moving data from S3 to Redshift via EMR. Data is stored in the HUDI format in S3 in parquet files. I've created the Pyspark script for full load transfer and ...

Rahul Kohli

1

asked Aug 22, 2023 at 6:49

0 votes

1 answer

624 views

PySpark monotonically_increasing_id results differ locally and on AWS EMR

I created a small function that would assign a composite id to each row to essentially group rows into smaller subsets, given a subset size. Locally on my computer the logic works flawlessly. Once I ...

dataviews

3,040

asked Aug 4, 2023 at 19:39

0 votes

1 answer

2k views

ImportError: Install s3fs to access S3 on amazon EMR 6.3.0

I have the following error on my notebook after setting up and EMR 6.3.0: An error was encountered: Install s3fs to access S3 Traceback (most recent call last): File "/usr/local/lib64/python3.7/...

Airone

1

asked Jul 2, 2023 at 17:39

1 vote

2 answers

665 views

AWS emr unable to install python library in bootstrap shell script

Using emr-5.33.1 and python3.7.16. Goal is to add petastorm==0.12.1 into EMR. These are the steps to install it in EMR (worked until now) Add all required dependencies of petastorm and itself into s3 ...

haneulkim

4,908

asked Jun 7, 2023 at 6:13

0 votes

1 answer

404 views

Performance and Data Integrity Issues with Hudi for Long-Term Data Retention

Our project requires that we perform full loads daily, retaining these versions for future queries. Upon implementing Hudi to maintain 6 years of data with the following setup: "hoodie.cleaner....

Luiz

1,315

asked May 29, 2023 at 14:11

1 vote

1 answer

818 views

EMR - Pyspark, No module named 'boto3'

I am running an EMR with the following creation statement: $ aws emr create-cluster \ --name "my_cluster" \ --log-uri "s3n://somebucket/" \ --release-label "emr-6.8.0" ...

Flo

429

asked May 12, 2023 at 15:57

-1 votes

1 answer

193 views

PySpark `monotonically_increasing_id()` returns 0 for each row

This code def foobar(df): return ( df.withColumn("id", monotonically_increasing_id()) .withColumn("foo", lit("bar")) .withColumn("bar&...

hdw3

931

asked Dec 30, 2022 at 15:24

-1 votes

2 answers

257 views

Great expectations installation to AWS EMR

I tried to use great expectations for data quality purpose I am running my jobs in AWS EMR cluster and I am trying to launch great expectations job on AWS EMR as well I have bootstrap script for ...

Liu Piu

45

asked Dec 27, 2022 at 11:08

0 votes

1 answer

657 views

SparkException: Exception thrown in awaitResult for EMR

I tried running my Spark application from EMR, which right now is just the pi calculation in the tutorial doc: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-application.html I uploaded ...

Kei

621

asked Dec 20, 2022 at 1:44

1 vote

1 answer

751 views

How to create an EMR cluster and submit a spark-submit step using Dagster?

I want to create a Dagster app that creates an EMR cluster and adds a spark-submit step, but due to a lack of documentation or examples I can't figure out how to do that (copilot also struggles with ...

Danylo Liakhovetskyi

43

asked Dec 12, 2022 at 13:57

1 vote

1 answer

378 views

Dagster PySpark not running on EMR

I am trying to build an pipeline in Dagster which does the following: Launch an EMR cluster using the EmrJobRunner class, by using its run_job_flow function. Add one or more steps to that cluster to ...

D. Gal

329

asked Dec 5, 2022 at 12:51

0 votes

1 answer

640 views

Install packages on EMR via bootstrap actions not working in Jupyter notebook

I have an EMR cluster using EMR-6.3.1. I am using the Python3 Kernel. I have a very simple bootstrap script in S3: #!/bin/bash sudo python3 -m pip install Cython==0.29.4 boto==2.49.0 boto3==1.18.50 ...

bravinator932421

183

asked Dec 2, 2022 at 17:10

-2 votes

1 answer

163 views

How to append json in Python List

I want to create a JSON Python List shown as below using without repeated codes(Python functions). Expected Output: Steps=[ { 'Name': 'Download Config File', '...

Amateur coder

1

asked Nov 29, 2022 at 10:10

0 votes

1 answer

545 views

Running Jupyter PySpark notebook in EMR, module not found, although it is installed

I am trying to run a Jupyter notebook in EMR. The notebook is using PySpark kernel. All packages needed for the notebook are installed via boostrap actions but when the notebook is run, it fails with ...

ezamur

2,172

asked Nov 28, 2022 at 15:44

3 votes

0 answers

343 views

Model inference using pyspark pandas_udf runs twice

Using python 3.7, pyspark 2.4.7 on emr 5.33.1. From trained keras model on pandas I'm trying to do distributed inference using pyspark. I have inference function wrapped with pandas_udf. @pandas_udf(...

haneulkim

4,908

asked Nov 21, 2022 at 3:34

0 votes

1 answer

390 views

Spark stuck in UNDEFINED status when submit a job in AWS EMR

I am running a spark submit using the following script by using an AWS EMR (version emr-5.26.0) on m4.xlarge instances : #!/bin/sh # Define variables to get script parameters MAIN_SPARK_URI=$1 # ...

Mistapopo

433

asked Oct 25, 2022 at 18:16

0 votes

1 answer

652 views

How to install additional packages for pyspark in AWS EMR cluster?

I have a pyspark application that uses boto3 library under the hood. I am trying to launch application with built wheel package that contains dependency of applications. External dependency like boto3 ...

Liu Piu

45

asked Oct 14, 2022 at 10:18

1 vote

0 answers

31 views

How to include user config in the SPARK STEPS in airflow?

I want to take the user input from airflow in json format like this: {"type":"adhoc", "start_date":"2022-01-01", "end_date":"2022-05-01"} ...

Tushar Vatsa

11

asked Oct 10, 2022 at 3:30

0 votes

1 answer

479 views

Unable to read data from mongoDB using Pyspark or Python in AWS EMR

I am trying to read data from 3 node MongoDB cluster(replica set) using PySpark and native python in AWS EMR. I am facing issues while executing the codes with in AWS EMR cluster as explained below ...

RANJITH JN JN

1

asked Oct 6, 2022 at 12:28

2 votes

1 answer

676 views

Pyspark streaming from Kafka to Hudi

I'm new using hudi and I have a problem. I'm working with an EMR in AWS with pyspark, Kafka and what I want to do is to read a topic from the Kafka cluster with pyspark streaming and then move it to ...

Valle1208

43

asked Sep 28, 2022 at 5:49

0 votes

2 answers

774 views

Submitting a Pyspark job with multiple files in AWS EMR

I have a pyspark job that is distributed in multiple code files in this structure: flexible_clendar - Cache - redis_main.py - Helpers - helpers.py - Spark - spark_main.py ...

Daniel Avigdor

179

asked Sep 9, 2022 at 9:17

1 vote

1 answer

1k views

Importing custom modules to AWS EMR

I have a s3 repository containing a 'main.py' file, with custom modules that I built (inside 'Cache' and 'Helpers'): My 'main.py' file looks like that: from pyspark.sql import SparkSession from ...

Daniel Avigdor

179

asked Aug 21, 2022 at 14:26

0 votes

1 answer

699 views

Map Reduce program to calculate the average and count

I am trying to calculate the taxi and its trip using map reduce python program. In Map program I have written the following code where it will assign each row a key. import sys for line in sys.stdin: ...

deepak

3

asked Aug 14, 2022 at 16:53

1 vote

1 answer

3k views

Virtualenv in aws emr-serverless

I'm trying to run some jobs on aws cli using a virtual environment where I installed some libraries. I followed this guide; the same is here. But when I run the job I have this error: Job execution ...

solopiu

756

asked Jul 18, 2022 at 12:12

2 votes

0 answers

586 views

EMR pyspark import functions from other notebook (and retain in memory)

I am attempting to call another notebook from a pyspark notebook on AWS EMR. I can successfully call a notebook that runs functions such as f and the function exists in memory to be called within cell ...

Francis Smart

4,045

asked Jul 2, 2022 at 1:26

0 votes

1 answer

2k views

ModuleNotFoundError: No module named 'boto3'

I have an AWS EMR cluster and all the steps are failing with the error: ''' ModuleNotFoundError: No module named 'boto3' ''' Python and pip versions: ''' python --version Python 3.7.10 pip --version ...

Sarat Kota

61

asked Jun 30, 2022 at 12:37

0 votes

0 answers

1k views

Livy logs location(in S3) for an EMR cluster(debuging Neither SparkSession nor HiveContext/SqlContext is available)

I'm using AWS SageMaker connected to an EMR cluster via Livy, with a "normal" session(default session config) the connection is created, and spark context works fine. but when adding spark....

Luis Leal

3,492

asked Jun 18, 2022 at 7:07

0 votes

1 answer

1k views

Sagemaker notebook to EMR pyspark using yarn-client instead of livy

I know there are good tutorials on connecting Sagemaker notebooks to EMR cluster for running pyspark jobs via the SparkMagic pre-installed kernel, however we want to connect to the cluster using yarn-...

Luis Leal

3,492

asked Jun 13, 2022 at 21:46

0 votes

2 answers

253 views

Perform preprocessing operations from pandas on Spark dataframe

I have a rather large CSV so I am using AWS EMR to read the data into a Spark dataframe to perform some operations. I have a pandas function that does some simple preprocessing: def clean_census_data(...

justanewb

133

asked May 18, 2022 at 6:30

1 vote

0 answers

758 views

How to run Apache Beam Python code on AWS EMR?

I am just using a python file from creating Apache Beam Pipeline and trying to run it through the AWS EMR. hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-...

Mohit Kumar

702

asked May 9, 2022 at 8:49

4 votes

1 answer

5k views

Aiflow 2 Xcom in Task Groups

I have two tasks inside a TaskGroup that need to pull xcom values to supply the job_flow_id and step_id. Here's the code: with TaskGroup('execute_my_steps') as execute_my_steps: config = {some ...

fjjones88

349

asked May 5, 2022 at 22:46

0 votes

0 answers

2k views

Installing Python libraries in AWS EMR Jupyter Notebook

I am trying to use my oft-used python libraries, such as numpy, pandas, and matplotlib, as well as other libraries like scikit-learn and PySpark on a Jupyter Notebook that is connected to an EMR ...

smpark

1

asked Apr 27, 2022 at 23:05

0 votes

1 answer

468 views

Is there an optimal way in pyspark to write the same dataframe to multiple locations?

I have a dataframe in pyspark and I want to write the same dataframe to two locations in AWS s3. Currently I have the following code running on AWS EMR. # result is the name of the dataframe ...

Ashill Chiranjan

5

asked Apr 26, 2022 at 7:21

0 votes

1 answer

142 views

Using spark to merge 12 large dataframes together

Say I have one dataframe that looks like this d1 = {'Location+Type':['Census Tract 3, Jefferson County, Alabama: Summary level: 140, state:01> county:073> tract:000300', 'Census Tract 4, ...

justanewb

133

asked Apr 20, 2022 at 4:22

1 vote

0 answers

84 views

Mismatched numpy versions in a venv on a running cluster with EMR notebooks

Trying to upgrade the numpy version on a running cluster with EMR notebooks. My conf file is: %%configure -f { "conf":{ "spark.pyspark.python": "python3", ...

laila

1,049

asked Apr 11, 2022 at 12:17

Collectives™ on Stack Overflow

All Questions

Related Tags