All Questions
Tagged with amazon-emr python
447 questions
0
votes
1
answer
60
views
Unable to access Public APIs using request library in EMR serverless
Getting below error when I am trying to fetch a API using request library.
Traceback (most recent call last):
File "/tmp/spark-39775710-130a-4403-9182-c557003f351b/lib.zip/urllib3/connection.py&...
0
votes
1
answer
246
views
How can I use yaml file in pyspark code executing on EMR cluster?
In EMR cluster , I am running the pyspark code,which uses the yaml file, I am getting the path not found error.
I am using the following spark submit :
spark-submit --deploy-mode client --executor-...
1
vote
0
answers
97
views
In EMR Serverless is there a way to set the LD_LIBRARY_PATH so that pyodbc lib can be used in the python virtual env
I am trying to run my existing python script(with native python non-spark) directly in Serverless using EMR Studio.
Our goal is to execute native Python scripts directly on EMR serverless. To achieve ...
1
vote
1
answer
480
views
Access data on EMR directory from EMR Studio: Workspaces (Notebooks)
I have some data saved on s3, which I want to import while running a python script on EMR.
To do it through a python code on EMR console: I just create the directories/file on my EMR like this /home/...
0
votes
0
answers
113
views
AWS EMR Jupyter Notebook not saving '_xsrf' argument missing from POST
In AWS EMR the Jupyter Notebook won't even attach to a cluster without giving the error:
'_xsrf' argument missing from POST
A lot of the solutions I have tried such as refreshing the page when ...
0
votes
0
answers
56
views
Running PySpark script on EMR 6.13.0
I am running a pyspark script on EMR version 6.13.0 and have installed neccessary packages using bootstrap as below:
#!/bin/bash
sudo yum update -y
sudo python3 -m pip install pandas PyDictionary ...
1
vote
0
answers
266
views
Import Custom Python Modules on EMR Serverless through Spark Configuration
I created a spark_ready.py module that hosts multiple classes that I want to use as a template. I've seen in multiple configurations online that using the "spark.submit.pyFiles" will allow ...
0
votes
0
answers
56
views
Cant reach hbase (on S3) from pyspark
I have created a cluster on emr and I want to use hbase with pyspark. I am new to using distributed systems so I might make amatuer mistakes but connecting to hbase from pyspark feels very hard.
My ...
0
votes
1
answer
74
views
Fetching list of tags of an EMR Cluster using AWS Lambda Python
Is there any function to get list of tags of an emr cluster like S3 bucket? Like in a S3 bucket, we have get_bucket_tagging
I tried using get_list but it does not work. Let me know if there is any ...
0
votes
1
answer
65
views
Ensuring File Size Limit is Adhered to When Batch Processing Downloads in PySpark on EMR
I'm working on a PySpark application running on Amazon EMR, where my task involves downloading files based on a url from a DataFrame. The goal is to continuously download these files on an EMR ...
0
votes
0
answers
68
views
Is there a way to merge spark-submit command line args with SparkConf() settings in code?
I am trying to load a jar file for my spark job.
I had been doing something like...
jar_file = "my.jar"
spark_config = (
SparkConf()
.setAppName("test_app")
.set("...
0
votes
2
answers
620
views
Pyspark error in EMR writting parquet files to S3
I have a process that reads data from S3, processes it and then saves it again to s3 in other location in parquet format. Sometimes I get this error when it is writting:
y4j.protocol.Py4JJavaError:...
0
votes
1
answer
198
views
EMR step execution using Airflow failed
I am using Airflow on local environment to add a step in EMR after creation of cluster using below code block.
def add_step(cluster_id,jar_file,step_args):
print("The cluster id : {}"....
0
votes
1
answer
193
views
Can't train sagemaker model using recordio protobuf data saved in S3
I want to preprocess data in Amazon EMR using PySpark and train a machine learning model in SageMaker using pipe mode. The issue I am having now is saving data in S3 and feeding it to the model.
...
1
vote
1
answer
267
views
Efficient partial string search on large pyspark dataframes
I'm currently working on a PySpark project where I need to perform a join between two large dataframes. One dataframe contains around 10 million entries with short strings as keywords(2-5 words), ...
0
votes
1
answer
209
views
Redshift interpreting boolean data type as bit and hence not able to move hudi table from S3 to Redshift if there is any boolean data type column
I'm creating a data pipeline in AWS for moving data from S3 to Redshift via EMR. Data is stored in the HUDI format in S3 in parquet files. I've created the Pyspark script for full load transfer and ...
0
votes
1
answer
624
views
PySpark monotonically_increasing_id results differ locally and on AWS EMR
I created a small function that would assign a composite id to each row to essentially group rows into smaller subsets, given a subset size. Locally on my computer the logic works flawlessly. Once I ...
0
votes
1
answer
2k
views
ImportError: Install s3fs to access S3 on amazon EMR 6.3.0
I have the following error on my notebook after setting up and EMR 6.3.0:
An error was encountered:
Install s3fs to access S3
Traceback (most recent call last):
File "/usr/local/lib64/python3.7/...
1
vote
2
answers
665
views
AWS emr unable to install python library in bootstrap shell script
Using emr-5.33.1 and python3.7.16.
Goal is to add petastorm==0.12.1 into EMR. These are the steps to install it in EMR (worked until now)
Add all required dependencies of petastorm and itself into s3 ...
0
votes
1
answer
404
views
Performance and Data Integrity Issues with Hudi for Long-Term Data Retention
Our project requires that we perform full loads daily, retaining these versions for future queries. Upon implementing Hudi to maintain 6 years of data with the following setup:
"hoodie.cleaner....
1
vote
1
answer
818
views
EMR - Pyspark, No module named 'boto3'
I am running an EMR with the following creation statement:
$ aws emr create-cluster \
--name "my_cluster" \
--log-uri "s3n://somebucket/" \
--release-label "emr-6.8.0" ...
-1
votes
1
answer
193
views
PySpark `monotonically_increasing_id()` returns 0 for each row
This code
def foobar(df):
return (
df.withColumn("id", monotonically_increasing_id())
.withColumn("foo", lit("bar"))
.withColumn("bar&...
-1
votes
2
answers
257
views
Great expectations installation to AWS EMR
I tried to use great expectations for data quality purpose
I am running my jobs in AWS EMR cluster and I am trying to launch great expectations job on AWS EMR as well
I have bootstrap script for ...
0
votes
1
answer
657
views
SparkException: Exception thrown in awaitResult for EMR
I tried running my Spark application from EMR, which right now is just the pi calculation in the tutorial doc: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-application.html
I uploaded ...
1
vote
1
answer
751
views
How to create an EMR cluster and submit a spark-submit step using Dagster?
I want to create a Dagster app that creates an EMR cluster and adds a spark-submit step, but due to a lack of documentation or examples I can't figure out how to do that (copilot also struggles with ...
1
vote
1
answer
378
views
Dagster PySpark not running on EMR
I am trying to build an pipeline in Dagster which does the following:
Launch an EMR cluster using the EmrJobRunner class, by using its
run_job_flow function.
Add one or more steps to that cluster to ...
0
votes
1
answer
640
views
Install packages on EMR via bootstrap actions not working in Jupyter notebook
I have an EMR cluster using EMR-6.3.1.
I am using the Python3 Kernel.
I have a very simple bootstrap script in S3:
#!/bin/bash
sudo python3 -m pip install Cython==0.29.4 boto==2.49.0 boto3==1.18.50 ...
-2
votes
1
answer
163
views
How to append json in Python List
I want to create a JSON Python List shown as below using without repeated codes(Python functions).
Expected Output:
Steps=[
{
'Name': 'Download Config File',
'...
0
votes
1
answer
545
views
Running Jupyter PySpark notebook in EMR, module not found, although it is installed
I am trying to run a Jupyter notebook in EMR. The notebook is using PySpark kernel.
All packages needed for the notebook are installed via boostrap actions but when the notebook is run, it fails with ...
3
votes
0
answers
343
views
Model inference using pyspark pandas_udf runs twice
Using python 3.7, pyspark 2.4.7 on emr 5.33.1.
From trained keras model on pandas I'm trying to do distributed inference using pyspark.
I have inference function wrapped with pandas_udf.
@pandas_udf(...
0
votes
1
answer
390
views
Spark stuck in UNDEFINED status when submit a job in AWS EMR
I am running a spark submit using the following script by using an AWS EMR (version emr-5.26.0) on m4.xlarge instances :
#!/bin/sh
# Define variables to get script parameters
MAIN_SPARK_URI=$1 # ...
0
votes
1
answer
652
views
How to install additional packages for pyspark in AWS EMR cluster?
I have a pyspark application that uses boto3 library under the hood.
I am trying to launch application with built wheel package that contains dependency of applications.
External dependency like boto3 ...
1
vote
0
answers
31
views
How to include user config in the SPARK STEPS in airflow?
I want to take the user input from airflow in json format like this:
{"type":"adhoc", "start_date":"2022-01-01", "end_date":"2022-05-01"}
...
0
votes
1
answer
479
views
Unable to read data from mongoDB using Pyspark or Python in AWS EMR
I am trying to read data from 3 node MongoDB cluster(replica set) using PySpark and native python in AWS EMR. I am facing issues while executing the codes with in AWS EMR cluster as explained below ...
2
votes
1
answer
676
views
Pyspark streaming from Kafka to Hudi
I'm new using hudi and I have a problem.
I'm working with an EMR in AWS with pyspark, Kafka and what I want to do is to read a topic from the Kafka cluster with pyspark streaming and then move it to ...
0
votes
2
answers
774
views
Submitting a Pyspark job with multiple files in AWS EMR
I have a pyspark job that is distributed in multiple code files in this structure:
flexible_clendar
- Cache
- redis_main.py
- Helpers
- helpers.py
- Spark
- spark_main.py
...
1
vote
1
answer
1k
views
Importing custom modules to AWS EMR
I have a s3 repository containing a 'main.py' file, with custom modules that I built (inside 'Cache' and 'Helpers'):
My 'main.py' file looks like that:
from pyspark.sql import SparkSession
from ...
0
votes
1
answer
699
views
Map Reduce program to calculate the average and count
I am trying to calculate the taxi and its trip using map reduce python program.
In Map program I have written the following code where it will assign each row a key.
import sys
for line in sys.stdin:
...
1
vote
1
answer
3k
views
Virtualenv in aws emr-serverless
I'm trying to run some jobs on aws cli using a virtual environment where I installed some libraries. I followed this guide; the same is here.
But when I run the job I have this error:
Job execution ...
2
votes
0
answers
586
views
EMR pyspark import functions from other notebook (and retain in memory)
I am attempting to call another notebook from a pyspark notebook on AWS EMR. I can successfully call a notebook that runs functions such as f and the function exists in memory to be called within cell ...
0
votes
1
answer
2k
views
ModuleNotFoundError: No module named 'boto3'
I have an AWS EMR cluster and all the steps are failing with the error:
'''
ModuleNotFoundError: No module named 'boto3'
'''
Python and pip versions:
'''
python --version
Python 3.7.10
pip --version
...
0
votes
0
answers
1k
views
Livy logs location(in S3) for an EMR cluster(debuging Neither SparkSession nor HiveContext/SqlContext is available)
I'm using AWS SageMaker connected to an EMR cluster via Livy, with a "normal" session(default session config) the connection is created, and spark context works fine. but when adding
spark....
0
votes
1
answer
1k
views
Sagemaker notebook to EMR pyspark using yarn-client instead of livy
I know there are good tutorials on connecting Sagemaker notebooks to EMR cluster for running pyspark jobs via the SparkMagic pre-installed kernel, however we want to connect to the cluster using yarn-...
0
votes
2
answers
253
views
Perform preprocessing operations from pandas on Spark dataframe
I have a rather large CSV so I am using AWS EMR to read the data into a Spark dataframe to perform some operations. I have a pandas function that does some simple preprocessing:
def clean_census_data(...
1
vote
0
answers
758
views
How to run Apache Beam Python code on AWS EMR?
I am just using a python file from creating Apache Beam Pipeline and trying to run it through the AWS EMR.
hadoop jar /var/lib/aws/emr/step-runner/hadoop-jars/command-runner.jar spark-submit --deploy-...
4
votes
1
answer
5k
views
Aiflow 2 Xcom in Task Groups
I have two tasks inside a TaskGroup that need to pull xcom values to supply the job_flow_id and step_id. Here's the code:
with TaskGroup('execute_my_steps') as execute_my_steps:
config = {some ...
0
votes
0
answers
2k
views
Installing Python libraries in AWS EMR Jupyter Notebook
I am trying to use my oft-used python libraries, such as numpy, pandas, and matplotlib, as well as other libraries like scikit-learn and PySpark on a Jupyter Notebook that is connected to an EMR ...
0
votes
1
answer
468
views
Is there an optimal way in pyspark to write the same dataframe to multiple locations?
I have a dataframe in pyspark and I want to write the same dataframe to two locations in AWS s3. Currently I have the following code running on AWS EMR.
# result is the name of the dataframe
...
0
votes
1
answer
142
views
Using spark to merge 12 large dataframes together
Say I have one dataframe that looks like this
d1 = {'Location+Type':['Census Tract 3, Jefferson County, Alabama: Summary level: 140, state:01> county:073> tract:000300', 'Census Tract 4, ...
1
vote
0
answers
84
views
Mismatched numpy versions in a venv on a running cluster with EMR notebooks
Trying to upgrade the numpy version on a running cluster with EMR notebooks.
My conf file is:
%%configure -f
{ "conf":{
"spark.pyspark.python": "python3",
...