0

We create a python shell job which is connecting Redshift and fetching data, below program is working fine in my local system. Below are the steps and programs.

Program:-

import sqlalchemy as sa
from sqlalchemy.orm import sessionmaker
#>>>>>>>> MAKE CHANGES HERE <<<<<<<<<<<<< 
DATABASE = "#####"
USER = "#####"
PASSWORD = "#####"
HOST = "#####.redshift.amazonaws.com"
PORT = "5439"
SCHEMA = "test"      #default is "public" 

####### connection and session creation ############## 
connection_string = "redshift+psycopg2://%s:%s@%s:%s/%s" % (USER,PASSWORD,HOST,str(PORT),DATABASE)
engine = sa.create_engine(connection_string)
session = sessionmaker()
session.configure(bind=engine)
s = session()
SetPath = "SET search_path TO %s" % SCHEMA
s.execute(SetPath)
###### All Set Session created using provided schema  #######
################ write queries from here ###################### 
query = "SELECT * FROM test1 limit 2;"
rr = s.execute(query)
all_results =  rr.fetchall()
def pretty(all_results):
    for row in all_results :
        print("row start >>>>>>>>>>>>>>>>>>>>")
        for r in row :
            print(" ----" , r)
        print("row end >>>>>>>>>>>>>>>>>>>>>>")
pretty(all_results)
########## close session in the end ###############
s.close()

Steps:-

  • sudo pip install psycopg2
  • sudo pip install sqlalchemy
  • sudo pip install sqlalchemy-redshift

I have uploaded the files psycopg2-2.8.4-cp27-cp27m-win32.whl, Flask_SQLAlchemy-2.4.1-py2.py3-none-any.whl and sqlalchemy_redshift-0.7.5-py2.py3-none-any.whl in S3 (s3://####/lib/), and map the folder in Python library path in AWS Glue Job.

When I run the program below error is occurring.

Traceback (most recent call last):
  File "/tmp/runscript.py", line 113, in <module>
    download_and_install(args.extra_py_files)
  File "/tmp/runscript.py", line 56, in download_and_install
    download_from_s3(s3_file_path, local_file_path)
  File "/tmp/runscript.py", line 81, in download_from_s3
    s3.download_file(bucket_name, s3_key, new_file_path)
  File "/usr/local/lib/python2.7/site-packages/boto3/s3/inject.py", line 172, in download_file
    extra_args=ExtraArgs, callback=Callback)
  File "/usr/local/lib/python2.7/site-packages/boto3/s3/transfer.py", line 307, in download_file
    future.result()
  File "/usr/local/lib/python2.7/site-packages/s3transfer/futures.py", line 106, in result
    return self._coordinator.result()
  File "/usr/local/lib/python2.7/site-packages/s3transfer/futures.py", line 265, in result
    raise self._exception
botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found

PS:- The Glue Job Role has full access to S3.

Please suggest how to map those libraries with the program.

2
  • Can you try passing these files in --extra-py-files and not as library paths.Also pass absolute path for each file separated by comma. Commented Dec 22, 2019 at 2:36
  • Hi, I have done this (comma(",") separated by lib files. now getting the other issue. "WARNING: The directory '/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag." Commented Dec 23, 2019 at 3:13

2 Answers 2

1

You can specify your own Python libraries packaged as an .egg or a .whl file under the "—extra-py-files" flag as shown in below example.

Command line example :

aws glue create-job --name python-redshift-test-cli --role role --command '{"Name" :  "pythonshell", "ScriptLocation" : "s3://MyBucket/python/library/redshift_test.py"}' 
     --connections Connections=connection-name --default-arguments '{"--extra-py-files" : ["s3://MyBucket/python/library/redshift_module-0.1-py2.7.egg", "s3://MyBucket/python/library/redshift_module-0.1-py2.7-none-any.whl"]}'

Refernece : Create a glue job with extra python library

7
  • Hi, I have done this, now getting this Error Error:- WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7fb88dd1db10>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/sqlalchemy/ ERROR: Could not find a version that satisfies the requirement SQLAlchemy>=0.8.0 (from Flask-SQLAlchemy==2.4.1) (from versions: none) ERROR: No matching distribution found for SQLAlchemy>=0.8.0 (from Flask-SQLAlchemy==2.4.1) Commented Dec 23, 2019 at 5:36
  • Traceback (most recent call last): File "/tmp/runscript.py", line 113, in <module> download_and_install(args.extra_py_files) File "/tmp/runscript.py", line 63, in download_and_install subprocess.check_call([sys.executable, "-m", "pip", "install", "--target= {} ".format(install_path), local_file_path]) File "/usr/local/lib/python2.7/subprocess.py", line 190, in check_call raise CalledProcessError(retcode, cmd) subprocess . CalledProcessError : Command '['/usr/local/bin/python', '-m', 'pip', 'install', '--target=/glue/lib/installation', Commented Dec 23, 2019 at 5:36
  • @DeepeshUniyal could you add a code snippet of your aws cli command. From the above code it looks like it is trying to load sqlalchemy from Flask_SQLAlchemy whl and couldn't find it. Commented Dec 23, 2019 at 5:51
  • @DeepeshUniyal : Also I see on your local environment you are using "pip install sqlalchemy" and in this case you are passing flask_sqlalchemy. Any specific reason for that? I don't see any flask dependency in your code. Commented Dec 23, 2019 at 5:54
  • Hi, Followed the given link now dependency problem has been solved but now not able to connect redshift, seem security issue, The redshift already open and I am able to access it manually from anywhere but not from the glue job. Logs:- File "/tmp/glue-python-scripts-gm07eo/redshift_test.py", line 3, in <module> File "/glue/lib/installation/redshift_module-0.1-py2.7.egg/redshift_module/pygresql_redshift_common.py", line 8, in : could not connect to server: Connection timed out host "testap-south-1.redshift.amazonaws.com" (14.236.218.0) and accepting TCP/IP connections on port 5439? Commented Dec 23, 2019 at 6:46
0

There is a simple way to import python dependencies using whl files, that can be find on Python site for particular module.

You can also add multiple wheel files from S3 using comma.

For eg "s3://xxxxxxxxx/common/glue/glue_whl/fastparquet-0.4.1-cp37-cp37m-macosx_10_9_x86_64.whl,s3://xxxxxx/common/glue/glue_whl/packaging-20.4-py2.py3-none-any.whl,s3://xxxxxx/common/glue/glue_whl/s3fs-0.5.0-py3-none-any.whl"

enter image description here

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.