84 questions
0
votes
1
answer
87
views
Pyspark error- Invalid argument, not a string or column
I have a dataframe in Pyspark - df_all. It has some data and need to do the following
count = ceil(df_all.count()/1000000)
It gives the following error
TypeError: Invalid argument, not a string or ...
2
votes
1
answer
794
views
How do you write a presto query to split a string into its own column
Trying to splint a string into multiple columns in qubole using presto query.
{"field0":[{"startdate":"2022-07-13","lastnightdate":"2022-07-16","...
0
votes
1
answer
834
views
Presto Pivoting Data
I am really new to Presto and having trouble pivoting data in it.
The method I am using is the following:
select
distinct location_id,
case when role_group = 'IT' then employee_number end as ...
2
votes
1
answer
510
views
need regexp_extract help, beginner
I have string column "49b8b35e-b62c-4a42-9d73-192d131d127a,03c8a7e0-5153-11ec-873a-0242ac11000a,eec8aee4-0500-4940-b319-15924cc2d248"
this string column has 3 values separate by ","...
2
votes
1
answer
52
views
Data comparisons in Qubole
I am very new to Qubole.We recently migrated Oracle ebiz data to Saleforce.We have both Ebiz and Salesforce data in the Qubole Data Lake.There are some discrepancies between Ebiz and Salesforce.What ...
1
vote
0
answers
475
views
Insert overwrite doesn't delete all the old data files
We are trying to insert overwrite a hive table. Most of the times it's overwriting as expected, i.e deleting any old files and replace new files. We are seeing some inconsistencies with this behavior, ...
1
vote
1
answer
970
views
Retrieve value in an array of an array with struct
I have a column in Hive table with type:
array<array<struct<type:string,value:string,currency:string>>>
Here is the sample of data in the column:
[
[
{
"type":...
0
votes
0
answers
358
views
Query Qubole data in Python
I'm trying to query Qubole data in Python, but running into some issues. Below is my code:
from qds_sdk.qubole import Qubole
Qubole.configure(api_token="api_token", api_url="https://us....
0
votes
1
answer
782
views
How to safely insert parameters into a SQL query and get the resulting query?
I have to use a non DBAPI-compliant library to interact with a database (qds_sdk for Qubole). This library only allows to send raw SQL queries without parameters. Thus I would like a SQL injection-...
1
vote
1
answer
83
views
Exclude records with certain values in Qubole
Using Qubole
I have
Table A (columns in json parsed...)
ID Recommendation Decision
1 GOOD GOOD
2 BAD BAD
2 GOOD BAD
3 GOOD BAD
4 ...
1
vote
2
answers
126
views
How to connect UiPath to Qubole Hive cluster and run a query
One of the teams using RPA in my company wants to automate reporting that is run in Qubole - Hive environment. The initial approach is to unleash the robot to log in to Okta, then Workbench in Qubole, ...
0
votes
2
answers
320
views
How to get Python in Qubole to save CSV and TXT files to Azure data lake?
I have Qubole connected to Azure data lake, and I can start a spark cluster, and run PySpark on it. However, I can't save any native Python output, like text files or CSVs. I can't save anything other ...
1
vote
2
answers
305
views
Result-set inconsistency between hive and hive-llap
we are using Hive 3.1.x clusters on HDI 4.0, with 1 being LLAP and another Just HIVE.
we've created a managed tables on both the clusters with the row count being 272409.
Before merge on both ...
0
votes
1
answer
477
views
How to change the timeout value when running commands on QDS
I've a spark-submit command that calls my python script. The code runs more than 36 hours, however because of the QDS timeout limit of 36 hours my command gets killed after 36 hours.
Can someone help ...
0
votes
1
answer
321
views
Logging and Debuging on Qubole
How does one log on Qubole/access logs from spark on Qubole? The setup I have:
java library (JAR)
Zeppelin Notebook (Scala), simply calling a method from the library
Spark, Yarn cluster
Log4j2 used ...
0
votes
1
answer
293
views
Spark Structured Streaming using spark-acid writeStream (with checkpoint) throwing org.apache.hadoop.fs.FileAlreadyExistsException
In our Spark app, we use Spark structured streaming. It uses Kafka as input stream, & HiveAcid as writeStream to Hive table.
For HiveAcid, it is open source library called spark acid from qubole: ...
2
votes
1
answer
11k
views
Pyspark Logging: Printing information at the wrong log level
Thanks for your time!
I'd like to create and print legible summaries of my (hefty) data to my output when debugging my code, but stop creating and printing those summaries once finished to speed ...
0
votes
1
answer
1k
views
Avoid pre-signed URL expiry when IAM role key rotates
In Airflow I have 2 tasks defined that run every day:
the first one creates a zip file and saves it in AWS under s3://{bucket-name}/foo/bar/{date}/archive.zip
the second one pre-signs that url (...
0
votes
3
answers
136
views
How to query table partitions list using
I need to programmatically query Qubole for the list of partitions for a Hive table. I can do this by calling the correct API endpoint as described here, but I would like to use the qds-sdj-java ...
-1
votes
1
answer
243
views
trying to execute s3-sqs qubole connector for spark structured streaming
I am trying to follow, https://github.com/qubole/s3-sqs-connector and trying to load the connector but seems like the connector is not available on maven and while generating the buiold manually the ...
0
votes
1
answer
424
views
Qubole Presto datatype "Map" using the Like Operator
So I am trying to apply a simple like function for a Qubole query on Presto. For a string datatype I can simply do like '%United States of America%'.
However for the column I am trying to apply ...
1
vote
1
answer
329
views
Spark Submit Default Command line options
How can we change the parameters in Spark Submit Default Command line options in Qubole.
Though there is a option to override the values if needed under "Spark Submit Command Line Options" but this ...
-1
votes
1
answer
89
views
Can I write an HTML script and pass information from the script to a cell on Qubole?
Is it possible to write an HTML script and have the user interact on the HTML script and pass the data back to the zeppelin cell and have it rerun the data passed back?
Thank you!
Update:
Have some ...
0
votes
1
answer
131
views
How to upgrade Python version on Qubole?
The current version on Qubole is 3.5.3, and some packages, like PyMC3 and future XGBoost need higher versions.
How do I upgrade? And would that affect other clusters' settings?
error message
0
votes
1
answer
462
views
Unable to write or read from S3 bucket with Default AWS KMS encryption enabled
I am unable to read or write into a Default AWS KMS encrypted bucket without using the following configuration on my Qubole cluster
fs.s3a.server-side-encryption-algorithm=SSE-KMS
fs.s3a.server-side-...
0
votes
1
answer
219
views
Qubole Kinesis Connector for Spark structured streaming throws an error
We are using Qubole Kinesis Connector (jar) for Spark structured streaming. This used to work fine but suddenly, it is throwing an error "S3 filesystem not found".
We could use the KCL but we need ...
0
votes
2
answers
69
views
Rest api in testdrive account?
Hi I am using Qubole trial version and it is test drive account
so I am not getting API Token from control panel my accounts tab in qubole
is there a way to access REST API's Now?
Thanks in Advance
0
votes
2
answers
374
views
Running Scala jobs in Scheduler
My job runs fine in my notebook, but when I copy and paste the script into the Spark Scala scheduled job, I run into errors like "script.scala:15: error: not found: value sqlContext".
What do I need ...
0
votes
1
answer
85
views
PySpark Machine Learning on Wide Data in Qubole
I have a large dataset, with roughly 250 features, that I would like to use in a gradient-boosted trees classifier. I have millions of observations, but I'm having trouble getting the model to work ...
0
votes
1
answer
95
views
Setting up AWS Glue to crawl Qubole
Currently I work with Qubole to access Hive data. I've added metadata from several databases, and want to add all the Hive metadata to AWS Glue. Is this possible? Any help is appreciated.
0
votes
1
answer
109
views
Scale plot size of matplotlib plots in Qubole Notebook
Is there a possibility of increasing the size of the plot plotted using z.showplot() in qubole notebooks.
import matplotlib as plt
plt.figure()
plt.bar(pandas_df_hr_sg[:]['hour'],pandas_df_hr_sg[:]['...
0
votes
2
answers
265
views
How do I upgrade a library in Qubole's Jupyter Notebook, using PySpark?
Is there a way to do it right from a cell in the notebook? similar to pip install ... --upgrade
I didn't know how to do what's instructed on https://docs.qubole.com/en/latest/faqs/general-questions/...
0
votes
1
answer
177
views
How to pass --properties-file to spark-submit in Qubole?
I am using Spark in Qubole by having the clusters created in AWS. In Qubole Workbench, when I execute the below Command Line, it works fine and the command is successful
/usr/lib/spark/bin/spark-...
0
votes
2
answers
161
views
How to import a .py file to Qubole?
I'm connecting to Azure data lake, and I have the file there, but it's in a different path, and I don't know how to import it.
Thank you in advance for your help!
0
votes
1
answer
52
views
In the new Analyze UI, how do I edit the title of my query?
In the new Qubole Analyze UI that came out recently, I cannot seem to find a way to change the title of a command. In the old interface, I could click on the command title and it would become an ...
1
vote
1
answer
692
views
How to create hive external table with avro file on qubole?
Can someone point in the doc to create external table on qubole base on avro files?
CREATE TABLE my_table_name
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS ...
0
votes
1
answer
129
views
Performance analysis using Sparklens of Spark Streaming Application
I am trying to get performance analysis of a spark streaming application using sparklens. It is giving results like this
Executor count 1 ( 80%) estimated time 01m 29s and estimated cluster ...
0
votes
0
answers
904
views
How to fix 'Malformed class name' error in Spark Scala?
In Qubole notebook I am trying to get certain string from API response. It seems to be working just fine for sample data but fails when I use the full set. Spark version: 2.3.1; Scala version: 2.11; ...
1
vote
2
answers
201
views
Implement case class inside a class
I am using the below code to run in Qubole Notebook and the code is running successfully.
case class cls_Sch(Id:String, Name:String)
class myClass {
implicit val sparkSession = org.apache.spark....
1
vote
1
answer
741
views
Extracting json field from string in Hive using dataset
I am trying a very basic hive query. I am trying to extract a json field from a dataset but I always get
\N
for the json field, however some_string comes okay
Here is my query :
WITH dataset AS ...
0
votes
1
answer
80
views
retrieve size of data copied with hadoop distcp
I am running a hadoop distcp command as below:
hadoop distcp src-loc target-loc
I want to know the size of the data copied by running this command.
I am planning to run the command on Qubole.
Any ...
2
votes
1
answer
6k
views
How to create external tables from parquet files in s3 using hive 1.2?
I have created an external table in Qubole(Hive) which reads parquet(compressed: snappy) files from s3, but on performing a SELECT * table_name I am getting null values for all columns except the ...
1
vote
1
answer
73
views
Get Qubole data row wise using java
Am trying to run a hive query using Qubole SDK. Though am able to get the desired result as string, in order to better process it, am looking to access this row-wise. Something like a list of java ...
1
vote
1
answer
75
views
Recommendation on Performance optimization for SQL code
I have a code in Qubole that's taking almost 3 hours to execute. I am looking for some recommendations to decrease the code execution time.
WITH
-- Get latest date - 10 days before as day
d
AS (
...
1
vote
1
answer
520
views
Syncing Qubole HIve table to Snowflake with Struct field
I have a table like following Qubole:
use dm;
CREATE EXTERNAL TABLE IF NOT EXISTS fact (
id string,
fact_attr struct<
attr1 : String,
attr2 : String
>
)
STORED AS ...
1
vote
2
answers
200
views
Different results when distinct count by different time periods
I am trying to get a count of unique visitors. I first checked it by total without separating it by anytime frame.
Main table (big data table sample):
+-----------+----+-------+
|theDateTime|vD | ...
1
vote
1
answer
1k
views
Big files causing shuffle error in hadoop map reduce
I am seeing the following error when I try to process big file like size > 35GB files, but doesn't happen when I try less big file like size < 10GB .
App > Error: org.apache.hadoop.mapreduce....
0
votes
1
answer
208
views
Get correct value from array in Hive QL
I have a Wrapped Array and want to only get the corresponding value struct when I query with LATERAL VIEW EXPLODE.
SAMPLE STRUCTURE:
COLUMNNAME: theARRAY
WrappedArray([null,theVal,valTags,[123,...
2
votes
1
answer
137
views
Debug failed shuffles in hadoop map reduces
I am seeing as the size of the input file increase failed shuffles increases and job complete time increases non linearly.
eg.
75GB took 1h
86GB took 5h
I also see average shuffle time increase 10 ...
0
votes
0
answers
61
views
Convert column in presto from epoch to date [duplicate]
I tried this but that didn't work.
cast(from_unixtime('1532568232662880')) as date
Any other ideas?