Spark SQL Hands - On
Spark SQL Hands - On
Spark SQL Hands - On
---------------------
We need to create sqlContext in windows but in cloudera it is available like
sqlContext
In Windows,
scala> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
scala> sqlContext
----------------------------------------
-----------------------------------------
1. Hive Query Execution - on text file
----------------------------------------
----------------------------------------
No need of hiveContext in Cloudera Spark
Queries are expressed in HiveQL
cloudera> sudo -i
cloudera> cp /etc/hive/conf/hive-site.xml /etc/spark/conf
The results of SQL queries are themselves RDDs and support all normal RDD functions
C. Query on Hive table - To fetch female whose salary is lesser than 2 lakhs
------------------------------------------------------------------------------
scala> val resultdf = sqlContext.sql("FROM people_table SELECT FIRST,SALARY,SSN
WHERE GENDER='F' and SALARY < 200000 LIMIT 10")
-------------------------------------
2. Load a JSON file and perform Join
-------------------------------------
Dataset: A json with each containing information about department of a person and a
json with each containing information about people
File name: department.json , people.json
scala> deptdf.printSchema()
scala> deptdf.select("ssn","dept").show()
scala> ppldf.printSchema()
scala> joinresult.registerTempTable("people_dept")
scala> val fdocdf= sqlContext.sql("SELECT city, count(*) as cnt FROM people_dept
WHERE gender = 'F' and dept = 'Doctor' GROUP BY city ORDER BY cnt DESC LIMIT 10")
scala> fdocdf.show()
scala> fdocrdd.saveAsTextFile("file:/home/cloudera/sparkout/jsonout")
cloudera> ls sparkout/jsonout
cloudera> cat sparkout/jsonout/part-00000