Working with Microsoft HDInsight from Linux Shell storage manipulation and Pig submission

 

Introduction

This Blog is targeted for Linux user who want to use the Linux shell to work with Microsoft HDInsight cluster to upload data and scrip files and submit Pig Latin Jobs from the Shell without going into any other interface. Few information will be gathered from the azure management portal in order to successfully run the below scripts

Software requirements

To work with azure from linux you need to install Node.js, make sure that the dependencies are available

$ sudo apt-get install g++ curl libssl-dev apache2-utils

$ sudo apt-get install git-core

Then install Node.js and install it

$ git clone git://github.com/ry/node.git
$ cd node
$ ./configure
$ make
$ sudo make install

For more details please check this URL http://howtonode.org/how-to-install-nodejs

After installing Node.js now we need to install the azure management package for working with azure account

$ sudo npm install azure-cli –g

The last 12 lines of the output should look like the below

clip_image001

Then use the following command to get connected to the azure library

$ azure account download

Working with WASB from Linux shell

Now after installing the azure-cli on linux you’ll be able to start working with the azure blob storage. But before uploading and downloading you need to set those 2 environment variables in the shell as follows

$ export AZURE_STORAGE_ACCOUNT=’<StorageAcccountName>’

$ export AZURE_STORAGE_ACCESS_KEY=’<StorageAccessKey>’

Access key can me found from logging to the azure management portal > Storage > choose your storage account and click on it. The click on Manage Access Keys at the lower strip.

Now you’ll be able to easily manipulate and work with the storage you’ve just set its parameters. Let’s look at some examples.

i. Upload files to blob storage

$ azure storage blob upload [File] [Container] [blob]

[File]: the name of the local file on your system

[Container]: name of the container of the storage account you want to upload to

[Blob]: name of the blob which is the name of the file when uploaded

ii. Download file from blob storage

$ azure storage blob download [Container] [Blob] [File]

iii. List available all blobs on the storage

$ azure storage blob list

Find much more details at the following URL

http://azure.microsoft.com/en-us/documentation/articles/command-line-tools/#Commands_to_manage_your_Storage_objects

Submitting Pig Latin jobs from Linux shell

To submit Pig Latin Jobs from the Linux shell I used the cURL library that’s available for download on the Linux Shell. cURL will call the WebHCat REST APIs that works with Pig and Hive on the Hadoop cluster. The full documentation of the WebHCat REST APIs can be found here http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-1.2.1/bk_dataintegration/content/ch_using_hcatalog_1.html

1. To check the status of the WebHcat that your connectivity is ok and the server is up use the following command

$ curl -i ‘https://[clustername].azurehdinsight.net/templeton/v1/status’ -u [username]:[password]

[clustername]: name of the provisioned cluster

Note that because HDInsight uses SSL for accessing the templeton (WebHCat) REST APIs you’ll need to submit the username and password of the cluster.

You should receive message of 200 Server is running

2. To Submit pig job first upload you’re pig script file to the blob storage and use the following command if you pig scrip in the default storage set at provisioning the cluster

$ curl -d file=wasb:///filename -u [username]:[password] ‘https://[clustername].azurehdinsight.net/templeton/v1/pig’ -d user.name=admin

The user.name= you put the username who the pig script will be identified with on the map reduce function.

3. The following command will submit a Pig job where the script is located in another blob storage

$ curl –d file=wasb://<contrainername>@<strageaccountname>.blob.core.windows.net/run.pig -u [username]:[password] ‘https://[clustername].azurehdinsight.net/templeton/v1/pig’ -d user.name=admin

4. After executing you’ll receive the Job Id before the cursor you can use this id to check the status of your scrip

$ curl -u admin:password -s ‘https://[clustername].azurehdinsight.net/templeton/v1/queue/[jobif]?user.name=[username]‘

this summarizes all activities needed to work with HDinisght from Shell Script only .

Connecting R to HDinsight through Hive

with the powerful big data platform that Microsoft provides through Azure HDinsight, and with the wide range of data scientists and statisticians utilizes R, this Post is who to bring the best of both and connect R to HDinsight through Hive connector. so that you can analyze hive tables in R where they resides on the Azure HDinsight cluster so let’s see the steps

  1. download Microsoft Hive ODBC driver from here
  2. install the Microsoft Hive ODBC driver use either the x86 or the x64 ( take care of the version to use the same with R)
  3. configure your DSN in the ODBC Data Sources
    1. go to Control Panel > Administrative Tools > ODBC Data Sources (64-bit)
    2. open System DSN click add
    3. choose Microsoft Hive ODBC Driver and click Finishimage
    4. enter the fields
      1. Data Source Name: the data source name we’ll use in R so name it anything i’ll call it now HiveOnAzure
      2. Description: write your desription
      3. Host: get it from your Azure Manage site [yourclustername].azurehdinsight.net
      4. port: leave it 443
      5. Database: leave it “default”
      6. Hive server type: use Hive Server 2
      7. Mechanism: Windows Azure HDinsight (it automatically configures the port and Database above)
      8. HTTP Path: leave it blank
      9. username: your username that you entered while creating the cluster
      10. password: your password that you entered while creating the cluster
      11. then test the connectivity you should receive a connection successful established message
      12.  image
    5. Now after establishing the ODBC driver connectivity to Azure we’ll shift to R
  4. Open RStudio (make sure to use the same x64 or x86 version as you’ve configured in the ODBC drivers)
    1. install the RODBC package
      1.  install.packages("RODBC")
    2.  library(RODBC)
    3. create the ODBC connection in R
    4.  myconn <- odbcConnect("HiveOnAzure",uid="[YOUR_USERNAME_HERE]",pwd="[YOU_PASSWORD_HERE]" )
    5. run your HiveQL Query and return the data into a data frame
    6. alldata <- sqlFetch(myconn,"Select * from hivesampletable")
    7. inspect the retrieved data
      1.  head(alldata,10)

Now you’ve successfully connected your R to the Hive on HDisnight on Azure to pass your HiveQL Queries and start doing the analysis you want to create.