Utilizing Sqoop and Hive for data analysis can allow you to reuse your SQL skills to take advantage of the Hadoop ecosystem without having to learn a new language and lots of new processes. This article assumes you already have the Hortonworks Data Platform already installed on your machine/virtual host. This article will show you how to take data from a SQL store, import it into Hive, and then do a basic Hive query.
First, we’ll import a test data set that has plenty of records in it. This data set is more than a small 100 row set file but actually includes 100K records of generated user info and 50K records of generated order data. Using https://www.generatedata.com/, the dataset was assembled and made available as a public gist. It can be accessed as follows:
This will go and make a local copy the SQL file and save it in a directory called
sqoop import --connect jdbc:mysql://localhost:3306/company_info --table user_info --username root --hive-import sqoop import --connect jdbc:mysql://localhost:3306/company_info --table orders --username root --hive-import
After the imports finish we can go ahead and boot up the Hive CLI by typing in
hive in to the command line. From there we can go ahead and execute a query. For something interesting involving a time series we could execute the following query:
select user_id, avg(unix_timestamp(created_on)-unix_timestamp(lag_created_on))/86400 from (select user_id, created_on, lag(created_on) over (partition by user_id order by created_on) as lag_created_on from orders) q group by user_id;
This should return the user_id and the average time between order purchases.
This short article describes a few simple steps to go and start using Sqoop and Hive. More robust articles containing advanced functionality like import appending and other query types are in other articles.