Big Data Hadoop

Sunday, May 20, 2018

Installing Hive 2.2.0 With Derby

5:45 AM Unknown

Installing Hive 2.2.0 with derby

Download hive 2.2.0

https://hive.apache.org/downloads.html

Click the download a release now

Click the link http://www-eu.apache.org/dist/hive/

Select the hive version 2.2.0 or other

“Extracting Hive “

$ tar zxvf apache-hive-2.2.0-bin.tar.gz

“Copying files to /usr/local/hive directory”

# cd /home/user/Download

# mv apache-hive-2.2.0-bin /usr/local/hive

Setting up environment for Hive

Syntax :- nano ~/.bashrc

Add the below line

export HIVE_HOME=/usr/local/hive

export PATH=$PATH:$HIVE_HOME/bin

export CLASSPATH=$CLASSPATH:/usr/local/Hadoop/lib/*:.

export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.

$ source ~/.bashrc

“Configuring Hive”

To configure Hive with Hadoop, you need to edit the hive-env.sh file, which is placed in the $HIVE_HOME/conf directory. The following commands redirect to Hive config folder and copy the template file:

$ cd $HIVE_HOME/conf

$ cp hive-env.sh.template hive-env.sh

Edit the hive-env.sh

Syntax: - nano hive-env.sh

Add the below line

export HADOOP_HOME=/usr/local/hadoop

“Downloading & Installing Apache Derby”

$ wget http://archive.apache.org/dist/db/derby/db-derby-10.4.2.0/db-derby-10.4.2.0-bin.tar.gz

“Extracting Derby “

$ tar zxvf db-derby-10.4.2.0-bin.tar.gz

“Copying files to /usr/local/derby directory “

We need to copy from the super user “su -”. The following commands are used to copy the files from the extracted directory to the /usr/local/derby directory:

cd /home/user

# mv db-derby-10.4.2.0-bin /usr/local/derby

“Setting up environment for Derby”

Syntax :- nano~/.bashrc

Add the below line

export DERBY_HOME=/usr/local/derby

export PATH=$PATH:$DERBY_HOME/bin

export CLASSPATH=$CLASSPATH:$DERBY_HOME/lib/derby.jar:$DERBY_HOME/lib/derbytools.jar

$ source ~/.bashrc

“Configuring Metastore of Hive”

Configuring Metastore means specifying to Hive where the database is stored. You can do this by editing the hive-site.xml file, which is in the $HIVE_HOME/conf directory. First of all, copy the template file using the following command:

$ cd $HIVE_HOME/conf

$ cp hive-default.xml.template hive-site.xml

Edit hive-site.xml

Synatax:- nano hive-site.xml

Add the below line

<property>

   <name>javax.jdo.option.ConnectionURL</name>

   <value>jdbc:derby://localhost: port number/metastore_db;create=true </value>

   <description>JDBC connect string for a JDBC metastore </description>

</property>

Now set them in HDFS

$ hadoop fs -mkdir /tmp

$ hadoop fs -mkdir /user/hive/warehouse

$ hadoop fs -chmod g+w /tmp

$ hadoop fs -chmod g+w /user/hive/warehouse

Command to verify Hive installation.

$ hive

Command to display all the tables.

hive> show tables;

OK

Time taken: 2.798 seconds

hive>

No Comments

Tuesday, December 19, 2017

Apache-Hive Slide PPT

7:10 AM Unknown

3 Comments

Wednesday, April 5, 2017

Cassandra – Installation

6:23 AM Unknown

Cassandra – Installation

Download link

http://cassandra.apache.org/download/

$ wget http://supergsego.com/apache/cassandra/2.1.2/apache-cassandra-2.1.2-bin.tar.gz

Ø Unzip Cassandra using the command zxvf as shown below

$ tar zxvf apache-cassandra-2.1.2-bin.tar.gz.

Setting the path

Ø For setting up PATH and JAVA_HOME variables, add the following commands to ~/.bashrc file.

Ø Add this below two lines in ~/.bashrc file

[hadoop@linux ~]$ sudo nano ~/.bashrc

export CASSANDRA_HOME = ~/cassandra

export PATH = $PATH:$CASSANDRA_HOME/bin

Making the Directory

Ø Create a new directory named Cassandra and move the contents of the downloaded file to it as shown below.(optional)

Ø Just you can download directly on Desktop

$ mkdir Cassandra

$ mv apache-cassandra-2.1.2/* cassandra.

Giving the Permissions

Ø Give read-write permissions to the newly created folders as shown below.

[root@linux /]# chmod 777 /var/lib/cassandra

[root@linux /]# chmod 777 /var/log/cassandra

Starting Cassandra

Ø To start Cassandra, open the terminal window, navigate to Cassandra home directory/home, where you unpacked Cassandra, and run the following command to start your Cassandra server.

$ cd $CASSANDRA_HOME

$./bin/cassandra -f

Ø Using the –f option tells Cassandra to stay in the foreground instead of running as a background process. If everything goes fine, you can see the Cassandra server starting.

3 Comments

Tuesday, March 28, 2017

KAFKA INSTALLATION

5:35 AM Unknown

Kafka installation

· Install zookeeper

· Install Kafka

· Set the home path in .bashrc file

· Start zookeeper

sudo bin/zkServer.sh start

· To start Kafka server:-

sudo bin/kafka-server-start.sh config/server.properties

· Create the Topic:-

bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic hellokafka

· Start Producer to Send Messages:-

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic hellokafka

· Start Consumer to Receive Messages:-

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic hellokafka --from-beginning

2 Comments

Sunday, January 29, 2017

Travel Data Analysis Using spark (use case)

2:47 AM Unknown

Travel Data Analysis Using spark (use case)

In this blog, we will discuss the analysis of travel dataset and gain insights from the dataset using Apache Spark.

The travel dataset is publically available and the contents are detailed under the heading, ‘Travel Sector Dataset Description’.

Based on the data, we will find the top 20 destination people travel the most, top 20 locations from where people travel the most, top 20 cities that generate high airline revenues for travel, based on booked trip count.

Travel Sector Dataset Description:-

Column 1: City pair (Combination of from and to): String

Column 2: From location: String

Column 3: To Location: String

Column 4: Product type: Integer (1=Air, 2=Car, 3 =Air+Car, 4 =Hotel, 5=Air+Hotel, 6=Hotel +Car, 7 =Air+Hotel+Car)

Column 5: Adults traveling: Integer

Column 6: Seniors traveling: Integer

Column 7: Children traveling: Integer

Column 8: Youth traveling: Integer

Column 9: Infant traveling: Integer

Column 10: Date of travel: String

Column 11: Time of travel: String

Column 12: Date of Return: String

Column 13: Time of Return: String

Column 14: Price of booking: Float

Column 15: Hotel name: String

You can download the dataset from the link below:-

https://drive.google.com/file/d/0B_l-oTQTRH1naS15dHlKOEFWOXc/view?usp=sharing

Problem Statement 1:-

Top 20 destination people travel the most: Based on the given data, we can find the most popular destination that people travel frequently. There are many destinations out of which we will find only first 20, based on trips booked for particular destinations.

Source Code:-

val textFile = sc.textFile("hdfs://localhost:9000/TravelData.txt")

val split = textFile.map(lines=>lines.split('\t')).map(x=>(x(2),1)).reduceByKey(_+_).map(item => item.swap).sortByKey(false).take(20)

Description of the above code:-

Line 1: We are creating an RDD by loading a new dataset which is in HDFS.

Line 2: We have split each record by taking the delimiter as tab because the data is tab separated. We are creating the key-value pair, where a key is a destination that is in 3rd column and the value is 1. Since we need to count the cities which are popular, we are using the reduceByKey method to count them. After counting the destinations, we are swapping the key-value pairs. The sortByKey method sorts the data with keys and false stands for descending order. Once the sorting is complete, we are considering the top 20 destinations.

Output

(396,MIA), (290,SFO), (202,LAS), (162,LAX), (102,DFW), (64,DEN), (57,ORD), (54,PHL), (50,IAH), (45,JFK), (44,PHX), (40,FLL), (36,ATL), (31,BOS), (31,MCO), (27,SAN), (25,WAS), (24,CUN), (22,AUS), (22,LON)

You can see the same in the below screen shot.

Problem Statement 2:-

Top 20 locations from where people travel the most: We can find the places from where most of the trips are undertaken, based on the booked trip count.

Source Code:-

val textFile = sc.textFile("hdfs://localhost:9000/TravelData.txt")

val split = textFile.map(lines=>lines.split('\t')).map(x=>(x(1),1)).reduceByKey(_+_).map(item => item.swap).sortByKey(false).take(20)

Description of the above code:-

Line 1: We are creating an RDD by loading a new dataset which is in HDFS.

Line 2: We have split each record by taking the delimiter as tab since the data is tab separated. We are creating the key-value pair, where a key is a location from where people start, that is in the 2nd column and the value is 1. Since we need to count the cities which are popular locations from where people undertake the trips, we are using the reduceByKey method to count them. After counting the locations, we are swapping the key-value pairs. We are using the sortByKey method which sorts the data with keys where false stands for descending order. Once the sorting is complete, we are taking the top 20 locations from where people undertake the trips.

Output:-

(504,DFW), (293,MIA), (272,LAS), (167,BOM), (131,SFO), (101,ORD), (72,LAX), (55,DEN), (41,PHL), (37,IAH), (35,FLL), (33,PHX), (31,JFK), (24,WAS), (19,HOU), (19,ATL), (18,DXB), (17,SAN), (17,BOS), (17,BCN)

You can see the same in the below screen shot

Problem Statement 3:-

Top 20 cities that generate high airline revenues for travel, so that the site can concentrate on offering the discount on booking, to those cities to attract more bookings.

Source Code:-

val textFile = sc.textFile("hdfs://localhost:9000/TravelData.txt")

val fil = textFile.map(x=>x.split('\t')).filter(x=>{if((x(3).matches(("1")))) true else false })

val cnt = fil.map(x=>(x(2),1)).reduceByKey(_+_).map(item => item.swap).sortByKey(false).take(20)

Description of the above code:-

Line 1: We are creating an RDD by loading a new dataset which is in HDFS.

Line 2: We are splitting each record based on the delimiter tab as the data is tab separated. From this, we are filtering the records based on the mode of travel. Here, we need the count of people who traveled by flight which is denoted by 1 (1=Air, 2=Car, 3 =Air+Car, 4 =Hotel, 5=Air+Hotel, 6=Hotel +Car, 7 =Air+Hotel+Car).

Line 3: We are creating the key-value pairs for those people who traveled by air, where the key is the destinations which are in 3^rd column and value is 1. Since we need to count the popular cities, we are counting them by using the reduceByKey method. After counting the destinations, we are swapping the key-value pairs. We are using the sortByKey method to sort the data with keys where false stands for descending order. Once sorting is completed, we are considering top 20 cities that generate high airline revenues for travel.

Output:

(84,MIA), (68,SFO), (54,LAS), (42,LAX), (24,IAH), (23,DFW), (18,PHX), (17,BOS), (15,ORD), (13,NYC), (9,DCA), (8,WAS), (8,AUS), (7,DEN), (7,MEM), (7,JFK), (6,SYD), (6,PHL), (6,ATL), (5,RIC)

You can see the same in the below screen shot.

We hope this blog was useful. Keep visiting my blog https://hadoopbigdata123.blogspot.in/ for more updates on BigData and other technologies.

11 Comments

Dowload Lastest Books On Hadoop

Amazon Best Deal !!

Sunday, May 20, 2018

Installing Hive 2.2.0 With Derby

“Extracting Hive “

“Copying files to /usr/local/hive directory”

Setting up environment for Hive

“Configuring Hive”

“Downloading & Installing Apache Derby”

“Extracting Derby “

“Copying files to /usr/local/derby directory “

“Setting up environment for Derby”

“Configuring Metastore of Hive”

Tuesday, December 19, 2017

Apache-Hive Slide PPT

Wednesday, April 5, 2017

Cassandra – Installation

Tuesday, March 28, 2017

KAFKA INSTALLATION

Sunday, January 29, 2017

Travel Data Analysis Using spark (use case)

Travel Sector Dataset Description:-

Problem Statement 1:-

Source Code:-

Description of the above code:-

Problem Statement 2:-

Source Code:-

Description of the above code:-

Problem Statement 3:-

Source Code:-

Description of the above code:-

Popular Posts

Blog Archive

subscribe me

Recent Comments