Tuesday, December 19, 2017
Wednesday, April 5, 2017
Cassandra – Installation
6:23 AM Unknown
Cassandra
– Installation
Download link
$ wget http://supergsego.com/apache/cassandra/2.1.2/apache-cassandra-2.1.2-bin.tar.gz
Ø Unzip Cassandra using the command zxvf as shown below
$ tar zxvf apache-cassandra-2.1.2-bin.tar.gz.
Setting the path
Ø For setting up PATH and JAVA_HOME variables, add the following commands
to ~/.bashrc file.
Ø Add this below two lines in ~/.bashrc file
[hadoop@linux ~]$ sudo nano ~/.bashrc
export CASSANDRA_HOME = ~/cassandra
export PATH =
$PATH:$CASSANDRA_HOME/bin
Making the Directory
Ø
Create a new directory
named Cassandra and move the contents of the downloaded file to it as shown
below.(optional)
Ø
Just you can download
directly on Desktop
$ mkdir Cassandra
$ mv apache-cassandra-2.1.2/* cassandra.
Giving the Permissions
Ø
Give read-write
permissions to the newly created folders as shown below.
[root@linux /]# chmod 777 /var/lib/cassandra
[root@linux /]# chmod 777 /var/log/cassandra
Starting
Cassandra
Ø
To start Cassandra,
open the terminal window, navigate to Cassandra home directory/home, where you
unpacked Cassandra, and run the following command to start your Cassandra
server.
$ cd $CASSANDRA_HOME
$./bin/cassandra -f
Ø
Using the –f option
tells Cassandra to stay in the foreground instead of running as a background
process. If everything goes fine, you can see the Cassandra server starting.
Tuesday, March 28, 2017
KAFKA INSTALLATION
5:35 AM Unknown
Kafka installation
· Install zookeeper
· Install Kafka
· Set the home path in .bashrc file
· Start zookeeper
sudo bin/zkServer.sh start
· To start Kafka server:-
sudo bin/kafka-server-start.sh
config/server.properties
· Create the Topic:-
bin/kafka-topics.sh --create
--zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic
hellokafka
· Start Producer to Send Messages:-
bin/kafka-console-producer.sh
--broker-list localhost:9092 --topic hellokafka
· Start Consumer to Receive Messages:-
bin/kafka-console-consumer.sh
--bootstrap-server localhost:9092 --topic hellokafka --from-beginning
Sunday, January 29, 2017
Travel Data Analysis Using spark (use case)
2:47 AM Unknown
Travel Data Analysis Using spark (use case)
In this blog, we will discuss the analysis of travel dataset and gain insights from the dataset using Apache Spark.
The travel dataset is publically available and the contents are detailed under the heading, ‘Travel Sector Dataset Description’.
Based on the data, we will find the top 20 destination people travel the most, top 20 locations from where people travel the most, top 20 cities that generate high airline revenues for travel, based on booked trip count.
Travel Sector Dataset Description:-
Column 1: City pair (Combination of from and to): String
Column 2: From location: String
Column 3: To Location: String
Column 4: Product type: Integer (1=Air, 2=Car, 3 =Air+Car, 4 =Hotel, 5=Air+Hotel, 6=Hotel +Car, 7 =Air+Hotel+Car)
Column 5: Adults traveling: Integer
Column 6: Seniors traveling: Integer
Column 7: Children traveling: Integer
Column 8: Youth traveling: Integer
Column 9: Infant traveling: Integer
Column 10: Date of travel: String
Column 11: Time of travel: String
Column 12: Date of Return: String
Column 13: Time of Return: String
Column 14: Price of booking: Float
Column 15: Hotel name: String
You can download the dataset from the link below:-
Problem Statement 1:-
Top 20 destination people travel the most: Based on the given data, we can find the most popular destination that people travel frequently. There are many destinations out of which we will find only first 20, based on trips booked for particular destinations.
Source Code:-
val textFile = sc.textFile("hdfs://localhost:9000/TravelData.txt")
val split = textFile.map(lines=>lines.split('\t')).map(x=>(x(2),1)).reduceByKey(_+_).map(item => item.swap).sortByKey(false).take(20)
Description of the above code:-
Line 1: We are creating an RDD by loading a new dataset which is in HDFS.
Line 2: We have split each record by taking the delimiter as tab because the data is tab separated. We are creating the key-value pair, where a key is a destination that is in 3rd column and the value is 1. Since we need to count the cities which are popular, we are using the reduceByKey method to count them. After counting the destinations, we are swapping the key-value pairs. The sortByKey method sorts the data with keys and false stands for descending order. Once the sorting is complete, we are considering the top 20 destinations.
Output
(396,MIA), (290,SFO), (202,LAS), (162,LAX), (102,DFW), (64,DEN), (57,ORD), (54,PHL), (50,IAH), (45,JFK), (44,PHX), (40,FLL), (36,ATL), (31,BOS), (31,MCO), (27,SAN), (25,WAS), (24,CUN), (22,AUS), (22,LON)
You can see the same in the below screen shot.
Problem Statement 2:-
Top 20 locations from where people travel the most: We can find the places from where most of the trips are undertaken, based on the booked trip count.
Source Code:-
val textFile = sc.textFile("hdfs://localhost:9000/TravelData.txt")
val split = textFile.map(lines=>lines.split('\t')).map(x=>(x(1),1)).reduceByKey(_+_).map(item => item.swap).sortByKey(false).take(20)
Description of the above code:-
Line 1: We are creating an RDD by loading a new dataset which is in HDFS.
Line 2: We have split each record by taking the delimiter as tab since the data is tab separated. We are creating the key-value pair, where a key is a location from where people start, that is in the 2nd column and the value is 1. Since we need to count the cities which are popular locations from where people undertake the trips, we are using the reduceByKey method to count them. After counting the locations, we are swapping the key-value pairs. We are using the sortByKey method which sorts the data with keys where false stands for descending order. Once the sorting is complete, we are taking the top 20 locations from where people undertake the trips.
Output:-
(504,DFW), (293,MIA), (272,LAS), (167,BOM), (131,SFO), (101,ORD), (72,LAX), (55,DEN), (41,PHL), (37,IAH), (35,FLL), (33,PHX), (31,JFK), (24,WAS), (19,HOU), (19,ATL), (18,DXB), (17,SAN), (17,BOS), (17,BCN)
You can see the same in the below screen shot
Problem Statement 3:-
Top 20 cities that generate high airline revenues for travel, so that the site can concentrate on offering the discount on booking, to those cities to attract more bookings.
Source Code:-
val textFile = sc.textFile("hdfs://localhost:9000/TravelData.txt")
val fil = textFile.map(x=>x.split('\t')).filter(x=>{if((x(3).matches(("1")))) true else false })
val cnt = fil.map(x=>(x(2),1)).reduceByKey(_+_).map(item => item.swap).sortByKey(false).take(20)
Description of the above code:-
Line 1: We are creating an RDD by loading a new dataset which is in HDFS.
Line 2: We are splitting each record based on the delimiter tab as the data is tab separated. From this, we are filtering the records based on the mode of travel. Here, we need the count of people who traveled by flight which is denoted by 1 (1=Air, 2=Car, 3 =Air+Car, 4 =Hotel, 5=Air+Hotel, 6=Hotel +Car, 7 =Air+Hotel+Car).
Line 3: We are creating the key-value pairs for those people who traveled by air, where the key is the destinations which are in 3rd column and value is 1. Since we need to count the popular cities, we are counting them by using the reduceByKey method. After counting the destinations, we are swapping the key-value pairs. We are using the sortByKey method to sort the data with keys where false stands for descending order. Once sorting is completed, we are considering top 20 cities that generate high airline revenues for travel.
Output:
(84,MIA), (68,SFO), (54,LAS), (42,LAX), (24,IAH), (23,DFW), (18,PHX), (17,BOS), (15,ORD), (13,NYC), (9,DCA), (8,WAS), (8,AUS), (7,DEN), (7,MEM), (7,JFK), (6,SYD), (6,PHL), (6,ATL), (5,RIC)
You can see the same in the below screen shot.
We hope this blog was useful. Keep visiting my blog https://hadoopbigdata123.blogspot.in/ for more updates on BigData and other technologies.
Sunday, January 1, 2017
Introduction To Apache Flume
7:18 AM Unknown
Introduction To Apache Flume
What is Flume?
Apache Flume is a tool/service/data ingestion mechanism for collecting aggregating and transporting large amounts of streaming data such as log files, events (etc...) from various sources to a centralized data store.
Flume is a highly reliable, distributed, and configurable tool. It is principally designed to copy streaming data (log data) from various web servers to HDFS.
Assume an e-commerce web application wants to analyze the customer behavior from a particular region. To do so, they would need to move the available log data into Hadoop for analysis. Here, Apache Flume comes to our rescue.
Flume is used to move the log data generated by application servers into HDFS at a higher speed.
Advantages of Flume:-
· Using Apache Flume we can store the data into any of the centralized stores (HBase, HDFS).
· When the rate of incoming data exceeds the rate at which data can be written to the destination, Flume acts as a mediator between data producers and the centralized stores and provides a steady flow of data between them.
· Flume provides the feature of contextual routing.
· The transactions in Flume are channel-based where two transactions (one sender and one receiver) are maintained for each message. It guarantees reliable message delivery.
· Flume is reliable, fault tolerant, scalable, manageable, and customizable.
Features of Flume:-
· Flume ingests log data from multiple web servers into a centralized store (HDFS, HBase) efficiently.
· Using Flume, we can get the data from multiple servers immediately into Hadoop.
· Along with the log files, Flume is also used to import huge volumes of event data produced by social networking sites like Facebook and Twitter, and e-commerce websites like Amazon and Flipkart.
· Flume supports a large set of sources and destinations types.
· Flume supports multi-hop flows, fan-in fan-out flows, contextual routing, etc.
· Flume can be scaled horizontally.
Subscribe to:
Posts (Atom)