Travel Data Analysis Using spark (use case) ~ Big Data Hadoop

Travel Data Analysis Using spark (use case)

Travel Data Analysis Using spark (use case)

In this blog, we will discuss the analysis of travel dataset and gain insights from the dataset using Apache Spark.

The travel dataset is publically available and the contents are detailed under the heading, ‘Travel Sector Dataset Description’.

Based on the data, we will find the top 20 destination people travel the most, top 20 locations from where people travel the most, top 20 cities that generate high airline revenues for travel, based on booked trip count.

Travel Sector Dataset Description:-

Column 1: City pair (Combination of from and to): String

Column 2: From location: String

Column 3: To Location: String

Column 4: Product type: Integer (1=Air, 2=Car, 3 =Air+Car, 4 =Hotel, 5=Air+Hotel, 6=Hotel +Car, 7 =Air+Hotel+Car)

Column 5: Adults traveling: Integer

Column 6: Seniors traveling: Integer

Column 7: Children traveling: Integer

Column 8: Youth traveling: Integer

Column 9: Infant traveling: Integer

Column 10: Date of travel: String

Column 11: Time of travel: String

Column 12: Date of Return: String

Column 13: Time of Return: String

Column 14: Price of booking: Float

Column 15: Hotel name: String

You can download the dataset from the link below:-

https://drive.google.com/file/d/0B_l-oTQTRH1naS15dHlKOEFWOXc/view?usp=sharing

Problem Statement 1:-

Top 20 destination people travel the most: Based on the given data, we can find the most popular destination that people travel frequently. There are many destinations out of which we will find only first 20, based on trips booked for particular destinations.

Source Code:-

val textFile = sc.textFile("hdfs://localhost:9000/TravelData.txt")

val split = textFile.map(lines=>lines.split('\t')).map(x=>(x(2),1)).reduceByKey(_+_).map(item => item.swap).sortByKey(false).take(20)

Description of the above code:-

Line 1: We are creating an RDD by loading a new dataset which is in HDFS.

Line 2: We have split each record by taking the delimiter as tab because the data is tab separated. We are creating the key-value pair, where a key is a destination that is in 3rd column and the value is 1. Since we need to count the cities which are popular, we are using the reduceByKey method to count them. After counting the destinations, we are swapping the key-value pairs. The sortByKey method sorts the data with keys and false stands for descending order. Once the sorting is complete, we are considering the top 20 destinations.

Output

(396,MIA), (290,SFO), (202,LAS), (162,LAX), (102,DFW), (64,DEN), (57,ORD), (54,PHL), (50,IAH), (45,JFK), (44,PHX), (40,FLL), (36,ATL), (31,BOS), (31,MCO), (27,SAN), (25,WAS), (24,CUN), (22,AUS), (22,LON)

You can see the same in the below screen shot.

Problem Statement 2:-

Top 20 locations from where people travel the most: We can find the places from where most of the trips are undertaken, based on the booked trip count.

Source Code:-

val textFile = sc.textFile("hdfs://localhost:9000/TravelData.txt")

val split = textFile.map(lines=>lines.split('\t')).map(x=>(x(1),1)).reduceByKey(_+_).map(item => item.swap).sortByKey(false).take(20)

Description of the above code:-

Line 1: We are creating an RDD by loading a new dataset which is in HDFS.

Line 2: We have split each record by taking the delimiter as tab since the data is tab separated. We are creating the key-value pair, where a key is a location from where people start, that is in the 2nd column and the value is 1. Since we need to count the cities which are popular locations from where people undertake the trips, we are using the reduceByKey method to count them. After counting the locations, we are swapping the key-value pairs. We are using the sortByKey method which sorts the data with keys where false stands for descending order. Once the sorting is complete, we are taking the top 20 locations from where people undertake the trips.

Output:-

(504,DFW), (293,MIA), (272,LAS), (167,BOM), (131,SFO), (101,ORD), (72,LAX), (55,DEN), (41,PHL), (37,IAH), (35,FLL), (33,PHX), (31,JFK), (24,WAS), (19,HOU), (19,ATL), (18,DXB), (17,SAN), (17,BOS), (17,BCN)

You can see the same in the below screen shot

Problem Statement 3:-

Top 20 cities that generate high airline revenues for travel, so that the site can concentrate on offering the discount on booking, to those cities to attract more bookings.

Source Code:-

val textFile = sc.textFile("hdfs://localhost:9000/TravelData.txt")

val fil = textFile.map(x=>x.split('\t')).filter(x=>{if((x(3).matches(("1")))) true else false })

val cnt = fil.map(x=>(x(2),1)).reduceByKey(_+_).map(item => item.swap).sortByKey(false).take(20)

Description of the above code:-

Line 1: We are creating an RDD by loading a new dataset which is in HDFS.

Line 2: We are splitting each record based on the delimiter tab as the data is tab separated. From this, we are filtering the records based on the mode of travel. Here, we need the count of people who traveled by flight which is denoted by 1 (1=Air, 2=Car, 3 =Air+Car, 4 =Hotel, 5=Air+Hotel, 6=Hotel +Car, 7 =Air+Hotel+Car).

Line 3: We are creating the key-value pairs for those people who traveled by air, where the key is the destinations which are in 3^rd column and value is 1. Since we need to count the popular cities, we are counting them by using the reduceByKey method. After counting the destinations, we are swapping the key-value pairs. We are using the sortByKey method to sort the data with keys where false stands for descending order. Once sorting is completed, we are considering top 20 cities that generate high airline revenues for travel.

Output:

(84,MIA), (68,SFO), (54,LAS), (42,LAX), (24,IAH), (23,DFW), (18,PHX), (17,BOS), (15,ORD), (13,NYC), (9,DCA), (8,WAS), (8,AUS), (7,DEN), (7,MEM), (7,JFK), (6,SYD), (6,PHL), (6,ATL), (5,RIC)

You can see the same in the below screen shot.

We hope this blog was useful. Keep visiting my blog https://hadoopbigdata123.blogspot.in/ for more updates on BigData and other technologies.

11 comments:

Prachi SharmaMarch 26, 2017 at 11:40 PM
Analogica Data is one of the Top Big Data Analysis Company in India.provides services like Dashboarding and Visualisation,Big Data Analysis,Internet Of Things,Data Warehousing,Data Mining and Machine Learning.

TejutejuApril 26, 2018 at 4:39 AM
Nice post ! Thanks for sharing valuable information with us. Keep sharing..Big data hadoop online Training
hamiJanuary 26, 2020 at 9:21 PM

click here for more info.
hamiJanuary 27, 2020 at 12:19 AM
click here for more info.
hamiJanuary 29, 2020 at 9:05 PM
click here for more info.
hamiJanuary 30, 2020 at 12:38 AM
click here for more info.
hamiJanuary 31, 2020 at 11:39 PM
click here for more info.
hamiFebruary 2, 2020 at 12:08 AM
click here for more info.
INFYCLE TECHNOLOGIESDecember 18, 2021 at 2:05 AM
Are you interested in doing Data Science Training in Chennai with a Certification Exam? Catch the best features of Data Science training courses with Infycle Technologies, the best Data Science Training & Placement institutes in and around Chennai. Infycle offers the best hands-on training to the students with the revised curriculum to enhance their knowledge. In addition to the Certification & Training, Infycle offers placement classes for personality tests, interview preparation, and mock interviews for clearing the interviews with the best records. To have all it in your hands, dial 7504633633 for a free demo from the experts.
California SEO AgencySeptember 8, 2022 at 4:32 AM
I wanted to thank you for this great read!! I definitely enjoying every little bit of it I have you bookmarked to check out new stuff you post.

rubel

saim

riyan

Peter Black

imran

anand
California SEO AgencySeptember 8, 2022 at 5:03 AM
A very awesome blog post. We are really grateful for your blog post. You will find a lot of approaches after visiting your post.

Peter Black

Peter Black

imran

abid

anand

Dowload Lastest Books On Hadoop

Amazon Best Deal !!

Sunday, January 29, 2017