DSLab2
DSLab2
DSLab2
UNIVERSITY OF TECHNOLOGY
FACULTY OF COMPUTER SCIENCE AND ENGINEERING
Contents
1 Task 1: Filtering Wrong Records 2
2 Task 2: Preprocessing 4
Step 3: Create a function to filter as the criteria and a function to convert time format:
1 >>> def f i l t e r _ c o r r e c t _ r e c o r d s ( line ) :
2 ... fields = line . split ( " " )
3 ... criteria = len ( fields ) == 7 and float ( fields [0]) >= 0 and fields [6]. isdigit ()
and int ( fields [6]) > 0 and fields [2] != " -"
4 ... return criteria
5 ...
6 >>> filtered_data = log_data . filter ( f i l t e r _ c o r r e c t _ r e c o r d s )
7 >>> def convert_time ( line ) :
8 ... fields = line . split ( " " )
9 ... input_data = fields [3] + " " + fields [4]
10 ... time_format = " [% d /% b /% Y :% H :% M :% S % z ] "
11 ... try :
12 ... timestamp = datetime . strptime ( input_data , time_format ) . replace ( tzinfo =
pytz . UTC ) . timestamp ()
13 ... return timestamp
14 ... except ValueError :
15 ... return None
16 ...
17 >>> filtered_data = filtered_data . filter ( lambda x : convert_time ( x ) is not None )
18 >>> sorted_data = filtered_data . sortBy ( convert_time )
19 >>> filtered_data . count ()
20 1303227
Listing 1: Total number of records
8 593586
Listing 2: Number of wrong records
2 Task 2: Preprocessing
Step 1: Create a function to classify services:
The classif ied log data RDD:
• Apply the below function to each line in the f iltered data RDD that I did at the Task 1.
• Create a new RDD of Key-Value pairs (as the service counts variable) which Key is the
service group, and Value is the number of records for that group.
Step 2: Print out the list of unique IPs by creating a function to get the IPs in the RDD data.
1 >>> def extract_ip ( line ) :
2 ... ip = line . split ( " " ) [1]
3 ... return ip
4 ...
5 >>> unique_ips = filtered_data . map ( extract_ip ) . distinct ()
6 >>> unique_ips . count ()
7 3952
1 >>> hcm_records = en riche d_log_ data . filter ( lambda log : log [2] == " Ho Chi Minh City " )
2 >>> print ( f " Number of records from Ho Chi Minh City : { hcm_records . count () } " )
3 Number of records from Ho Chi Minh City : 217212
Listing 5: Number of records from Ho Chi Minh City
1 >>> hanoi_traffic = en riche d_log_ data . filter ( lambda log : log [2] == " Hanoi " ) . map (
lambda log : log [5]) . reduce ( lambda a , b : a + b )
2 >>> print ( f " Total traffic from Hanoi : { hanoi_traffic } " )
3 Total traffic from Hanoi : 204245300091
Listing 6: Total traffic from Hanoi