multiple portion selection of a string in python

Question

I have a log file as below:

12-02-2022 15:18:22 +0330 SOCK5.6699 00000 user144 97.251.107.125:38605 1.1.1.1:443 51766 169369 0 CONNECT 1.1.1.1:443
12-02-2022 15:18:27 +0330 SOCK5.6699 00094 user156 32.99.193.2:51242 1.1.1.1:443 715 388 0 CONNECT 1.1.1.1:443
12-02-2022 15:18:56 +0330 SOCK5.6699 00000 user105 191.184.66.98:40048 1.1.1.1:443 18105 29029 0 CONNECT 1.1.1.1:443
12-02-2022 15:18:56 +0330 SOCK5.6699 00000 user105 191.184.66.98:40070 1.1.1.1:443 674 26805 0 CONNECT 1.1.1.1:443
12-02-2022 15:20:24 +0330 SOCK5.6699 00000 user143 112.199.63.119:60682 1.1.1.1:443 475 445 0 CONNECT 1.1.1.1:443
12-02-2022 15:20:37 +0330 SOCK5.6699 00000 user105 191.184.66.98:40102 1.1.1.1:443 12913 18780 0 CONNECT 1.1.1.1:443
12-02-2022 15:20:42 +0330 SOCK5.6699 00000 user143 112.199.63.119:60688 1.1.1.1:443 4530 34717 0 CONNECT 1.1.1.1:443
12-02-2022 15:20:44 +0330 SOCK5.6699 00000 user127 212.167.145.49:2972 1.1.1.1:443 827 267 0 CONNECT 1.1.1.1:443

my goal is to extract two portions of this log file:

Username
IP address of the user source

below is a sample of the portions of data needed.

12-02-2022 15:18:22 +0330 SOCK5.6699 00000 user144 97.251.107.125:38605 1.1.1.1:443 51766 169369 0 CONNECT 1.1.1.1:443

So I wrote a Python script to extract both items and store them in separate lists and then joined them with zip function.

import pprint
import collections
iplist=[]
for l in data:
    ip_port=l[53:71]
    iplist.append(ip_port.split(':')[0])
userlist=[]
for u in data:
    user=u[42:52]
    userlist.append(user.replace(" ", ""))
a=list(zip(iplist,userlist))
most_ip=collections.Counter(a).most_common(5)
pprint.pprint(most_ip)

This code works fine, and I'm able to get the top used ip with its corresponding username. Also need to mention that I didn't use re module, since it was listing the second IP (destination IP which is 1.1.1.1- which I don't care about it)

Question: Is there any other way(more neat wey) than the way I've written the code?

@dirkt this is a Linux/unix based command, I'm trying to use Python. since I want to use the script to some none-Unix systems as well. — Zareh Kasparian, Feb 13 '22 at 09:23
This is probably a better fit for StackOverflow since it's about programming. Not sure if it's an answer to your actual problem but there are lots of tools to parse logs out there, such as the Elastic FileBeats utility, among many others. You could also look at PyGrok. — shearn89, Feb 14 '22 at 08:58
Also, you're doing 2 iterations through the data which is slow. Do one, split each line on spaces, pull out the fields you need by index and add them to the dictionary. You'll do it in half the time. — shearn89, Feb 14 '22 at 09:00
@shearn89 Thanks shearn89, you mentioned a good point. I have edited my code, it looks simpler and much clear now. — Zareh Kasparian, Feb 14 '22 at 17:49

score 1 · Answer 1 · answered Feb 14 '22 at 17:52

With the suggestion of "shearn89" I have edited my code as below:

much simpler with a single iteration.

userlist=[]
iplist=[]
for i in data:
    ip=i.split(' ')[6].split(':')[0]
    user=i.split(' ')[5]
    iplist.append(ip)
    userlist.append(user)
top_used=collections.Counter(zip(iplist,userlist)).most_common(5)
pprint.pprint(top_used)

score 1 · Accepted Answer · answered Feb 17 '22 at 23:36

1

There are many capabilities to optimize also your new code. The two things catching me most:

Do not execute split() more than once for each line of the log, just execute split() once and store the result in a variable, because each execution of this functions needs some time (even its not much, but will add up the more data you process).

s = i.split(' ')
ip=s[6].split(':')[0]
user=s[5]

Why creating two list and then zipping them together afterwards? Just store the tuples directly in a list:

l = []
for i in data:
   s = i.split(' ')
   ip=s[6].split(':')[0]
   user=s[5]
   l.append(tuple((ip, user)))
top_used=collections.Counter(l).most_common(5)

answered Feb 17 '22 at 23:36

Misc08

26

Thanks for your code. having tuple in this case is just for speeding up the code? – Zareh Kasparian Feb 18 '22 at 10:31
1

@ZarehKasparian Indeed creating the tuples directly is speeding up the code, since you don't need the zip-function anymore, which is basically creating tuples from those two lists, see https://docs.python.org/3/library/functions.html#zip – Misc08 Feb 18 '22 at 13:48

multiple portion selection of a string in python

2 Answers2