1

I have a log file as below:

12-02-2022 15:18:22 +0330 SOCK5.6699 00000 user144 97.251.107.125:38605 1.1.1.1:443 51766 169369 0 CONNECT 1.1.1.1:443
12-02-2022 15:18:27 +0330 SOCK5.6699 00094 user156 32.99.193.2:51242 1.1.1.1:443 715 388 0 CONNECT 1.1.1.1:443
12-02-2022 15:18:56 +0330 SOCK5.6699 00000 user105 191.184.66.98:40048 1.1.1.1:443 18105 29029 0 CONNECT 1.1.1.1:443
12-02-2022 15:18:56 +0330 SOCK5.6699 00000 user105 191.184.66.98:40070 1.1.1.1:443 674 26805 0 CONNECT 1.1.1.1:443
12-02-2022 15:20:24 +0330 SOCK5.6699 00000 user143 112.199.63.119:60682 1.1.1.1:443 475 445 0 CONNECT 1.1.1.1:443
12-02-2022 15:20:37 +0330 SOCK5.6699 00000 user105 191.184.66.98:40102 1.1.1.1:443 12913 18780 0 CONNECT 1.1.1.1:443
12-02-2022 15:20:42 +0330 SOCK5.6699 00000 user143 112.199.63.119:60688 1.1.1.1:443 4530 34717 0 CONNECT 1.1.1.1:443
12-02-2022 15:20:44 +0330 SOCK5.6699 00000 user127 212.167.145.49:2972 1.1.1.1:443 827 267 0 CONNECT 1.1.1.1:443

my goal is to extract two portions of this log file:

  1. Username
  2. IP address of the user source

below is a sample of the portions of data needed.

12-02-2022 15:18:22 +0330 SOCK5.6699 00000 user144 97.251.107.125:38605 1.1.1.1:443 51766 169369 0 CONNECT 1.1.1.1:443

So I wrote a Python script to extract both items and store them in separate lists and then joined them with zip function.

import pprint
import collections

iplist=[] for l in data: ip_port=l[53:71] iplist.append(ip_port.split(':')[0])

userlist=[] for u in data: user=u[42:52] userlist.append(user.replace(" ", ""))

a=list(zip(iplist,userlist)) most_ip=collections.Counter(a).most_common(5) pprint.pprint(most_ip)

This code works fine, and I'm able to get the top used ip with its corresponding username. Also need to mention that I didn't use re module, since it was listing the second IP (destination IP which is 1.1.1.1- which I don't care about it)

Question: Is there any other way(more neat wey) than the way I've written the code?

  • You could have used cut (commandline tool). – dirkt Feb 12 '22 at 20:03
  • @dirkt this is a Linux/unix based command, I'm trying to use Python. since I want to use the script to some none-Unix systems as well. – Zareh Kasparian Feb 13 '22 at 09:23
  • This is probably a better fit for StackOverflow since it's about programming. Not sure if it's an answer to your actual problem but there are lots of tools to parse logs out there, such as the Elastic FileBeats utility, among many others. You could also look at PyGrok. – shearn89 Feb 14 '22 at 08:58
  • 1
    Also, you're doing 2 iterations through the data which is slow. Do one, split each line on spaces, pull out the fields you need by index and add them to the dictionary. You'll do it in half the time. – shearn89 Feb 14 '22 at 09:00
  • @shearn89 Thanks shearn89, you mentioned a good point. I have edited my code, it looks simpler and much clear now. – Zareh Kasparian Feb 14 '22 at 17:49

2 Answers2

1

With the suggestion of "shearn89" I have edited my code as below:

much simpler with a single iteration.

userlist=[]
iplist=[]
for i in data:
    ip=i.split(' ')[6].split(':')[0]
    user=i.split(' ')[5]
    iplist.append(ip)
    userlist.append(user)

top_used=collections.Counter(zip(iplist,userlist)).most_common(5) pprint.pprint(top_used)

1

There are many capabilities to optimize also your new code. The two things catching me most:

Do not execute split() more than once for each line of the log, just execute split() once and store the result in a variable, because each execution of this functions needs some time (even its not much, but will add up the more data you process).

s = i.split(' ')
ip=s[6].split(':')[0]
user=s[5]

Why creating two list and then zipping them together afterwards? Just store the tuples directly in a list:

l = []
for i in data:
   s = i.split(' ')
   ip=s[6].split(':')[0]
   user=s[5]
   l.append(tuple((ip, user)))
top_used=collections.Counter(l).most_common(5)
Misc08
  • 26
  • Thanks for your code. having tuple in this case is just for speeding up the code? – Zareh Kasparian Feb 18 '22 at 10:31
  • 1
    @ZarehKasparian Indeed creating the tuples directly is speeding up the code, since you don't need the zip-function anymore, which is basically creating tuples from those two lists, see https://docs.python.org/3/library/functions.html#zip – Misc08 Feb 18 '22 at 13:48