Python统计网页相同元素出现次数

世界上不该有任何牢笼能困住一个真正的男人,只有一样例外,那就是你喜欢的姑娘。

前言

有些时候做漏洞验证的时候,需要统计漏洞页面有哪些特征码(比如svn这个词在这些网页中出现了多少次),手动的话不仅慢还浪费精力,于是用二十行代码实现寻找多个网站中出现的相同元素次数。

涉及知识点

  1. jieba分词
  2. collection的Counter
  3. Pretty格式化输出

工程逻辑

代码实例

import sys
import jieba
import random
import requests
from collections import Counter
from prettytable import PrettyTable
import time
import os
reload(sys)
sys.setdefaultencoding('utf-8')

搜索引擎分词

ll=[]
def scan(url):
    try:
        r = requests.get(url=url, headers=headers, timeout=5).content.decode("utf8","ignore").encode("gbk","ignore")
        for x in jieba.cut_for_search(r,HMM=True):
            ll.append(x)
    except Exception, e:
        print e

导入扫描网址并扫描分词

url_input = raw_input(unicode('请输入需要扫描的网址文本:','utf-8').encode('gbk'))
list_url = list(set([i.replace('\n','') for i in open(url_input,'r').readlines()]))
for url in list_url:
    print 'Scan: '+(url)
    scan(url)

词频统计并格式化输出

d=dict(Counter(ll))
d1 = dict(sorted(zip(d.values(),d.keys())))
x = PrettyTable(["出现次数", "元素"])
for k,v in d1.iteritems():
    x.add_row([k, v])
with open ('result.txt','a+')as a:
    a.write(str(x))
print x

完整代码

# -*- coding: utf-8 -*-
# @Time    : 2018/4/23 0023 13:17
# @Author  : Langzi
# @Blog    : www.langzi.fun
# @File    : find_vlue.py
# @Software: PyCharm
import sys
import jieba
import random
import requests
from collections import Counter
from prettytable import PrettyTable
import time
import os
reload(sys)
sys.setdefaultencoding('utf-8')
print '''

|    __   __   __  
|_, (__( |  ) (__| 
               __/ 

'''
time.sleep(5)
headerss = [
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
    "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
    "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
    "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"]
ll = []
def scan(url):
    try:
        UA = random.choice(headerss)
        headers = {'User-Agent': UA}
        r = requests.get(url=url, headers=headers, timeout=5).content.decode("utf8","ignore").encode("gbk","ignore")
        for x in jieba.cut_for_search(r,HMM=True):
            ll.append(x)
    except Exception, e:
        print e
url_input = raw_input(unicode('请输入需要扫描的网址文本:','utf-8').encode('gbk'))
list_url = list(set([i.replace('\n','') for i in open(url_input,'r').readlines()]))
for url in list_url:
    print 'Scan: '+(url)
    scan(url)
d=dict(Counter(ll))
d1 = dict(sorted(zip(d.values(),d.keys())))
x = PrettyTable(["出现次数", "元素"])
for k,v in d1.iteritems():
    x.add_row([k, v])
with open ('result.txt','a+')as a:
    a.write(str(x))
print x
time.sleep(10)
def get(data_str):
    try:
        for xx in list_url:
            print 'Scan:' + xx
            UA = random.choice(headerss)
            headers = {'User-Agent': UA}
            r1 = requests.get(url=xx, headers=headers, timeout=5).content.decode("utf8","ignore").encode("gbk","ignore")
            if data_str in r1:
                with open(str(data_str+'.txt'),'a+')as aa:
                    aa.write(xx+'\n')
    except Exception,e:
        print e
while 1:
    data_str=raw_input(unicode('请输入需要寻找所在网站的关键词:','utf-8').encode('gbk'))
    get(data_str)
    print unicode('当前关键词扫描完毕....','utf-8')
    time.sleep(10)

结语

有的时候做扫描验证,复现和查找漏洞样本很麻烦,选择特征码也是个累人的活,虽然写出来的工具并不是完全准确,但是相对于人工来说速度效率高了很多。

其实一开始没想用jieba分词,想的是用split(‘’)拆分字符串,然后使用集合的方法筛选出相同的元素,然后再寻找出相同元素出现的次数。然而并不能达到想要的效果,首先用空字符串分割并不准确,使用集合得到相同元素后,多次遍历数据浪费时间和内存。

从工具开发角度来说,仅仅是寻找多个网页中出现相同的词这个功能并没有达到一个工具最优的结果,可以添加新的功能。比如扫描完后,等待60秒,然后让用户得到某些词所在的网址,比如扫描完后,等待60秒,用户在扫描的结果中发现’svn’这个词出现了15次,想要知道出现’SVN’这个词的网址是哪些。于是程序提示用户输入想要的词,然后程序再寻找哪些网站出现了,最后保存。

坚持原创技术分享,您的支持将鼓励我继续创作!
------ 本文结束 ------

版权声明

LangZi_Blog's by Jy Xie is licensed under a Creative Commons BY-NC-ND 4.0 International License
由浪子LangZi创作并维护的Langzi_Blog's博客采用创作共用保留署名-非商业-禁止演绎4.0国际许可证
本文首发于Langzi_Blog's 博客( http://langzi.fun ),版权所有,侵权必究。

0%