Python Spider: 北邮用电查询系统数据

速写爬虫的一些经历

愿总有阳光照进回忆里:学生购电用电查询系统

感冒了,难得会被人挂念。谢谢,总有阳光照进回忆里,温暖的气息让人难以忘记。


某天,一位大神师兄 @Zhaoking 神秘兮兮地给我看一个网站 http://ydcx.bupt.edu.cn/

师兄说,你可以试着把它的数据爬下来。

当时开着八个线程跟着机器学习、回归分析、统计推断、探索性图形分析一堆数据分析的课程,觉得似乎把用电数据爬下来可以玩玩。 于是谨听师兄命令……

ydcx.png

首先,就是研究研究网站逻辑。打开firebug,输入1-101,回车。

ydcx1.png

显然,首先是一个post操作,但返回了一个302重定向,所以,第二个请求应该是正确的请求

试试看看

In [1]: import requests

In [2]: dorm_num = '1-101'

In [3]: r = requests.get('http://ydcx.bupt.edu.cn/see.aspx?useid=' + dorm_num)

检查下r.content,确认在里头看到了电量和加电信息。

ydcx2.png

那么,还有这么多页这么办?我们点点看看。页码1已经不能点了,点2。

ydcx3.png

竟然变成一个post了,而且post的数据这么多,你可以抱着试试看的心理不加上这些post的数据试试。

In [5]: __EVENTTARGET = 'GridView1'

In [6]: __EVENTARGUMENT = 'Page$2'

In [7]: data = {'__EVENTTARGET': __EVENTTARGET,  '__EVENTARGUMENT': __EVENTARGUMENT}

In [9]: r = requests.post('http://ydcx.bupt.edu.cn/see.aspx?useid=' + dorm_num, data=data)

In [10]: r
Out[10]: <Response [500]>

500-–—internal error……这个错误一般表示服务器内部处理出现错误。怎么回事呢,直觉告诉我们,问题就在于那些Post的参数。

加上试试,

 In [15]: data = {'__EVENTTARGET': __EVENTTARGET,  '__EVENTARGUMENT': __EVENTARGUMENT, '__EVENTVALIDATION': '/wEWCwLMxp63CAKtsp64BAKtsuK4BAKtsva4BAKtsvq4BAKtsu64BAKtsvK4BAKtssa4BAKtssq4BALDuK6IBQKokYx1/Loh35D537/CRr+++EgM74nLP5E=', '__VIEWSTATE': '/wEPDwULLTIwNzMwNzAxOTAPZBYCAgMPZBYQZg8PFgIeBFRleHQFATFkZAIBDw8WAh8ABQEgZGQCAg8PFgIfAAUyMS0xMDEgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBkZAIDDw8WAh8ABQbnlLXku7dkZAIEDw8WAh8ABQwwMDAwMDAwMzg5NTVkZAIFDw8WAh8ABQ0xMC4yMTAuOTYuMjEwZGQCBg88KwANAQAPFgQeC18hRGF0YUJvdW5kZx4LXyFJdGVtQ291bnQCjQNkFgJmD2QWFgIBD2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTI0IDA6MDA6MDBkZAICDw8WAh8ABQMyMTVkZAICD2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTIzIDA6MDA6MDBkZAICDw8WAh8ABQMyMThkZAIDD2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTIyIDA6MDA6MDBkZAICDw8WAh8ABQMyMjBkZAIED2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTIxIDA6MDA6MDBkZAICDw8WAh8ABQMyMjNkZAIFD2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTIwIDA6MDA6MDBkZAICDw8WAh8ABQMyMjVkZAIGD2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTE5IDA6MDA6MDBkZAICDw8WAh8ABQMyMjhkZAIHD2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTE4IDA6MDA6MDBkZAICDw8WAh8ABQMyMzBkZAIID2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTE3IDA6MDA6MDBkZAICDw8WAh8ABQMyMzJkZAIJD2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTE2IDA6MDA6MDBkZAICDw8WAh8ABQMyMzRkZAIKD2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTE1IDA6MDA6MDBkZAICDw8WAh8ABQMyMzZkZAILDw8WAh4HVmlzaWJsZWhkZAIIDzwrAA0BAA8WBB8BZx8CAgVkFgJmD2QWDgIBD2QWDmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTcgMTQ6MTM6NDBkZAICDw8WAh8ABQMyNTBkZAIDDw8WAh8ABQMxMjBkZAIEDw8WAh8ABQfotK0g55S1ZGQCBQ8PFgIfAAUM5Yqg55S15a6M5oiQZGQCBg8PFgIfAAUJ5byg6ICB5biIZGQCAg9kFg5mDw8WAh8ABTIxLTEwMSAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGRkAgEPDxYCHwAFETIwMTQtMi0xOCA4OjI3OjQ2ZGQCAg8PFgIfAAUDMjgwZGQCAw8PFgIfAAUBMGRkAgQPDxYCHwAFB+WFjSDotLlkZAIFDw8WAh8ABQzliqDnlLXlrozmiJBkZAIGDw8WAh8ABQnlvKDogIHluIhkZAIDD2QWDmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAUQMjAxMy05LTMgOTo1NTowOGRkAgIPDxYCHwAFAzI4MGRkAgMPDxYCHwAFATBkZAIEDw8WAh8ABQflhY0g6LS5ZGQCBQ8PFgIfAAUM5Yqg55S15a6M5oiQZGQCBg8PFgIfAAUJ5byg6ICB5biIZGQCBA9kFg5mDw8WAh8ABTIxLTEwMSAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGRkAgEPDxYCHwAFEjIwMTMtMy0yMCAxNjozMzoyNWRkAgIPDxYCHwAFAzE4MGRkAgMPDxYCHwAFATBkZAIEDw8WAh8ABQflhY0g6LS5ZGQCBQ8PFgIfAAUM5Yqg55S15a6M5oiQZGQCBg8PFgIfAAUJ5byg6ICB5biIZGQCBQ9kFg5mDw8WAh8ABTIxLTEwMSAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGRkAgEPDxYCHwAFETIwMTMtMy0xIDE1OjM5OjIxZGQCAg8PFgIfAAUDMTAwZGQCAw8PFgIfAAUBMGRkAgQPDxYCHwAFB+WFjSDotLlkZAIFDw8WAh8ABQzliqDnlLXlrozmiJBkZAIGDw8WAh8ABQnlvKDogIHluIhkZAIGDw8WAh8DaGRkAgcPDxYCHwNoZGQYAgUJR3JpZFZpZXcyDzwrAAoBCAIBZAUJR3JpZFZpZXcxDzwrAAoBCAIoZG6dKHZt7NdjJRdl8NOMCRx8QVCP'} 

In [16]: r = requests.post('http://ydcx.bupt.edu.cn/see.aspx?useid=' + dorm_num, data=data)

In [17]: r
Out[17]: <Response [200]>

成功的返回了需要的用电数据,你可以检查下r.content看看

接下来的问题在于,这些参数从哪里来的?

简单谷歌下和检查下网页源代码,可以看到

ydcx6.png

那么逻辑就清晰了,先get请求某个寝室的页面,获取对应的post参数,然后一页一页往后翻就是。

下面讲讲如何从html源码中提取信息,用re当然可以,但是xpath这种东西也许更好用。比如

## parse the content
root = fromstring(r.content)
## extract __VIEWSTATE
__VIEWSTATE = root.xpath('//input[@name="__VIEWSTATE"]/@value')[0]
## extract __EVENTVALIDATION
__EVENTVALIDATION = root.xpath('//input[@name="__EVENTVALIDATION"]/@value')[0]

相比写这么个正则简单优雅很多,不过所谓quick and dirty,不管黑猫白猫……:

re.search('id="__VIEWSTATE" value="([^"]+)", r.content).group(1)

接下来的问题在于,页码总数我不知道,最后的一个…符号可以进入下一个十页。

ydcx4.png
ydcx5.png

那么,反正最后经过trial and error以一种很quick and dirty的方式work around页码这个问题……

我的解决方案,一个循环,页码不断加一,并且读取当前页右下角所有显示页码。

如果当前页面的那些页码cpn中最后一个,比我下一个想要抓取的np页码小一,或者当前页码个数不是十个,就肯定是最后几页了,就可以抓取然后break不继续往后抓取了。

if (int(cpn[-1]) = np - 1) or (int(cpn[-1]) % 10 ! 0):

反正……最后可以运行……

ydcx7.png

多么不堪入目的实现,随便看看玩

#!/usr/bin/env python
# -*- coding: utf-8 -*-

"""
用电信息
http://ydcx.bupt.edu.cn/Default.aspx
"""

__author__ = "Reverland"

import socket
from gevent import monkey
monkey.patch_all()
import requests
from lxml.html import fromstring
import re
import os

timeout = 30
socket.setdefaulttimeout(timeout)

try:
    os.makedirs('data')
except:
    pass


def buy_info(dorm_num, root):
    # skip for downloaded
    if os.path.exists('data/' + dorm_num + '.csv'):
        print dorm_num + 'buy info downloaded...'
        return
    data = root.xpath('//table[@id="GridView2"]/tr/td/font/text()')
    # header
    with open('data/' + dorm_num + '_buy.csv', 'wb') as f:
        f.write('timestamp, electric_increase, money, charge_bool, state, operator\n')
    for i in range(len(data)):
        # timestamp, electric_increase, money, charge_bool, state, operator
        if i % 7 < 6:
            with open('data/' + dorm_num + '_buy.csv', 'a') as f:
                f.write(data[i].encode('iso-8859-1').strip() + ',')
        if i % 7 == 6:
            with open('data/' + dorm_num + '_buy.csv', 'a') as f:
                f.write(data[i].encode('iso-8859-1').strip() + '\n')



def parse_dorm_elec(dorm_num):
    # skip for downloaded
    if os.path.exists('data/' + dorm_num + '.csv'):
        print dorm_num + ' downloaded...'
        return
    print "parsing " + dorm_num + ' now...'
    data = []
    r = requests.get('http://ydcx.bupt.edu.cn/see.aspx?useid=' + dorm_num)
    ## parse the content
    root = fromstring(r.content)
    ## extract __VIEWSTATE
    __VIEWSTATE = root.xpath('//input[@name="__VIEWSTATE"]/@value')[0]
    ## extract __EVENTVALIDATION
    __EVENTVALIDATION = root.xpath('//input[@name="__EVENTVALIDATION"]/@value')[0]
    ## __EVENTTARGET is fixed
    __EVENTTARGET = 'GridView1'
    buy_info(dorm_num, root)
    ## define headers
    header = {
    'Content-Type': 'application/x-www-form-urlencoded'
    }
    np = 2

    while (1):
        # First extract some value
        # print "Retrieving data from page", np
        __EVENTARGUMENT = 'Page$' + str(np)
        payload = {
        '__EVENTARGUMENT': __EVENTARGUMENT,
        '__EVENTTARGET': __EVENTTARGET,
        '__EVENTVALIDATION': __EVENTVALIDATION,
        '__VIEWSTATE': __VIEWSTATE}
        r = requests.post('http://ydcx.bupt.edu.cn/see.aspx?useid=' + dorm_num, data=payload, headers=header)
        ## get electric data
        data += re.findall('<td align="center"><font color="#4A3C8C">([^<]+)', r.content)
        ## get current page numbers
        cpn = re.findall('\)"><font color="#4A3C8C">([\d]+)', r.content)
        if (int(cpn[-1]) == np - 1) or (int(cpn[-1]) % 10 != 0):
            cpn = range(np+1, int(cpn[-1]) + 1)
            ## next page
            for p in cpn:
                ## parse the content
                root = fromstring(r.content)
                ## extract __VIEWSTATE
                __VIEWSTATE = root.xpath('//input[@name="__VIEWSTATE"]/@value')[0]
                ## extract __EVENTVALIDATION
                __EVENTVALIDATION = root.xpath('//input[@name="__EVENTVALIDATION"]/@value')[0]
                __EVENTARGUMENT = 'Page$'+ str(p)
                payload = {
                '__EVENTARGUMENT': __EVENTARGUMENT,
                '__EVENTTARGET': __EVENTTARGET,
                '__EVENTVALIDATION': __EVENTVALIDATION,
                '__VIEWSTATE': __VIEWSTATE}
                r = requests.post('http://ydcx.bupt.edu.cn/see.aspx?useid=' + dorm_num, data=payload, headers=header)
                data += re.findall('<td align="center"><font color="#4A3C8C">([^<]+)', r.content)
                # print "Retrieving data from page", p
            break
        ## next page
        for p in cpn:
            ## parse the content
            root = fromstring(r.content)
            ## extract __VIEWSTATE
            __VIEWSTATE = root.xpath('//input[@name="__VIEWSTATE"]/@value')[0]
            ## extract __EVENTVALIDATION
            __EVENTVALIDATION = root.xpath('//input[@name="__EVENTVALIDATION"]/@value')[0]
            __EVENTARGUMENT = 'Page$' + p
            payload = {
            '__EVENTARGUMENT': __EVENTARGUMENT,
            '__EVENTTARGET': __EVENTTARGET,
            '__EVENTVALIDATION': __EVENTVALIDATION,
            '__VIEWSTATE': __VIEWSTATE}
            r = requests.post('http://ydcx.bupt.edu.cn/see.aspx?useid=' + dorm_num, data=payload, headers=header)
            data += re.findall('<td align="center"><font color="#4A3C8C">([^<]+)', r.content)
            # print "Retrieving data from page", p
        # check if reach end
        np = int(cpn[-1]) + 1
    # write header
    with open('data/' + dorm_num + '.csv', 'wb') as f:
        f.write('timestamp' + ',' + 'electric_remain\n')
    for i in range(len(data)):
        if i % 3 == 1:
            with open('data/' + dorm_num + '.csv', 'a') as f:
                f.write(data[i] + ',')
        if i % 3 == 2:
            with open('data/' + dorm_num + '.csv', 'a') as f:
                f.write(data[i] + '\n')
    return data
    ## for example
    # __EVENTVALIDATION = '/wEWCwKhncT7DwKtsp64BAKtsuK4BAKtsva4BAKtsvq4BAKtsu64BAKtsvK4BAKtssa4BAKtssq4BALDuK6IBQKokYx1vYI/vd4i5UCVuVYMSo/l30x1o/g='
    # __EVENTARGUMENT = 'Page$2'

#data = parse_dorm_elec(dorm_num)
def parse_dorm_elec_wrap(dorm_num):
    try:
        parse_dorm_elec(dorm_num)
    except:
        print "parsing " + dorm_num + " error..."
        pass
# ----------------------------------------------------
# # dorm 10
# dorms = []
# # layer 1
# dorms += map(lambda x: '10-1' + str(x).zfill(2), range(1, 17))
# # layer 2
# dorms += map(lambda x: '10-2' + str(x).zfill(2), range(1, 21))
# # layer 3 to 7
# for l in range(3, 8):
#     dorms += map(lambda x: '10-' + str(l) + str(x).zfill(2), range(1, 72))
# # layer 8 to 13
# for l in range(8, 14):
#     dorms += map(lambda x: '10-' + str(l) + str(x).zfill(2) , range(1, 53))
# ------------------------------------------------------

#--------------------
# dorm 1
dorms = []
# layer 1
dorms += map(lambda x: '1-' + str(1) + str(x).zfill(2), range(1, 24))
# layer 2,5
for l in range(2, 6):
    dorms += map(lambda x: '1-' + str(l) + str(x).zfill(2), range(1, 27))
#--------------------

# for d in dorms:
#     try:
#         parse_dorm_elec_wrap(d)
#     except:
#         try:
#             parse_dorm_elec_wrap(d)
#         except:
#             print "Error with ", d
#             pass

from gevent.pool import Pool
pool = Pool(5)
pool.join(timeout=timeout)
pool.map(parse_dorm_elec_wrap, dorms)
#
# print dorms
# parse_dorm_elec('10-1220')