Python Spider: 北邮用电查询系统数据

Published on Apr 03, 2015

愿总有阳光照进回忆里:学生购电用电查询系统

感冒了,难得会被人挂念。谢谢,总有阳光照进回忆里,温暖的气息让人难以忘记。


某天,一位大神师兄 @Zhaoking 神秘兮兮地给我看一个网站 http://ydcx.bupt.edu.cn/

师兄说,你可以试着把它的数据爬下来。

当时开着八个线程跟着机器学习、回归分析、统计推断、探索性图形分析一堆数据分析的课程,觉得似乎把用电数据爬下来可以玩玩。 于是谨听师兄命令……

ydcx.png

首先,就是研究研究网站逻辑。打开firebug,输入1-101,回车。

ydcx1.png

显然,首先是一个post操作,但返回了一个302重定向,所以,第二个请求应该是正确的请求

试试看看

    In [1]: import requests

    In [2]: dorm_num = '1-101'

    In [3]: r = requests.get('http://ydcx.bupt.edu.cn/see.aspx?useid=' + dorm_num)

检查下r.content,确认在里头看到了电量和加电信息。

ydcx2.png

那么,还有这么多页这么办?我们点点看看。页码1已经不能点了,点2。

ydcx3.png

竟然变成一个post了,而且post的数据这么多,你可以抱着试试看的心理不加上这些post的数据试试。

    In [5]: __EVENTTARGET = 'GridView1'

    In [6]: __EVENTARGUMENT = 'Page$2'

    In [7]: data = {'__EVENTTARGET': __EVENTTARGET,  '__EVENTARGUMENT': __EVENTARGUMENT}

    In [9]: r = requests.post('http://ydcx.bupt.edu.cn/see.aspx?useid=' + dorm_num, data=data)

    In [10]: r
    Out[10]: <Response [500]>

500-–—internal error……这个错误一般表示服务器内部处理出现错误。怎么回事呢,直觉告诉我们,问题就在于那些Post的参数。

加上试试,

     In [15]: data = {'__EVENTTARGET': __EVENTTARGET,  '__EVENTARGUMENT': __EVENTARGUMENT, '__EVENTVALIDATION': '/wEWCwLMxp63CAKtsp64BAKtsuK4BAKtsva4BAKtsvq4BAKtsu64BAKtsvK4BAKtssa4BAKtssq4BALDuK6IBQKokYx1/Loh35D537/CRr+++EgM74nLP5E=', '__VIEWSTATE': '/wEPDwULLTIwNzMwNzAxOTAPZBYCAgMPZBYQZg8PFgIeBFRleHQFATFkZAIBDw8WAh8ABQEgZGQCAg8PFgIfAAUyMS0xMDEgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBkZAIDDw8WAh8ABQbnlLXku7dkZAIEDw8WAh8ABQwwMDAwMDAwMzg5NTVkZAIFDw8WAh8ABQ0xMC4yMTAuOTYuMjEwZGQCBg88KwANAQAPFgQeC18hRGF0YUJvdW5kZx4LXyFJdGVtQ291bnQCjQNkFgJmD2QWFgIBD2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTI0IDA6MDA6MDBkZAICDw8WAh8ABQMyMTVkZAICD2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTIzIDA6MDA6MDBkZAICDw8WAh8ABQMyMThkZAIDD2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTIyIDA6MDA6MDBkZAICDw8WAh8ABQMyMjBkZAIED2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTIxIDA6MDA6MDBkZAICDw8WAh8ABQMyMjNkZAIFD2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTIwIDA6MDA6MDBkZAICDw8WAh8ABQMyMjVkZAIGD2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTE5IDA6MDA6MDBkZAICDw8WAh8ABQMyMjhkZAIHD2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTE4IDA6MDA6MDBkZAICDw8WAh8ABQMyMzBkZAIID2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTE3IDA6MDA6MDBkZAICDw8WAh8ABQMyMzJkZAIJD2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTE2IDA6MDA6MDBkZAICDw8WAh8ABQMyMzRkZAIKD2QWBmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTE1IDA6MDA6MDBkZAICDw8WAh8ABQMyMzZkZAILDw8WAh4HVmlzaWJsZWhkZAIIDzwrAA0BAA8WBB8BZx8CAgVkFgJmD2QWDgIBD2QWDmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAURMjAxNC04LTcgMTQ6MTM6NDBkZAICDw8WAh8ABQMyNTBkZAIDDw8WAh8ABQMxMjBkZAIEDw8WAh8ABQfotK0g55S1ZGQCBQ8PFgIfAAUM5Yqg55S15a6M5oiQZGQCBg8PFgIfAAUJ5byg6ICB5biIZGQCAg9kFg5mDw8WAh8ABTIxLTEwMSAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGRkAgEPDxYCHwAFETIwMTQtMi0xOCA4OjI3OjQ2ZGQCAg8PFgIfAAUDMjgwZGQCAw8PFgIfAAUBMGRkAgQPDxYCHwAFB+WFjSDotLlkZAIFDw8WAh8ABQzliqDnlLXlrozmiJBkZAIGDw8WAh8ABQnlvKDogIHluIhkZAIDD2QWDmYPDxYCHwAFMjEtMTAxICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgZGQCAQ8PFgIfAAUQMjAxMy05LTMgOTo1NTowOGRkAgIPDxYCHwAFAzI4MGRkAgMPDxYCHwAFATBkZAIEDw8WAh8ABQflhY0g6LS5ZGQCBQ8PFgIfAAUM5Yqg55S15a6M5oiQZGQCBg8PFgIfAAUJ5byg6ICB5biIZGQCBA9kFg5mDw8WAh8ABTIxLTEwMSAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGRkAgEPDxYCHwAFEjIwMTMtMy0yMCAxNjozMzoyNWRkAgIPDxYCHwAFAzE4MGRkAgMPDxYCHwAFATBkZAIEDw8WAh8ABQflhY0g6LS5ZGQCBQ8PFgIfAAUM5Yqg55S15a6M5oiQZGQCBg8PFgIfAAUJ5byg6ICB5biIZGQCBQ9kFg5mDw8WAh8ABTIxLTEwMSAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIGRkAgEPDxYCHwAFETIwMTMtMy0xIDE1OjM5OjIxZGQCAg8PFgIfAAUDMTAwZGQCAw8PFgIfAAUBMGRkAgQPDxYCHwAFB+WFjSDotLlkZAIFDw8WAh8ABQzliqDnlLXlrozmiJBkZAIGDw8WAh8ABQnlvKDogIHluIhkZAIGDw8WAh8DaGRkAgcPDxYCHwNoZGQYAgUJR3JpZFZpZXcyDzwrAAoBCAIBZAUJR3JpZFZpZXcxDzwrAAoBCAIoZG6dKHZt7NdjJRdl8NOMCRx8QVCP'} 

    In [16]: r = requests.post('http://ydcx.bupt.edu.cn/see.aspx?useid=' + dorm_num, data=data)

    In [17]: r
    Out[17]: <Response [200]>

成功的返回了需要的用电数据,你可以检查下r.content看看

接下来的问题在于,这些参数从哪里来的?

简单谷歌下和检查下网页源代码,可以看到

ydcx6.png

那么逻辑就清晰了,先get请求某个寝室的页面,获取对应的post参数,然后一页一页往后翻就是。

下面讲讲如何从html源码中提取信息,用re当然可以,但是xpath这种东西也许更好用。比如

    ## parse the content
    root = fromstring(r.content)
    ## extract __VIEWSTATE
    __VIEWSTATE = root.xpath('//input[@name="__VIEWSTATE"]/@value')[0]
    ## extract __EVENTVALIDATION
    __EVENTVALIDATION = root.xpath('//input[@name="__EVENTVALIDATION"]/@value')[0]

相比写这么个正则简单优雅很多,不过所谓quick and dirty,不管黑猫白猫……:

    re.search('id="__VIEWSTATE" value="([^"]+)", r.content).group(1)

接下来的问题在于,页码总数我不知道,最后的一个…符号可以进入下一个十页。

ydcx4.png
ydcx5.png

那么,反正最后经过trial and error以一种很quick and dirty的方式work around页码这个问题……

我的解决方案,一个循环,页码不断加一,并且读取当前页右下角所有显示页码。

如果当前页面的那些页码cpn中最后一个,比我下一个想要抓取的np页码小一,或者当前页码个数不是十个,就肯定是最后几页了,就可以抓取然后break不继续往后抓取了。

if (int(cpn[-1]) = np - 1) or (int(cpn[-1]) % 10 ! 0):

反正……最后可以运行……

ydcx7.png

多么不堪入目的实现,随便看看玩

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-

    """
    用电信息
    http://ydcx.bupt.edu.cn/Default.aspx
    """

    __author__ = "Reverland"

    import socket
    from gevent import monkey
    monkey.patch_all()
    import requests
    from lxml.html import fromstring
    import re
    import os

    timeout = 30
    socket.setdefaulttimeout(timeout)

    try:
        os.makedirs('data')
    except:
        pass


    def buy_info(dorm_num, root):
        # skip for downloaded
        if os.path.exists('data/' + dorm_num + '.csv'):
            print dorm_num + 'buy info downloaded...'
            return
        data = root.xpath('//table[@id="GridView2"]/tr/td/font/text()')
        # header
        with open('data/' + dorm_num + '_buy.csv', 'wb') as f:
            f.write('timestamp, electric_increase, money, charge_bool, state, operator\n')
        for i in range(len(data)):
            # timestamp, electric_increase, money, charge_bool, state, operator
            if i % 7 < 6:
                with open('data/' + dorm_num + '_buy.csv', 'a') as f:
                    f.write(data[i].encode('iso-8859-1').strip() + ',')
            if i % 7 == 6:
                with open('data/' + dorm_num + '_buy.csv', 'a') as f:
                    f.write(data[i].encode('iso-8859-1').strip() + '\n')



    def parse_dorm_elec(dorm_num):
        # skip for downloaded
        if os.path.exists('data/' + dorm_num + '.csv'):
            print dorm_num + ' downloaded...'
            return
        print "parsing " + dorm_num + ' now...'
        data = []
        r = requests.get('http://ydcx.bupt.edu.cn/see.aspx?useid=' + dorm_num)
        ## parse the content
        root = fromstring(r.content)
        ## extract __VIEWSTATE
        __VIEWSTATE = root.xpath('//input[@name="__VIEWSTATE"]/@value')[0]
        ## extract __EVENTVALIDATION
        __EVENTVALIDATION = root.xpath('//input[@name="__EVENTVALIDATION"]/@value')[0]
        ## __EVENTTARGET is fixed
        __EVENTTARGET = 'GridView1'
        buy_info(dorm_num, root)
        ## define headers
        header = {
        'Content-Type': 'application/x-www-form-urlencoded'
        }
        np = 2

        while (1):
            # First extract some value
            # print "Retrieving data from page", np
            __EVENTARGUMENT = 'Page$' + str(np)
            payload = {
            '__EVENTARGUMENT': __EVENTARGUMENT,
            '__EVENTTARGET': __EVENTTARGET,
            '__EVENTVALIDATION': __EVENTVALIDATION,
            '__VIEWSTATE': __VIEWSTATE}
            r = requests.post('http://ydcx.bupt.edu.cn/see.aspx?useid=' + dorm_num, data=payload, headers=header)
            ## get electric data
            data += re.findall('<td align="center"><font color="#4A3C8C">([^<]+)', r.content)
            ## get current page numbers
            cpn = re.findall('\)"><font color="#4A3C8C">([\d]+)', r.content)
            if (int(cpn[-1]) == np - 1) or (int(cpn[-1]) % 10 != 0):
                cpn = range(np+1, int(cpn[-1]) + 1)
                ## next page
                for p in cpn:
                    ## parse the content
                    root = fromstring(r.content)
                    ## extract __VIEWSTATE
                    __VIEWSTATE = root.xpath('//input[@name="__VIEWSTATE"]/@value')[0]
                    ## extract __EVENTVALIDATION
                    __EVENTVALIDATION = root.xpath('//input[@name="__EVENTVALIDATION"]/@value')[0]
                    __EVENTARGUMENT = 'Page$'+ str(p)
                    payload = {
                    '__EVENTARGUMENT': __EVENTARGUMENT,
                    '__EVENTTARGET': __EVENTTARGET,
                    '__EVENTVALIDATION': __EVENTVALIDATION,
                    '__VIEWSTATE': __VIEWSTATE}
                    r = requests.post('http://ydcx.bupt.edu.cn/see.aspx?useid=' + dorm_num, data=payload, headers=header)
                    data += re.findall('<td align="center"><font color="#4A3C8C">([^<]+)', r.content)
                    # print "Retrieving data from page", p
                break
            ## next page
            for p in cpn:
                ## parse the content
                root = fromstring(r.content)
                ## extract __VIEWSTATE
                __VIEWSTATE = root.xpath('//input[@name="__VIEWSTATE"]/@value')[0]
                ## extract __EVENTVALIDATION
                __EVENTVALIDATION = root.xpath('//input[@name="__EVENTVALIDATION"]/@value')[0]
                __EVENTARGUMENT = 'Page$' + p
                payload = {
                '__EVENTARGUMENT': __EVENTARGUMENT,
                '__EVENTTARGET': __EVENTTARGET,
                '__EVENTVALIDATION': __EVENTVALIDATION,
                '__VIEWSTATE': __VIEWSTATE}
                r = requests.post('http://ydcx.bupt.edu.cn/see.aspx?useid=' + dorm_num, data=payload, headers=header)
                data += re.findall('<td align="center"><font color="#4A3C8C">([^<]+)', r.content)
                # print "Retrieving data from page", p
            # check if reach end
            np = int(cpn[-1]) + 1
        # write header
        with open('data/' + dorm_num + '.csv', 'wb') as f:
            f.write('timestamp' + ',' + 'electric_remain\n')
        for i in range(len(data)):
            if i % 3 == 1:
                with open('data/' + dorm_num + '.csv', 'a') as f:
                    f.write(data[i] + ',')
            if i % 3 == 2:
                with open('data/' + dorm_num + '.csv', 'a') as f:
                    f.write(data[i] + '\n')
        return data
        ## for example
        # __EVENTVALIDATION = '/wEWCwKhncT7DwKtsp64BAKtsuK4BAKtsva4BAKtsvq4BAKtsu64BAKtsvK4BAKtssa4BAKtssq4BALDuK6IBQKokYx1vYI/vd4i5UCVuVYMSo/l30x1o/g='
        # __EVENTARGUMENT = 'Page$2'

    #data = parse_dorm_elec(dorm_num)
    def parse_dorm_elec_wrap(dorm_num):
        try:
            parse_dorm_elec(dorm_num)
        except:
            print "parsing " + dorm_num + " error..."
            pass
    # ----------------------------------------------------
    # # dorm 10
    # dorms = []
    # # layer 1
    # dorms += map(lambda x: '10-1' + str(x).zfill(2), range(1, 17))
    # # layer 2
    # dorms += map(lambda x: '10-2' + str(x).zfill(2), range(1, 21))
    # # layer 3 to 7
    # for l in range(3, 8):
    #     dorms += map(lambda x: '10-' + str(l) + str(x).zfill(2), range(1, 72))
    # # layer 8 to 13
    # for l in range(8, 14):
    #     dorms += map(lambda x: '10-' + str(l) + str(x).zfill(2) , range(1, 53))
    # ------------------------------------------------------

    #--------------------
    # dorm 1
    dorms = []
    # layer 1
    dorms += map(lambda x: '1-' + str(1) + str(x).zfill(2), range(1, 24))
    # layer 2,5
    for l in range(2, 6):
        dorms += map(lambda x: '1-' + str(l) + str(x).zfill(2), range(1, 27))
    #--------------------

    # for d in dorms:
    #     try:
    #         parse_dorm_elec_wrap(d)
    #     except:
    #         try:
    #             parse_dorm_elec_wrap(d)
    #         except:
    #             print "Error with ", d
    #             pass

    from gevent.pool import Pool
    pool = Pool(5)
    pool.join(timeout=timeout)
    pool.map(parse_dorm_elec_wrap, dorms)
    #
    # print dorms
    # parse_dorm_elec('10-1220')