Python Spider: 迎新系统学生信息爬取

Published on Apr 05, 2015

每一颗眼泪，是一万道光：迎新系统学生信息爬取

You don't get over the fear. You run towards it, with your knees buckling.

—Amin Ariana, Technical Founder, hacker and advisor at several ventures

有多少次，希望那短暂平凡的一刻又一刻定格到永恒。

简简单单就是幸福

忘乎所有只有热爱

去年8月，来跪邮写得第一个程序。在学十还略显空荡的房间，空荡荡的桌面，床上没有被子只有个睡袋，惨白惨白的灯光和兴奋的新同学们。

这次，依然是selenium专场。让程序操作浏览器。

首先，依然是研究整个流程。

打开 http://welcome.bupt.edu.cn

看看怎么登录

一切显而易见，输入用户名密码，点击登录按钮。

进入界面

这时候看到有个选框，发现可以选择研究生或者本科生。在这里我不讨论这个问题，留作读者自己思考。

我们随便翻翻看看

注意到左下角有几个页码，左边还有个=3136/210=之类的东西。

大概研究下猜想，3136是学生总数，210是总页数。

同时注意到页码是一次显示5页，通过点击=>=翻入下个5页。

为了得到我们要翻多少页，需要提取出210这个数。我们已经讲过如何用xpath来索引到对应的元素。

紧接着，抓取，点击下一页，每翻五页，点击=>=，然后继续重复以上步骤。

直到把210页全翻完。

我们要提取的信息在class=porlet-table=中

接着，一切都显而易见了，用selenium自动化这个步骤。

我是直接在ipython中一点点试验这个过程，最后把历史记录摘录出来写成程序，最后在ipython中打开pdb自动捕捉异常的功能或者设置断点来运行调试。

当然，也许你使用自己的方式。

#!/usr/bin/env python
# -*- coding: utf-8 -*-


from selenium import webdriver
#from selenium.webdriver.common.keys import Keys
import time
import random

__author__ = 'Reverland'

"""
You know what it is...
With NO warranty.
At your OWN risk.

A quick and dirty spider implemented with selenium webdriver
to dump the students' dorm data

条码号  姓名    院系    专业    学号    班级    宿舍校区    宿舍区  宿舍楼
宿舍房号    床号    入住情况
"""

driver = webdriver.Firefox()
driver.get("http://welcome.bupt.edu.cn")
username = driver.find_element_by_id("username")
password = driver.find_element_by_id("password")
username.send_keys("2xxxxxx")
password.send_keys("xxxxx")
submit = driver.find_element_by_name("submit")
submit.click()

time.sleep(2)

# login success
# y
p = 1
xpath_p_last = '//div[@class="pagination-info clearFix"]/span'
n_pages = driver.find_element_by_xpath(xpath_p_last)
p_last = int(n_pages.text.split('/')[1])
n_student = int(n_pages.text.split('/')[0])
print "Number of Students Found: ", n_student
while (p <= p_last):
    time.sleep(random.randint(3, 5))
    table = driver.find_element_by_class_name("portlet-table")
    # remove headers
    text = table.text[45::] + '\n'
    print text
    with open("bupt_students_yan.txt", 'a') as f:
        f.write(text.encode('utf-8'))
    p += 1
    if p > p_last:
        print "finished"
        break
    if p % 5 == 1:
        driver.find_element_by_link_text(">").click()
    else:
        driver.find_element_by_link_text(str(p)).click()

# b
option = driver.find_element_by_xpath('//select/option[@value="serieN10B"]')
option.click()
p = 1
xpath_p_last = '//div[@class="pagination-info clearFix"]/span'
n_pages = driver.find_element_by_xpath(xpath_p_last)
p_last = int(n_pages.text.split('/')[1])
n_student = int(n_pages.text.split('/')[0])
print "Number of Students Found: ", n_student
while (p <= p_last):
    time.sleep(random.randint(3, 5))
    table = driver.find_element_by_class_name("portlet-table")
    # remove headers
    text = table.text[45::] + '\n'
    print text
    with open("bupt_students_ben.txt", 'a') as f:
        f.write(text.encode('utf-8'))
    p += 1
    if p > p_last:
        print "finished"
        break
    if p % 5 == 1:
        driver.find_element_by_link_text(">").click()
    else:
        driver.find_element_by_link_text(str(p)).click()

Happy hacking~