使用Python脚本抓取早期知乎页面里的图片

出于男性本能的需求，我就花了一个中午的时间写了个脚本，专门用于抓取知乎问题页面下的图片，然后就把我的代码写到了这个问题的答案里。接着，又过了几天——我的答案被折叠了……

使用的 Python 版本是 2.7 ，用到的库是 requests 和 pyquery 。

Python 源码如下：

(dl_with_names.py) download

#!/usr/bin/env python
# coding:utf8

import os
import sys
import requests
import re
from pyquery import PyQuery as pq

entity = {}
images = []


def main():
    try:
        get_url()
        load_page()
        save_imgs()
    except Exception, e:
        print e


def get_url():
    args = sys.argv
    # 如果不是dl.py zhihu_url这种格式的话，抛出异常
    if len(args) != 2:
        raise Exception(u"Wrong number for args, please use Zhihu question url!")

    zhihu_url = args[1]
    # zhihu_url不符合问题页面url格式的话，抛出异常
    re_exp = re.compile(ur"^https://www\.zhihu\.com/question/(\d+)")
    match = re_exp.match(zhihu_url)
    if not match:
        raise Exception(u"Zhihu url is invalid!")

    entity['url'] = zhihu_url
    entity['question'] = match.groups()[0]

    print entity


def load_page():
    header = {
        ur'User-Agent': ur'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.116 Safari/537.36',
        ur'Host': ur'www.zhihu.com',
        ur'Accept': ur'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        ur'Accept-Language': ur'zh-CN,zh;q=0.8,en;q=0.6',
        ur'Accept-Encoding': ur'gzip, deflate, sdch',
        ur'Connection': ur'keep-alive',
        ur'Cache-Control': ur'max-age=0'
    }

    resp = requests.get(entity['url'], headers=header)
    if resp.status_code != 200:
        raise Exception(u"Http error!")

    d = pq(resp.content)
    title = d('title').text()
    entity['title'] = title.split(u" ")[0]
    imgs = d("img.origin_image.zh-lightbox-thumb.lazy")
    for ele in imgs:
        images.append(pq(ele).attr("data-original"))


def save_imgs():
    dest_dir = os.path.dirname(os.path.abspath(__file__)) + "/images/" + entity['question'] + \
               entity['title']
    print dest_dir
    if not os.path.exists(dest_dir):
        os.makedirs(dest_dir)

    for img in images:
        res = requests.get(img)
        filename = os.path.basename(img)
        fp = open(dest_dir + "/" + filename, "wb")
        fp.write(res.content)
        fp.close()
        print img + " done."


if __name__ == "__main__":
    main()

使用方法很简单：

python dl_with_names.py https://www.zhihu.com/question/37709992

嗯，由于知乎页面改版了，所以现在并没有办法再下载了…… 只能再另外通过解决登录态的问题，看到问题页之后再想办法抓取图片。

上面这种方法适合有一定命令行基础的人群，对于普通人来说，更简单的方法是使用 javascript 。

(dl.js) download

/**
 * Created by caiknife on 16/9/9.
 */
$(function () {
    $('<div id="showImg"></div>').prependTo($("body"));
    $('img.origin_image.zh-lightbox-thumb.lazy').each(function () {
        $("#showImg").append($(this).data('original') + "<br/>");
    });
});

先在地址栏里输入 javascript: ，再把上面的 JS 代码复制到地址栏里，回车之后就会在页面顶端出现问题页之内图片的链接。

不过由于现在知乎已经不用 jQuery 了，所以这段代码也失效了……

那还是写篇文章纪念一下吧。

CDEFGAB 1010110

挖了太多坑，一点点填回来

使用Python脚本抓取早期知乎页面里的图片