python爬虫1 | 木叶清风,C++,Python,Java,C#,Unity3d,Unreal4,UE4,技术日志

# Python爬虫基础

爬虫是从网络上抓取数据信息的自动化程序，可以把网络想象成一张蜘蛛网，存于节点的各个数据就是目标，而爬虫就是一些小程序充当小蜘蛛来获取目标。它需要抓取，分析，存储三个环节。

# 基础的爬取

下面需要安装lxml包，请在python环境中键入：

pip install lxml

import requests
from bs4 import BeautifulSoup
response = requests.get("https://tj.lianjia.com/zufang/")
page=BeautifulSoup(response.text,'lxml')
linkdiv=page.find_all('div',class_ = "content__list--item")
links = [div.a.get("title") for div in linkdiv]
prices=page.find_all("span",class_="content__list--item-price")
for one in prices:
    print(one.text)

1
2
3
4
5
6
7
8
9

上面requests用来获取请求网页的结果，BeautifulSoup则将复杂的HTML文档转换为一个复杂的树形结构，其中的每个节点都是Python对象，这些对象归纳为：

Tag
NavigableString
BeautifulSoup
Comment

# Scrapy入门

前面我们使用requests，urlib，re模块来实现简单的爬虫，但是我们需要做一些更复杂的工作，这个时候就需要更成熟的框架，比如Scrapy，它整合了很多包，比如Twisted(基于事件驱动的网络引擎包)，lxml(专业XML处理包)，cssselect(高效地提取HTML页面信息)，像这样的包还有很多。下面就从一个简单的例子开始，我们要从58同城天津企业名录 (opens new window)获取企业名单。

首先安装Scrapy：

pip install scrapy pymysql

新建一个项目：

scrapy start MyFirst

它包含几个重要文件：

scrapy.cfg：项目的总配置文件。
items.py：定义爬虫用到的数据传输对象，一般是爬取数据的相关描述。
pipelines：用于处理爬取到的信息，是存入数据库还是文件。
spiders：用于存放爬取的"蜘蛛"。

下面我们来实现目标：

定义传输对象，在items.py文件中添加：

class CompanyNameItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    companyName = scrapy.Field();
    pass

1
2
3
4
5

如何处理要爬取的数据，这里我们放入到数据库中，打开pipelines.py文件：

import pymysql

class MyfirstPipeline(object):
    database = {
        'host':'maria',
        'database':'pythonrepo',
        'user':'root',
        'password':'666666',
        'charset':'utf8'
    }
    def __init__(self):
        self.db = pymysql.connect(**self.database)
        self.cursor = self.db.cursor();
    def close_spider(self,spider):
        print("------关闭数据库资源------")
        self.cursor.close()
        self.db.close()
    def process_item(self, item, spider):
        #if spider.name == "companyName":
        print("hi,this is common pipelines")
        return item

class CompanyNamePipeline(MyfirstPipeline):
    def process_item(self, item, spider):
        self.cursor.execute("insert into test2(name) values(%s)",(item['companyName']))
        self.db.commit()
        return item

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

在设置中添加上面的处理流程，将settings.py文件中的行取消注释：
```
ITEM_PIPELINES = {
    'MyFirst.pipelines.MyfirstPipeline': 300,
}
```
1
2
3
添加爬虫文件，在项目目录中输入下面命令：

scrapy genspider companyName "qy.58.com/tj/"

添加爬虫内容：

import scrapy
from MyFirst.items import CompanyNameItem

class CompanynameSpider(scrapy.Spider):
    name = 'companyName'
    allowed_domains = ['qy.58.com']
    start_urls = ['http://qy.58.com/tj/']

    custom_settings = {
        'ITEM_PIPELINES':{'MyFirst.pipelines.CompanyNamePipeline': 300}
    }

    searchPage=1

    def parse(self, response):
        result = response.xpath('//div[@class="compList"]/ul/li/span/a/text()').extract();
        if(result and len(result)>0):
            self.searchPage += 1

            for one in result:
                item = CompanyNameItem()
                item["companyName"] = one;
                yield item

            '''new_links = response.xpath('//div[@class="pager"]/a[@class="next"]/@href').extract()
            if new_links and len(new_links) > 0:
                yield scrapy.Request(new_links[0],callback=self.parse)
            else:
                print("new_links is null!!!")
            '''
            yield scrapy.Request(self.start_urls[0]+'pn'+str(self.searchPage),callback=self.parse)
        else:
            print("----------爬取到第%s页后结束---------" % self.searchPage)
            return

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34

需要建立数据库pythonrepo，然后建立一个表test2，它有一个name字段，类型为varchar，然后在项目目录中运行爬虫：

scrapy crawl companyName

# 调试技巧

有的时候我们需要调试，可以使用：

scrapy shell -s USER_AGENT="Mozilla/5.0" https://baidu.com

← python应用-修改文件内容 python爬虫2 →