網絡爬蟲是獲取互聯網數據的重要工具,Python因其豐富的庫和簡潔的語法成為爬蟲開發的首選語言。在眾多爬蟲工具中,Scrapy和Beautiful Soup各具特色,結合使用能高效完成數據采集與處理任務。
Scrapy是一個功能強大的爬蟲框架,適合大規模、結構化的數據采集。
`python
import scrapy
class ExampleSpider(scrapy.Spider):
name = 'example'
start_urls = ['http://example.com']
def parse(self, response):
# 提取數據
title = response.css('h1::text').get()
yield {'title': title}
# 跟進鏈接
for link in response.css('a::attr(href)').getall():
yield response.follow(link, self.parse)`
Beautiful Soup是靈活的HTML/XML解析庫,適合小規模或結構不規則的頁面。
`python
from bs4 import BeautifulSoup
import requests
html = requests.get('http://example.com').text
soup = BeautifulSoup(html, 'lxml')
soup.find('div', class='content')
soup.select('div.content > p')
soup.findall(text='特定文本')`
`python
import re
from datetime import datetime
def clean_text(text):
return re.sub(r'\s+', ' ', text).strip()
def normalizedate(datestr):
formats = ['%Y-%m-%d', '%d/%m/%Y', '%b %d, %Y']
for fmt in formats:
try:
return datetime.strptime(date_str, fmt).date()
except:
continue
return None`
`python
import pandas as pd
from cerberus import Validator
schema = {
'title': {'type': 'string', 'required': True},
'price': {'type': 'float', 'min': 0},
'url': {'type': 'string', 'regex': '^https?://'}
}
validator = Validator(schema)
if validator.validate(data):
# 數據有效
pass`
`python
# 存儲到CSV
import csv
with open('data.csv', 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(data_list)
import sqlite3
conn = sqlite3.connect('data.db')
cursor = conn.cursor()
cursor.execute('''CREATE TABLE IF NOT EXISTS items
(title TEXT, price REAL, url TEXT)''')`
數據采集層(Scrapy/Requests) → 數據解析層(Beautiful Soup) →
數據處理層(清洗/驗證) → 數據存儲層(數據庫/文件) →
數據API層(RESTful接口)
`python
# 使用aiohttp異步請求
import aiohttp
import asyncio
async def fetch(url):
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
return await response.text()
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(maxworkers=10) as executor:
results = executor.map(processdata, data_list)`
`python
import scrapy
from bs4 import BeautifulSoup
import pandas as pd
from datetime import datetime
class PriceMonitorSpider(scrapy.Spider):
name = 'pricemonitor'
def startrequests(self):
urls = ['http://example.com/products']
for url in urls:
yield scrapy.Request(url, callback=self.parselist)
def parselist(self, response):
soup = BeautifulSoup(response.text, 'lxml')
products = soup.select('.product-item')
for product in products:
item = {
'name': product.selectone('.name').text.strip(),
'price': float(product.selectone('.price').text.replace('¥', '')),
'url': response.urljoin(product.selectone('a')['href']),
'crawltime': datetime.now().isoformat()
}
yield item
# 在settings.py中配置數據管道
`
Scrapy和Beautiful Soup是Python爬蟲生態中的黃金組合。Scrapy適合構建完整的爬蟲項目,提供完整的框架支持;Beautiful Soup則在小規模、快速開發的場景中表現出色。結合兩者的優勢,配合合理的數據處理流程,可以構建出高效、穩定的數據采集與處理服務。
在實際開發中,應根據具體需求選擇合適工具,注重代碼的可維護性和擴展性,同時遵守相關法律法規和網站使用條款,實現可持續的數據采集服務。