用于动态页面的带 Scrapy 的硒

我试图从一个网页上获取产品信息,使用 Scrapy。我的待刮网页看起来像这样:

  • 从一个包含10个产品的 product _ list 页面开始
  • 一个点击“下一个”按钮载入下10个产品(网址不会在两个页面之间改变)
  • 我使用链接提取器跟随每个产品链接进入产品页面,并得到所有我需要的信息

我试图复制下一个按钮-ajax-调用,但不能得到工作,所以我给硒一个尝试。我可以在一个单独的脚本中运行 selenium 的网络驱动程序,但是我不知道如何与 scrapy 集成。我应该把含硒的部分放在我的痒蜘蛛里面吗?

我的蜘蛛相当标准,如下所示:

class ProductSpider(CrawlSpider):
name = "product_spider"
allowed_domains = ['example.com']
start_urls = ['http://example.com/shanghai']
rules = [
Rule(SgmlLinkExtractor(restrict_xpaths='//div[@id="productList"]//dl[@class="t2"]//dt'), callback='parse_product'),
]


def parse_product(self, response):
self.log("parsing product %s" %response.url, level=INFO)
hxs = HtmlXPathSelector(response)
# actual data follows

任何想法都是值得赞赏的,谢谢!

109020 次浏览

It really depends on how do you need to scrape the site and how and what data do you want to get.

Here's an example how you can follow pagination on ebay using Scrapy+Selenium:

import scrapy
from selenium import webdriver


class ProductSpider(scrapy.Spider):
name = "product_spider"
allowed_domains = ['ebay.com']
start_urls = ['http://www.ebay.com/sch/i.html?_odkw=books&_osacat=0&_trksid=p2045573.m570.l1313.TR0.TRC0.Xpython&_nkw=python&_sacat=0&_from=R40']


def __init__(self):
self.driver = webdriver.Firefox()


def parse(self, response):
self.driver.get(response.url)


while True:
next = self.driver.find_element_by_xpath('//td[@class="pagn-next"]/a')


try:
next.click()


# get the data and write it to scrapy items
except:
break


self.driver.close()

Here are some examples of "selenium spiders":


There is also an alternative to having to use Selenium with Scrapy. In some cases, using ScrapyJS middleware is enough to handle the dynamic parts of a page. Sample real-world usage:

If (url doesn't change between the two pages) then you should add dont_filter=True with your scrapy.Request() or scrapy will find this url as a duplicate after processing first page.

If you need to render pages with javascript you should use scrapy-splash, you can also check this scrapy middleware which can handle javascript pages using selenium or you can do that by launching any headless browser

But more effective and faster solution is inspect your browser and see what requests are made during submitting a form or triggering a certain event. Try to simulate the same requests as your browser sends. If you can replicate the request(s) correctly you will get the data you need.

Here is an example :

class ScrollScraper(Spider):
name = "scrollingscraper"


quote_url = "http://quotes.toscrape.com/api/quotes?page="
start_urls = [quote_url + "1"]


def parse(self, response):
quote_item = QuoteItem()
print response.body
data = json.loads(response.body)
for item in data.get('quotes', []):
quote_item["author"] = item.get('author', {}).get('name')
quote_item['quote'] = item.get('text')
quote_item['tags'] = item.get('tags')
yield quote_item


if data['has_next']:
next_page = data['page'] + 1
yield Request(self.quote_url + str(next_page))

When pagination url is same for every pages & uses POST request then you can use scrapy.FormRequest() instead of scrapy.Request(), both are same but FormRequest adds a new argument (formdata=) to the constructor.

Here is another spider example form this post:

class SpiderClass(scrapy.Spider):
# spider name and all
name = 'ajax'
page_incr = 1
start_urls = ['http://www.pcguia.pt/category/reviews/#paginated=1']
pagination_url = 'http://www.pcguia.pt/wp-content/themes/flavor/functions/ajax.php'


def parse(self, response):


sel = Selector(response)


if self.page_incr > 1:
json_data = json.loads(response.body)
sel = Selector(text=json_data.get('content', ''))


# your code here


# pagination code starts here
if sel.xpath('//div[@class="panel-wrapper"]'):
self.page_incr += 1
formdata = {
'sorter': 'recent',
'location': 'main loop',
'loop': 'main loop',
'action': 'sort',
'view': 'grid',
'columns': '3',
'paginated': str(self.page_incr),
'currentquery[category_name]': 'reviews'
}
yield FormRequest(url=self.pagination_url, formdata=formdata, callback=self.parse)
else:
return