Google Chrome 扩展中的网页抓取(JavaScript + Chrome API)

使用 JavaScript 执行 从 Google Chrome 扩展中抓取当前未打开的标签页的最佳选择是什么,以及可用的其他技术是什么

重要的是屏蔽抓取,使其表现得像一个正常的 web 请求 。没有迹象表明 AJAX 或 XMLHttpRequest,如 X-Requested-With: XMLHttpRequestOrigin

刮取的内容必须可以从 JavaScript 访问,以便在扩展中进一步操作和表示,最有可能的是作为一个字符串。

在任何 WebKit/Chrome 特定的 API: s 中是否有任何钩子可以用来发出正常的 Web 请求并获得操作结果?

var pageContent = getPageContent(url); // TODO: Implement
var items = $(pageContent).find('.item');
// Display items with further selections

附加值-从磁盘 上的本地文件进行初始调试。但是,如果这是唯一的问题是停止解决方案,那么忽略奖励点。

25327 次浏览

I'm not sure it's entirely possible with just JavaScript, but if you can set up a dedicated PHP script for your extension that uses cURL to fetch the HTML for a page, the PHP script could scrape the page for you and your extension could read it in through an AJAX request.

The actual page being scraped wouldn't know it's an AJAX request, however, because it is being accessed through cURL.

couldn't you just do some iframe trickery? if you load the url into a dedicated frame, you have the dom in a document object and can do your jquery selections, no?

Attempt to use XHR2 responseType = "document" and fall back on (new DOMParser).parseFromString(responseText, getResponseHeader("Content-Type")) with my text/html patch. See https://gist.github.com/1138724 for an example of how I detect responseType = "document support (synchronously checking response === null on an object URL created from a text/html blob).

Use the Chrome WebRequest API to hide X-Requested-With, etc. headers.

If you are fine looking at something beyond a Google Chrome Plugin, look at phantomjs which uses Qt-Webkit in background and runs just like a browser incuding making ajax requests. You can call it a headless browser as it doesn't display the output on a screen and can quitely work in background while you are doing other stuff. If you want, you can export out images, pdf out of the pages it fetches. It provides JS interface to load pages, clicking on buttons etc much like you have in a browser. You can also inject custom JS for example jQuery on any of the pages you want to scrape and use it to access the dom and export out desired data. As its using Webkit its rendering behaviour is exactly like Google Chrome.

Another option would be to use Aptana Jaxer which is based on Mozilla Engine and is very good concept in itself. It can be used as a simple scraping tool as well.

Web scraping is kind of convoluted in a Chrome Extension. Some points:

  • You run content scripts for access to the DOM.
  • Background pages (one per browser) can send and receive messages to content scripts. That is, you can run a content script that sets up an RPC endpoint and fires a specified callback in the context of the background page as a response.
  • You can execute content scripts in all frames of a webpage, then stitch the document tree (composed of the 1..N frames that the page contains) together.
  • As S.K. suggested, your background page can send the data as an XMLHttpRequest to some kind of lightweight HTTP server that listens locally.

I think you can start from this example.

So basically you can try using Extension + Plugin combination. Extension would have access to DOM (including plugin) and drive the process. And Plugin would send actual HTTP requests.

I can recommend using Firebreath as a crossplatform Chrome/Firefox plugin platform, in particular take a look at this example: Firebreath - Making+HTTP+Requests+with+SimpleStreamsHelper

A lot of tools have been released since this question was asked.

artoo.js is one of them. It's a piece of JavaScript code meant to be run in your browser's console to provide you with some scraping utilities. It can also be used as a chrome extension.