Firefox 阅读器如何操作

摘要

我正在寻找我可以创建一个网页的标准,并且相当确定它会出现在 < strong > Firefox 阅读器中 如果用户需要,请查看 。

有些网站有这个选项,有些没有。一些文本较多的用户没有这个选项,而另一些用户的文本较少。堆栈溢出 Instance 只显示问题,而不显示 Reader 中的任何答案 查看

提问

我已经把我的 Firefox 从38.0.1升级到38.0.5,并且发现了一个叫做 ReaderView 的新功能——它是一种覆盖,可以去除“页面杂乱”,使文本更容易阅读。 阅读器视图位于地址栏的右侧,是某些页面上的可点击图标。

这很好,但是从编程的角度来看,我想知道“读者视图”是如何工作的,它应用于哪些页面的标准是什么。我对 Mozilla Firefox 网站进行了一些探索,但是没有得到明确的答案(所有我找到的编程答案都很糟糕) ,当然我也搜索了一下这个网站,只找到了一些关于 Firefox 插件的信息——这不是一个插件,而是 Firefox 新版本的一个主要部分。

我假设 readerview 使用了 HTML5并且会提取出 <article>的内容,但是事实并非如此,因为它在 Wikipedia 上并没有使用 <article>或者类似的 HTML5标签,相反 readview 提取出某些 <div>并且单独显示它们。这个特性适用于某些 HTML5页面——比如 wikipedia ——但不适用于其他页面。

如果有人知道 Firefox ReaderView 实际上是如何操作的,以及网站开发人员如何使用这个操作的,你能分享一下吗?或者,如果你能找到这些信息可以在哪里找到,你能给我指出正确的方向吗——因为我一直没能找到这个。

30167 次浏览

Reading through the gitHub code, this morning, the process is that page elements are listed in a likelyhood order - with <section>,<p>,<div>,<article> at the top of the list (ie most likely).

Then each of these "nodes" is given a score based on things such as comma counts and class names that apply to the node. This is a somewhat multi-faceted process where scores are added for text chunks but also scores are seemingly reduced for invalid parts or syntax. Scores in sub-parts of "node" are reflected in the score of the node as a whole. ie the parent element contains the scores of all lower elements, I think.

This score value decides if the HTML page can be "page viewed" in Firefox.

I am not absolutely clear if the score value is set by Firefox or by the readability function.

Javascript is really not my strong point,and I think someone else should check over the link provided by Richard ( https://github.com/mozilla/readability ) and see if they can provide a more thorough answer.

What I did not see but expected to see was score based on amount of text content in a <p> or a <div> (or other) relevant tags.

Any improvements on this question or answer, please share!!

EDIT: Images in <div> or <figure> tags (HTML5) within the <p> element appear to be retained in the Reader View when the page text content is valid.

You need at least one <p> tag around the text, that you want to see in Reader View, and at least 516 characters in 7 words inside the text.

for example this will trigger the ReaderView:

<body>
<p>
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789
123456789 123456
</p>
</body>

See my example at https://stackoverflow.com/a/30750212/1069083

I followed Martin's link to the Readability.js GitHub repository, and had a look at the source code. Here's what I make of it.

The algorithm works with paragraph tags. First of all, it tries to identify parts of the page which are definitely not content - like forms and so on - and removes them. Then it goes through the paragraph nodes on the page and assigns a score based on content-richness: it gives them points for things like number of commas, length of content, etc. Notice that a paragraph with fewer than 25 characters is immediately discarded.

Scores then "bubble up" the DOM tree: each paragraph will add part of it's score to all of it's parent nodes - a direct parent gets the full score added to its total, a grandparent only half, a great-grandparent a third and so on. This allows the algorithm to identify higher-level elements which are likely to be the main content section.

Though this is just Firefox's algorithm, my guess is if it works well for Firefox, it'll work well for other browsers too.

In order for these Reader View algorithms to work for your website, you want them to correctly identify the content-heavy sections of your page. This means you want the more content-heavy nodes on your page to get high scores in the algorithm.

So here are some rules of thumb to improve the quality of the page in the eyes of these algorithms:

  1. Use paragraph tags in your content! Many people tend to overlook them in favor of <br /> tags. While it may look similar, many content-related algorithms (not only Reader View ones) rely heavily on them.
  2. Use HTML5 semantic elements in your markup, like <article>, <nav>, <section>, <aside>. Even though they're not the only criterion (as you noted in the question), these are very useful to computers reading your page (not just Reader View) to distinguish different sections of your content. Readability.js uses them to guess which nodes are likely or unlikely to contain important content.
  3. Wrap your main content in one container, like an <article> or <div> element. This will receive score points from all the paragraph tags inside it, and be identified as the main content section.
  4. Keep your DOM tree shallow in content-dense areas. If you have a lot of elements breaking your content up, you're only making life harder for the algorithm: there won't be a single element that stands out as being parent of a lot of content-heavy paragraphs, but many separate ones with low scores.