例外: 读超时

当我尝试使用 Jsoup 解析大量 HTML 文档时,我得到了一个 SocketTimeoutException

例如,我得到了一个链接列表:

<a href="www.domain.com/url1.html">link1</a>
<a href="www.domain.com/url2.html">link2</a>
<a href="www.domain.com/url3.html">link3</a>
<a href="www.domain.com/url4.html">link4</a>

对于每个链接,我解析链接到 URL 的文档(来自 href 属性) ,以获取这些页面中的其他信息。

所以我可以想象这会花费很多时间,但是如何关闭这个异常呢? 下面是整个堆栈跟踪过程:

java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(Unknown Source)
at java.io.BufferedInputStream.fill(Unknown Source)
at java.io.BufferedInputStream.read1(Unknown Source)
at java.io.BufferedInputStream.read(Unknown Source)
at sun.net.www.http.HttpClient.parseHTTPHeader(Unknown Source)
at sun.net.www.http.HttpClient.parseHTTP(Unknown Source)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(Unknown Source)
at java.net.HttpURLConnection.getResponseCode(Unknown Source)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:381)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:364)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:143)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:132)
at app.ForumCrawler.crawl(ForumCrawler.java:50)
at Main.main(Main.java:15)
65800 次浏览

I think you can do

Jsoup.connect("...").timeout(10 * 1000).get();

which sets timeout to 10s.

Ok - so, I tried to offer this as an edit to MarcoS's answer, but the edit was rejected. Nevertheless, the following information may be useful to future visitors:

According to the javadocs, the default timeout for an org.jsoup.Connection is 30 seconds.

As has already been mentioned, this can be set using timeout(int millis)

Also, as the OP notes in the edit, this can also be set using timeout(0). However, as the javadocs state:

A timeout of zero is treated as an infinite timeout.

Set timeout while connecting from jsoup.

There is mistake on https://jsoup.org/apidocs/org/jsoup/Connection.html. Default timeout is not 30 seconds. It is 3 seconds. Just look at javadoc in codes. It says 3000 ms.

I had the same error:

java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)

and only setting .userAgent(Opera) worked for me.

So I used Connection userAgent(String userAgent) method of Connection class to set Jsoup user agent.

Something like:

Jsoup.connect("link").userAgent("Opera").get();

This should work: Jsoup.connect(url.toLowerCase()).timeout(0);.