如何用 Java 程序化地下载网页

我希望能够获取一个网页的 html 并将其保存到 String,这样我就可以对它进行一些处理。此外,我如何处理各种类型的压缩。

我该如何使用 Java 来实现这一点呢?

208789 次浏览

您可以使用内置的库,比如 URL网址连接,但是它们不能提供很多控制。

就个人而言,我会选择 Apache HTTPClient库。
编辑: HTTPClient 已经被 Apache 设置为 生命的终结。替换为: < a href = “ HTTP://hc.Apache.org/”rel = “ noReferrer”> HTTP 组件

On a Unix/Linux box you could just run 'wget' but this is not really an option if you're writing a cross-platform client. Of course this assumes that you don't really want to do much with the data you download between the point of downloading it and it hitting the disk.

下面是一些使用 Java 的 网址类测试的代码。但是,我建议在处理异常或将异常传递给调用堆栈方面做得比这里更好。

public static void main(String[] args) {
URL url;
InputStream is = null;
BufferedReader br;
String line;


try {
url = new URL("http://stackoverflow.com/");
is = url.openStream();  // throws an IOException
br = new BufferedReader(new InputStreamReader(is));


while ((line = br.readLine()) != null) {
System.out.println(line);
}
} catch (MalformedURLException mue) {
mue.printStackTrace();
} catch (IOException ioe) {
ioe.printStackTrace();
} finally {
try {
if (is != null) is.close();
} catch (IOException ioe) {
// nothing to see here
}
}
}

Bill 的回答非常好,但是您可能需要对请求执行一些操作,比如压缩或用户代理。下面的代码显示了如何对请求进行各种类型的压缩。

URL url = new URL(urlStr);
HttpURLConnection conn = (HttpURLConnection) url.openConnection(); // Cast shouldn't fail
HttpURLConnection.setFollowRedirects(true);
// allow both GZip and Deflate (ZLib) encodings
conn.setRequestProperty("Accept-Encoding", "gzip, deflate");
String encoding = conn.getContentEncoding();
InputStream inStr = null;


// create the appropriate stream wrapper based on
// the encoding type
if (encoding != null && encoding.equalsIgnoreCase("gzip")) {
inStr = new GZIPInputStream(conn.getInputStream());
} else if (encoding != null && encoding.equalsIgnoreCase("deflate")) {
inStr = new InflaterInputStream(conn.getInputStream(),
new Inflater(true));
} else {
inStr = conn.getInputStream();
}

若要还设置用户代理,请添加以下代码:

conn.setRequestProperty ( "User-agent", "my agent name");

我会使用像 这样的 HTML 解析器:

String html = Jsoup.connect("http://stackoverflow.com").get().html();

It handles GZIP and chunked responses and character encoding fully transparently. It offers more advantages as well, like HTML 穿越 and manipulation by CSS selectors like as jQuery can do. You only have to grab it as Document, not as a String.

Document document = Jsoup.connect("http://google.com").get();

实际上,不要希望在 HTML 上运行基本的 String 方法甚至正则表达式来处理它。

参见:

上面提到的所有方法都不会像在浏览器中那样下载网页文本。现在很多数据都是通过 html 页面中的脚本加载到浏览器中的。上面提到的技术都不支持脚本,它们只是下载 html 文本。HTMLUNIT 支持 javascript。因此,如果你正在寻找下载的网页文本,因为它看起来在浏览器然后你应该使用 HTMLUNIT

Jetty has an HTTP client which can be use to download a web page.

package com.zetcode;


import org.eclipse.jetty.client.HttpClient;
import org.eclipse.jetty.client.api.ContentResponse;


public class ReadWebPageEx5 {


public static void main(String[] args) throws Exception {


HttpClient client = null;


try {


client = new HttpClient();
client.start();
            

String url = "http://example.com";


ContentResponse res = client.GET(url);


System.out.println(res.getContentAsString());


} finally {


if (client != null) {


client.stop();
}
}
}
}

该示例打印一个简单网页的内容。

在一个 用 Java 阅读网页教程中,我写了六个例子,用 Java 通过 URL、 JSoup、 HtmlCleaner、 Apache HttpClient、 Jetty HttpClient 和 HtmlUnit 编程下载网页。

我使用了这篇文章的实际答案(url) ,并将输出写入 文件。

package test;


import java.net.*;
import java.io.*;


public class PDFTest {
public static void main(String[] args) throws Exception {
try {
URL oracle = new URL("http://www.fetagracollege.org");
BufferedReader in = new BufferedReader(new InputStreamReader(oracle.openStream()));


String fileName = "D:\\a_01\\output.txt";


PrintWriter writer = new PrintWriter(fileName, "UTF-8");
OutputStream outputStream = new FileOutputStream(fileName);
String inputLine;


while ((inputLine = in.readLine()) != null) {
System.out.println(inputLine);
writer.println(inputLine);
}
in.close();
} catch(Exception e) {


}


}
}

Get help from this class it get code and filter some information.

public class MainActivity extends AppCompatActivity {


EditText url;
@Override
protected void onCreate(Bundle savedInstanceState) {
super.onCreate( savedInstanceState );
setContentView( R.layout.activity_main );


url = ((EditText)findViewById( R.id.editText));
DownloadCode obj = new DownloadCode();


try {
String des=" ";


String tag1= "<div class=\"description\">";
String l = obj.execute( "http://www.nu.edu.pk/Campus/Chiniot-Faisalabad/Faculty" ).get();


url.setText( l );
url.setText( " " );


String[] t1 = l.split(tag1);
String[] t2 = t1[0].split( "</div>" );
url.setText( t2[0] );


}
catch (Exception e)
{
Toast.makeText( this,e.toString(),Toast.LENGTH_SHORT ).show();
}


}
// input, extrafunctionrunparallel, output
class DownloadCode extends AsyncTask<String,Void,String>
{
@Override
protected String doInBackground(String... WebAddress) // string of webAddress separate by ','
{
String htmlcontent = " ";
try {
URL url = new URL( WebAddress[0] );
HttpURLConnection c = (HttpURLConnection) url.openConnection();
c.connect();
InputStream input = c.getInputStream();
int data;
InputStreamReader reader = new InputStreamReader( input );


data = reader.read();


while (data != -1)
{
char content = (char) data;
htmlcontent+=content;
data = reader.read();
}
}
catch (Exception e)
{
Log.i("Status : ",e.toString());
}
return htmlcontent;
}
}
}

您很可能需要从安全的 Web 页面(https 协议)中提取代码。在下面的示例中,html 文件被保存到 c: temp filename.html 中!

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;


import javax.net.ssl.HttpsURLConnection;


/**
* <b>Get the Html source from the secure url </b>
*/
public class HttpsClientUtil {
public static void main(String[] args) throws Exception {
String httpsURL = "https://stackoverflow.com";
String FILENAME = "c:\\temp\\filename.html";
BufferedWriter bw = new BufferedWriter(new FileWriter(FILENAME));
URL myurl = new URL(httpsURL);
HttpsURLConnection con = (HttpsURLConnection) myurl.openConnection();
con.setRequestProperty ( "User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0" );
InputStream ins = con.getInputStream();
InputStreamReader isr = new InputStreamReader(ins, "Windows-1252");
BufferedReader in = new BufferedReader(isr);
String inputLine;


// Write each line into the file
while ((inputLine = in.readLine()) != null) {
System.out.println(inputLine);
bw.write(inputLine);
}
in.close();
bw.close();
}
}

使用 NIO.2功能强大的 Files.copy (InputStream in,Path target)完成此操作:

URL url = new URL( "http://download.me/" );
Files.copy( url.openStream(), Paths.get("downloaded.html" ) );