如何使用 C # 在网页源代码的 div 中找到文本

我怎样才能从一个网站的 HTML代码,保存它,并找到一些文本使用 LINQ表达式?

我使用以下代码来获取网页的源代码:


public static String code(string Url)
{
HttpWebRequest myRequest = (HttpWebRequest)WebRequest.Create(Url);
myRequest.Method = "GET";
WebResponse myResponse = myRequest.GetResponse();
StreamReader sr = new StreamReader(myResponse.GetResponseStream(),
System.Text.Encoding.UTF8);
string result = sr.ReadToEnd();
sr.Close();
myResponse.Close();
    

return result;
}

我如何找到一个 div 内的文本在网页的源?

346058 次浏览

Getting HTML code from a website. You can use code like this:

string urlAddress = "http://google.com";


HttpWebRequest request = (HttpWebRequest)WebRequest.Create(urlAddress);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();


if (response.StatusCode == HttpStatusCode.OK)
{
Stream receiveStream = response.GetResponseStream();
StreamReader readStream = null;
if (String.IsNullOrWhiteSpace(response.CharacterSet))
readStream = new StreamReader(receiveStream);
else
readStream = new StreamReader(receiveStream,
Encoding.GetEncoding(response.CharacterSet));
string data = readStream.ReadToEnd();
response.Close();
readStream.Close();
}

This will give you the returned HTML from the website. But find text via LINQ is not that easy. Perhaps it is better to use regular expression but that does not play well with HTML.

Best thing to use is HTMLAgilityPack. You can also look into using Fizzler or CSQuery depending on your needs for selecting the elements from the retrieved page. Using LINQ or Regukar Expressions is just to error prone, especially when the HTML can be malformed, missing closing tags, have nested child elements etc.

You need to stream the page into an HtmlDocument object and then select your required element.

// Call the page and get the generated HTML
var doc = new HtmlAgilityPack.HtmlDocument();
HtmlAgilityPack.HtmlNode.ElementsFlags["br"] = HtmlAgilityPack.HtmlElementFlag.Empty;
doc.OptionWriteEmptyNodes = true;


try
{
var webRequest = HttpWebRequest.Create(pageUrl);
Stream stream = webRequest.GetResponse().GetResponseStream();
doc.Load(stream);
stream.Close();
}
catch (System.UriFormatException uex)
{
Log.Fatal("There was an error in the format of the url: " + itemUrl, uex);
throw;
}
catch (System.Net.WebException wex)
{
Log.Fatal("There was an error connecting to the url: " + itemUrl, wex);
throw;
}


//get the div by id and then get the inner text
string testDivSelector = "//div[@id='test']";
var divString = doc.DocumentNode.SelectSingleNode(testDivSelector).InnerHtml.ToString();

[EDIT] Actually, scrap that. The simplest method is to use FizzlerEx, an updated jQuery/CSS3-selectors implementation of the original Fizzler project.

Code sample directly from their site:

using HtmlAgilityPack;
using Fizzler.Systems.HtmlAgilityPack;


//get the page
var web = new HtmlWeb();
var document = web.Load("http://example.com/page.html");
var page = document.DocumentNode;


//loop through all div tags with item css class
foreach(var item in page.QuerySelectorAll("div.item"))
{
var title = item.QuerySelector("h3:not(.share)").InnerText;
var date = DateTime.Parse(item.QuerySelector("span:eq(2)").InnerText);
var description = item.QuerySelector("span:has(b)").InnerHtml;
}

I don't think it can get any simpler than that.

Better you can use the Webclient class to simplify your task:

using System.Net;


using (WebClient client = new WebClient())
{
string htmlCode = client.DownloadString("http://somesite.com/default.html");
}

Here's an example of using the HttpWebRequest class to fetch a URL

private void buttonl_Click(object sender, EventArgs e)
{
String url = TextBox_url.Text;
HttpWebRequest request = (HttpWebRequest) WebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse) request.GetResponse();
StreamReader sr = new StreamReader(response.GetResponseStream());
richTextBox1.Text = sr.ReadToEnd();
sr.Close();
}

Try this solution. It works fine.

 try{
String url = textBox1.Text;
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(url);
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
StreamReader sr = new StreamReader(response.GetResponseStream());
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.Load(sr);
var aTags = doc.DocumentNode.SelectNodes("//a");
int counter = 1;
if (aTags != null)
{
foreach (var aTag in aTags)
{
richTextBox1.Text +=  aTag.InnerHtml +  "\n" ;
counter++;
}
}
sr.Close();
}
catch (Exception ex)
{
MessageBox.Show("Failed to retrieve related keywords." + ex);
}

I am using AngleSharp and have been very satisfied with it.

Here is a simple example how to fetch a page:

var config = Configuration.Default.WithDefaultLoader();
var document = await BrowsingContext.New(config).OpenAsync("https://www.google.com");

And now you have a web page in document variable. Then you can easily access it by LINQ or other methods. For example if you want to get a string value from a HTML table:

var someStringValue = document.All.Where(m =>
m.LocalName == "td" &&
m.HasAttribute("class") &&
m.GetAttribute("class").Contains("pid-1-bid")
).ElementAt(0).TextContent.ToString();

To use CSS selectors please see AngleSharp examples.

You can use WebClient to download the html for any url. Once you have the html, you can use a third-party library like HtmlAgilityPack to lookup values in the html as in below code -

public static string GetInnerHtmlFromDiv(string url)
{
string HTML;
using (var wc = new WebClient())
{
HTML = wc.DownloadString(url);
}
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(HTML);
        

HtmlNode element = doc.DocumentNode.SelectSingleNode("//div[@id='<div id here>']");
if (element != null)
{
return element.InnerHtml.ToString();
}
return null;
}