You will definitely want to start with a good web scraping framework. Later on you may decide that they are too limiting and you can put together your own stack of libraries but without a lot of scraping experience your design will be much worse than pjscrape or scrapy.
Why do you need to work out what data is in html and what data comes in from an ajax call? Managing all that session and cookie data? You don't have to when you browse a site, the browser and the site javascript do that. That's the whole point.
现在有两个选项,让它输出完成的 html 并解析它,或者在页面中注入一些 javascript,这些 javascript 执行解析和数据格式化并输出数据(可能是 json 格式)。您也可以自由地混合这两个选项。
哪种方法最好?
That depends, you will need to be familiar and comfortable with the low level approach for sure. The embedded browser approach works for anything, it will be much easier to implement and will make some of the trickiest problems in scraping disappear. It's also quite a complex piece of machinery that you will need to understand. It's not just HTTP requests and responses, it's requests, embedded browser rendering, site javascript, injected javascript, your own code and 2-way interaction with the embedded browser process.
The embedded browser is also much slower at scale because of the rendering overhead but that will almost certainly not matter unless you are scraping a lot of different domains. Your need to rate limit your requests will make the rendering time completely negligible in the case of a single domain.
Identify your requests with a user agent string that identifies your bot and have a webpage for your bot explaining it's purpose. This url goes in the agent string.
The second part of your tests involve caching and change detection. While writing your code you don't want to be hammering the server for the same page over and over again for no reason. While running your unit tests you want to know if your tests are failing because you broke your code or because the website has been redesigned. Run your unit tests against a cached copy of the urls involved. A caching proxy is very useful here but tricky to configure and use properly.