python利用什么库抓取
Web scraping has three simple steps:
Web抓取具有三个简单步骤:
Step 1: Access the webpage 第1步:访问网页 Step 2: Locate and parse the items to be scraped步骤2:找到并解析要刮除的项目Step 3: Save scraped items on a file步骤3:将抓取的项目保存到文件中The top Python libraries for webscraping are: requests, selenium, beautiful soup, pandas and scrapy. Today, we will only cover the first four and save the fifth one, scrapy, for another post (it requires more documentation and is relatively complex). Our goal here is to quickly understand how the libraries work and to try them for ourselves.
用于网络抓取的顶级Python库是:请求,Selenium,漂亮的汤,熊猫和scrapy 。 今天,我们将只讨论前四篇,而将第五篇( scrapy)保存为另一篇文章(它需要更多文档并且相对复杂)。 我们的目标是快速了解库的工作原理,并亲自尝试。
As a practice project, we will use this 20 dollar job post from Upwork:
作为一个实践项目,我们将使用来自Upwork的20美元工作岗位:
There are two links that the client wants to scrape and we will focus on the second one. It’s a webpage for publicly traded companies listed in NASDAQ:
客户想抓取两个链接,我们将重点关注第二个。 这是纳斯达克上市公司的网页:
According to the client, he wants to scrape the stock symbols and stock names listed on this webpage. Then he wants to save both data in one JSON file. There are hundreds of job posts like this on Upwork and this is a good example of how you can make money using Python.
据客户说,他想刮掉此网页上列出的股票代码和股票名称。 然后,他想将两个数据保存在一个JSON文件中。 在Upwork上有数百个这样的工作职位,这是如何使用Python赚钱的一个很好的例子。
Before we start, keep in mind that there are ethical and legal issues around web scraping. Be mindful of how the data you scrape will be used.
在开始之前,请记住,有关网络抓取存在道德和法律问题。 请注意将如何使用您抓取的数据。
Accessing a webpage is as easy as typing a URL on a browser. Only this time, we have to remove the human element in the process. We can use requests or selenium to do this. Here’s how they work:
访问网页就像在浏览器中键入URL一样容易。 只有这一次,我们才必须删除流程中的人为因素。 我们可以使用请求或Selenium来做到这一点。 它们的工作方式如下:
Requests allows us to send HTTP requests to a server. This library gets the job done especially for static websites that would immediately render HTML contents. However, most sites are laced with with javascript code that keeps full HTML contents from rendering. For example, sometimes we need to tick a box or scroll down to a specific section to completely load a page and that is Javascript in action. Sometimes websites have anti-scraper code too that that would detect HTTP requests sent through the requests library. For these reasons, we explore other libraries.
请求允许我们将HTTP请求发送到服务器。 该库可以完成工作,尤其是对于将立即呈现HTML内容的静态网站。 但是,大多数网站都带有JavaScript代码,这些代码可阻止呈现完整HTML内容。 例如,有时我们需要在方框中打钩或向下滚动到特定部分以完全加载页面,而这实际上就是Javascript。 有时,网站也具有防爬虫代码,该代码将检测通过请求库发送的HTTP请求。 由于这些原因,我们将探索其他库。
Selenium was originally created to help an engineer test web applications. When we use Selenium to access a page, Selenium will have to literally open a browser and interact with the page’s elements (e.g., buttons, links, forms). If a website needs a log in session, we can easily pass our credentials through Selenium. This is why it’s more robust than the requests library.
Selenium最初是为了帮助工程师测试Web应用程序而创建的。 当我们使用Selenium访问页面时,Selenium必须从字面上打开浏览器并与页面的元素进行交互(例如,按钮,链接,表单)。 如果网站需要登录会话,我们可以轻松地通过Selenium传递我们的凭据。 这就是为什么它比请求库更强大的原因。
Here’s an example of Selenium opening the webpage for me:
这是Selenium为我打开网页的示例:
Chrome lets me know that it is being controlled by a test software (Selenium). Chrome让我知道它受测试软件(Selenium)的控制。But Selenium cannot do all these on its own without help from the browser. If you examine my code above, you can see that I called the Chrome driver from my Downloads folder so that Selenium can open the url in Chrome for me.
但是,如果没有浏览器的帮助,Selenium无法独自完成所有这些操作。 如果您检查上面的代码,可以看到我从“下载”文件夹中调用了Chrome驱动程序,以便Selenium可以为我打开Chrome中的URL 。
The information we want to scrape are rendered to a webpage via HTML. This is the part where we locate and parse them. To do so, we can still use selenium or, if we don’t want to install a Chrome driver, beautiful soup:
我们要抓取的信息通过HTML呈现到网页上。 这是我们定位和解析它们的部分。 为此,我们仍然可以使用Selenium,或者,如果我们不想安装Chrome驱动程序,可以使用漂亮的汤:
According to this documentation, there are many ways that Selenium can locate HTML elements for us. I will use the find_elements_by_css_selector command just because it’s convenient.
根据本文档,Selenium可以通过多种方式为我们定位HTML元素。 我将使用find_elements_by_css_selector命令只是因为它很方便。
stock_symbol = driver.find_elements_by_css_selector('#ctl00_cph1_divSymbols > table > tbody > tr > td:nth-child(1) > a')How did I know what CSS selector to pass? Easy. I just inspected a sample stock symbol (AACG) and copied its selector. Then I tweaked the code a little bit so that all symbols would be parsed (not just AACG).
我怎么知道要传递哪个CSS选择器? 简单。 我只是检查了一个样本股票代号(AACG)并复制了它的选择器。 然后,我对代码进行了一些调整,以便可以解析所有符号(而不仅仅是AACG)。
Here’s the full code so far:
这是到目前为止的完整代码:
This returns a list of selenium objects. We need to access the text inside these objects to see the symbol and compile them in a list:
这将返回Selenium对象的列表。 我们需要访问这些对象内的文本以查看符号并将其编译为列表:
symbol = []for x in stock_symbol: sym = x.text symbol.append(sym)There you go! Now that we have the stock symbols, we just need to repeat the process to get the stock names:
你去! 现在我们有了股票代码,我们只需要重复此过程即可获得股票名称:
Scraping for the stock names using Selenium 使用Selenium搜寻股票名称Looking good!
看起来不错!
Aside from Selenium, we can use beautiful soup to locate and parse HTML items. It’s often used together with the request library to avoid the need to install a Chrome driver which is a requirement for Selenium. Recall this code from the request library:
除了Selenium,我们还可以使用漂亮的汤来定位和解析HTML项目。 它通常与请求库一起使用,以避免安装Selenium所需的Chrome驱动程序。 从请求库中调用以下代码:
import requestsurl = "http://eoddata.com/stocklist/NASDAQ/A.htm"page = requests.get(url)From here, all we need to do is import and call beautiful soup’s HTML parser:
从这里开始,我们需要做的就是导入并调用美汤HTML解析器:
from bs4 import BeautifulSoupsoup = BeautifulSoup(page.text, 'html.parser')This will grab all the HTML elements of thepage:
这将获取页面的所有HTML元素:
From this documentation, this is how beautiful soup parses HTML items:
从本文档中,这就是汤解析HTML项目的方式:
It’s not as straightforward as using a CSS selector so we have to use some for loops to map and store HTML items:
它不像使用CSS选择器那么简单,因此我们必须使用一些for循环来映射和存储HTML项:
Not bad!
不错!
In the code snippets above, you can see that we stored the stock symbol and names in a list called symbol and names respectively. From here we can use the pandas library to put these lists on a dataframe and output them as a JSON file.
在上面的代码段中,您可以看到我们将股票代码和名称存储在分别称为代码和名称的列表中。 在这里,我们可以使用pandas库将这些列表放在数据框上,并将它们输出为JSON文件。
import pandas as pddf = pd.DataFrame(index = None)df['stock_symbol'] = symboldf['stock_name'] = namesPerfect! Now there is one more thing we need to do. In the job post, the client mentioned he wants to set the stock symbol as a key and the stock name as a value. This pandas code should do it:
完善! 现在,我们还需要做一件事。 在工作岗位上,客户提到他想将股票代码设置为键,将股票名称设置为值。 此熊猫代码应执行以下操作:
df.set_index('stock_symbol, inplace = True)Finally, let’s save the file as a JSON format as requested by the client:
最后,让我们按照客户端的要求将文件另存为JSON格式:
df.to_json('NASDAQ Stock List')Ca-ching! That was an easy $20!
加油! 那是很容易的20美元!
If you enjoyed this, then you might want to stay tune for the Scrapy tutorial. With Scrapy, we can create more powerful and flexible web scrapers.
如果喜欢这个,那么您可能希望继续关注Scrapy教程。 借助Scrapy,我们可以创建功能更强大,更灵活的Web刮板。
翻译自: https://medium.com/python-in-plain-english/4-python-libraries-to-help-you-make-money-from-webscraping-57ba6d8ce56d
python利用什么库抓取
相关资源:python抓包第三方库