深度学习 图像数据集
I work predominantly in NLP for the last three months at work. It’s been a long time I work on the image data. Hence, I decided to build a unique image classifier model as part of my personal project and learning.
在过去的三个月中,我主要在NLP工作。 我处理图像数据已经很长时间了。 因此,我决定建立一个独特的图像分类器模型,作为我个人项目和学习的一部分。
One thing I am really missing in the current pandemic is traveling. These days I used to see a lot of travel vlogs and travel pictures on Instagram, wondering when we will go back to the normal world.
在当前的大流行中,我真正想念的一件事就是旅行。 这些天,我曾经在Instagram上看到很多旅行视频博客和旅行图片,想知道我们何时才能回到正常世界。
This strikes me to create an image classifier model with five classes like Mountain, Beach, Desert, Lake, and Museum. However, I don’t have an image dataset to build the model and unable to get any dataset from google. One way is to manually scrape the image, but it takes time. I come across google images download and bing image downloader and found it very easy to build your custom image dataset.
这让我印象深刻,创建了一个包含五个类的图像分类器模型,如山,海滩,沙漠,湖泊和博物馆。 但是,我没有图像数据集来构建模型,也无法从Google获取任何数据集。 一种方法是手动刮取图像,但这需要时间。 我遇到了Google图片下载和Bing图片下载器,发现构建自定义图片数据集非常容易。
I am planning to use transfer learning, so I require only a small amount of images. Hence, I decided to collect 100 images per class using google images download. This blog explains how to build a custom image dataset using google images download and bing image downloader.
我打算使用转移学习,因此只需要少量图像。 因此,我决定使用Google图片下载为每个班级收集100张图片。 该博客介绍了如何使用Google图像下载和Bing图像下载器构建自定义图像数据集。
For simplicity, I am going to build only for two classes: ‘Mountain’ and ‘Beach.’
为了简单起见,我将仅针对两个类进行构建: “ Mountain”和“ Beach”。
Initially, I tried with pip install google_image_download. However, it is not working. I referred stack overflow and installed the library using JoeClinton’s GitHub link.
最初,我尝试使用pip install google_image_download。 但是,它不起作用。 我提到了堆栈溢出问题,并使用JoeClinton的GitHub 链接安装了该库。
You can check the official google images download page here.
您可以在此处查看google图片的官方下载页面。
演示地址
The next step is to import the google image download from google image and initiate the class called a response.
下一步是从google image导入google image下载并启动称为响应的类。
演示地址
Now we need to pass our arguments. I need Mountain, beach images so I am passing ‘Mountain’, ‘Beach’ as a keyword.
现在我们需要传递我们的论点。 我需要山峰,沙滩图像,所以我将“ Mountain”,“ Beach”作为关键字传递。
Format — It’s a file option. I am looking for a jpg file. This supports gif, png, bmp, svg, webp, ico, raw according to documentation.
格式 -这是一个文件选项。 我正在寻找一个jpg文件。 根据文档,它支持gif,png,bmp,svg,webp,ico。
limit: It refers to the number of images. The default size is 100. If you want to download more than 100 images, then we need to install Selenium along with the Chromedriver extension. I have not tried the same as I need only 100 images.
限制:指图像数。 默认大小为100。 如果要下载100张以上的图片,则需要安装Selenium和Chromedriver扩展程序。 我没有尝试相同的操作,因为我只需要100张图像。
Sometimes, we get images less than 100 due to occasional errors while downloading images.
有时,由于下载图像时偶尔出现错误,我们得到的图像少于100张。
Print URLs: Printing the URLs of the image that extracts
打印URL:打印提取的图像的URL
There are other arguments that are available as color, aspect ratio, etc. Please check their documentation and give it a try.
还有其他可用的参数,例如颜色,纵横比等。请检查其文档并尝试一下。
The below flow chart explains the process. It takes the query (arguments), search, download the raw HTML link, scrape all the image links, download and save the images.
下面的流程图说明了该过程。 它需要查询(参数),搜索,下载原始HTML链接,抓取所有图像链接,下载并保存图像。
https://github.com/Joeclinton1/google-images-download/blob/patch-1/images/flow-chart.png https://github.com/Joeclinton1/google-images-download/blob/patch-1/images/flow-chart.png演示地址
All the images downloaded, and stored in the folder called “download” with a subfolder of “Mountain” and “Beach”. Mountain images stored in the folder Mountain and Beach images are stored in the Beach folder.
所有图像均已下载并存储在名为“ download”的文件夹中,且其子文件夹为“ Mountain”和“ Beach”。 存储在文件夹中的山图像山和海滩图像存储在海滩文件夹中。
演示地址
We can see that all the images have been stored in the respective folders. The images are in different sizes. It needs to resize the image before feeding it into the model. This library is highly useful if you want to build a custom image dataset for the image classifier.
我们可以看到所有图像都已存储在相应的文件夹中。 图像尺寸不同。 在将图像输入模型之前,需要调整图像的大小。 如果要为图像分类器构建自定义图像数据集,此库非常有用。
Bing image downloader is a python library which used to download bulk of images from bing.com. Please check here for more information.
Bing图像下载器是一个python库,用于从bing.com下载大量图像。 请在此处查看更多信息。
演示地址
Here, I am going to extract only Mountain images. So, creating a local directory called ‘mountain’ to store the images.
在这里,我将仅提取山图像。 因此,创建一个名为“ mountain”的本地目录来存储图像。
演示地址
Now, importing bing image downloader and passing the arguments. We need mountain images, hence I am passing mountain as a string to be searched.
现在,导入bing图像下载器并传递参数。 我们需要山峰图像,因此我正在将山峰作为字符串进行搜索。
演示地址
Limit = Number of images to download. Bing search can download bulk images. I am limiting to 200 due to my ram. You can try with a higher number and check.
限制 =要下载的图像数量。 必应搜索可以下载批量图像。 我的内存限制为200。 您可以尝试使用更大的数字并进行检查。
Output_dir = Name of output directory. It is optional. I created a directory called mountain and storing all the images. If you don’t specify the directory, then all the images get stored in your path directory.
Output_dir =输出目录的名称。 它是可选的。 我创建了一个名为mountain的目录,并存储了所有图像。 如果未指定目录,则所有图像都将存储在路径目录中。
adult_filter_off = It helps to disable adult filtration. By default is true.
adult_filter_off =它有助于禁用成人过滤。 默认情况下为true。
force_replace: It deletes the folder if present and starts afresh download
force_replace :删除文件夹(如果存在)并重新开始下载
Checking the image files in the mountain directory
检查山目录中的图像文件
演示地址
200 image files are stored in the directory. Let’s read some of the image files using IPython
目录中存储了200个图像文件。 让我们使用IPython读取一些图像文件
演示地址
Another one
另一个
演示地址
We can download bulk images from the bing image downloader. However, sometimes getting an accurate image is challenging.
我们可以从bing图像下载器下载批量图像。 但是,有时获取准确的图像具有挑战性。
Please ensure before making using these images for any commercial purpose as it violates its copyright terms. Google or bing downloader does not own the copyright of the images, and it owns by the original creator of the images.
在将这些图像用于任何商业目的之前,请确保其违反版权条款。 Google或Bing下载器不拥有图片的版权,而是图片的原始创建者所拥有。
Thanks for reading. Keep learning and stay tuned for more!
谢谢阅读。 继续学习,敬请期待!
Thanks to Anirudh Koul
感谢Anirudh Koul
翻译自: https://medium.com/towards-artificial-intelligence/building-a-custom-image-dataset-for-deep-learning-projects-7f759d069877
深度学习 图像数据集
相关资源:微信小程序源码-合集6.rar
