大数据hadoop

科技2025-03-26 61

大数据hadoop

In order to understand whats is Big data and Hadoop, you need to understand what is data.

为了了解什么是大数据和Hadoop，您需要了解什么是数据。

什么是数据？ (What is data?)

In general, data is any set of characters that is gathered and translated for some purpose, usually analysis.

通常，数据是出于某种目的(通常是分析)而收集和翻译的任何字符集。

什么是大数据？ (What is Big Data?)

Big Data is also data but with a huge size. Big Data is a term used to describe a collection of data that is huge in volume and growing with time. In short such data is so large that none of the traditional data management techniques are able to store it and process it efficiently.

大数据也是数据，但规模巨大。大数据是一个术语，用于描述数量巨大且随时间增长的数据集合。简而言之，此类数据是如此之大，以至于传统的数据管理技术都无法存储和有效地对其进行处理。

大数据示例 (Big Data Examples)

Big data is getting bigger every minute in almost every sector. The volume of data processing we are talking about is mind-boggling. Here is some information to give you an idea.:

大数据几乎在每个部门中每分钟都在增长。我们正在谈论的数据处理量令人难以置信。这里有一些信息可以给您一个想法。

The weather channels receive 18,055,555 forecast requests every minute.

天气频道每分钟收到18,055,555个天气预报请求。

Netflix users stream 97,222 hours of video every minute.

Netflix用户每分钟流播放97,222小时的视频。

Twitter users post 473,400 tweets every minute.

Twitter用户每发布473,400条推文分钟。

Facebook generates 4 new petabytes of data per day.

Facebook每天会生成4 PB新数据。

A single Jet engine can generate 10+terabytes of data in 30 minutes of flight time. With many thousand flights per day, the generation of data reaches up to many Petabytes.

单个Jet引擎可以在30分钟的飞行时间内生成10 TB以上的数据。每天有成千上万的航班，因此数据生成量高达PB级。

you can check the states from here: https://www.internetlivestats.com/

您可以从此处查看状态： https ： //www.internetlivestats.com/

大数据类型 (Types Of Big Data)

BigData’ could be found in three forms:

BigData”可以通过以下三种形式找到：

Structured

结构化的

Unstructured

非结构化

Semi-structured

半结构化

结构化的(Structured)

Any data that can be stored, accessed, and processed in the form of fixed-format is termed as a ‘structured’ data. Over the period of time, talent in computer science has achieved greater success in developing techniques for working with such kind of data (where the format is well known in advance) and also deriving value out of it.

可以以固定格式存储，访问和处理的任何数据都称为“结构化”数据。在一段时间内，计算机科学领域的人才在开发用于处理此类数据的技术方面取得了更大的成功(其中格式是事先已知的)，并从中获得了价值。

非结构化 (Unstructured)

Any data with the unknown form of the structure is classified as unstructured data. In addition to the size being huge, unstructured data poses multiple challenges in terms of its processing for deriving value out of it. A typical example of unstructured data is data containing a combination of simple text files, images, videos, etc.

具有未知结构形式的任何数据都被归类为非结构化数据。除了庞大的数据量外，非结构化数据在处理从中获得价值的过程中也面临着诸多挑战。非结构化数据的典型示例是包含简单文本文件，图像，视频等的组合的数据。

半结构化 (Semi-structured)

Semi-structured data can contain both forms of data that is structured and unstructured

半结构化数据可以包含结构化和非结构化两种数据形式

大数据的特征(Characteristics of Big Data)

卷(Volume)

Volume refers to the unimaginable amounts of information generated every second from social media, cell phones, cars, credit cards. We are currently using distributed systems, to store data in several locations and brought together by a software Framework like Hadoop.

数量是指每秒从社交媒体，手机，汽车，信用卡产生的不可思议的信息量。目前，我们正在使用分布式系统，将数据存储在多个位置，并通过Hadoop之类的软件框架将其组合在一起。

品种 (Variety)

Big Data is generated in multiple varieties. Compared to the traditional data like phone numbers and addresses, the latest trend of data is in the form of photos, videos, and audios, and many more, making about 80% of the data to be completely unstructured

大数据产生多种形式。与传统数据(例如电话号码和地址)相比，数据的最新趋势是照片，视频和音频等形式，这使大约80％的数据完全非结构化

真实性(Veracity)

Veracity basically means the degree of reliability that the data has to offer. Since a major part of the data is unstructured and irrelevant, Big Data needs to find an alternate way to filter them or to translate them out as the data is crucial in business developments

准确性基本上意味着数据必须提供的可靠性程度。由于数据的大部分是非结构化且无关紧要的，因此大数据需要寻找一种替代方法来过滤或转换它们，因为数据在业务发展中至关重要

值(Value)

Value is the major issue that we need to concentrate on. It is not just the amount of data that we store or process. It is actually the amount of valuable, reliable, and trustworthy data that needs to be stored, processed, analyzed to find insights.

价值是我们需要关注的主要问题。不仅仅是我们存储或处理的数据量。实际上，需要存储，处理，分析以发现洞察力的有价值，可靠和可信赖的数据量。

速度 (Velocity)

The term ‘velocity’ refers to the speed of generation of data. How fast the data is generated and processed to meet the demands, determines the real potential in the data.

术语“速度”是指数据生成的速度。数据的生成和处理速度可以满足需求，这决定了数据的实际潜力。

解决大数据问题有哪些解决方案？ (What are the solutions for solving the problem of Big Data?)

The most optimal solution nowadays that almost all companies use is Distributed Storage.

如今，几乎所有公司都使用的最佳解决方案是分布式存储。

A Distributed Storage is a concept that can split data across multiple physical servers in more than one data center. It typically takes the form of a cluster of storage units. so such type of implementation is known topology and this topology known as master-slave node topology.

分布式存储是一种可以在多个数据中心中的多个物理服务器之间拆分数据的概念。它通常采用存储单元集群的形式。因此，这种实现方式称为拓扑，而这种拓扑称为主从节点拓扑。

consider we have 1000 GB of data but we have fewer resources to store it. Users can think we should buy 1000GB storage and store our data in it. so we can store our data but it requires more time to process the data it will give the problem of I/O working.

考虑我们有1000 GB的数据，但存储资源较少。用户可以认为我们应该购买1000GB的存储并将数据存储在其中。因此我们可以存储数据，但是需要更多时间来处理数据，这会带来I / O工作的问题。

So the best solution for this kind of use case is to divide the storage 1000GB into 10 parts of 100 GB and store it in different 10 storage centers. because of these, our data can be stored efficiently which removes the volume problem, and also it got stored in less time which removes the velocity problem.

因此，针对这种用例的最佳解决方案是将存储1000GB分为100 GB的10个部分，并将其存储在10个不同的存储中心中。因此，我们的数据可以有效地存储，从而消除了体积问题，并且可以在较少的时间内存储数据，从而消除了速度问题。

Following 3(slave node)storage centers where we distribute our storage are known as Slave Nodes and from where we distribute our storage to slave nodes is known as Master Node. Now all these nodes combine to form an Infrastructure called a Cluster. In the Big Data world, it is known as a Distributed Storage Cluster.

在以下3个(从节点)存储中心(我们在其中分配存储)称为“从节点”(Slave Nodes)，并在其中将存储分配到“从”节点(称为“主节点”)。现在，所有这些节点合并形成一个称为集群的基础架构。在大数据世界中，它被称为分布式存储集群。

the above problem can be solved by Hadoop.

Hadoop可以解决以上问题。

什么是Hadoop？ (What is Hadoop?)

Hadoop is open-source software for storing data and running applications on clusters of commodity hardware. It provides massive storage for any kind of data, enormous processing power, and the ability to handle virtually limitless concurrent tasks or jobs.

Hadoop是开源软件，用于存储数据并在商用硬件集群上运行应用程序。它为任何类型的数据提供海量存储，巨大的处理能力以及处理几乎无限的并发任务或作业的能力。

Conclusion:-

结论：-

So, we learned how the MNC’S like Google, Facebook, etc solve the challenges of the Big Data this concept.

因此，我们了解了诸如Google，Facebook等跨国公司如何解决这一概念所带来的大数据挑战。

this much I learned just in 2 days of the ARTH Journey

我在ARTH旅程的2天中学到了很多

“Thanks! to Mr. Vimal Daga sir for giving the great information of Big data”

“谢谢！致Vimal Daga先生先生，他提供了大数据的重要信息”

翻译自: https://medium.com/@venkateshpensalwar/big-data-hadoop-edf6572b2232

大数据hadoop

Processed: 0.029, SQL: 8