大数据文件格式说明

科技2023-12-06 97

介绍 (Introduction)

For data lakes, in the Hadoop ecosystem, HDFS file system is used. However, most cloud providers have replaced it with their own deep storage system such as S3 or GCS. When using deep storage choosing the right file format is crucial.

对于数据湖，在Hadoop生态系统中，使用HDFS文件系统。但是，大多数云提供商已将其替换为自己的深度存储系统，例如S3或GCS 。使用深度存储时，选择正确的文件格式至关重要。

These file systems or deep storage systems are cheaper than data bases but just provide basic storage and do not provide strong ACID guarantees.

这些文件系统或深度存储系统比数据库便宜，但仅提供基本存储，不提供强大的ACID保证。

You will need to choose the right storage for your use case based on your needs and budget. For example, you may use a database for ingestion if you budget permit and then once data is transformed, store it in your data lake for OLAP analysis. Or you may store everything in deep storage but a small subset of hot data in a fast storage system such as a relational database.

您将需要根据您的需求和预算为您的用例选择合适的存储。例如，如果您的预算允许，则可以使用数据库进行摄取，然后转换数据后，将其存储在数据湖中以进行OLAP分析。或者，您可以将所有内容存储在深度存储中，但将一小部分热数据存储在关系数据库等快速存储系统中。

档案格式 (File Formats)

Note that deep storage systems store the data as files and different file formats and compression algorithms provide benefits for certain use cases. How you store the data in your data lake is critical and you need to consider the format, compression and especially how you partition your data.

请注意，深度存储系统将数据存储为文件，并且不同的文件格式和压缩算法为某些用例提供了好处。如何在数据湖中存储数据至关重要，您需要考虑格式，压缩方式，尤其是如何对数据进行分区。

The most common formats are CSV, JSON, AVRO, Protocol Buffers, Parquet, and ORC.

最常见的格式是CSV，JSON， AVRO ，协议缓冲区， Parquet和ORC 。

File Format Options 文件格式选项

Some things to consider when choosing the format are:

选择格式时应考虑以下几点：

The structure of your data: Some formats accept nested data such as JSON, Avro or Parquet and others do not. Even, the ones that do, may not be highly optimized for it. Avro is the most efficient format for nested data, I recommend not to use Parquet nested types because they are very inefficient. Process nested JSON is also very CPU intensive. In general, it is recommended to flat the data when ingesting it.

数据的结构：某些格式可以接受嵌套数据，例如JSON，Avro或Parquet，而其他格式则不能。即使这样做，也可能不会对其进行高度优化。 Avro是嵌套数据的最有效格式，我建议不要使用Parquet嵌套类型，因为它们效率很低。进程嵌套JSON也非常占用CPU。通常，建议在摄取数据时将其放平。

Performance: Some formats such as Avro and Parquet perform better than other such JSON. Even between Avro and Parquet for different use cases one will be better than others. For example, since Parquet is a column based format it is great to query your data lake using SQL whereas Avro is better for ETL row level transformation.

性能：Avro和Parquet等某些格式的性能优于其他JSON。即使在Avro和Parquet的不同用例之间，一个也会比其他更好。例如，由于Parquet是基于列的格式，因此使用SQL查询数据湖非常有用，而Avro更适合ETL行级转换。

Easy to read: Consider if you need people to read the data or not. JSON or CSV are text formats and are human readable whereas more performant formats such parquet or Avro are binary.

易于阅读：考虑是否需要人们阅读数据。 JSON或CSV是文本格式，并且易于阅读，而功能更强的格式例如镶木地板或Avro是二进制。

Compression: Some formats offer higher compression rates than others.

压缩：某些格式比其他格式提供更高的压缩率。

Schema evolution: Adding or removing fields is far more complicated in a data lake than in a database. Some formats like Avro or Parquet provide some degree of schema evolution which allows you to change the data schema and still query the data. Tools such Delta Lake format provide even better tools to deal with changes in Schemas.

模式演变：在数据湖中添加或删除字段要比在数据库中复杂得多。诸如Avro或Parquet之类的某些格式提供了某种程度的架构演变，使您可以更改数据架构并仍然查询数据。诸如Delta Lake格式的工具甚至提供了更好的工具来处理模式中的更改。

Compatibility: JSON or CSV are widely adopted and compatible with almost any tool while more performant options have less integration points.

兼容性：JSON或CSV被广泛采用并与几乎所有工具兼容，而性能更高的选项具有更少的集成点。

档案格式 (File Formats)

CSV: Good option for compatibility, spreadsheet processing and human readable data. The data must be flat. It is not efficient and cannot handle nested data. There may be issues with the separator which can lead to data quality issues. Use this format for exploratory analysis, POCs or small data sets.

CSV ：兼容性，电子表格处理和人类可读数据的不错选择。数据必须是平坦的。它效率不高，无法处理嵌套数据。分隔符可能存在问题，可能导致数据质量问题。使用此格式进行探索性分析，POC或小型数据集。

JSON: Heavily used in APIs. Nested format. It is widely adopted and human readable but it can be difficult to read if there are lots of nested fields. Great for small data sets, landing data or API integration. If possible convert to more efficient format before processing large amounts of data.

JSON ：在API中大量使用。嵌套格式。它被广泛采用并且易于阅读，但是如果有很多嵌套字段，可能很难阅读。非常适合小型数据集，着陆数据或API集成。如果可能，请在处理大量数据之前转换为更有效的格式。

Avro: Great for storing row data, very efficient. It has a schema and supports evolution. Great integration with Kafka. Supports file splitting. Use it for row level operations or in Kafka. Great to write data, slower to read.

Avro ：非常适合存储行数据，非常高效。它具有模式并支持进化。与Kafka的完美集成。支持文件分割。在行级操作或Kafka中使用它。写入数据很棒，读取速度慢。

Protocol Buffers: Great for APIs, especially for gRPC. Supports Schema and it is very fast. Use for APIs or machine learning.

协议缓冲区：非常适合API，尤其是gRPC 。支持模式，速度非常快。用于API或机器学习。

Parquet: Columnar storage. It has schema support. It works very well with Hive and Spark as a way to store columnar data in deep storage that is queried using SQL. Because it stores data in columns, query engines will only read files that have the selected columns and not the entire data set as opposed to Avro. Use it as a reporting layer.

实木复合地板：圆柱状存储。它具有架构支持。它与Hive和Spark配合使用非常好，可以将列数据存储在使用SQL查询的深度存储中。因为它将数据存储在列中，所以查询引擎将仅读取具有选定列的文件，而不读取与Avro相反的整个数据集的文件。将其用作报告层。

ORC: Similar to Parquet, it offers better compression. It also provides better schema evolution support as well, but it is less popular.

ORC ：类似于Parquet，它提供了更好的压缩效果。它还提供了更好的模式演化支持，但是不太流行。

文件压缩 (File Compression)

Lastly, you need to also consider how to compress the data considering the trade off between file size and CPU costs. Some compression algorithms are faster but with bigger file size and others slower but with better compression rates. For more details check this article.

最后，您还需要考虑文件大小和CPU成本之间的权衡，如何压缩数据。某些压缩算法速度更快，但文件大小更大；另一些压缩算法速度较慢，但压缩率更高。有关更多详细信息，请查看本文。

Compression options (image by author) 压缩选项(作者提供的图片)

I recommend using snappy for streaming data since it does not require too much CPU power. For batch bzip2 is a great option.

我建议使用快照来流式传输数据，因为它不需要太多的CPU能力。对于批处理，bzip2是一个不错的选择。

结论 (Conclusion)

As we can see, CSV and JSON are easy to use, human readable and common formats but lack many of the capabilities of other formats, making it too slow to be used to query the data lake. ORC and Parquet are widely used in the Hadoop ecosystem to query data whereas Avro is also used outside of Hadoop, especially together with Kafka for ingestion, it is very good for row level ETL processing. Row oriented formats have better schema evolution capabilities than column oriented formats making them a great option for data ingestion.

如我们所见，CSV和JSON易于使用，易于阅读和通用格式，但是缺乏其他格式的许多功能，因此它太慢而无法用于查询数据湖。 ORC和Parquet在Hadoop生态系统中被广泛用于查询数据，而Avro还在Hadoop之外使用，尤其是与Kafka一起用于提取时，对于行级ETL处理非常有用。面向行的格式比面向列的格式具有更好的模式演变功能，这使它们成为数据提取的理想选择。

I hope you enjoyed this article. Feel free to leave a comment or share this post. Follow me for future posts.

希望您喜欢这篇文章。随时发表评论或分享这篇文章。跟我来用于将来的帖子。

翻译自: https://towardsdatascience.com/big-data-file-formats-explained-dfaabe9e8b33

相关资源：大数据实战班课表说明.xlsx

Processed: 0.020, SQL: 8