Parquet vs ORC vs ORC with Snappy

我正在运行一些关于 Hive 可用存储格式的测试,并使用 Parquet 和 ORC 作为主要选项。我一次使用默认压缩包含 ORC,一次使用 Snappy 包含 ORC。

我已经阅读了许多文档,它们声明 Parquet 在时间/空间复杂性方面比 ORC 更好,但我的测试与我阅读的文档相反。

以下是我数据的一些细节。

Table A- Text File Format- 2.5GB


Table B - ORC - 652MB


Table C - ORC with Snappy - 802MB


Table D - Parquet - 1.9 GB

拼花地板对我的桌子来说是最糟糕的压缩。

我对以上表格的测试得到了以下结果。

行计数操作

Text Format Cumulative CPU - 123.33 sec


Parquet Format Cumulative CPU - 204.92 sec


ORC Format Cumulative CPU - 119.99 sec


ORC with SNAPPY Cumulative CPU - 107.05 sec

列操作之和

Text Format Cumulative CPU - 127.85 sec


Parquet Format Cumulative CPU - 255.2 sec


ORC Format Cumulative CPU - 120.48 sec


ORC with SNAPPY Cumulative CPU - 98.27 sec

列操作的平均值

Text Format Cumulative CPU - 128.79 sec


Parquet Format Cumulative CPU - 211.73 sec


ORC Format Cumulative CPU - 165.5 sec


ORC with SNAPPY Cumulative CPU - 135.45 sec

使用 where 子句从给定范围选择4列

Text Format Cumulative CPU -  72.48 sec


Parquet Format Cumulative CPU - 136.4 sec


ORC Format Cumulative CPU - 96.63 sec


ORC with SNAPPY Cumulative CPU - 82.05 sec

这是否意味着 ORC 比 Parquet 快?或者我可以做些什么来使它在查询响应时间和压缩比方面更好地工作?

谢谢!

66942 次浏览

I would say, that both of these formats have their own advantages.

Parquet might be better if you have highly nested data, because it stores its elements as a tree like Google Dremel does (See here).
Apache ORC might be better if your file-structure is flattened.

And as far as I know parquet does not support Indexes yet. ORC comes with a light weight Index and since Hive 0.14 an additional Bloom Filter which might be helpful the better query response time especially when it comes to sum operations.

The Parquet default compression is SNAPPY. Are Table A - B - C and D holding the same Dataset? If yes it looks like there is something shady about it, when it only compresses to 1.9 GB

You are seeing this because:

  • Hive has a vectorized ORC reader but no vectorized parquet reader.

  • Spark has a vectorized parquet reader and no vectorized ORC reader.

  • Spark performs best with parquet, hive performs best with ORC.

I've seen similar differences when running ORC and Parquet with Spark.

Vectorization means that rows are decoded in batches, dramatically improving memory locality and cache utilization.

(correct as of Hive 2.0 and Spark 2.1)

We did some benchmark comparing the different file formats (Avro, JSON, ORC, and Parquet) in different use cases.

https://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet

The data is all publicly available and benchmark code is all open source at:

https://github.com/apache/orc/tree/branch-1.4/java/bench

Both of them have their advantages. We use Parquet at work together with Hive and Impala, but just wanted to point a few advantages of ORC over Parquet: during long-executing queries, when Hive queries ORC tables GC is called about 10 times less frequently. Might be nothing for many projects, but might be crucial for others.

ORC also takes much less time, when you need to select just a few columns from the table. Some other queries, especially with joins, also take less time because of vectorized query execution, which is not available for Parquet

Also, ORC compression is sometimes a bit random, while Parquet compression is much more consistent. It looks like when ORC table has many number columns - it doesn't compress as well. It affects both zlib and snappy compression

Both Parquet and ORC have their own advantages and disadvantages. But I simply try to follow a simple rule of thumb - "How nested is your Data and how many columns are there". If you follow the Google Dremel you can find how parquet is designed. They user a hierarchal tree-like structure to store data. More the nesting deeper the tree.

But ORC is designed for a flattened file store. So if your Data is flattened with fewer columns, you can go with ORC, otherwise, parquet would be fine for you. Compression on flattened Data works amazingly in ORC.

We did some benchmarking with a larger flattened file, converted it to spark Dataframe and stored it in both parquet and ORC format in S3 and did querying with **Redshift-Spectrum **.

Size of the file in parquet: ~7.5 GB and took 7 minutes to write
Size of the file in ORC: ~7.1. GB and took 6 minutes to write
Query seems faster in ORC files.

Soon we will do some benchmarking for nested Data and update the results here.

The default file format for Spark is parquet, but for Hive is orc. As far as I know (maybe I'm wrong), the ratio of compression using zlib is higher than with snappy but it requires more CPU. Snappy on the other hand is a great "decent" compression format when you don't want too much CPU consumption.

I haven't tried the parquet API to write/read files, but I have some experience doing that with ORC. The ORC format is great, but it has what it seems as a bottleneck when you're trying to write files at the same time in different threads of the same JVM process. And it also has some memory problems too. I had to make some minor changes in the classes

org.apache.hadoop.hive.ql.io.orc.MemoryManager org.apache.hadoop.hive.ql.io.orc.WriterImpl

in order to make it work better and faster (HDP 2.6.4.0).

As previous fellows have said, it all depends on your data structure, the API or framework you're using to read the data and what you're trying to do with that data.

ORC files have statistics at different levels (file, stripes and rows) that can improve a lot the performance when you're filtering data, for example.

ORC has also some improvements when writing if your columns have null values or the same value repeats often.

A benchmark is not that useful when what you're really trying to do has nothing to do with what the benchmark is testing.