By default parquet-tools in general will look for the local file directory, so to point it to hdfs, we need to add hdfs:// in the beginning of the file path. So in your case, you can do something like this
parquet-tools head hdfs://localhost/<hdfs-path> | less
I had the same issue and it worked fine for me. There is no need to download the file locally first.
I'd rather use hdfs NFS Gateway + autofs for easy hdfs file investigation.
My setup:
HDFS NFS Gateway service running on namenode.
distribution bundled autofs service on. with following configuration change made to auto.master
/net -hosts nobind
I can easily run following command to investigate any hdfs file
head /net/<namenodeIP>/path/to/hdfs/file
parquet-tools head /net/<namenodeIP>/path/to/hdfs/par-file
rsync -rv /local/directory/ /net/<namenodeIP>/path/to/hdfs/parentdir/
> parquet-tools dump -m -c make part-00000-fc34f237-c985-4ebc-822b-87fa446f6f70.c000.snappy.parquet | head -20
BINARY make
--------------------------------------------------------------------------------
*** row group 1 of 1, values 1 to 148192 ***
value 1: R:0 D:1 V:HYDA
value 2: R:0 D:1 V:NISS
value 3: R:0 D:1 V:NISS
value 4: R:0 D:1 V:TOYO
value 5: R:0 D:1 V:AUDI
value 6: R:0 D:1 V:MERC
value 7: R:0 D:1 V:LEX
value 8: R:0 D:1 V:BMW
value 9: R:0 D:1 V:GMC
value 10: R:0 D:1 V:HOND
value 11: R:0 D:1 V:TOYO
value 12: R:0 D:1 V:NISS
value 13: R:0 D:1 V:
value 14: R:0 D:1 V:THOR
value 15: R:0 D:1 V:DODG
value 16: R:0 D:1 V:DODG
value 17: R:0 D:1 V:HOND
If you're using HDFS, the following commands are very useful as they are frequently used (left here for future reference):
hadoop jar parquet-tools-1.9.0.jar schema hdfs://path/to/file.snappy.parquet
hadoop jar parquet-tools-1.9.0.jar head -n5 hdfs://path/to/file.snappy.parquet
Initially tried brew install parquet-tools, but this did not appear to work under my install of WSL
Windows 10 + MSVC
Same as above. Use CMake to generate the Visual Studio 2019 project, then build.
git checkout https://github.com/apache/arrow
cd arrow
cd cpp
mkdir buildmsvc
cd buildmsvc
cmake .. -DPARQUET_BUILD_EXECUTABLES=ON -DARROW_PARQUET=ON -DARROW_WITH_SNAPPY=ON -DARROW_WITH_BROTLI=ON -DPARQUET_BUILD_EXAMPLES=ON -DARROW_CSV=ON
# Then open the generated .sln file in MSVC and build. Everything should build perfectly.
Troubleshooting:
In case there was any missing libraries, I pointed it at my install of vcpkg. I ran vcpkg integrate install, then copied the to the end of the CMake line:
This installs everything into the current directory. You will have to add this directory manually to the path, or run parq.exe from within this directory.
My other answer builds parquet-reader from source. This utility looks like it does much the same job.
Actually, I find out that pandas has already supported parquet files, as long as you've installed pyarrow or fastparquet as its backend. Check out read_parquet:
import pandas as pd
df = pd.read_parquet('your-file.parquet')
df.head(10)
...
Previous answer:
Might be late for the party, but I just learnt that pyarrow supports reading parquet already, and it's quite powerful. Chances are that you already have pyarrow and pandas installed, so you can read parquet just like this
from pyarrow import parquet
import pandas
p = parquet.read_table('/path/to/your/xxxxx.parquet')
df = p.to_pandas()
df.head(10)
...
In case anyone else comes to this looking for an easy way to inspect a parquet file from the command line, I wrote the tool clidb to do this.
It doesn’t generate json like the OP wanted but instead shows the parquet data as a table and allows SQL snippets to be run against it. It should work with:
DuckDB has CLI tool (prebuilt binaries for linux, windows, macOS) that can be used to query parquet data from command line.
PS C:\Users\nsuser\dev\standalone_executable_binaries> ./duckdb
v0.5.1 7c111322d
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
Read parquet data using SQL queries
D SELECT * FROM READ_PARQUET('C:\Users\nsuser\dev\sample_files\userdata1.parquet') limit 3;
┌─────────────────────┬────┬────────────┬───────────┬─────────────────────────┬────────┬────────────────┬──────────────────┬───────────┬───────────┬───────────┬─────────────────────┬──────────┐
│ registration_dttm │ id │ first_name │ last_name │ email │ gender │ ip_address │ cc │ country │ birthdate │ salary │ title │ comments │
├─────────────────────┼────┼────────────┼───────────┼─────────────────────────┼────────┼────────────────┼──────────────────┼───────────┼───────────┼───────────┼─────────────────────┼──────────┤
│ 2016-02-03 07:55:29 │ 1 │ Amanda │ Jordan │ ajordan0@com.com │ Female │ 1.197.201.2 │ 6759521864920116 │ Indonesia │ 3/8/1971 │ 49756.53 │ Internal Auditor │ 1E+02 │
│ 2016-02-03 17:04:03 │ 2 │ Albert │ Freeman │ afreeman1@is.gd │ Male │ 218.111.175.34 │ │ Canada │ 1/16/1968 │ 150280.17 │ Accountant IV │ │
│ 2016-02-03 01:09:31 │ 3 │ Evelyn │ Morgan │ emorgan2@altervista.org │ Female │ 7.161.136.94 │ 6767119071901597 │ Russia │ 2/1/1960 │ 144972.51 │ Structural Engineer │ │
└─────────────────────┴────┴────────────┴───────────┴─────────────────────────┴────────┴────────────────┴──────────────────┴───────────┴───────────┴───────────┴─────────────────────┴──────────┘
pip install parquet-cli //installs via pip
parq filename.parquet //view meta data
parq filename.parquet --schema //view the schema
parq filename.parquet --head 10 //view top n rows