Pandas : Reading first n rows from parquet file?

Each Answer to this Q is separated by one/two green lines.

I have a parquet file and I want to read first n rows from the file into a pandas data frame.
What I tried:

df = pd.read_parquet(path="filepath", nrows = 10)

It did not work and gave me error:

TypeError: read_table() got an unexpected keyword argument 'nrows'

I did try the skiprows argument as well but that also gave me same error.

Alternatively, I can read the complete parquet file and filter the first n rows, but that will require more computations which I want to avoid.

Is there any way to achieve it?

After exploring around and getting in touch with the pandas dev team, the end point is pandas does not support argument nrows or skiprows while reading the parquet file.

The reason being that pandas use pyarrow or fastparquet parquet engines to process parquet file and pyarrow has no support for reading file partially or reading file by skipping rows (not sure about fastparquet). Below is the link of issue on pandas github for discussion.

https://github.com/pandas-dev/pandas/issues/24511

The accepted answer is out of date. It is now possible to read only the first few lines of a parquet file into pandas, though it is a bit messy and backend dependent.

To read using PyArrow as the backend, follow below:

from pyarrow.parquet import ParquetFile
import pyarrow as pa 

pf = ParquetFile('file_name.pq') 
first_ten_rows = next(pf.iter_batches(batch_size = 10)) 
df = pa.Table.from_batches([first_ten_rows]).to_pandas() 

Change the line batch_size = 10 to match however many rows you want to read in.

Parquet file is column oriented storage, designed for that… So it’s normal to load all the file to access just one line.


The answers/resolutions are collected from stackoverflow, are licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0 .