Each Answer to this Q is separated by one/two green lines.
I am exploring switching to python and pandas as a long-time SAS user.
However, when running some tests today, I was surprised that python ran out of memory when trying to
pandas.read_csv() a 128mb csv file. It had about 200,000 rows and 200 columns of mostly numeric data.
With SAS, I can import a csv file into a SAS dataset and it can be as large as my hard drive.
Is there something analogous in
I regularly work with large files and do not have access to a distributed computing network.
Wes is of course right! I’m just chiming in to provide a little more complete example code. I had the same issue with a 129 Mb file, which was solved by:
import pandas as pd tp = pd.read_csv('large_dataset.csv', iterator=True, chunksize=1000) # gives TextFileReader, which is iterable with chunks of 1000 rows. df = pd.concat(tp, ignore_index=True) # df is DataFrame. If errors, do `list(tp)` instead of `tp`
In principle it shouldn’t run out of memory, but there are currently memory problems with
read_csv on large files caused by some complex Python internal issues (this is vague but it’s been known for a long time: http://github.com/pydata/pandas/issues/407).
At the moment there isn’t a perfect solution (here’s a tedious one: you could transcribe the file row-by-row into a pre-allocated NumPy array or memory-mapped file–
np.mmap), but it’s one I’ll be working on in the near future. Another solution is to read the file in smaller pieces (use
iterator=True, chunksize=1000) then concatenate then with
pd.concat. The problem comes in when you pull the entire text file into memory in one big slurp.
This is an older thread, but I just wanted to dump my workaround solution here. I initially tried the
chunksize parameter (even with quite small values like 10000), but it didn’t help much; had still technical issues with the memory size (my CSV was ~ 7.5 Gb).
Right now, I just read chunks of the CSV files in a for-loop approach and add them e.g., to an SQLite database step by step:
import pandas as pd import sqlite3 from pandas.io import sql import subprocess # In and output file paths in_csv = '../data/my_large.csv' out_sqlite="../data/my.sqlite" table_name="my_table" # name for the SQLite database table chunksize = 100000 # number of lines to process at each iteration # columns that should be read from the CSV file columns = ['molecule_id','charge','db','drugsnow','hba','hbd','loc','nrb','smiles'] # Get number of lines in the CSV file nlines = subprocess.check_output('wc -l %s' % in_csv, shell=True) nlines = int(nlines.split()) # connect to database cnx = sqlite3.connect(out_sqlite) # Iteratively read CSV and dump lines into the SQLite table for i in range(0, nlines, chunksize): df = pd.read_csv(in_csv, header=None, # no header, define column header manually later nrows=chunksize, # number of rows to read at each iteration skiprows=i) # skip rows that were already read # columns to read df.columns = columns sql.to_sql(df, name=table_name, con=cnx, index=False, # don't use CSV file index index_label="molecule_id", # use a unique column from DataFrame as index if_exists="append") cnx.close()
Below is my working flow.
import sqlalchemy as sa import pandas as pd import psycopg2 count = 0 con = sa.create_engine('postgresql://postgres:[email protected]:00001/r') #con = sa.create_engine('sqlite:///XXXXX.db') SQLite chunks = pd.read_csv('..file', chunksize=10000, encoding="ISO-8859-1", sep=',', error_bad_lines=False, index_col=False, dtype="unicode")
Base on your file size, you’d better optimized the chunksize.
for chunk in chunks: chunk.to_sql(name="Table", if_exists="append", con=con) count += 1 print(count)
After have all data in Database, You can query out those you need from database.
If you want to load huge csv files, dask might be a good option. It mimics the pandas api, so it feels quite similar to pandas
You can use Pytable rather than pandas df.
It is designed for large data sets and the file format is in hdf5.
So the processing time is relatively fast.