Read a zipped file as a pandas DataFrame

Each Answer to this Q is separated by one/two green lines.

I’m trying to unzip a csv file and pass it into pandas so I can work on the file.
The code I have tried so far is:

import requests, zipfile, StringIO
r = requests.get('http://data.octo.dc.gov/feeds/crime_incidents/archive/crime_incidents_2013_CSV.zip')
z = zipfile.ZipFile(StringIO.StringIO(r.content))
crime2013 = pandas.read_csv(z.read('crime_incidents_2013_CSV.csv'))

After the last line, although python is able to get the file, I get a “does not exist” at the end of the error.

Can someone tell me what I’m doing incorrectly?

If you want to read a zipped or a tar.gz file into pandas dataframe, the read_csv methods includes this particular implementation.

df = pd.read_csv('filename.zip')

Or the long form:

df = pd.read_csv('filename.zip', compression='zip', header=0, sep=',', quotechar=""")

Description of the compression argument from the docs:

compression : {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’
For on-the-fly decompression of on-disk data. If ‘infer’ and filepath_or_buffer is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’ (otherwise no decompression). If using ‘zip’, the ZIP file must contain only one data file to be read in. Set to None for no decompression.

New in version 0.18.1: support for ‘zip’ and ‘xz’ compression.

I think you want to open the ZipFile, which returns a file-like object, rather than read:

In [11]: crime2013 = pd.read_csv(z.open('crime_incidents_2013_CSV.csv'))

In [12]: crime2013
Out[12]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24567 entries, 0 to 24566
Data columns (total 15 columns):
CCN                            24567  non-null values
REPORTDATETIME                 24567  non-null values
SHIFT                          24567  non-null values
OFFENSE                        24567  non-null values
METHOD                         24567  non-null values
LASTMODIFIEDDATE               24567  non-null values
BLOCKSITEADDRESS               24567  non-null values
BLOCKXCOORD                    24567  non-null values
BLOCKYCOORD                    24567  non-null values
WARD                           24563  non-null values
ANC                            24567  non-null values
DISTRICT                       24567  non-null values
PSA                            24567  non-null values
NEIGHBORHOODCLUSTER            24263  non-null values
BUSINESSIMPROVEMENTDISTRICT    3613  non-null values
dtypes: float64(4), int64(1), object(10)

It seems you don’t even have to specify the compression any more. The following snippet loads the data from filename.zip into df.

import pandas as pd
df = pd.read_csv('filename.zip')

(Of course you will need to specify separator, header, etc. if they are different from the defaults.)

For “zip” files, you can use import zipfile and your code will be working simply with these lines:

import zipfile
import pandas as pd
with zipfile.ZipFile("Crime_Incidents_in_2013.zip") as z:
   with z.open("Crime_Incidents_in_2013.csv") as f:
      train = pd.read_csv(f, header=0, delimiter="\t")
      print(train.head())    # print the first 5 rows

And the result will be:

X,Y,CCN,REPORT_DAT,SHIFT,METHOD,OFFENSE,BLOCK,XBLOCK,YBLOCK,WARD,ANC,DISTRICT,PSA,NEIGHBORHOOD_CLUSTER,BLOCK_GROUP,CENSUS_TRACT,VOTING_PRECINCT,XCOORD,YCOORD,LATITUDE,LONGITUDE,BID,START_DATE,END_DATE,OBJECTID
0  -77.054968548763071,38.899775938598317,0925135...                                                                                                                                                               
1  -76.967309569035052,38.872119553647011,1003352...                                                                                                                                                               
2  -76.996184958456539,38.927921847721443,1101010...                                                                                                                                                               
3  -76.943077541353617,38.883686046653935,1104551...                                                                                                                                                               
4  -76.939209158039446,38.892278093281632,1125028...

I guess what your looking is the following

from io import BytesIO
import requests
import pandas as pd

result = requests.get("https://www.xxx.zzz/file.zip")
df = pd.read_csv(BytesIO(result.content),compression='zip', header=0, sep=',', quotechar=""")

Read these article to understand why: https://medium.com/dev-bits/ultimate-guide-for-working-with-i-o-streams-and-zip-archives-in-python-3-6f3cf96dca50

https://www.kaggle.com/jboysen/quick-gz-pandas-tutorial

Please follow this link.

import pandas as pd
traffic_station_df = pd.read_csv('C:\\Folders\\Jupiter_Feed.txt.gz', compression='gzip',
                                 header=1, sep='\t', quotechar=""")

#traffic_station_df['Address'] = 'address'

#traffic_station_df.append(traffic_station_df)
print(traffic_station_df)


The answers/resolutions are collected from stackoverflow, are licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0 .

Leave a Reply

Your email address will not be published.