Tutorial by Examples | RIP Tutorial

Reading csv file into DataFrame

Example for reading file data_file.csv such as: File: index,header1,header2,header3 1,str_data,12,1.4 3,str_data,22,42.33 4,str_data,2,3.44 2,str_data,43,43.34 7, str_data, 25, 23.32 Code: pd.read_csv('data_file.csv') Output: index header1 header2 header3 0 1 str_dat...

pandas • Pandas IO tools (reading and saving data sets)

Basic saving to a csv file

raw_data = {'first_name': ['John', 'Jane', 'Jim'], 'last_name': ['Doe', 'Smith', 'Jones'], 'department': ['Accounting', 'Sales', 'Engineering'],} df = pd.DataFrame(raw_data,columns=raw_data.keys()) df.to_csv('data_file.csv')

pandas • Pandas IO tools (reading and saving data sets)

Parsing dates when reading from csv

You can specify a column that contains dates so pandas would automatically parse them when reading from the csv pandas.read_csv('data_file.csv', parse_dates=['date_column'])

pandas • Pandas IO tools (reading and saving data sets)

Spreadsheet to dict of DataFrames

with pd.ExcelFile('path_to_file.xls) as xl: d = {sheet_name: xl.parse(sheet_name) for sheet_name in xl.sheet_names}

pandas • Pandas IO tools (reading and saving data sets)

Read a specific sheet

pd.read_excel('path_to_file.xls', sheetname='Sheet1') There are many parsing options for read_excel (similar to the options in read_csv. pd.read_excel('path_to_file.xls', sheetname='Sheet1', header=[0, 1, 2], skiprows=3, index_col=0) # etc.

pandas • Pandas IO tools (reading and saving data sets)

Testing read_csv

import pandas as pd import io temp=u"""index; header1; header2; header3 1; str_data; 12; 1.4 3; str_data; 22; 42.33 4; str_data; 2; 3.44 2; str_data; 43; 43.34 7; str_data; 25; 23.32""" #after testing replace io.StringIO(temp) to filename df = pd.read_csv(io....

pandas • Pandas IO tools (reading and saving data sets)

List comprehension

All files are in folder files. First create list of DataFrames and then concat them: import pandas as pd import glob #a.csv #a,b #1,2 #5,8 #b.csv #a,b #9,6 #6,4 #c.csv #a,b #4,3 #7,0 files = glob.glob('files/*.csv') dfs = [pd.read_csv(fp) for fp in files] #duplicated ind...

pandas • Pandas IO tools (reading and saving data sets)

Read in chunks

import pandas as pd chunksize = [n] for chunk in pd.read_csv(filename, chunksize=chunksize): process(chunk) delete(chunk)

pandas • Pandas IO tools (reading and saving data sets)

Save to CSV file

Save with default parameters: df.to_csv(file_name) Write specific columns: df.to_csv(file_name, columns =['col']) Difault delimiter is ',' - to change it: df.to_csv(file_name,sep="|") Write without the header: df.to_csv(file_name, header=False) Write with a given head...

pandas • Pandas IO tools (reading and saving data sets)

Parsing date columns with read_csv

Date always have a different format, they can be parsed using a specific parse_dates function. This input.csv: 2016 06 10 20:30:00 foo 2016 07 11 19:45:30 bar 2013 10 12 4:30:00 foo Can be parsed like this : mydateparser = lambda x: pd.datetime.strptime(x, "%Y %m %d %H:%M:%S...

pandas • Pandas IO tools (reading and saving data sets)

Read & merge multiple CSV files (with the same structure) into one DF

import os import glob import pandas as pd def get_merged_csv(flist, **kwargs): return pd.concat([pd.read_csv(f, **kwargs) for f in flist], ignore_index=True) path = 'C:/Users/csvfiles' fmask = os.path.join(path, '*mask*.csv') df = get_merged_csv(glob.glob(fmask), index_col=None, use...

pandas • Pandas IO tools (reading and saving data sets)

Reading cvs file into a pandas data frame when there is no header row

If the file does not contain a header row, File: 1;str_data;12;1.4 3;str_data;22;42.33 4;str_data;2;3.44 2;str_data;43;43.34 7; str_data; 25; 23.32 you can use the keyword names to provide column names: df = pandas.read_csv('data_file.csv', sep=';', index_col=0, ski...

pandas • Pandas IO tools (reading and saving data sets)

Using HDFStore

import string import numpy as np import pandas as pd generate sample DF with various dtypes df = pd.DataFrame({ 'int32': np.random.randint(0, 10**6, 10), 'int64': np.random.randint(10**7, 10**9, 10).astype(np.int64)*10, 'float': np.random.rand(10), 'string': ...

pandas • Pandas IO tools (reading and saving data sets)

Read Nginx access log (multiple quotechars)

For multiple quotechars use regex in place of sep: df = pd.read_csv(log_file, sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])', engine='python', usecols=[0, 3, 4, 5, 6, 7, 8], names=['ip', 'time', 'request', 'statu...

pandas • Pandas IO tools (reading and saving data sets)