Ch7 Excel File Manipulation with pandas¶

Case Study: Excel Reporting¶

import pandas as pd

df = pd.read_excel("sales_data/new/January.xlsx")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9493 entries, 0 to 9492
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   transaction_id    9493 non-null   object        
 1   store             9493 non-null   object        
 2   status            9493 non-null   object        
 3   transaction_date  9493 non-null   datetime64[ns]
 4   plan              9493 non-null   object        
 5   contract_type     9493 non-null   object        
 6   amount            9493 non-null   float64       
dtypes: datetime64[ns](1), float64(1), object(5)
memory usage: 519.3+ KB

Reading Excel Files with pandas¶

df = pd.read_excel("xl/stores.xlsx",
                   sheet_name="2019", skiprows=1, usecols="B:F")
df

	Store	Employees	Manager	Since	Flagship
0	New York	10	Sarah	2018-07-20	False
1	San Francisco	12	Neriah	2019-11-02	MISSING
2	Chicago	4	Katelin	2020-01-31	NaN
3	Boston	5	Georgiana	2017-04-01	True
4	Washington DC	3	Evan	NaT	False
5	Las Vegas	11	Paul	2020-01-06	False

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Store      6 non-null      object        
 1   Employees  6 non-null      int64         
 2   Manager    6 non-null      object        
 3   Since      5 non-null      datetime64[ns]
 4   Flagship   5 non-null      object        
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 368.0+ bytes

def fix_missing(x):
  return False if x in ["", "MISSING"] else x

df = pd.read_excel("xl/stores.xlsx",
                   sheet_name="2019", skiprows=1, usecols="B:F",
                   converters={"Flagship": fix_missing})
df

	Store	Employees	Manager	Since	Flagship
0	New York	10	Sarah	2018-07-20	False
1	San Francisco	12	Neriah	2019-11-02	False
2	Chicago	4	Katelin	2020-01-31	False
3	Boston	5	Georgiana	2017-04-01	True
4	Washington DC	3	Evan	NaT	False
5	Las Vegas	11	Paul	2020-01-06	False

# The Flagship column now has Dtype "bool"
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Store      6 non-null      object        
 1   Employees  6 non-null      int64         
 2   Manager    6 non-null      object        
 3   Since      5 non-null      datetime64[ns]
 4   Flagship   6 non-null      bool          
dtypes: bool(1), datetime64[ns](1), int64(1), object(2)
memory usage: 326.0+ bytes

sheets = pd.read_excel("xl/stores.xlsx", sheet_name=["2019", "2020"],
                       skiprows=1, usecols=["Store", "Employees"])
sheets["2019"].head(2)

	Store	Employees
0	New York	10
1	San Francisco	12

df = pd.read_excel("xl/stores.xlsx", sheet_name=0,
                   skiprows=2, skipfooter=3,
                   usecols="B:C,F", header=None,
                   names=["Branch", "Employee_Count", "Is_Flagship"])
df

	Branch	Employee_Count	Is_Flagship
0	New York	10	False
1	San Francisco	12	MISSING
2	Chicago	4	NaN

df = pd.read_excel("xl/stores.xlsx", sheet_name="2019",
                   skiprows=1, usecols="B,C,F", skipfooter=2,
                   na_values="MISSING", keep_default_na=False)
df

	Store	Employees	Flagship
0	New York	10	False
1	San Francisco	12	NaN
2	Chicago	4
3	Boston	5	True

f = open("output.txt", "w")
f.write("Some text")
f.close()

Context Managers and the with Statement¶

with open("output.txt", "w") as f:
  f.write("Some text")

with pd.ExcelFile("xl/stores.xls") as f:
  df1 = pd.read_excel(f, "2019", skiprows=1, usecols="B:F", nrows=2)
  df2 = pd.read_excel(f, "2020", skiprows=1, usecols="B:F", nrows=2)

df1

	Store	Employees	Manager	Since	Flagship
0	New York	10	Sarah	2018-07-20	False
1	San Francisco	12	Neriah	2019-11-02	MISSING

stores = pd.ExcelFile("xl/stores.xlsx")
stores.sheet_names

['2019', '2020', '2019-2020']

url = ("https://raw.githubusercontent.com/fzumstein/"
       "python-for-excel/1st-edition/xl/stores.xlsx")
pd.read_excel(url, skiprows=1, usecols="B:E", nrows=2)

	Store	Employees	Manager	Since
0	New York	10	Sarah	2018-07-20
1	San Francisco	12	Neriah	2019-11-02

Writing Excel Files with pandas¶

import numpy as np
import datetime as dt

data = [[dt.datetime(2020, 1, 1, 10, 13), 2.222, 1, True],
        [dt.datetime(2020, 1, 2), np.nan, 2, False],
        [dt.datetime(2020, 1, 2), np.inf, 3, True]]
df = pd.DataFrame(data=data,
                  columns=["Dates", "Floats", "Integers", "Booleans"])
df.index.name = "index"
df

	Dates	Floats	Integers	Booleans
index
0	2020-01-01 10:13:00	2.222	1	True
1	2020-01-02 00:00:00	NaN	2	False
2	2020-01-02 00:00:00	inf	3	True

df.to_excel("written_with_pandas.xlsx", sheet_name="Output",
            startrow=1, startcol=1, index=True, header=True,
            na_rep="<NA>", inf_rep="<INF>")

with pd.ExcelWriter("written_with_pandas2.xlsx") as writer:
  df.to_excel(writer, sheet_name="Sheet1", startrow=1, startcol=1)
  df.to_excel(writer, sheet_name="Sheet1", startrow=10, startcol=1)
  df.to_excel(writer, sheet_name="Sheet2")

Ch06 Time Series Analysis with pandas Ch08 Excel File Manipulation with Reader and Writer