Pandas入门筆記

2024-01-17

1. Simple ways to hard code DataFrame

Use python dic:

fruit_info = pd.DataFrame(
    data = {'fruit': ['apple', 'orange', 'banana', 'raspberry'],
          'color': ['red', 'orange', 'yellow', 'pink']
          })
fruit_info

Use specified format like below:

fruit_info2 = pd.DataFrame(
    [("red", "apple"), ("orange", "orange"), ("yellow", "banana"),
     ("pink", "raspberry")],
    columns = ["color", "fruit"])
fruit_info2

The out put of two above will be:

	color	fruit
0	red	apple
1	orange	orange
2	yellow	banana
3	pink	raspberry

You can call df.shape to inspect the shape of it
You can convert the entire DataFrame into a two-dimensional Numpy array using df.values.

2. Do some simple manipulation on DF

To add a column:
1
2
fruit_info['rank1']=[2,4,1,3]
fruit_info
Output:

fruit color rank1

0 apple red 2

1 orange orange 4

2 banana yellow 1

3 raspberry pink 3
- Or you can use d.loc[:, 'new column name'] = …,take in two value in the “[]” with first indicate rows ; second indicate columns
  1
  2
  fruit_info.loc[:,'rank2'] = [2,4,1,3]
  fruit_info
  Output:
  
  fruit color rank1 rank2
  
  0 apple red 2 2
  
  1 orange orange 4 4
  
  2 banana yellow 1 1
  
  3 raspberry pink 3 3

	fruit	color	rank1
0	apple	red	2
1	orange	orange	4
2	banana	yellow	1
3	raspberry	pink	3

	fruit	color	rank1	rank2
0	apple	red	2	2
1	orange	orange	4	4
2	banana	yellow	1	1
3	raspberry	pink	3	3

To drop a row or column:

# "axis = 1" means to drop columns while "axis = 0" means to drop rows

fruit_info_original = fruit_info.drop(labels = ['rank1','rank2'],axis = 1 )
fruit_info_original

Output:

	fruit	color
0	apple	red
1	orange	orange
2	banana	yellow
3	raspberry	pink

To rename the column name(in below , turn them into capital letters)

1
2
3

fruit_info_original.rename(lambda x:x.capitalize(),axis = 1,inplace = True)
fruit_info_caps = fruit_info_original
fruit_info_caps

Output:

	Fruit	Color
0	apple	red
1	orange	orange
2	banana	yellow
3	raspberry	pink

3. Load and form DF from zip

import zipfile
zf = zipfile.ZipFile(namesbystate_path, 'r')

column_labels = ['State', 'Sex', 'Year', 'Name', 'Count']

def load_dataframe_from_zip(zf, f):
    with zf.open(f) as fh:
        return pd.read_csv(fh, header=None, names=column_labels)

states = [
    load_dataframe_from_zip(zf, f)
    for f in sorted(zf.filelist, key=lambda x:x.filename)
    if f.filename.endswith('.TXT')
]

baby_names = states[0]
for state_df in states[1:]:
    baby_names = pd.concat([baby_names, state_df])
baby_names = baby_names.reset_index().iloc[:, 1:]

About with as from copilot:

这段代码中的 with 和 as 是用于上下文管理的关键字。让我来解释一下：

with 语句：

with 语句用于创建一个上下文管理器，它可以在进入和退出代码块时执行一些操作。

在这段代码中，with 语句用于打开一个 ZIP 文件并处理其中的内容。

as 关键字：

在 with 语句中，as 后面的变量名称表示我们获取到的上下文管理器对象。

在这里，fh 是一个文件句柄，它是通过 zf.open(f) 打开的 ZIP 文件中的一个文件。

as 关键字允许我们将这个文件句柄赋值给变量 fh，以便在 with 代码块内使用。

总结一下，[with] 语句帮助我们管理资源，确保在使用完毕后，资源被正确地释放。而 [as] 关键字则用于获取上下文管理器中的对象，以便我们可以直接操作它。

4. Basic inspection of data

len(df) Output how many row does DF have
df.head() Output five row in head of DF
df.shape Output the shape of DF
df.describe() Output the basic statics such as mean value when the inside datatype of a col is valuable

5. Two Slicing way of DF

Thinking of accessing a specified piece of data, there are two different methods call loc and iloc

loc method:Selection Using Label/Index
iloc method:Selection using Integer location

Eg.

1.baby_names.head()

output:

	State	Sex	Year	Name	Count
0	AK	F	1910	Mary	14
1	AK	F	1910	Annie	12
2	AK	F	1910	Anna	10
3	AK	F	1910	Margaret	8
4	AK	F	1910	Helen	7

2.baby_names.loc[2:5, ['Name']]

output:

	Name
2	Anna
3	Margaret
4	Helen
5	Elsie

3.baby_names.iloc[2:5,['Name']]

output:IndexError: .iloc requires numeric indexers, got [‘Name’]

4.baby_names.iloc[2:5[3]]

output:

	Name
2	Anna
3	Margaret
4	Helen

5.df = baby_names[:5].set_index("Name")

6.df

output: (changing the index as ‘Name’ col)

	State	Sex	Year	Count
Name
Mary	AK	F	1910	14
Annie	AK	F	1910	12
Anna	AK	F	1910	10
Margaret	AK	F	1910	8
Helen	AK	F	1910	7

7.df.loc[['Mary', 'Anna'], :]

output:

	State	Sex	Year	Count
Name
Mary	AK	F	1910	14
Anna	AK	F	1910	10

However, if we still want to access rows by location we will need to use the integer loc (iloc) accessor:

8.df.iloc[1:4, 2:3]

output:

	Year
Name
Annie	1910
Anna	1910
Margaret	1910

to be continue…