Loading...

1. Simple ways to hard code DataFrame

  • Use python dic:
    1
    2
    3
    4
    5
    fruit_info = pd.DataFrame(
    data = {'fruit': ['apple', 'orange', 'banana', 'raspberry'],
    'color': ['red', 'orange', 'yellow', 'pink']
    })
    fruit_info
  • Use specified format like below:
    1
    2
    3
    4
    5
    fruit_info2 = pd.DataFrame(
    [("red", "apple"), ("orange", "orange"), ("yellow", "banana"),
    ("pink", "raspberry")],
    columns = ["color", "fruit"])
    fruit_info2

The out put of two above will be:

color fruit
0 red apple
1 orange orange
2 yellow banana
3 pink raspberry
  • You can call df.shape to inspect the shape of it
  • You can convert the entire DataFrame into a two-dimensional Numpy array using df.values.

2. Do some simple manipulation on DF

  • To add a column:

    1
    2
    fruit_info['rank1']=[2,4,1,3]
    fruit_info

    Output:

    fruit color rank1
    0 apple red 2
    1 orange orange 4
    2 banana yellow 1
    3 raspberry pink 3
    • Or you can use d.loc[:, 'new column name'] = …,take in two value in the “[]” with first indicate rows ; second indicate columns
      1
      2
      fruit_info.loc[:,'rank2'] = [2,4,1,3]
      fruit_info
      Output:
      fruit color rank1 rank2
      0 apple red 2 2
      1 orange orange 4 4
      2 banana yellow 1 1
      3 raspberry pink 3 3
  • To drop a row or column:

    1
    2
    3
    4
    # "axis = 1" means to drop columns while "axis = 0" means to drop rows

    fruit_info_original = fruit_info.drop(labels = ['rank1','rank2'],axis = 1 )
    fruit_info_original

    Output:

    fruit color
    0 apple red
    1 orange orange
    2 banana yellow
    3 raspberry pink
  • To rename the column name(in below , turn them into capital letters)

    1
    2
    3
    fruit_info_original.rename(lambda x:x.capitalize(),axis = 1,inplace = True)
    fruit_info_caps = fruit_info_original
    fruit_info_caps

    Output:

    Fruit Color
    0 apple red
    1 orange orange
    2 banana yellow
    3 raspberry pink

3. Load and form DF from zip

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import zipfile
zf = zipfile.ZipFile(namesbystate_path, 'r')

column_labels = ['State', 'Sex', 'Year', 'Name', 'Count']

def load_dataframe_from_zip(zf, f):
with zf.open(f) as fh:
return pd.read_csv(fh, header=None, names=column_labels)

states = [
load_dataframe_from_zip(zf, f)
for f in sorted(zf.filelist, key=lambda x:x.filename)
if f.filename.endswith('.TXT')
]

baby_names = states[0]
for state_df in states[1:]:
baby_names = pd.concat([baby_names, state_df])
baby_names = baby_names.reset_index().iloc[:, 1:]

About with as from copilot:

这段代码中的  with  和  as  是用于上下文管理的关键字。让我来解释一下:

  1. with  语句
    • with  语句用于创建一个上下文管理器,它可以在进入和退出代码块时执行一些操作。
    • 在这段代码中,with  语句用于打开一个 ZIP 文件并处理其中的内容。
  2. as  关键字
    • 在  with  语句中,as  后面的变量名称表示我们获取到的上下文管理器对象。
    • 在这里,fh  是一个文件句柄,它是通过  zf.open(f) 打开的 ZIP 文件中的一个文件。
    • as  关键字允许我们将这个文件句柄赋值给变量  fh,以便在  with  代码块内使用。

总结一下,[with]  语句帮助我们管理资源,确保在使用完毕后,资源被正确地释放。而  [as]  关键字则用于获取上下文管理器中的对象,以便我们可以直接操作它。

4. Basic inspection of data

  • len(df) Output how many row does DF have
  • df.head() Output five row in head of DF
  • df.shape Output the shape of DF
  • df.describe() Output the basic statics such as mean value when the inside datatype of a col is valuable

5. Two Slicing way of DF

Thinking of accessing a specified piece of data, there are two different methods call loc and iloc

  • loc method:Selection Using Label/Index
  • iloc method:Selection using Integer location

Eg.

1.baby_names.head()

output:

State Sex Year Name Count
0 AK F 1910 Mary 14
1 AK F 1910 Annie 12
2 AK F 1910 Anna 10
3 AK F 1910 Margaret 8
4 AK F 1910 Helen 7

2.baby_names.loc[2:5, ['Name']]

output:

Name
2 Anna
3 Margaret
4 Helen
5 Elsie

3.baby_names.iloc[2:5,['Name']]

output:IndexError: .iloc requires numeric indexers, got [‘Name’]

4.baby_names.iloc[2:5[3]]

output:

Name
2 Anna
3 Margaret
4 Helen

5.df = baby_names[:5].set_index("Name")

6.df

output: (changing the index as ‘Name’ col)

State Sex Year Count
Name
Mary AK F 1910 14
Annie AK F 1910 12
Anna AK F 1910 10
Margaret AK F 1910 8
Helen AK F 1910 7

7.df.loc[['Mary', 'Anna'], :]

output:

State Sex Year Count
Name
Mary AK F 1910 14
Anna AK F 1910 10

However, if we still want to access rows by location we will need to use the integer loc (iloc) accessor:

8.df.iloc[1:4, 2:3]

output:

Year
Name
Annie 1910
Anna 1910
Margaret 1910

to be continue…