如何用pandas對excel進行操作來生成不重複的列表（文字列舉）？

1樓：黃索遠

瀉腰，可以的。pandas提供duplicated和drop_duplicates來滿足題主的要求。

比如：In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: df = pd.DataFrame()

In [4]: df

Out[4a b c

0 one x 0.320745

1 one y 1.404111

2 two x 0.442200

3 two y 0.592355

4 two x 0.284760

5 three x 0.138029

6 four x 1.406531

In [5]: df.duplicated('a')

Out[5]:

0 False

1 True

2 False

3 True

4 True

5 False

6 False

dtype: bool

In [6]: df.duplicated('a', keep='last')

Out[6]:

0 True

1 False

2 True

3 True

4 False

5 False

6 False

dtype: bool

In [8]: df.duplicated('a', keep=False)

Out[8]:

0 True

1 True

2 True

3 True

4 True

5 False

6 False

dtype: bool

In [9]: df.drop_duplicates('a')

Out[9a b c

0 one x 0.320745

2 two x 0.442200

5 three x 0.138029

6 four x 1.406531

In [10]: df.drop_duplicates('a', keep='last')

Out[10a b c

1 one y 1.404111

4 two x 0.284760

5 three x 0.138029

6 four x 1.406531

In [11]: df.drop_duplicates('a', keep=False)

Out[11a b c

5 three x 0.138029

6 four x 1.406531

這兩個方法都接受行的標識作為引數，例子中是「a」。

duplicated返回乙個布林值的向量，長度是行數，布林值表示是否重複。

drop_duplicates會移除重複的行。

重複行中出現第一行預設被保留，但是也支援keep指定保留的策略。

keep='first'(預設): 標記row最小的行/移除除了row最小的行的其他行；

keep='last': 標記row最小的行/移除除了row最小的行的其他行；

keep=False:標記/移除所有的重複行。

也可以傳入多個column的標識In[

12]:df.

duplicated

(['a'

,'b'

])Out[12

]:0False

1False

2False

3False

4True

5False

6False

dtype

:boolIn[

13]:df.

drop_duplicates

(['a'

,'b'

])Out[13

]:abc

0one

x0.320745

1one

y1.404111

2two

x0.442200

3two

y0.592355

5three

x0.138029

6four

x1.406531

如果想移除重複的索引值，要使用Index.duplicated，也支援keep引數。

In [14]: df = pd.DataFrame(, index=['a', 'a', 'b', 'c', 'b', 'a'])

In [15]: df

Out[15]:

a b

a 0 1.777118

a 1 -0.881011

b 2 0.456924

c 3 0.922302

b 4 -0.490238

a 5 2.925533

In [16]: df.index.duplicated()

Out[16]: array([False, True, False, False, True, True], dtype=bool)

In [17]: df[~df.index.duplicated()]

Out[17]:

a b

a 0 1.777118

b 2 0.456924

c 3 0.922302

In [18]: df[~df.index.duplicated(keep='last')]

Out[18]:

a b

c 3 0.922302

b 4 -0.490238

a 5 2.925533

In [19]: df[~df.index.duplicated(keep=False)]

Out[19]:

a b

c 3 0.922302

另外多說一句，題主提高乙個搜尋能力呀，這些在官方文件裡都是找得到的

2樓：Ziqiao

df['your_column'].unique()如果你的列原來是array(['a','b','a','a','c'])

這條語句會返回array(['a','b','c'])要轉成列表，再加個.tolist()就可以了。

不知題主是不是想要這個。

如何用python對excel實現讀取指定日期的資料

1 讀資料。df pd.read excel test.xlsx skiprows 4 2 把要固定的兩列設為index。df 2 df.set index Type Part No 3 計算起止日期，取資料 start pd.Timestamp 2020 12 01 end start pd.Ti...

如何用excel解決曾用名的問題？

TuskAi 以VLOOKUP為核心構造乙個自定義函式匹配資料時有四種可能 1.資料未輸入，結果顯示為資料未輸入 2.資料匹配不到結果，結果顯示為無匹配結果 3.資料可以匹配到結果，將結果作為引數用VLOOKUP繼續匹配，最終出現乙個無法匹配到結果的引數，顯示該引數 4.資料一直能匹配到結果，...

如何用Excel提取單元格內數字之和？

吳棋仁收藏的乙個自定義函式可以幫到你 Function SumNumsInString StringToSearch AsString As Double Finds numbers within a string and sums them Late binding,so no referenc...

如何用pandas對excel進行操作來生成不重複的列表（文字列舉）？

如何用python對excel實現讀取指定日期的資料

如何用excel解決曾用名的問題？

如何用Excel提取單元格內數字之和？

其他用戶還看了：