pandas – xl blog

Series

创建

```
obj = Series([4, 7, -5, 3])
```

obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

取索引

ind = obj.index （ind 类型是 ‘pandas.core.index.Int64Index’ ）
vvv = ind[0] （vvv 类型是 ‘numpy.int64’ ，可以通过 int（）转换）

取值（list）

val = obj.values （val 类型是 ‘numpy.ndarray’ ）
al = val.tolist() （转为Python的list，其中每个值的类型是 long ）

取单个值

obj[‘单个索引名称’]

其它操作：

过滤：

```
obj2[obj2 > 0]
```

加减乘除：

```
obj2 * 2
```

判断是否包含：

'b' in obj2                  （如果索引中存在 b，返回 True）

Dataframe

创建

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)

dict1 = { \
  'timestamp' : str(timestamp), \
  'guifei'   : float(itemList[16]) \
 }
 buyDataList.append(dict1)
### use 'list of dict' to construct dataframe
buyDf = pd.DataFrame(buyDataList, columns=['timestamp' 'qita'])

## create df with list of tuple

data1 = []
for item in resList:
    l1 = item.split(',')
    l1.insert(0,str('sz002466'))
    tu1 = tuple(l1)
    data1.append(tu1)

df = pd.DataFrame(data1, columns=eastTableColumns)

## create df with list of list
df4 = pd.DataFrame([list6], columns=eastTableColumns)

### existing df append list as Series
tt7 = pd.Series(list1, index=list)
tt7.name = ('sz002407', '2016-08-02')
df3 = df3.append(tt7)

      #### create df by list of list and combine df
arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
          np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)

narr = [np.array(['bar', 'baz', 'foo', 'qux']),
       np.array(['three', 'three', 'three', 'three'])]

l1 = [1,3,5,7]
l2 = [2,3,4,5]
l3 = [5,7,5,6]
l4 = [24,4,6,7]
alll = [l1,l2,l3,l4]
df1 = pd.DataFrame( alll , index=narr)

frames = [ df , df1 ]
res = pd.concat(frames)

取索引

ind = obj.index （ind 类型是 ‘pandas.core.index.Int64Index’ ）
vvv = ind[0] （vvv 类型是 ‘numpy.int64’ ，可以通过 int（）转换）

设置index

df1 = df.set_index([‘code’, ‘date’])

取单列（list）

t2 = df1[‘code’] （t2 类型是 Series ）
t3 = df1.code
t4 = df1[[‘code’, ‘name’]] ### 取多个列

取部分行

df[0:3]
data[data[‘three’] > 5]

几种通过索引选取行列的方法：

.loc

基于行索引label选择部分行列
例子：

dfl.loc['20130102':'20130104']     #### 选取 部分行

df.loc[:,['A', 'B'] ]    #### 选取 A列和 B列的所有行

df.loc['20130102':'20130104'   ,['A', 'B'] ]    ### 行列都指定

.iloc （slice的话是含前不含后）

基于行索引数值选择部分行列
例子：

```
s1.iloc[:3]       ### 取前3行
```

s1.iloc[3]        ### 取第 4 行

df1.iloc[1:5, 2:4]     #### 1:5  是行序号， 2:4是列序号

基于行索引label或者数值选择部分行列
例子：

df.ix[[('bar', 'two'), ('qux', 'on ')]]    ### 圆括号里面的是组合行索引，索引包含2个值

.布尔型索引

基于条件判断选择部分行
例子：

dfl[df.A > 0 ]
df[ df > 0 ]
df.code.isin([‘002407’])
df.loc[df.code.isin([‘002407’])]
```
de  >df[(df.one >=1 ) & (df.one 
```

de  >df1.loc[ df1.Per == '-' , 'Per' ] = 0de>

dfwithnostop = dfndaylow[ ~dfndaylow[1].isin( dfstop1.drop_duplicates().tolist()  )  ]

```
####pandas 取反符号是 ~
```
```
  
```

.loc

基于行索引选择部分行
例子：

访问多重index

例子：

arrays = [np.array([‘bar’, ‘bar’, ‘baz’, ‘baz’, ‘foo’, ‘foo’, ‘qux’, ‘qux’]),
np.array([‘one’, ‘two’, ‘one’, ‘two’, ‘one’, ‘two’, ‘one’, ‘two’])]
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df.ix[(‘aux’,’one’),0] = 3.5
df.ix[(‘aux’,’one’),(0, 1, 2, 3) ] = [3.5, 3.6, 3.7, 3.8]

取单个值

obj[‘单个索引名称’]

dfOne = selDf[start:end]

# get first value
 code = dfOne['code'].get(0)
 tstamp = dfOne['timestamp'].get(0)
 price = float( dfOne['price'].get(0) )
 amount = abs( int( dfOne['amount'].get(0) ) )

其它操作：

修改列：

frame2['debt'] = 16.5         （如果原来没有'debt' 列，这个操作就增加了一列，默认值是16.5 ）

frame2['eastern'] = frame2.state == 'Ohio'

（创建新的列，每个的值为一个判断语句的返回值（frame2.state == 'Ohio'））

del frame2['eastern']        （删除列）

修改单个值

推荐使用 df.loc[row_index,col_indexer] 方式。
这样可以减少/避免由于内存layout（不可预测）因素导致的view/copy 问题。

转置（行列标题交换）：

vv = frame.T   （转置完的类型还是 Dataframe ）

选择头尾部分：

df.head(3)                  （如果索引中存在 b，返回 True）

```
df.tail(3)
```

基本统计使用describe()函数：

df2 = df.describe()   （类型还是 Dataframe ）（[u'count', u'mean', u'std', u'min', u'25%', u'50%', u'75%', u'max']）

选择头尾部分：

df.head(3)                  （如果索引中存在 b，返回 True）

```
df.tail(3)
```

sort 函数：

可以按多个列先后排序，每个列可以指定是升序还是降序

例子： df.sort([‘c1’,‘c2’], ascending=[False,True])
dfNew = dfTable.sort_index(ascending=False) ### 按照 index 排序

groupby 函数：

求排序值

frame[‘sum_order’] = frame[‘sum’].rank()
frame[‘sum_sq_order’] = frame[‘sum_sq’].rank()
frame[‘max_order’] = frame[‘max’].rank()
frame[‘mean_order’] = frame[[‘sum_order’, ‘sum_sq_order’, ‘max_order’]].mean(axis=1)

常见操作：

data < 5 ## 过滤
data[ data < 5 ] = 0 ## 替换
df[24] = df[24].str[:10] ## 仅保留前10个字符

kbdf = buy_df.copy()



sdf = sdf.sort(['date','time'])

sortedBdf = matchBuyDf.sort(['amount','price'], ascending=[False,True])

df2 = df.sort_values(by=[timestamp], ascending=[False] )



# 寻找满足条件的行的index（集合），然后使用  loc 函数匹配

t1 = bdf[ (bdf['date'] > lastSellDate) | ( (bdf['date'] == lastSellDate) & (bdf['time'] > lastSellTime) ) ].index

df.loc[t1, 'status'] = 2

# 如果是DF 中的单独一行，需要使用 “name”属性。

df.loc[sortedBdf.iloc[0].name,'status'] = 1

df = df[ (df['flag'] == sell_flag) | (df['flag'] == buy_flag) ]



# 选取 DF 中的某几列

df = df[list(df.columns[:7]) + list(df.columns[10:])]

df2[ [4 , 10, 14, 18, 22, 23 ] ]



# 遍历某个 DF，其中 sdf_row 是当前行的 series

for sell_df_index, sdf_row in sdf.iterrows():

 sdf_row['price']         # series 方式



# 使用该函数读取文件

# mycols 是一个自定义的列表，其中包括 code， amount

# 通过 dtype 参数指定数据类型

df  = pd.read_csv(pandasInputFile, sep=',', skiprows=5, index_col=False,

 names=mycols, dtype={'code':np.object, 'amount':np.int16} )

## 如上，mycols可以是一个list，在dtype参数里面可以指定mycols中包含的列的数据类型

df2.to_csv('reffiles/stBasic.txt', sep='|', encoding='utf-8')





# df 中字符串的操作

df.loc[ df[1].str.startswith('6') , 1] = 'sh' + df[1].astype(str)

df2[[4,10]] = df2[[4,10]].applymap(lambda x : x.strip('%') ).apply( pd.to_numeric )  ### 去除字符串末尾的百分号，并且转换成数值类型

合并，联合以及连接

Merge, join, and concatenate

http://pandas.pydata.org/pandas-docs/stable/merging.html

【

pd.concat(objs, axis=0, join=’outer’, join_axes=None, ignore_index=False,

keys=None, levels=None, names=None, verify_integrity=False,

copy=True)

objs : a sequence or mapping of Series, DataFrame, or Panel objects. If a dict is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised.
axis : {0, 1, …}, default 0. The axis to concatenate along.
join : {‘inner’, ‘outer’}, default ‘outer’. How to handle indexes on other axis(es). Outer for union and inner for intersection.
ignore_index : boolean, default False. If True, do not use the index values on the concatenation axis. The resulting axis will be labeled 0, …, n – 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.
join_axes : list of Index objects. Specific indexes to use for the other n – 1 axes instead of performing inner/outer set logic.
keys : sequence, default None. Construct hierarchical index using the passed keys as the outermost level. If multiple levels passed, should contain tuples.
levels : list of sequences, default None. Specific levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the keys.
names : list, default None. Names for the levels in the resulting hierarchical index.
verify_integrity : boolean, default False. Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation.
copy : boolean, default True. If False, do not copy data unnecessarily.

】

frames = [ df1, df2, df3 ]

· concat 的合并方向选项：

o 1. axis = 0 (默认选项) ：行跟着行

o 2. axis = 1 ：列跟着列

· concat 的合并交集方式：

o 1. join=’outer’ (默认选项) ：取并集，以求不损失数据（信息）

o 2. join=’inner’ ：取交集，即所有df 中都存在的index

o 3. join_axes=[df1.index] ：以某个df的index范围进行判断，在里面的就合并，不在里面的就舍弃。

Append

Merge（根据列的值，而非index，来进行合并）

【

pd.merge(left, right, how=’inner’, on=None, left_on=None, right_on=None,

left_index=False, right_index=False, sort=True,

suffixes=(‘_x’, ‘_y’), copy=True, indicator=False)

left: A DataFrame object
right: Another DataFrame object
on: Columns (names) to join on. Must be found in both the left and right DataFrame objects. If not passed and left_index and right_index are False, the intersection of the columns in the DataFrames will be inferred to be the join keys
left_on: Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame
right_on: Columns from the right DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame
left_index: If True, use the index (row labels) from the left DataFrame as its join key(s). In the case of a DataFrame with a MultiIndex (hierarchical), the number of levels must match the number of join keys from the right DataFrame
right_index: Same usage as left_index for the right DataFrame
how: One of ‘left’, ‘right’, ‘outer’, ‘inner’. Defaults to inner. See below for more detailed description of each method
sort: Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve performance substantially in many cases
suffixes: A tuple of string suffixes to apply to overlapping columns. Defaults to (‘_x’, ‘_y’).
copy: Always copy data (default True) from the passed DataFrame objects, even when reindexing is not necessary. Cannot be avoided in many cases but may improve performance / memory usage. The cases where copying can be avoided are somewhat pathological but this option is provided nonetheless.
indicator: Add a column to the output DataFrame called _merge with information on the source of each row. _merge is Categorical-type and takes on a value of left_only for observations whose merge key only appears in ‘left’ DataFrame, right_only for observations whose merge key only appears in ‘right’ DataFrame, and both if the observation’s merge key is found in both.

】

· 如何使用 how 参数例子：

· 如何使用 indicator参数例子：

· In [50]: pd.merge(df1, df2, on=’col1′, how=’outer’, indicator=True)

· Out[50]:

· col1 col_left col_right _merge

· 0 0 a NaN left_only

· 1 1 b 2.0 both

· 2 2 NaN 2.0 right_only

· 3 2 NaN 2.0 right_only

Join（从两个可能index不相同的df中，把部分列合并到一起）

图1

图2

图3

]]

标签： pandas

pandas 入门

pandas Merge, join, and concatenate