pandas 入门

  1. Series
    1. 创建
      1. obj = Series([4, 7, -5, 3])
      2. obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

    2. 取索引
      1. ind = obj.index (ind 类型是 ‘pandas.core.index.Int64Index’ )
      2. vvv = ind[0] (vvv 类型是 ‘numpy.int64’ , 可以通过 int() 转换)
    3. 取值(list)
      1. val = obj.values (val 类型是 ‘numpy.ndarray’ )
      2. al = val.tolist() (转为Python的list,其中每个值的类型是 long )
    4. 取单个值
      1. obj[‘单个索引名称’]
    5. 其它操作:
      1. 过滤:
        1. obj2[obj2 > 0]

      2. 加减乘除:
        1. obj2 * 2

      3. 判断是否包含:
        1. 'b' in obj2                  (如果索引中存在 b,返回 True)

  2. Dataframe
    1. 创建
      1. data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
        frame = DataFrame(data)
      2. dict1 = { \
        'timestamp' : str(timestamp), \
        'guifei' : float(itemList[16]) \
        }
        buyDataList.append(dict1)
        ### use 'list of dict' to construct dataframe
        buyDf = pd.DataFrame(buyDataList, columns=['timestamp' 'qita'])
      3. a
        ## create df with list of tuple

        data1 = []
        for item in resList:
        l1 = item.split(',')
        l1.insert(0,str('sz002466'))
        tu1 = tuple(l1)
        data1.append(tu1)

        df = pd.DataFrame(data1, columns=eastTableColumns)
      4. a
        ## create df with list of list
        df4 = pd.DataFrame([list6], columns=eastTableColumns)
      5. a
        ### existing df append list as Series
        tt7 = pd.Series(list1, index=list)
        tt7.name = ('sz002407', '2016-08-02')
        df3 = df3.append(tt7)


      6.       #### create df by list of list and combine df
        arrays = [np.array(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux']),
        np.array(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'])]
        df = pd.DataFrame(np.random.randn(8, 4), index=arrays)

        narr = [np.array(['bar', 'baz', 'foo', 'qux']),
        np.array(['three', 'three', 'three', 'three'])]

        l1 = [1,3,5,7]
        l2 = [2,3,4,5]
        l3 = [5,7,5,6]
        l4 = [24,4,6,7]
        alll = [l1,l2,l3,l4]
        df1 = pd.DataFrame( alll , index=narr)

        frames = [ df , df1 ]
        res = pd.concat(frames)
      7. 
        
    2. 取索引
      1. ind = obj.index (ind 类型是 ‘pandas.core.index.Int64Index’ )
      2. vvv = ind[0] (vvv 类型是 ‘numpy.int64’ , 可以通过 int() 转换)
    3. 设置index
      1. df1 = df.set_index([‘code’, ‘date’])
    4. 取单列(list)
      1. t2 = df1[‘code’] (t2 类型是 Series )
      2. t3 = df1.code  
      3. t4 = df1[[‘code’, ‘name’]] ### 取多个列
    5. 取部分行
      1. df[0:3]
      2. data[data[‘three’] > 5]
    6. 几种通过索引选取行列的方法:
      1. .loc
        1. 基于行索引label选择部分行列
        2. 例子:
          1. dfl.loc['20130102':'20130104']     #### 选取 部分行 

          2. df.loc[:,['A', 'B'] ]    #### 选取 A列和 B列的所有行
          3. df.loc['20130102':'20130104'   ,['A', 'B'] ]    ### 行列都指定
      2. .iloc (slice的话是含前不含后)
        1. 基于行索引数值选择部分行列
        2. 例子:
          1. s1.iloc[:3]       ### 取前3行
          2. s1.iloc[3]        ### 取第 4 行
          3. df1.iloc[1:5, 2:4]     #### 1:5  是行序号, 2:4是列序号
          4. 
            
      3. .ix
        1. 基于行索引label或者数值选择部分行列
        2. 例子:
          1. df.ix[[('bar', 'two'), ('qux', 'on ')]]    ### 圆括号里面的是组合行索引,索引包含2个值
      4. .布尔型索引
        1. 基于条件判断选择部分行
        2. 例子:
          1. dfl[df.A > 0 ]
          2. df[ df > 0 ]
          3. df.code.isin([‘002407’])
          4. df.loc[df.code.isin([‘002407’])]
          5. de  >df[(df.one >=1 ) & (df.one 

          6. de  >df1.loc[ df1.Per == '-' , 'Per' ] = 0de>
          7. dfwithnostop = dfndaylow[ ~dfndaylow[1].isin( dfstop1.drop_duplicates().tolist()  )  ]             


          8. ####pandas 取反符号是 ~


          9.   
      5. .loc
        1. 基于行索引选择部分行
        2. 例子:
          1. dfl
    7. 访问多重index
      1. 例子:
        1. arrays = [np.array([‘bar’, ‘bar’, ‘baz’, ‘baz’, ‘foo’, ‘foo’, ‘qux’, ‘qux’]),
        2.           np.array([‘one’, ‘two’, ‘one’, ‘two’, ‘one’, ‘two’, ‘one’, ‘two’])] 
        3. df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
        4. df.ix[(‘aux’,’one’),0] = 3.5
        5. df.ix[(‘aux’,’one’),(0, 1, 2, 3) ] = [3.5, 3.6, 3.7, 3.8]


    8. 取单个值
      1. obj[‘单个索引名称’]
      2. dfOne = selDf[start:end]

        # get first value
        code = dfOne['code'].get(0)
        tstamp = dfOne['timestamp'].get(0)
        price = float( dfOne['price'].get(0) )
        amount = abs( int( dfOne['amount'].get(0) ) )

    9. 其它操作:
      1. 修改列:
        1. frame2['debt'] = 16.5         (如果原来没有'debt' 列,这个操作就增加了一列,默认值是16.5 )

        2. frame2['eastern'] = frame2.state == 'Ohio'        
        3. (创建新的列,每个的值为一个判断语句的返回值(frame2.state == 'Ohio'))  
        4. del frame2['eastern']        (删除列

      2. 修改单个值
        1. 推荐使用 df.loc[row_index,col_indexer] 方式。
        2. 这样可以减少/避免由于内存layout(不可预测)因素导致的view/copy 问题。
      3. 转置(行列标题交换):
        1. vv = frame.T   (转置完的类型还是 Dataframe )

      4. 选择头尾部分:
        1. df.head(3)                  (如果索引中存在 b,返回 True)

        2. df.tail(3)
      5. 基本统计使用describe()函数
        1. df2 = df.describe()   (类型还是 Dataframe )([u'count', u'mean', u'std', u'min', u'25%', u'50%', u'75%', u'max']
      6. 选择头尾部分:
        1. df.head(3)                  (如果索引中存在 b,返回 True)

        2. df.tail(3)
      7. sort 函数:
        1. 可以按多个列先后排序,每个列可以指定是升序还是降序
          1. 例子: df.sort([‘c1’,‘c2’], ascending=[False,True])
          2. dfNew = dfTable.sort_index(ascending=False) ### 按照 index 排序
      8. groupby 函数:
        1. sf
      9. 求排序值
        1. frame[‘sum_order’] = frame[‘sum’].rank()
        2. frame[‘sum_sq_order’] = frame[‘sum_sq’].rank()
        3. frame[‘max_order’] = frame[‘max’].rank()
        4. frame[‘mean_order’] = frame[[‘sum_order’, ‘sum_sq_order’, ‘max_order’]].mean(axis=1)

  • 常见操作:
    1. data < 5 ## 过滤
    2. data[ data < 5 ] = 0 ## 替换
    3. df[24] = df[24].str[:10] ## 仅保留前10个字符


    kbdf = buy_df.copy()



    sdf = sdf.sort(['date','time'])

    sortedBdf = matchBuyDf.sort(['amount','price'], ascending=[False,True])

    df2 = df.sort_values(by=[timestamp], ascending=[False] )



    # 寻找满足条件的行的index(集合),然后使用 loc 函数匹配

    t1 = bdf[ (bdf['date'] > lastSellDate) | ( (bdf['date'] == lastSellDate) & (bdf['time'] > lastSellTime) ) ].index

    df.loc[t1, 'status'] = 2

    # 如果是DF 中的单独一行,需要使用 “name”属性。

    df.loc[sortedBdf.iloc[0].name,'status'] = 1

    df = df[ (df['flag'] == sell_flag) | (df['flag'] == buy_flag) ]



    # 选取 DF 中的某几列

    df = df[list(df.columns[:7]) + list(df.columns[10:])]

    df2[ [4 , 10, 14, 18, 22, 23 ] ]



    # 遍历某个 DF,其中 sdf_row 是当前行的 series

    for sell_df_index, sdf_row in sdf.iterrows():

    sdf_row['price'] # series 方式



    # 使用该函数读取文件

    # mycols 是一个自定义的列表,其中包括 code, amount

    # 通过 dtype 参数指定数据类型

    df = pd.read_csv(pandasInputFile, sep=',', skiprows=5, index_col=False,

    names=mycols, dtype={'code':np.object, 'amount':np.int16} )

    ## 如上,mycols可以是一个list,在dtype参数里面可以指定mycols中包含的列的数据类型

    df2.to_csv('reffiles/stBasic.txt', sep='|', encoding='utf-8')





    # df 中字符串的操作

    df.loc[ df[1].str.startswith('6') , 1] = 'sh' + df[1].astype(str)

    df2[[4,10]] = df2[[4,10]].applymap(lambda x : x.strip('%') ).apply( pd.to_numeric ) ### 去除字符串末尾的百分号,并且转换成数值类型






































    pandas Merge, join, and concatenate

    合并,联合以及连接

    Merge, join, and concatenate

    http://pandas.pydata.org/pandas-docs/stable/merging.html

    pd.concat(objs, axis=0, join=’outer’, join_axes=None, ignore_index=False,
              keys=None, levels=None, names=None, verify_integrity=False,
              copy=True)
    • objs : a sequence or mapping of Series, DataFrame, or Panel objects. If a dict is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised.
    • axis : {0, 1, …}, default 0. The axis to concatenate along.
    • join : {‘inner’, ‘outer’}, default ‘outer’. How to handle indexes on other axis(es). Outer for union and inner for intersection.
    • ignore_index : boolean, default False. If True, do not use the index values on the concatenation axis. The resulting axis will be labeled 0, …, n – 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.
    • join_axes : list of Index objects. Specific indexes to use for the other n – 1 axes instead of performing inner/outer set logic.
    • keys : sequence, default None. Construct hierarchical index using the passed keys as the outermost level. If multiple levels passed, should contain tuples.
    • levels : list of sequences, default None. Specific levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the keys.
    • names : list, default None. Names for the levels in the resulting hierarchical index.
    • verify_integrity : boolean, default False. Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation.
    • copy : boolean, default True. If False, do not copy data unnecessarily.

    frames  = [ df1, df2, df3 ]

    ·         concat 的合并方向选项:
    o   1. axis = 0 (默认选项) 行跟着行
    o   2. axis = 1                      列跟着列

    ·         concat 的合并交集方式:
    o   1. join=’outer’ (默认选项) 取并集,以求不损失数据(信息)
    o   2. join=’inner’                      取交集,即所有df 中都存在的index
    o   3. join_axes=[df1.index]         以某个dfindex范围进行判断,在里面的就合并,不在里面的就舍弃。

    Append


    Merge根据列的值,而非index,来进行合并

    pd.merge(left, right, how=’inner’, on=None, left_on=None, right_on=None,
             left_index=False, right_index=False, sort=True,
             suffixes=(‘_x’, ‘_y’), copy=True, indicator=False)
    • left: A DataFrame object
    • right: Another DataFrame object
    • on: Columns (names) to join on. Must be found in both the left and right DataFrame objects. If not passed and left_index and right_index are False, the intersection of the columns in the DataFrames will be inferred to be the join keys
    • left_on: Columns from the left DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame
    • right_on: Columns from the right DataFrame to use as keys. Can either be column names or arrays with length equal to the length of the DataFrame
    • left_index: If True, use the index (row labels) from the left DataFrame as its join key(s). In the case of a DataFrame with a MultiIndex (hierarchical), the number of levels must match the number of join keys from the right DataFrame
    • right_index: Same usage as left_index for the right DataFrame
    • how: One of ‘left’, ‘right’, ‘outer’, ‘inner’. Defaults to inner. See below for more detailed description of each method
    • sort: Sort the result DataFrame by the join keys in lexicographical order. Defaults to True, setting to False will improve performance substantially in many cases
    • suffixes: A tuple of string suffixes to apply to overlapping columns. Defaults to (‘_x’, ‘_y’).
    • copy: Always copy data (default True) from the passed DataFrame objects, even when reindexing is not necessary. Cannot be avoided in many cases but may improve performance / memory usage. The cases where copying can be avoided are somewhat pathological but this option is provided nonetheless.
    • indicator: Add a column to the output DataFrame called _merge with information on the source of each row. _merge is Categorical-type and takes on a value of left_only for observations whose merge key only appears in ‘left’ DataFrame, right_only for observations whose merge key only appears in ‘right’ DataFrame, and both if the observation’s merge key is found in both.

    ·         如何使用 how 参数例子:

    ·         如何使用 indicator参数例子:
    ·        In [50]: pd.merge(df1, df2, on=’col1′, how=’outer’, indicator=True)
    ·        Out[50]:
    ·           col1 col_left  col_right      _merge
    ·        0     0        a        NaN   left_only
    ·        1     1        b        2.0        both
    ·        2     2      NaN        2.0  right_only
    ·        3     2      NaN        2.0  right_only

    Join从两个可能index不相同的df中,把部分列合并到一起

    1
    2

    3

    ]]