DataFeed模块

一、Data Feed

1、Data Feed简介

  • Backtrader中Feeds模块提供了灵活的数据加载和处理功能,支持多种数据源和格式,可以添加一个或者多个股票数据。

  • Feed是一个数据源对象,负责向策略提供时间序列数据,如股票的开盘价、收盘价、成交量等。每个Feed对象代表一个数据源,可以是本地的CSV文件,也可以是实时的股票数据。Backtrader内置了多种常用的Feed类:

    • GenericCSVData:用于加载通用的CSV格式数据。
    • YahooFinanceData:用于从Yahoo Finance下载数据。
    • PandasData:用于从Pandas DataFrame加载数据。
    • IBData:用于从Interactive Brokers API获取实时数据。
  • PandasData数据加载方式如下:

     stock_hfq_df = pd.read_csv("../data/sh000300.csv",index_col='date',parse_dates=True)
     start_date = datetime(2021, 9, 1)  # 回测开始时间
     end_date = datetime(2021, 9, 30)  # 回测结束时间
     data = bt.feeds.PandasData(dataname=stock_hfq_df, fromdate=start_date, todate=end_date)  # 加载数据
    
  • Pandas方便进行数据的预处理,是量化数据常用格式数据,为了后续直接获取数据方便处理,统一使用PandasData进行说明。

2、数据存储

  • BackTrader把每个股票数据看作为一张表,一张时间维度和指标维度的表,self.datas是集合了多个股票的数据集,形成了一个三维数据源,分别是: 数据表维度、时间维度、指标维度。

在这里插入图片描述

  • Data Feeds中的self.datas数据类型是list,每个Data Feed是一张包含时间维度和指标维度的数据表,行情数据按照导入的顺序依次确定索引位置,第一个导入的行情数据的索引位置为 0 ,依次递增。

  • 数据表维度是list集合,集合了所有添加进来的股票数据,每个股票数据都有时间维度和指标维度构成的数据表,通过self.datas[N]访问。

  • 指标维度是回测时使用的指标,除了常用指标还可以自定义指标。可以通过self.datas[N].lines.xxx[M]访问指标数据,通过self.data.lines.getlinealiases()获取所有指标名称。

    字段类型描述
    datetimefloat日期,如果打印日期,用datetime.date[0]
    openfloat开盘价
    highfloat最高价
    lowfloat最低价
    closefloat收盘价
    volumefloat成交量
    openinterestfloat持仓量
    扩展指标自定义或扩展指标,列如 pe 、pb
  • 时间维度是回测时间段,fromdate-todate之间,可以通过self.data[N].lines.datetime.date(M)访问。

  • 要导入的数据表格的指标数量和排列顺序并不需要严格按照预定义一致,只要告诉GenericCSVData、PandasData 、PandasDirectData各个指标在数据源中的位置,如果没有,则设置为-1。

3、数据索引

  • self.datas数据类型是list,可以通过多种方式进行索引:

    • 下标索引:self.datas[N],其中N为0到N-1时,是正向,N为-1到-N时为反向。

    • 缩写索引:self.dataN ,不是datas,N为0到N-1。

    • 表名索引:self.getdatabyname(‘name’),其中name为导入数据时adddata(date_feed, name=code)时设置的表名。

    • 第一个数据集索引:self.datas[0]等价self.data0等价self.data。

      # 访问第一个数据集的 close 线
      self.data.lines.close # 可省略 lines 简写成:self.data.close
      self.data.lines_close # 可省略 lines 简写成:self.data_close
      # 访问第二个数据集的 open 线
      self.data1.lines.close # 可省略 lines 简写成:self.data1.close
      self.data1.lines_close # 可省略 lines 简写成:self.data1_close
      # 注:只有从 self.datas 调用 line 时可以省略 lines,调用 indicators 中的 line 时不能省略s
      
  • 日期通过self.datas[N].lines.datetime.date(M)索引,其它通过self.datas[N].lines.其它字段名(open、high、low、close、volume)[M]索引。

  • datetime是以float数据类型存储,访问需要借助xxx.date(N)函数进行转换,也可以使用bt.num2date()函数将时datatime格式将其转为“xxxx-xx-xx xx:xx:xx”。

4、切片方式

  • 对数据进行切片使用get方法进行获取:

    self.data1.lines.close.get(ago=N, size=M)
    
    • ago:索引开始位置
    • size:切片大小
    • 返回值:array数组[close[N-(M-1)],…,close[N-1],close[N]]

5、Strategy数据流

  • 回测长度:N = self.data.buflen()

  • 已经回测长度:len(self.data)

  • 索引下标0在init()函数和next()函数不一样:

    • 在init()函数中索引0代表回测时间todate,只运行一次,可以计算指标、买卖信号等耗时操作,为next函数准备一切数据。
    • 在next()函数中索引0代表当前回测的时间,运行时间维度总长度的次数,索引0是当前运行时间的节点。
  • 在init()函数中,索引0是todate ,索引1是fromdate,支持正向和反向访问的两种方式。

    • 正向索引的索引下标为1、2 … N
    • 反向索引下标为0 、-1、-2 … -(N-1)
      在这里插入图片描述
  • 在next()函数中,索引0永远是当前的时间节点,索引0随着以时间维度的循环,不停移动。backward是已经回测过的,forward是还没有回测到的。

    在这里插入图片描述

6、自定义数据集

  • 如果每次都要设置参数来告知指标位置很麻烦,可以重新自定义数据读取函数,自定义的方式就是继承数据加载类GenericCSVData、PandasData再构建一个新的类,然后在新类里统一设置参数。

    class My_PandasData(bt.feeds.PandasData):
        params = (
            ('fromdate', datetime.datetime(2019, 1, 2)),
            ('todate', datetime.datetime(2021, 1, 28)),
            ('nullvalue', 0.0),
            ('dtformat', ('%Y-%m-%d')),
            ('datetime', 0),
            ('time', -1),
            ('high', 3),
            ('low', 4),
            ('open', 2),
            ('close', 5),
            ('volume', 6),
            ('openinterest', -1)
        )
    

二、PandasData

1、PandasData实例化

PandasData实例化
  • PandasData继承体系从AbstractDataBase开始继承元类,Pandas实例化时,首先会调用MetaBase的__call__,MetaBase的__call__代码如下:

    class MetaBase(type):
        def doprenew(cls, *args, **kwargs):
            return cls, args, kwargs
    
        def donew(cls, *args, **kwargs):
            _obj = cls.__new__(cls, *args, **kwargs)
            return _obj, args, kwargs
    
        def dopreinit(cls, _obj, *args, **kwargs):
            return _obj, args, kwargs
    
        def doinit(cls, _obj, *args, **kwargs):
            _obj.__init__(*args, **kwargs)
            return _obj, args, kwargs
    
        def dopostinit(cls, _obj, *args, **kwargs):
            return _obj, args, kwargs
    
        def __call__(cls, *args, **kwargs):
            cls, args, kwargs = cls.doprenew(*args, **kwargs)
            _obj, args, kwargs = cls.donew(*args, **kwargs)
            _obj, args, kwargs = cls.dopreinit(_obj, *args, **kwargs)
            _obj, args, kwargs = cls.doinit(_obj, *args, **kwargs)
            _obj, args, kwargs = cls.dopostinit(_obj, *args, **kwargs)
            return _obj
    
    • __call__中会顺序执行doprenew、donew、dopreinit、doinit和dopostinit。prenew时还没实例,只能返回cls,donew时候才会实例化,因此后续返回对象实例。

    • 调用doprenew,由于没有任何MetaBase子类重写doprenew函数,只能调用MetaBase自己的doprenew。

    • 调用donew,MetaLineSeries重写了donew函数:

          def donew(cls, *args, **kwargs):
              '''
              Intercept instance creation, take over lines/plotinfo/plotlines
              class attributes by creating corresponding instance variables and add
              aliases for "lines" and the "lines" held within it
              '''
              # _obj.plotinfo shadows the plotinfo (class) definition in the class
              plotinfo = cls.plotinfo()
      
              for pname, pdef in cls.plotinfo._getitems():
                  setattr(plotinfo, pname, kwargs.pop(pname, pdef))
      
              # Create the object and set the params in place
              _obj, args, kwargs = super(MetaLineSeries, cls).donew(*args, **kwargs)
      
              # set the plotinfo member in the class
              _obj.plotinfo = plotinfo
      
              # _obj.lines shadows the lines (class) definition in the class
              _obj.lines = cls.lines()
      
              # _obj.plotinfo shadows the plotinfo (class) definition in the class
              _obj.plotlines = cls.plotlines()
      
              # add aliases for lines and for the lines class itself
              _obj.l = _obj.lines
              if _obj.lines.fullsize():
                  _obj.line = _obj.lines[0]
      
              for l, line in enumerate(_obj.lines):
                  setattr(_obj, 'line_%s' % l, _obj._getlinealias(l))
                  setattr(_obj, 'line_%d' % l, line)
                  setattr(_obj, 'line%d' % l, line)
      
              # Parameter values have now been set before __init__
              return _obj, args, kwargs
      
    • 调用MetaLineSeries父类的donew进行实例化和参数映射。

    • MetaLineSeries父类是MetaLineRoot,然后MetaLineRoot的donew代码:

      class MetaLineRoot(metabase.MetaParams):
          '''
          Once the object is created (effectively pre-init) the "owner" of this
          class is sought
          '''
      
          def donew(cls, *args, **kwargs):
              _obj, args, kwargs = super(MetaLineRoot, cls).donew(*args, **kwargs)
      
              # Find the owner and store it
              # startlevel = 4 ... to skip intermediate call stacks
              ownerskip = kwargs.pop('_ownerskip', None)
              _obj._owner = metabase.findowner(_obj,
                                               _obj._OwnerCls or LineMultiple,
                                               skip=ownerskip)
      
              # Parameter values have now been set before __init__
              return _obj, args, kwargs
      
    • 调用MetaLineRoot的父类的donew,MetaLineRoot的父类是MetaParas,即调用MetaBase的donew,donnew再对PandasData类进行实例化,并将参数映射到类属性。

    • 完成父类的实例化后,需要继续实例化Lines。

Lines实例化
  • Lines是一个普通类,实例化时会用__new__实例化,并调用__init__方法初始化,初始化代码如下:

    class Lines(object):
    
        def __init__(self, initlines=None):
            '''
            Create the lines recording during "_derive" or else use the
            provided "initlines"
            '''
            self.lines = list()
            for line, linealias in enumerate(self._getlines()):
                kwargs = dict()
                self.lines.append(LineBuffer(**kwargs))
    
            # Add the required extralines
            for i in range(self._getlinesextra()):
                if not initlines:
                    self.lines.append(LineBuffer())
                else:
                    self.lines.append(initlines[i])
    
    • 初始化lines容器,然后针对每一个的line实例化一个LineBuffer。初始lines包含close,low, high, open, volume, openinterest,还有一个datetime。
    • 实例化额外的lines为Linebuffer。
LineBuffer实例化
  • LineBuffer继承了元类,实例化受MetaBase的控制。元类实例化时首先会调用doprenew,LineBuffer继承体系并没有重写doprenew,调用MebaBase的doprenew。

  • donew方法则执行MetaLineRoot的donew。

    class MetaLineRoot(metabase.MetaParams):
        '''
        Once the object is created (effectively pre-init) the "owner" of this
        class is sought
        '''
    
        def donew(cls, *args, **kwargs):
            _obj, args, kwargs = super(MetaLineRoot, cls).donew(*args, **kwargs)
    
            # Find the owner and store it
            # startlevel = 4 ... to skip intermediate call stacks
            ownerskip = kwargs.pop('_ownerskip', None)
            _obj._owner = metabase.findowner(_obj,
                                             _obj._OwnerCls or LineMultiple,
                                             skip=ownerskip)
    
            # Parameter values have now been set before __init__
            return _obj, args, kwargs
    
    • 执行MetaLineRoot父类MetaParas的donew,在MetaParas中继续调用Metabase的donew对LineBuffer类进行实例化,并将参数映射到类属性。
    • 调用findowner找到创建LineBuffer实例的主人,是PandasData实例。
  • dopreinit在LineBuffer继承体系中没有重写,执行Metabase的dopreinit,直接返回。

  • doinit调用LineBuffer的__init__方法:

        def __init__(self):
            self.lines = [self]
            self.mode = self.UnBounded
            self.bindings = list()
            self.reset()
            self._tz = None
    
    • 先把自己加进lines
    • 属性的初始化
    • 调用reset重置内存存储的结构和索引
  • 完成PandasData实例化。

2、PandasData初始化

  • PandasData执行donew完成实例化后,会继续执行dopreinit完成初始化。PandasData的父类中MetaAbstractDataBase重写了dopreinit。

        def dopreinit(cls, _obj, *args, **kwargs):
            _obj, args, kwargs = \
                super(MetaAbstractDataBase, cls).dopreinit(_obj, *args, **kwargs)
    
            # Find the owner and store it
            _obj._feed = metabase.findowner(_obj, FeedBase)
    
            _obj.notifs = collections.deque()  # store notifications for cerebro
    
            _obj._dataname = _obj.p.dataname
            _obj._name = ''
            return _obj, args, kwargs
    
    • 调用父类的dopreinint,最终调用MetaBase的dopreinint。
    • 查找Pandas的owner,返回为空,因为PandasData是实例化的初始发起者,没有owner。
    • 初始化notifs,用于存储发送给Cerebro的通知。
    • 名称name赋值为空串。
  • 执行dopreinit后,继续执行doinit,即执行PandasData类本身的__init__函数:

        def __init__(self):
            super(PandasData, self).__init__()
    
            # these "colnames" can be strings or numeric types
            colnames = list(self.p.dataname.columns.values)
            if self.p.datetime is None:
                # datetime is expected as index col and hence not returned
                pass
    
            # try to autodetect if all columns are numeric
            cstrings = filter(lambda x: isinstance(x, string_types), colnames)
            colsnumeric = not len(list(cstrings))
    
            # Where each datafield find its value
            self._colmapping = dict()
    
            # Build the column mappings to internal fields in advance
            for datafield in self.getlinealiases():
                defmapping = getattr(self.params, datafield)
    
                if isinstance(defmapping, integer_types) and defmapping < 0:
                    # autodetection requested
                    for colname in colnames:
                        if isinstance(colname, string_types):
                            if self.p.nocase:
                                found = datafield.lower() == colname.lower()
                            else:
                                found = datafield == colname
    
                            if found:
                                self._colmapping[datafield] = colname
                                break
    
                    if datafield not in self._colmapping:
                        # autodetection requested and not found
                        self._colmapping[datafield] = None
                        continue
                else:
                    # all other cases -- used given index
                    self._colmapping[datafield] = defmapping
    
    • 调用PandasData父类的__init__,。MetaAbstractDataBase
    • 将输入参数dataname的列名记录到colnames中。
    • 参数datetime通常不用输入,用于指示datetime在哪一列,通常都是放到第一列(索引为0)。
    • 检查有没有列名字是数字,有则直接记录到_colmapping。如果参数中指定了列名称所在的列,那么系统就不会通过名字来映射,直接使用数字。
    • 将Pandas.DataFrame的列名称和PandasData的数据字段做好映射。PandasData的缺省字段包括:[‘datetime’, ‘open’, ‘high’, ‘low’, ‘close’, ‘volume’, ‘openinterest’]。映射关系放到_colmapping字典中,结果:{‘close’: ‘close’, ‘low’: ‘low’, ‘high’: ‘high’, ‘open’: ‘open’, ‘volume’: ‘volume’} .
  • PandasData初始化完成。

3、PandasData数据加载

  • 如果Cerebro初始化时没有指定预加载数据,则Cerebro在run时会进行数据预加载(在Cerebro的runstrategies函数),在预加载数据前首先需要对数据进行reset。PandasData类继承体系中Lines实现了reset,遍历调用Line(对应LineBuffer实例)的reset,包括close,low, high, open, volume, openinterest,datetime,每一个line都会进行reset。reset实际上初始化一个array.array用于存储数据。

    if not predata:
        for data in self.datas:
            data.reset()
            if self._exactbars < 1:  # datas can be full length
                data.extend(size=self.params.lookahead)
                data._start()
                if self._dopreload:
                    data.preload()
    
  • _start函数在PandasData的父类AbstractDataBase中定义:

        def _start_finish(self):
            # A live feed (for example) may have learnt something about the
            # timezones after the start and that's why the date/time related
            # parameters are converted at this late stage
            # Get the output timezone (if any)
            self._tz = self._gettz()
            # Lines have already been create, set the tz
            self.lines.datetime._settz(self._tz)
    
            # This should probably be also called from an override-able method
            self._tzinput = bt.utils.date.Localizer(self._gettzinput())
    
            # Convert user input times to the output timezone (or min/max)
            if self.p.fromdate is None:
                self.fromdate = float('-inf')
            else:
                self.fromdate = self.date2num(self.p.fromdate)
    
            if self.p.todate is None:
                self.todate = float('inf')
            else:
                self.todate = self.date2num(self.p.todate)
    
            # FIXME: These two are never used and could be removed
            self.sessionstart = time2num(self.p.sessionstart)
            self.sessionend = time2num(self.p.sessionend)
    
            self._calendar = cal = self.p.calendar
            if cal is None:
                self._calendar = self._env._tradingcal
            elif isinstance(cal, string_types):
                self._calendar = PandasMarketCalendar(calendar=cal)
    
            self._started = True
    
        def _start(self):
            self.start()
    
            if not self._started:
                self._start_finish()
    
  • start方法在PandasData类进行定义:

        def start(self):
            super(PandasData, self).start()
    
            # reset the length with each start
            self._idx = -1
    
            # Transform names (valid for .ix) into indices (good for .iloc)
            if self.p.nocase:
                colnames = [x.lower() for x in self.p.dataname.columns.values]
            else:
                colnames = [x for x in self.p.dataname.columns.values]
    
            for k, v in self._colmapping.items():
                if v is None:
                    continue  # special marker for datetime
                if isinstance(v, string_types):
                    try:
                        if self.p.nocase:
                            v = colnames.index(v.lower())
                        else:
                            v = colnames.index(v)
                    except ValueError as e:
                        defmap = getattr(self.params, k)
                        if isinstance(defmap, integer_types) and defmap < 0:
                            v = None
                        else:
                            raise e  # let user now something failed
    
                self._colmapping[k] = v
    
    • 调用父类start方法。
    • 初始化索引为-1,后续+1便可以得到0的起始索引。
    • colnames保存Padas.DataFrame原始数据的列名字。
    • PandasData初始化时_colmapping记录的是PandasData对应的原始数据的列名字,start函数内修改为原始数据的列索引:{‘close’: 4, ‘low’: 3, ‘high’: 2, ‘open’: 1, ‘volume’: 5, ‘openinterest’: None, ‘datetime’: None} ,后面两个没有对应的列。datetime因为原始数据中date直接作为索引,colnames中就没有。
  • 父类的satrt方法向上层级传递最终调用在AbstractDataBase定义的start:

    def start(self):
        self._barstack = collections.deque()
        self._barstash = collections.deque()
        self._laststatus = self.CONNECTED
    
  • _start_finish函数在PandasData的父类AbstractDataBase定义如下:

        def _start_finish(self):
            # A live feed (for example) may have learnt something about the
            # timezones after the start and that's why the date/time related
            # parameters are converted at this late stage
            # Get the output timezone (if any)
            self._tz = self._gettz()
            # Lines have already been create, set the tz
            self.lines.datetime._settz(self._tz)
    
            # This should probably be also called from an override-able method
            self._tzinput = bt.utils.date.Localizer(self._gettzinput())
    
            # Convert user input times to the output timezone (or min/max)
            if self.p.fromdate is None:
                self.fromdate = float('-inf')
            else:
                self.fromdate = self.date2num(self.p.fromdate)
    
            if self.p.todate is None:
                self.todate = float('inf')
            else:
                self.todate = self.date2num(self.p.todate)
    
            # FIXME: These two are never used and could be removed
            self.sessionstart = time2num(self.p.sessionstart)
            self.sessionend = time2num(self.p.sessionend)
    
            self._calendar = cal = self.p.calendar
            if cal is None:
                self._calendar = self._env._tradingcal
            elif isinstance(cal, string_types):
                self._calendar = PandasMarketCalendar(calendar=cal)
    
            self._started = True
    
    • 设置data的时区以及各个lines datatime的时区。
    • 将参数输入时间变换为数字形式,以公元1年1月1日零时记为1,每过1天就增加1,不够一天按照按照比例记(比如中午12点,记为0.5)。把时间转化为一个独一无二的数字,方便对数据的快速处理。
    • 记录日历信息到_calendar。
    • 标记start完成。
  • preload函数在PandasData的父类AbstractDataBase中定义,preload内循环调用load函数加载数据:

    def preload(self):
        while self.load():
            pass
    
        self._last()
        self.home()
            
    def load(self):
            while True:
                # move data pointer forward for new bar
                self.forward()
    
                if self._fromstack():  # bar is available
                    return True
    
                if not self._fromstack(stash=True):
                    _loadret = self._load()
                    if not _loadret:  # no bar use force to make sure in exactbars
                        # the pointer is undone this covers especially (but not
                        # uniquely) the case in which the last bar has been seen
                        # and a backwards would ruin pointer accounting in the
                        # "stop" method of the strategy
                        self.backwards(force=True)  # undo data pointer
    
                        # return the actual returned value which may be None to
                        # signal no bar is available, but the data feed is not
                        # done. False means game over
                        return _loadret
    
                # Get a reference to current loaded time
                dt = self.lines.datetime[0]
    
                # A bar has been loaded, adapt the time
                if self._tzinput:
                    # Input has been converted at face value but it's not UTC in
                    # the input stream
                    dtime = num2date(dt)  # get it in a naive datetime
                    # localize it
                    dtime = self._tzinput.localize(dtime)  # pytz compatible-ized
                    self.lines.datetime[0] = dt = date2num(dtime)  # keep UTC val
    
                # Check standard date from/to filters
                if dt < self.fromdate:
                    # discard loaded bar and carry on
                    self.backwards()
                    continue
                if dt > self.todate:
                    # discard loaded bar and break out
                    self.backwards(force=True)
                    break
    
                # Pass through filters
                retff = False
                for ff, fargs, fkwargs in self._filters:
                    # previous filter may have put things onto the stack
                    if self._barstack:
                        for i in range(len(self._barstack)):
                            self._fromstack(forward=True)
                            retff = ff(self, *fargs, **fkwargs)
                    else:
                        retff = ff(self, *fargs, **fkwargs)
    
                    if retff:  # bar removed from systemn
                        break  # out of the inner loop
    
                if retff:  # bar removed from system - loop to get new bar
                    continue  # in the greater loop
    
                # Checks let the bar through ... notify it
                return True
    
            # Out of the loop ... no more bars or past todate
            return False
    
    • 调用forward
    • 调用_fromstack_barstack或者_barstash中获取数据,现在无法获取,因为_start时是空的。
    • 加载数据后,如果数据中输入了时区,那么就转化为本地时间,并更新datetime line的数据为新的日期数值。
    • 如果小于参数起始日期(fromdate)或者大于参数中终止日期(enddate),就会调用backwards函数。
  • forward函数在PandasData父类LineSeries中定义:

    def forward(self, value=NAN, size=1):
        '''
            Proxy line operation
            '''
        for line in self.lines:
            line.forward(value, size=size)
    
  • LineBuffer的forward函数实现如下:

    def forward(self, value=NAN, size=1):
        ''' Moves the logical index foward and enlarges the buffer as much as needed
    
            Keyword Args:
                value (variable): value to be set in new positins
                size (int): How many extra positions to enlarge the buffer
            '''
        self.idx += size
        self.lencount += size
    
        for i in range(size):
            self.array.append(value)
    
    • 索引加1(缺省步幅)。缺省idx是-1,第一次调用forward函数就变成0。
    • 长度加1
    • array中加入NAN无效值(初始化值)
  • pandasData类重写了_load函数:

        def _load(self):
            self._idx += 1
    
            if self._idx >= len(self.p.dataname):
                # exhausted all rows
                return False
    
            # Set the standard datafields
            for datafield in self.getlinealiases():
                if datafield == 'datetime':
                    continue
    
                colindex = self._colmapping[datafield]
                if colindex is None:
                    # datafield signaled as missing in the stream: skip it
                    continue
    
                # get the line to be set
                line = getattr(self.lines, datafield)
    
                # indexing for pandas: 1st is colum, then row
                line[0] = self.p.dataname.iloc[self._idx, colindex]
    
            # datetime conversion
            coldtime = self._colmapping['datetime']
    
            if coldtime is None:
                # standard index in the datetime
                tstamp = self.p.dataname.index[self._idx]
            else:
                # it's in a different column ... use standard column index
                tstamp = self.p.dataname.iloc[self._idx, coldtime]
    
            # convert to float via datetime and store it
            dt = tstamp.to_pydatetime()
            dtnum = date2num(dt)
            self.lines.datetime[0] = dtnum
    
            # Done ... return
            return True
    
    • 首先索引加1,从0开始。如果索引大于原始数据的行数,表明加载完成。
    • 根据data中每个line的别名(初始化时是close,low, high, open, volume, openinterest),在原始数据中找到对应的列编号(记录在_colmapping中)。然后将对应列的原始数据加入到line的array.array中。
    • 找datetime,datetime通常作为索引放到第一列,所以从第一列取一个数据,并调用date2num转换为数字记录到所有line的datetime中。
  • 经过LineSeries的backforward直接到LineBuffer的backforward函数:

        def backwards(self, size=1, force=False):
            ''' Moves the logical index backwards and reduces the buffer as much as needed
    
            Keyword Args:
                size (int): How many extra positions to rewind and reduce the
                buffer
            '''
            # Go directly to property setter to support force
            self.set_idx(self._idx - size, force=force)
            self.lencount -= size
            for i in range(size):
                self.array.pop()
    
    • 首先将idx回退,加第一个数据的时候是0,回退到-1.
    • 长度也减去回退的步幅。
    • 将最新增加的值删除掉。
  • Cerebro在runstrategies函数中调用preload时需要满足一个条件:

    if self._dopreload:
            data.preload()
    
  • 但有两种情况不会预加载:

    • 数据源datas包含实时数据。
    • 数据源包含resample和replay数据。
  • 此时需要在next函数中加载。

4、重采样

Resampling
  • Resampling主要用于将粒度小的数据重新抽样为粒度大的数据,例如日线转为周线。

  • Cerebro中resampledata定义如下:

    def resampledata(self, dataname, name=None, **kwargs):
            '''
            Adds a ``Data Feed`` to be resample by the system
    
            If ``name`` is not None it will be put into ``data._name`` which is
            meant for decoration/plotting purposes.
    
            Any other kwargs like ``timeframe``, ``compression``, ``todate`` which
            are supported by the resample filter will be passed transparently
            '''
            if any(dataname is x for x in self.datas):
                dataname = dataname.clone()
    
            dataname.resample(**kwargs)
            self.adddata(dataname, name=name)
            self._doreplay = True
    
            return dataname
    
    • 从已经加入的data中找到参数中指定的需要resample的数据,克隆一个完全一样的数据。
    • 调用data的resample函数。
  • data的resample方法在AbstractDataBase定义:

        def resample(self, **kwargs):
            self.addfilter(Resampler, **kwargs)
    
        def replay(self, **kwargs):
            self.addfilter(Replayer, **kwargs)
            
        def addfilter(self, p, *args, **kwargs):
            if inspect.isclass(p):
                pobj = p(self, *args, **kwargs)
                self._filters.append((pobj, [], {}))
    
                if hasattr(pobj, 'last'):
                    self._ffilters.append((pobj, [], {}))
    
            else:
                self._filters.append((p, args, kwargs))
    
    • 调用addfilter加载滤器类(Resampler)。函数输入既可以是类,也可以是实例。
    • Resample加入的数据,和原数据完全一样,只是增加一个Resampler对象,在加载数据时进行针对性处理。
Resampler实例化
  • Resampler继承自_BaseResampler_BaseResampler继承自元类MetaParams,MetaParams继承自MetaBase,因此Resampler最终实例化也需要执行MetaBase类的doprenew方法以及MetaParams的donew方法。MetaParams重写的donew函数进行参数到属性的映射,并完成实例化。

    class MetaBase(type):
        def doprenew(cls, *args, **kwargs):
            return cls, args, kwargs
    
        def donew(cls, *args, **kwargs):
            _obj = cls.__new__(cls, *args, **kwargs)
            return _obj, args, kwargs
    
        def dopreinit(cls, _obj, *args, **kwargs):
            return _obj, args, kwargs
    
        def doinit(cls, _obj, *args, **kwargs):
            _obj.__init__(*args, **kwargs)
            return _obj, args, kwargs
    
        def dopostinit(cls, _obj, *args, **kwargs):
            return _obj, args, kwargs
    
        def __call__(cls, *args, **kwargs):
            cls, args, kwargs = cls.doprenew(*args, **kwargs)
            _obj, args, kwargs = cls.donew(*args, **kwargs)
            _obj, args, kwargs = cls.dopreinit(_obj, *args, **kwargs)
            _obj, args, kwargs = cls.doinit(_obj, *args, **kwargs)
            _obj, args, kwargs = cls.dopostinit(_obj, *args, **kwargs)
            return _obj
        
    class MetaParams(MetaBase):
        def __new__(meta, name, bases, dct):
    
        def donew(cls, *args, **kwargs):
            
    class _BaseResampler(with_metaclass(metabase.MetaParams, object)):
        params = (
            ('bar2edge', True),
            ('adjbartime', True),
            ('rightedge', True),
            ('boundoff', 0),
    
            ('timeframe', TimeFrame.Days),
            ('compression', 1),
    
            ('takelate', True),
    
            ('sessionend', True),
        )
    
        def __init__(self, data):
            self.subdays = TimeFrame.Ticks < self.p.timeframe < TimeFrame.Days
            self.subweeks = self.p.timeframe < TimeFrame.Weeks
            self.componly = (not self.subdays and
                             data._timeframe == self.p.timeframe and
                             not (self.p.compression % data._compression))
    
            self.bar = _Bar(maxdate=True)  # bar holder
            self.compcount = 0  # count of produced bars to control compression
            self._firstbar = True
            self.doadjusttime = (self.p.bar2edge and self.p.adjbartime and
                                 self.subweeks)
    
            self._nexteos = None
    
            # Modify data information according to own parameters
            data.resampling = 1
            data.replaying = self.replaying
            data._timeframe = self.p.timeframe
            data._compression = self.p.compression
    
            self.data = data
            
    class Resampler(_BaseResampler):
        params = (
            ('bar2edge', True),
            ('adjbartime', True),
            ('rightedge', True),
        )
        replaying = False
        def last(self, data):
            
        def __call__(self, data, fromcheck=False, forcedata=None):
    
Resampler初始化
  • 实例化后,就开始初始化。_BaseResampler__init__主要负责参数的初始化,将自己和data对象绑定。

    参数缺省值含义
    adjbartimeTRUE使用边界时间调整采样后时间,而不是最后看到的时间戳。如果重新采样时间粒度为5s,那么时间调整为hh:mm:05,即使在时间宽度范围内最后一个bar的时间戳是hh:mm:04.33。
    bar2edgeTRUE以时间边界为目标的重采样。将ticks(时间戳)按照5秒粒度重新采样,则生成的5秒粒度将对齐如下:xx:00,xx:05,xx:10
    boundoff0向前移动一定数量数据用来resample。比如现在是1分钟粒度抽样为15分钟粒度,系统缺省是从00:01:00到00:15:00 15个1分钟粒度的数据产生1个15分钟粒度。如果boundoff值设置为1,那么向前移动一位,从00:00:00到00:14:00也是15个1分钟粒度的数据产生一个15分钟粒度的数据。
    compression1压缩比,比如compression为2,表示2个小粒度数据压缩为1个目标粒度数据。
    rightedgeTRUE使用边界时间的右边缘作为采用后的时间,如果采用目标长度是5s。设置为False:hh:mm:00到hh:mm:04之间的秒数抽样为为hh:mm:00(边界的开始时间) 设置为True,那么抽样后为hh:mm:05(边界结束时间)
Resampler数据加载
  • Resampler的数据加载过程和普通data类似,加载的是DataClone。在runstrategis函数中调用DataClone重写的_start函数:

    class DataClone(AbstractDataBase):
        _clone = True
    
        def __init__(self):
            self.data = self.p.dataname
            self._dataname = self.data._dataname
    
            # Copy date/session parameters
            self.p.fromdate = self.p.fromdate
            self.p.todate = self.p.todate
            self.p.sessionstart = self.data.p.sessionstart
            self.p.sessionend = self.data.p.sessionend
    
            self.p.timeframe = self.data.p.timeframe
            self.p.compression = self.data.p.compression
    
        def _start(self):
            # redefine to copy data bits from guest data
            self.start()
    
            # Copy tz infos
            self._tz = self.data._tz
            self.lines.datetime._settz(self._tz)
    
            self._calendar = self.data._calendar
    
            # input has already been converted by guest data
            self._tzinput = None  # no need to further converr
    
            # Copy dates/session infos
            self.fromdate = self.data.fromdate
            self.todate = self.data.todate
    
            # FIXME: if removed from guest, remove here too
            self.sessionstart = self.data.sessionstart
            self.sessionend = self.data.sessionend
    
        def start(self):
            super(DataClone, self).start()
            self._dlen = 0
            self._preloading = False
    
    • 调用DataClone的父类AbstractDataBase的start。
    • 设定lines的时区。
    • 记录起始结束日期。
  • 调用dataclone的next,再调用DataClone的_load函数。

    def _load(self):
            # assumption: the data is in the system
            # simply copy the lines
            if self._preloading:
                # data is preloaded, we are preloading too, can move
                # forward until have full bar or data source is exhausted
                self.data.advance()
                if len(self.data) > self.data.buflen():
                    return False
    
                for line, dline in zip(self.lines, self.data.lines):
                    line[0] = dline[0]
    
                return True
    
            # Not preloading
            if not (len(self.data) > self._dlen):
                # Data not beyond last seen bar
                return False
    
            self._dlen += 1
    
            for line, dline in zip(self.lines, self.data.lines):
                line[0] = dline[0]
    
            return True
    
Resample应用场景
  • 在BackTrader框架中主要用于对数据进行重采样,以适应不同的时间间隔或频率。
    • 时间间隔转换:原始数据的时间间隔可能不适合策略或分析需求。原始数据可能是每分钟获取的,但策略需要更长的时间间隔,如每小时或每天。可以使用 cerebro.resampledata()方法将数据重采样到所需的时间间隔。
    • 数据平滑:重采样数据还可以用于数据平滑。通过将数据重采样到更大的时间间隔,可以减少数据的波动性,从而更好地观察市场趋势和模式。
    • 跨市场对齐:如果策略需要在多个市场进行交易,而不同市场的数据时间间隔不同,可以使用cerebro.resampledata()方法将不同市场的数据进行对齐。
    • 预测模型:对于基于预测的策略,使用重采样数据可能更为关键。一些预测模型可能需要特定频率的数据输入,如每日收盘价而不是更高频率的数据。通过使用cerebro.resampledata()方法,可以将数据调整为预测模型的输入要求。

5、自定义数据

  • lines包含close,low, high, open, volume, openinterest,datetime。如果选股需要更多数据,比如PE、ROE和turnover等等,可以自定义一个继承自PandasData数据类。

    class MyCustomdata(PandasData):
        lines = ('turnover',)
        params = (('turnover',-1),)
    
    • 增加一个line(也可以加多个),其它lines从PandasData继承。
    • 添加一个参数,指示Line(turnover)对应的原始数据PandasFrame的列号。如果是-1,让系统从原始数据Pandas.DataFrame列名称中匹配查找。
  • 自定义类使用如下:

    stock_hfq_df = pd.read_csv("../data/sh600000.csv",index_col='date',parse_dates=True)
    start_date = datetime(2021, 9,1 )  # 回测开始时间
    end_date = datetime(2021, 9, 30)  # 回测结束时间
    data=MyCustomdata(dataname=stock_hfq_df, fromdate=start_date,todate=end_date)
    
    def next(self):
            self.log('Close:%.3f' % self.data0.close[0])
            self.log('turnover, %.8f' % self.data0.turnover[0])
    

三、数据存储格式

  • Python Pandas常用数据存储格式包括:CSV、HDF5、Parquet、Feather、Pickle。

  • Pandas读写不同数据格式性测试如下:

    import os
    import pandas as pd
    import time
    
    
    if __name__ == "__main__":
        
        start_time = time.time()
        data = pd.read_hdf("/home/samba/test/Market/stocks/stocks_post_1min/000004.XSHE.h5")
        end_time = time.time()
        print(data.shape)
        print("read_hdf ", len(data)/(end_time - start_time), "row/s")
    
        start_time = time.time()
        data.to_parquet("/home/samba/test/000004.XSHE.parquet")
        end_time = time.time()
        print("to_parquet ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
        
        start_time = time.time()
        data = pd.read_parquet("/home/samba/test/000004.XSHE.parquet")
        end_time = time.time()
        print("read_parquet ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
    
        start_time = time.time()
        data.to_pickle("/home/samba/test/000004.XSHE.pickle")
        end_time = time.time()
        print("to_pickle ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
        
        start_time = time.time()
        data = pd.read_pickle("/home/samba/test/000004.XSHE.pickle")
        end_time = time.time()
        print("read_pickle ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
        
    
        start_time = time.time()
        data = data.reset_index()
        data.to_feather("/home/samba/test/000004.XSHE.feather")
        end_time = time.time()
        print("to_feather ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
        
        start_time = time.time()
        data = pd.read_feather("/home/samba/test/000004.XSHE.feather")
        end_time = time.time()
        print("read_feather ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
        
    
        start_time = time.time()
        data.to_csv("/home/samba/test/000004.XSHE.csv", chunksize=20000)
        end_time = time.time()
        print("to_csv ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
        
        start_time = time.time()
        data = pd.read_csv("/home/samba/test/000004.XSHE.csv")
        end_time = time.time()
        print("read_csv ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
    
        start_time = time.time()
        data.to_hdf("/home/samba/test/000004.XSHE.h5", key='data', mode='w', complevel=9, data_columns=True)
        end_time = time.time()
        print("to_hdf ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
    
    
        start_time = time.time()
        data = pd.read_hdf("/home/samba/test/Market/factors/factors_post_5min/roc96_sp1000.h5", key='roc96')
        end_time = time.time()
        print(data.shape)
        print("read_hdf ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
    
        start_time = time.time()
        data.to_parquet("/home/samba/test/roc96_sp1000.parquet")
        end_time = time.time()
        print("to_parquet ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
        
        start_time = time.time()
        data = pd.read_parquet("/home/samba/test/roc96_sp1000.parquet")
        end_time = time.time()
        print("read_parquet ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
    
        start_time = time.time()
        data.to_pickle("/home/samba/test/roc96_sp1000.pickle")
        end_time = time.time()
        print("to_pickle ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
        
        start_time = time.time()
        data = pd.read_pickle("/home/samba/test/roc96_sp1000.pickle")
        end_time = time.time()
        print("read_pickle ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
        
        start_time = time.time()
        data = data.reset_index()
        data.to_feather("/home/samba/test/roc96_sp1000.feather")
        end_time = time.time()
        print("to_feather ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
        
        start_time = time.time()
        data = pd.read_feather("/home/samba/test/roc96_sp1000.feather")
        end_time = time.time()
        print("read_feather ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
    
        start_time = time.time()
        data.to_csv("/home/samba/test/roc96_sp1000.csv", chunksize=20000)
        end_time = time.time()
        print("to_csv ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
        
        start_time = time.time()
        data = pd.read_csv("/home/samba/test/roc96_sp1000.csv")
        end_time = time.time()
        print("read_csv ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
    
        start_time = time.time()
        data.to_hdf("/home/samba/test/roc96_sp1000.h5", key='roc96', mode='w', complevel=9, data_columns=True)
        end_time = time.time()
        print("to_hdf ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
    
  • Panda读写性能如下:

在这里插入图片描述

  • 根据Pandas读写窄表、宽表数据的读写性能、压缩性能进行综合评价,选择Feather格式作为数据存储格式。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值