DataFeed模块

最新推荐文章于 2025-04-12 15:25:44 发布

原创最新推荐文章于 2025-04-12 15:25:44 发布 · 1k 阅读

15 ·

CC 4.0 BY-SA版权

文章标签：

#BackTrader #量化回测

BackTrader 专栏收录该内容

5 篇文章

订阅专栏

一、Data Feed

1、Data Feed简介

Backtrader中Feeds模块提供了灵活的数据加载和处理功能，支持多种数据源和格式，可以添加一个或者多个股票数据。
Feed是一个数据源对象，负责向策略提供时间序列数据，如股票的开盘价、收盘价、成交量等。每个Feed对象代表一个数据源，可以是本地的CSV文件，也可以是实时的股票数据。Backtrader内置了多种常用的Feed类：
- GenericCSVData：用于加载通用的CSV格式数据。
- YahooFinanceData：用于从Yahoo Finance下载数据。
- PandasData：用于从Pandas DataFrame加载数据。
- IBData：用于从Interactive Brokers API获取实时数据。

PandasData数据加载方式如下：

 stock_hfq_df = pd.read_csv("../data/sh000300.csv",index_col='date',parse_dates=True)
 start_date = datetime(2021, 9, 1)  # 回测开始时间
 end_date = datetime(2021, 9, 30)  # 回测结束时间
 data = bt.feeds.PandasData(dataname=stock_hfq_df, fromdate=start_date, todate=end_date)  # 加载数据

Pandas方便进行数据的预处理，是量化数据常用格式数据，为了后续直接获取数据方便处理，统一使用PandasData进行说明。

2、数据存储

BackTrader把每个股票数据看作为一张表，一张时间维度和指标维度的表，self.datas是集合了多个股票的数据集，形成了一个三维数据源，分别是：数据表维度、时间维度、指标维度。

在这里插入图片描述

Data Feeds中的self.datas数据类型是list，每个Data Feed是一张包含时间维度和指标维度的数据表，行情数据按照导入的顺序依次确定索引位置，第一个导入的行情数据的索引位置为 0 ，依次递增。
数据表维度是list集合，集合了所有添加进来的股票数据，每个股票数据都有时间维度和指标维度构成的数据表，通过self.datas[N]访问。

指标维度是回测时使用的指标，除了常用指标还可以自定义指标。可以通过self.datas[N].lines.xxx[M]访问指标数据，通过self.data.lines.getlinealiases()获取所有指标名称。

字段	类型	描述
datetime	float	日期，如果打印日期，用datetime.date[0]
open	float	开盘价
high	float	最高价
low	float	最低价
close	float	收盘价
volume	float	成交量
openinterest	float	持仓量
扩展指标		自定义或扩展指标，列如 pe 、pb

时间维度是回测时间段，fromdate-todate之间，可以通过self.data[N].lines.datetime.date(M)访问。
要导入的数据表格的指标数量和排列顺序并不需要严格按照预定义一致，只要告诉GenericCSVData、PandasData 、PandasDirectData各个指标在数据源中的位置，如果没有，则设置为-1。

3、数据索引

self.datas数据类型是list，可以通过多种方式进行索引：

下标索引：self.datas[N]，其中N为0到N-1时，是正向，N为-1到-N时为反向。
缩写索引：self.dataN ，不是datas，N为0到N-1。
表名索引：self.getdatabyname(‘name’)，其中name为导入数据时adddata(date_feed, name=code)时设置的表名。

第一个数据集索引：self.datas[0]等价self.data0等价self.data。

# 访问第一个数据集的 close 线
self.data.lines.close # 可省略 lines 简写成：self.data.close
self.data.lines_close # 可省略 lines 简写成：self.data_close
# 访问第二个数据集的 open 线
self.data1.lines.close # 可省略 lines 简写成：self.data1.close
self.data1.lines_close # 可省略 lines 简写成：self.data1_close
# 注：只有从 self.datas 调用 line 时可以省略 lines，调用 indicators 中的 line 时不能省略s

日期通过self.datas[N].lines.datetime.date(M)索引，其它通过self.datas[N].lines.其它字段名(open、high、low、close、volume)[M]索引。
datetime是以float数据类型存储，访问需要借助xxx.date(N)函数进行转换，也可以使用bt.num2date()函数将时datatime格式将其转为“xxxx-xx-xx xx:xx:xx”。

4、切片方式

对数据进行切片使用get方法进行获取：
```
self.data1.lines.close.get(ago=N, size=M)
```
- ago：索引开始位置
- size：切片大小
- 返回值：array数组[close[N-(M-1)],…,close[N-1],close[N]]

5、Strategy数据流

回测长度：N = self.data.buflen()
已经回测长度：len(self.data)
索引下标0在init()函数和next()函数不一样：
- 在init()函数中索引0代表回测时间todate，只运行一次，可以计算指标、买卖信号等耗时操作，为next函数准备一切数据。
- 在next()函数中索引0代表当前回测的时间，运行时间维度总长度的次数，索引0是当前运行时间的节点。
在init()函数中，索引0是todate ，索引1是fromdate，支持正向和反向访问的两种方式。
- 正向索引的索引下标为1、2 … N
- 反向索引下标为0 、-1、-2 … -(N-1)
在next()函数中，索引0永远是当前的时间节点，索引0随着以时间维度的循环，不停移动。backward是已经回测过的，forward是还没有回测到的。

6、自定义数据集

如果每次都要设置参数来告知指标位置很麻烦，可以重新自定义数据读取函数，自定义的方式就是继承数据加载类GenericCSVData、PandasData再构建一个新的类，然后在新类里统一设置参数。

class My_PandasData(bt.feeds.PandasData):
    params = (
        ('fromdate', datetime.datetime(2019, 1, 2)),
        ('todate', datetime.datetime(2021, 1, 28)),
        ('nullvalue', 0.0),
        ('dtformat', ('%Y-%m-%d')),
        ('datetime', 0),
        ('time', -1),
        ('high', 3),
        ('low', 4),
        ('open', 2),
        ('close', 5),
        ('volume', 6),
        ('openinterest', -1)
    )

二、PandasData

1、PandasData实例化

PandasData实例化

PandasData继承体系从AbstractDataBase开始继承元类，Pandas实例化时，首先会调用MetaBase的__call__，MetaBase的__call__代码如下：

class MetaBase(type):
    def doprenew(cls, *args, **kwargs):
        return cls, args, kwargs

    def donew(cls, *args, **kwargs):
        _obj = cls.__new__(cls, *args, **kwargs)
        return _obj, args, kwargs

    def dopreinit(cls, _obj, *args, **kwargs):
        return _obj, args, kwargs

    def doinit(cls, _obj, *args, **kwargs):
        _obj.__init__(*args, **kwargs)
        return _obj, args, kwargs

    def dopostinit(cls, _obj, *args, **kwargs):
        return _obj, args, kwargs

    def __call__(cls, *args, **kwargs):
        cls, args, kwargs = cls.doprenew(*args, **kwargs)
        _obj, args, kwargs = cls.donew(*args, **kwargs)
        _obj, args, kwargs = cls.dopreinit(_obj, *args, **kwargs)
        _obj, args, kwargs = cls.doinit(_obj, *args, **kwargs)
        _obj, args, kwargs = cls.dopostinit(_obj, *args, **kwargs)
        return _obj

__call__中会顺序执行doprenew、donew、dopreinit、doinit和dopostinit。prenew时还没实例，只能返回cls，donew时候才会实例化，因此后续返回对象实例。
调用doprenew，由于没有任何MetaBase子类重写doprenew函数，只能调用MetaBase自己的doprenew。

调用donew，MetaLineSeries重写了donew函数：

    def donew(cls, *args, **kwargs):
        '''
        Intercept instance creation, take over lines/plotinfo/plotlines
        class attributes by creating corresponding instance variables and add
        aliases for "lines" and the "lines" held within it
        '''
        # _obj.plotinfo shadows the plotinfo (class) definition in the class
        plotinfo = cls.plotinfo()

        for pname, pdef in cls.plotinfo._getitems():
            setattr(plotinfo, pname, kwargs.pop(pname, pdef))

        # Create the object and set the params in place
        _obj, args, kwargs = super(MetaLineSeries, cls).donew(*args, **kwargs)

        # set the plotinfo member in the class
        _obj.plotinfo = plotinfo

        # _obj.lines shadows the lines (class) definition in the class
        _obj.lines = cls.lines()

        # _obj.plotinfo shadows the plotinfo (class) definition in the class
        _obj.plotlines = cls.plotlines()

        # add aliases for lines and for the lines class itself
        _obj.l = _obj.lines
        if _obj.lines.fullsize():
            _obj.line = _obj.lines[0]

        for l, line in enumerate(_obj.lines):
            setattr(_obj, 'line_%s' % l, _obj._getlinealias(l))
            setattr(_obj, 'line_%d' % l, line)
            setattr(_obj, 'line%d' % l, line)

        # Parameter values have now been set before __init__
        return _obj, args, kwargs

调用MetaLineSeries父类的donew进行实例化和参数映射。

MetaLineSeries父类是MetaLineRoot，然后MetaLineRoot的donew代码：

class MetaLineRoot(metabase.MetaParams):
    '''
    Once the object is created (effectively pre-init) the "owner" of this
    class is sought
    '''

    def donew(cls, *args, **kwargs):
        _obj, args, kwargs = super(MetaLineRoot, cls).donew(*args, **kwargs)

        # Find the owner and store it
        # startlevel = 4 ... to skip intermediate call stacks
        ownerskip = kwargs.pop('_ownerskip', None)
        _obj._owner = metabase.findowner(_obj,
                                         _obj._OwnerCls or LineMultiple,
                                         skip=ownerskip)

        # Parameter values have now been set before __init__
        return _obj, args, kwargs

调用MetaLineRoot的父类的donew，MetaLineRoot的父类是MetaParas，即调用MetaBase的donew，donnew再对PandasData类进行实例化，并将参数映射到类属性。
完成父类的实例化后，需要继续实例化Lines。

Lines实例化

Lines是一个普通类，实例化时会用__new__实例化，并调用__init__方法初始化，初始化代码如下：

class Lines(object):

    def __init__(self, initlines=None):
        '''
        Create the lines recording during "_derive" or else use the
        provided "initlines"
        '''
        self.lines = list()
        for line, linealias in enumerate(self._getlines()):
            kwargs = dict()
            self.lines.append(LineBuffer(**kwargs))

        # Add the required extralines
        for i in range(self._getlinesextra()):
            if not initlines:
                self.lines.append(LineBuffer())
            else:
                self.lines.append(initlines[i])

初始化lines容器，然后针对每一个的line实例化一个LineBuffer。初始lines包含close,low, high, open, volume, openinterest，还有一个datetime。
实例化额外的lines为Linebuffer。

LineBuffer实例化

LineBuffer继承了元类，实例化受MetaBase的控制。元类实例化时首先会调用doprenew，LineBuffer继承体系并没有重写doprenew，调用MebaBase的doprenew。

donew方法则执行MetaLineRoot的donew。

class MetaLineRoot(metabase.MetaParams):
    '''
    Once the object is created (effectively pre-init) the "owner" of this
    class is sought
    '''

    def donew(cls, *args, **kwargs):
        _obj, args, kwargs = super(MetaLineRoot, cls).donew(*args, **kwargs)

        # Find the owner and store it
        # startlevel = 4 ... to skip intermediate call stacks
        ownerskip = kwargs.pop('_ownerskip', None)
        _obj._owner = metabase.findowner(_obj,
                                         _obj._OwnerCls or LineMultiple,
                                         skip=ownerskip)

        # Parameter values have now been set before __init__
        return _obj, args, kwargs

执行MetaLineRoot父类MetaParas的donew，在MetaParas中继续调用Metabase的donew对LineBuffer类进行实例化，并将参数映射到类属性。
调用findowner找到创建LineBuffer实例的主人，是PandasData实例。

dopreinit在LineBuffer继承体系中没有重写，执行Metabase的dopreinit，直接返回。

doinit调用LineBuffer的__init__方法：

    def __init__(self):
        self.lines = [self]
        self.mode = self.UnBounded
        self.bindings = list()
        self.reset()
        self._tz = None

先把自己加进lines
属性的初始化
调用reset重置内存存储的结构和索引

完成PandasData实例化。

2、PandasData初始化

PandasData执行donew完成实例化后，会继续执行dopreinit完成初始化。PandasData的父类中MetaAbstractDataBase重写了dopreinit。

    def dopreinit(cls, _obj, *args, **kwargs):
        _obj, args, kwargs = \
            super(MetaAbstractDataBase, cls).dopreinit(_obj, *args, **kwargs)

        # Find the owner and store it
        _obj._feed = metabase.findowner(_obj, FeedBase)

        _obj.notifs = collections.deque()  # store notifications for cerebro

        _obj._dataname = _obj.p.dataname
        _obj._name = ''
        return _obj, args, kwargs

调用父类的dopreinint，最终调用MetaBase的dopreinint。
查找Pandas的owner，返回为空，因为PandasData是实例化的初始发起者，没有owner。
初始化notifs，用于存储发送给Cerebro的通知。
名称name赋值为空串。

执行dopreinit后，继续执行doinit，即执行PandasData类本身的__init__函数：

    def __init__(self):
        super(PandasData, self).__init__()

        # these "colnames" can be strings or numeric types
        colnames = list(self.p.dataname.columns.values)
        if self.p.datetime is None:
            # datetime is expected as index col and hence not returned
            pass

        # try to autodetect if all columns are numeric
        cstrings = filter(lambda x: isinstance(x, string_types), colnames)
        colsnumeric = not len(list(cstrings))

        # Where each datafield find its value
        self._colmapping = dict()

        # Build the column mappings to internal fields in advance
        for datafield in self.getlinealiases():
            defmapping = getattr(self.params, datafield)

            if isinstance(defmapping, integer_types) and defmapping < 0:
                # autodetection requested
                for colname in colnames:
                    if isinstance(colname, string_types):
                        if self.p.nocase:
                            found = datafield.lower() == colname.lower()
                        else:
                            found = datafield == colname

                        if found:
                            self._colmapping[datafield] = colname
                            break

                if datafield not in self._colmapping:
                    # autodetection requested and not found
                    self._colmapping[datafield] = None
                    continue
            else:
                # all other cases -- used given index
                self._colmapping[datafield] = defmapping

调用PandasData父类的__init__，。MetaAbstractDataBase
将输入参数dataname的列名记录到colnames中。
参数datetime通常不用输入，用于指示datetime在哪一列，通常都是放到第一列(索引为0)。
检查有没有列名字是数字，有则直接记录到_colmapping。如果参数中指定了列名称所在的列，那么系统就不会通过名字来映射，直接使用数字。
将Pandas.DataFrame的列名称和PandasData的数据字段做好映射。PandasData的缺省字段包括：[‘datetime’, ‘open’, ‘high’, ‘low’, ‘close’, ‘volume’, ‘openinterest’]。映射关系放到_colmapping字典中，结果：{‘close’: ‘close’, ‘low’: ‘low’, ‘high’: ‘high’, ‘open’: ‘open’, ‘volume’: ‘volume’} .

PandasData初始化完成。

3、PandasData数据加载

如果Cerebro初始化时没有指定预加载数据，则Cerebro在run时会进行数据预加载(在Cerebro的runstrategies函数)，在预加载数据前首先需要对数据进行reset。PandasData类继承体系中Lines实现了reset，遍历调用Line(对应LineBuffer实例)的reset，包括close,low, high, open, volume, openinterest，datetime，每一个line都会进行reset。reset实际上初始化一个array.array用于存储数据。
```
if not predata:
    for data in self.datas:
        data.reset()
        if self._exactbars < 1:  # datas can be full length
            data.extend(size=self.params.lookahead)
            data._start()
            if self._dopreload:
                data.preload()
```

_start函数在PandasData的父类AbstractDataBase中定义：

    def _start_finish(self):
        # A live feed (for example) may have learnt something about the
        # timezones after the start and that's why the date/time related
        # parameters are converted at this late stage
        # Get the output timezone (if any)
        self._tz = self._gettz()
        # Lines have already been create, set the tz
        self.lines.datetime._settz(self._tz)

        # This should probably be also called from an override-able method
        self._tzinput = bt.utils.date.Localizer(self._gettzinput())

        # Convert user input times to the output timezone (or min/max)
        if self.p.fromdate is None:
            self.fromdate = float('-inf')
        else:
            self.fromdate = self.date2num(self.p.fromdate)

        if self.p.todate is None:
            self.todate = float('inf')
        else:
            self.todate = self.date2num(self.p.todate)

        # FIXME: These two are never used and could be removed
        self.sessionstart = time2num(self.p.sessionstart)
        self.sessionend = time2num(self.p.sessionend)

        self._calendar = cal = self.p.calendar
        if cal is None:
            self._calendar = self._env._tradingcal
        elif isinstance(cal, string_types):
            self._calendar = PandasMarketCalendar(calendar=cal)

        self._started = True

    def _start(self):
        self.start()

        if not self._started:
            self._start_finish()

start方法在PandasData类进行定义：

    def start(self):
        super(PandasData, self).start()

        # reset the length with each start
        self._idx = -1

        # Transform names (valid for .ix) into indices (good for .iloc)
        if self.p.nocase:
            colnames = [x.lower() for x in self.p.dataname.columns.values]
        else:
            colnames = [x for x in self.p.dataname.columns.values]

        for k, v in self._colmapping.items():
            if v is None:
                continue  # special marker for datetime
            if isinstance(v, string_types):
                try:
                    if self.p.nocase:
                        v = colnames.index(v.lower())
                    else:
                        v = colnames.index(v)
                except ValueError as e:
                    defmap = getattr(self.params, k)
                    if isinstance(defmap, integer_types) and defmap < 0:
                        v = None
                    else:
                        raise e  # let user now something failed

            self._colmapping[k] = v

调用父类start方法。
初始化索引为-1，后续+1便可以得到0的起始索引。
colnames保存Padas.DataFrame原始数据的列名字。
PandasData初始化时_colmapping记录的是PandasData对应的原始数据的列名字，start函数内修改为原始数据的列索引：{‘close’: 4, ‘low’: 3, ‘high’: 2, ‘open’: 1, ‘volume’: 5, ‘openinterest’: None, ‘datetime’: None} ，后面两个没有对应的列。datetime因为原始数据中date直接作为索引，colnames中就没有。

父类的satrt方法向上层级传递最终调用在AbstractDataBase定义的start：

def start(self):
    self._barstack = collections.deque()
    self._barstash = collections.deque()
    self._laststatus = self.CONNECTED

_start_finish函数在PandasData的父类AbstractDataBase定义如下：

    def _start_finish(self):
        # A live feed (for example) may have learnt something about the
        # timezones after the start and that's why the date/time related
        # parameters are converted at this late stage
        # Get the output timezone (if any)
        self._tz = self._gettz()
        # Lines have already been create, set the tz
        self.lines.datetime._settz(self._tz)

        # This should probably be also called from an override-able method
        self._tzinput = bt.utils.date.Localizer(self._gettzinput())

        # Convert user input times to the output timezone (or min/max)
        if self.p.fromdate is None:
            self.fromdate = float('-inf')
        else:
            self.fromdate = self.date2num(self.p.fromdate)

        if self.p.todate is None:
            self.todate = float('inf')
        else:
            self.todate = self.date2num(self.p.todate)

        # FIXME: These two are never used and could be removed
        self.sessionstart = time2num(self.p.sessionstart)
        self.sessionend = time2num(self.p.sessionend)

        self._calendar = cal = self.p.calendar
        if cal is None:
            self._calendar = self._env._tradingcal
        elif isinstance(cal, string_types):
            self._calendar = PandasMarketCalendar(calendar=cal)

        self._started = True

设置data的时区以及各个lines datatime的时区。
将参数输入时间变换为数字形式，以公元1年1月1日零时记为1，每过1天就增加1，不够一天按照按照比例记(比如中午12点，记为0.5)。把时间转化为一个独一无二的数字，方便对数据的快速处理。
记录日历信息到_calendar。
标记start完成。

preload函数在PandasData的父类AbstractDataBase中定义，preload内循环调用load函数加载数据：

def preload(self):
    while self.load():
        pass

    self._last()
    self.home()
        
def load(self):
        while True:
            # move data pointer forward for new bar
            self.forward()

            if self._fromstack():  # bar is available
                return True

            if not self._fromstack(stash=True):
                _loadret = self._load()
                if not _loadret:  # no bar use force to make sure in exactbars
                    # the pointer is undone this covers especially (but not
                    # uniquely) the case in which the last bar has been seen
                    # and a backwards would ruin pointer accounting in the
                    # "stop" method of the strategy
                    self.backwards(force=True)  # undo data pointer

                    # return the actual returned value which may be None to
                    # signal no bar is available, but the data feed is not
                    # done. False means game over
                    return _loadret

            # Get a reference to current loaded time
            dt = self.lines.datetime[0]

            # A bar has been loaded, adapt the time
            if self._tzinput:
                # Input has been converted at face value but it's not UTC in
                # the input stream
                dtime = num2date(dt)  # get it in a naive datetime
                # localize it
                dtime = self._tzinput.localize(dtime)  # pytz compatible-ized
                self.lines.datetime[0] = dt = date2num(dtime)  # keep UTC val

            # Check standard date from/to filters
            if dt < self.fromdate:
                # discard loaded bar and carry on
                self.backwards()
                continue
            if dt > self.todate:
                # discard loaded bar and break out
                self.backwards(force=True)
                break

            # Pass through filters
            retff = False
            for ff, fargs, fkwargs in self._filters:
                # previous filter may have put things onto the stack
                if self._barstack:
                    for i in range(len(self._barstack)):
                        self._fromstack(forward=True)
                        retff = ff(self, *fargs, **fkwargs)
                else:
                    retff = ff(self, *fargs, **fkwargs)

                if retff:  # bar removed from systemn
                    break  # out of the inner loop

            if retff:  # bar removed from system - loop to get new bar
                continue  # in the greater loop

            # Checks let the bar through ... notify it
            return True

        # Out of the loop ... no more bars or past todate
        return False

调用forward
调用_fromstack从_barstack或者_barstash中获取数据，现在无法获取，因为_start时是空的。
加载数据后，如果数据中输入了时区，那么就转化为本地时间，并更新datetime line的数据为新的日期数值。
如果小于参数起始日期(fromdate)或者大于参数中终止日期(enddate)，就会调用backwards函数。

forward函数在PandasData父类LineSeries中定义：

def forward(self, value=NAN, size=1):
    '''
        Proxy line operation
        '''
    for line in self.lines:
        line.forward(value, size=size)

LineBuffer的forward函数实现如下：

def forward(self, value=NAN, size=1):
    ''' Moves the logical index foward and enlarges the buffer as much as needed

        Keyword Args:
            value (variable): value to be set in new positins
            size (int): How many extra positions to enlarge the buffer
        '''
    self.idx += size
    self.lencount += size

    for i in range(size):
        self.array.append(value)

索引加1(缺省步幅)。缺省idx是-1，第一次调用forward函数就变成0。
长度加1
array中加入NAN无效值(初始化值)

pandasData类重写了_load函数：

    def _load(self):
        self._idx += 1

        if self._idx >= len(self.p.dataname):
            # exhausted all rows
            return False

        # Set the standard datafields
        for datafield in self.getlinealiases():
            if datafield == 'datetime':
                continue

            colindex = self._colmapping[datafield]
            if colindex is None:
                # datafield signaled as missing in the stream: skip it
                continue

            # get the line to be set
            line = getattr(self.lines, datafield)

            # indexing for pandas: 1st is colum, then row
            line[0] = self.p.dataname.iloc[self._idx, colindex]

        # datetime conversion
        coldtime = self._colmapping['datetime']

        if coldtime is None:
            # standard index in the datetime
            tstamp = self.p.dataname.index[self._idx]
        else:
            # it's in a different column ... use standard column index
            tstamp = self.p.dataname.iloc[self._idx, coldtime]

        # convert to float via datetime and store it
        dt = tstamp.to_pydatetime()
        dtnum = date2num(dt)
        self.lines.datetime[0] = dtnum

        # Done ... return
        return True

首先索引加1，从0开始。如果索引大于原始数据的行数，表明加载完成。
根据data中每个line的别名(初始化时是close,low, high, open, volume, openinterest)，在原始数据中找到对应的列编号(记录在_colmapping中)。然后将对应列的原始数据加入到line的array.array中。
找datetime，datetime通常作为索引放到第一列，所以从第一列取一个数据，并调用date2num转换为数字记录到所有line的datetime中。

经过LineSeries的backforward直接到LineBuffer的backforward函数：

    def backwards(self, size=1, force=False):
        ''' Moves the logical index backwards and reduces the buffer as much as needed

        Keyword Args:
            size (int): How many extra positions to rewind and reduce the
            buffer
        '''
        # Go directly to property setter to support force
        self.set_idx(self._idx - size, force=force)
        self.lencount -= size
        for i in range(size):
            self.array.pop()

首先将idx回退，加第一个数据的时候是0，回退到-1.
长度也减去回退的步幅。
将最新增加的值删除掉。

Cerebro在runstrategies函数中调用preload时需要满足一个条件：
```
if self._dopreload:
        data.preload()
```
但有两种情况不会预加载：
- 数据源datas包含实时数据。
- 数据源包含resample和replay数据。
此时需要在next函数中加载。

4、重采样

Resampling

Resampling主要用于将粒度小的数据重新抽样为粒度大的数据，例如日线转为周线。

Cerebro中resampledata定义如下：

def resampledata(self, dataname, name=None, **kwargs):
        '''
        Adds a ``Data Feed`` to be resample by the system

        If ``name`` is not None it will be put into ``data._name`` which is
        meant for decoration/plotting purposes.

        Any other kwargs like ``timeframe``, ``compression``, ``todate`` which
        are supported by the resample filter will be passed transparently
        '''
        if any(dataname is x for x in self.datas):
            dataname = dataname.clone()

        dataname.resample(**kwargs)
        self.adddata(dataname, name=name)
        self._doreplay = True

        return dataname

从已经加入的data中找到参数中指定的需要resample的数据，克隆一个完全一样的数据。
调用data的resample函数。

data的resample方法在AbstractDataBase定义：

    def resample(self, **kwargs):
        self.addfilter(Resampler, **kwargs)

    def replay(self, **kwargs):
        self.addfilter(Replayer, **kwargs)
        
    def addfilter(self, p, *args, **kwargs):
        if inspect.isclass(p):
            pobj = p(self, *args, **kwargs)
            self._filters.append((pobj, [], {}))

            if hasattr(pobj, 'last'):
                self._ffilters.append((pobj, [], {}))

        else:
            self._filters.append((p, args, kwargs))

调用addfilter加载滤器类(Resampler)。函数输入既可以是类，也可以是实例。
Resample加入的数据，和原数据完全一样，只是增加一个Resampler对象，在加载数据时进行针对性处理。

Resampler实例化

Resampler继承自_BaseResampler，_BaseResampler继承自元类MetaParams，MetaParams继承自MetaBase，因此Resampler最终实例化也需要执行MetaBase类的doprenew方法以及MetaParams的donew方法。MetaParams重写的donew函数进行参数到属性的映射，并完成实例化。

class MetaBase(type):
    def doprenew(cls, *args, **kwargs):
        return cls, args, kwargs

    def donew(cls, *args, **kwargs):
        _obj = cls.__new__(cls, *args, **kwargs)
        return _obj, args, kwargs

    def dopreinit(cls, _obj, *args, **kwargs):
        return _obj, args, kwargs

    def doinit(cls, _obj, *args, **kwargs):
        _obj.__init__(*args, **kwargs)
        return _obj, args, kwargs

    def dopostinit(cls, _obj, *args, **kwargs):
        return _obj, args, kwargs

    def __call__(cls, *args, **kwargs):
        cls, args, kwargs = cls.doprenew(*args, **kwargs)
        _obj, args, kwargs = cls.donew(*args, **kwargs)
        _obj, args, kwargs = cls.dopreinit(_obj, *args, **kwargs)
        _obj, args, kwargs = cls.doinit(_obj, *args, **kwargs)
        _obj, args, kwargs = cls.dopostinit(_obj, *args, **kwargs)
        return _obj
    
class MetaParams(MetaBase):
    def __new__(meta, name, bases, dct):

    def donew(cls, *args, **kwargs):
        
class _BaseResampler(with_metaclass(metabase.MetaParams, object)):
    params = (
        ('bar2edge', True),
        ('adjbartime', True),
        ('rightedge', True),
        ('boundoff', 0),

        ('timeframe', TimeFrame.Days),
        ('compression', 1),

        ('takelate', True),

        ('sessionend', True),
    )

    def __init__(self, data):
        self.subdays = TimeFrame.Ticks < self.p.timeframe < TimeFrame.Days
        self.subweeks = self.p.timeframe < TimeFrame.Weeks
        self.componly = (not self.subdays and
                         data._timeframe == self.p.timeframe and
                         not (self.p.compression % data._compression))

        self.bar = _Bar(maxdate=True)  # bar holder
        self.compcount = 0  # count of produced bars to control compression
        self._firstbar = True
        self.doadjusttime = (self.p.bar2edge and self.p.adjbartime and
                             self.subweeks)

        self._nexteos = None

        # Modify data information according to own parameters
        data.resampling = 1
        data.replaying = self.replaying
        data._timeframe = self.p.timeframe
        data._compression = self.p.compression

        self.data = data
        
class Resampler(_BaseResampler):
    params = (
        ('bar2edge', True),
        ('adjbartime', True),
        ('rightedge', True),
    )
    replaying = False
    def last(self, data):
        
    def __call__(self, data, fromcheck=False, forcedata=None):

Resampler初始化

实例化后，就开始初始化。_BaseResampler的__init__主要负责参数的初始化，将自己和data对象绑定。

参数	缺省值	含义
adjbartime	TRUE	使用边界时间调整采样后时间，而不是最后看到的时间戳。如果重新采样时间粒度为5s，那么时间调整为hh:mm:05，即使在时间宽度范围内最后一个bar的时间戳是hh:mm:04.33。
bar2edge	TRUE	以时间边界为目标的重采样。将ticks(时间戳)按照5秒粒度重新采样，则生成的5秒粒度将对齐如下：xx:00，xx:05，xx:10
boundoff	0	向前移动一定数量数据用来resample。比如现在是1分钟粒度抽样为15分钟粒度，系统缺省是从00:01:00到00:15:00 15个1分钟粒度的数据产生1个15分钟粒度。如果boundoff值设置为1，那么向前移动一位，从00:00:00到00:14:00也是15个1分钟粒度的数据产生一个15分钟粒度的数据。
compression	1	压缩比，比如compression为2，表示2个小粒度数据压缩为1个目标粒度数据。
rightedge	TRUE	使用边界时间的右边缘作为采用后的时间，如果采用目标长度是5s。设置为False：hh:mm:00到hh:mm:04之间的秒数抽样为为hh:mm:00(边界的开始时间) 设置为True，那么抽样后为hh:mm:05(边界结束时间)

Resampler数据加载

Resampler的数据加载过程和普通data类似，加载的是DataClone。在runstrategis函数中调用DataClone重写的_start函数：

class DataClone(AbstractDataBase):
    _clone = True

    def __init__(self):
        self.data = self.p.dataname
        self._dataname = self.data._dataname

        # Copy date/session parameters
        self.p.fromdate = self.p.fromdate
        self.p.todate = self.p.todate
        self.p.sessionstart = self.data.p.sessionstart
        self.p.sessionend = self.data.p.sessionend

        self.p.timeframe = self.data.p.timeframe
        self.p.compression = self.data.p.compression

    def _start(self):
        # redefine to copy data bits from guest data
        self.start()

        # Copy tz infos
        self._tz = self.data._tz
        self.lines.datetime._settz(self._tz)

        self._calendar = self.data._calendar

        # input has already been converted by guest data
        self._tzinput = None  # no need to further converr

        # Copy dates/session infos
        self.fromdate = self.data.fromdate
        self.todate = self.data.todate

        # FIXME: if removed from guest, remove here too
        self.sessionstart = self.data.sessionstart
        self.sessionend = self.data.sessionend

    def start(self):
        super(DataClone, self).start()
        self._dlen = 0
        self._preloading = False

调用DataClone的父类AbstractDataBase的start。
设定lines的时区。
记录起始结束日期。

调用dataclone的next，再调用DataClone的_load函数。

def _load(self):
        # assumption: the data is in the system
        # simply copy the lines
        if self._preloading:
            # data is preloaded, we are preloading too, can move
            # forward until have full bar or data source is exhausted
            self.data.advance()
            if len(self.data) > self.data.buflen():
                return False

            for line, dline in zip(self.lines, self.data.lines):
                line[0] = dline[0]

            return True

        # Not preloading
        if not (len(self.data) > self._dlen):
            # Data not beyond last seen bar
            return False

        self._dlen += 1

        for line, dline in zip(self.lines, self.data.lines):
            line[0] = dline[0]

        return True

Resample应用场景

在BackTrader框架中主要用于对数据进行重采样，以适应不同的时间间隔或频率。
- 时间间隔转换：原始数据的时间间隔可能不适合策略或分析需求。原始数据可能是每分钟获取的，但策略需要更长的时间间隔，如每小时或每天。可以使用 cerebro.resampledata()方法将数据重采样到所需的时间间隔。
- 数据平滑：重采样数据还可以用于数据平滑。通过将数据重采样到更大的时间间隔，可以减少数据的波动性，从而更好地观察市场趋势和模式。
- 跨市场对齐：如果策略需要在多个市场进行交易，而不同市场的数据时间间隔不同，可以使用cerebro.resampledata()方法将不同市场的数据进行对齐。
- 预测模型：对于基于预测的策略，使用重采样数据可能更为关键。一些预测模型可能需要特定频率的数据输入，如每日收盘价而不是更高频率的数据。通过使用cerebro.resampledata()方法，可以将数据调整为预测模型的输入要求。

5、自定义数据

lines包含close,low, high, open, volume, openinterest,datetime。如果选股需要更多数据，比如PE、ROE和turnover等等，可以自定义一个继承自PandasData数据类。
```
class MyCustomdata(PandasData):
    lines = ('turnover',)
    params = (('turnover',-1),)
```
- 增加一个line(也可以加多个)，其它lines从PandasData继承。
- 添加一个参数，指示Line(turnover)对应的原始数据PandasFrame的列号。如果是-1，让系统从原始数据Pandas.DataFrame列名称中匹配查找。

自定义类使用如下：

stock_hfq_df = pd.read_csv("../data/sh600000.csv",index_col='date',parse_dates=True)
start_date = datetime(2021, 9,1 )  # 回测开始时间
end_date = datetime(2021, 9, 30)  # 回测结束时间
data=MyCustomdata(dataname=stock_hfq_df, fromdate=start_date,todate=end_date)

def next(self):
        self.log('Close:%.3f' % self.data0.close[0])
        self.log('turnover, %.8f' % self.data0.turnover[0])

三、数据存储格式

Python Pandas常用数据存储格式包括：CSV、HDF5、Parquet、Feather、Pickle。

Pandas读写不同数据格式性测试如下：

import os
import pandas as pd
import time


if __name__ == "__main__":
    
    start_time = time.time()
    data = pd.read_hdf("/home/samba/test/Market/stocks/stocks_post_1min/000004.XSHE.h5")
    end_time = time.time()
    print(data.shape)
    print("read_hdf ", len(data)/(end_time - start_time), "row/s")

    start_time = time.time()
    data.to_parquet("/home/samba/test/000004.XSHE.parquet")
    end_time = time.time()
    print("to_parquet ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
    
    start_time = time.time()
    data = pd.read_parquet("/home/samba/test/000004.XSHE.parquet")
    end_time = time.time()
    print("read_parquet ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")

    start_time = time.time()
    data.to_pickle("/home/samba/test/000004.XSHE.pickle")
    end_time = time.time()
    print("to_pickle ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
    
    start_time = time.time()
    data = pd.read_pickle("/home/samba/test/000004.XSHE.pickle")
    end_time = time.time()
    print("read_pickle ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
    

    start_time = time.time()
    data = data.reset_index()
    data.to_feather("/home/samba/test/000004.XSHE.feather")
    end_time = time.time()
    print("to_feather ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
    
    start_time = time.time()
    data = pd.read_feather("/home/samba/test/000004.XSHE.feather")
    end_time = time.time()
    print("read_feather ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
    

    start_time = time.time()
    data.to_csv("/home/samba/test/000004.XSHE.csv", chunksize=20000)
    end_time = time.time()
    print("to_csv ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
    
    start_time = time.time()
    data = pd.read_csv("/home/samba/test/000004.XSHE.csv")
    end_time = time.time()
    print("read_csv ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")

    start_time = time.time()
    data.to_hdf("/home/samba/test/000004.XSHE.h5", key='data', mode='w', complevel=9, data_columns=True)
    end_time = time.time()
    print("to_hdf ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")


    start_time = time.time()
    data = pd.read_hdf("/home/samba/test/Market/factors/factors_post_5min/roc96_sp1000.h5", key='roc96')
    end_time = time.time()
    print(data.shape)
    print("read_hdf ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")

    start_time = time.time()
    data.to_parquet("/home/samba/test/roc96_sp1000.parquet")
    end_time = time.time()
    print("to_parquet ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
    
    start_time = time.time()
    data = pd.read_parquet("/home/samba/test/roc96_sp1000.parquet")
    end_time = time.time()
    print("read_parquet ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")

    start_time = time.time()
    data.to_pickle("/home/samba/test/roc96_sp1000.pickle")
    end_time = time.time()
    print("to_pickle ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
    
    start_time = time.time()
    data = pd.read_pickle("/home/samba/test/roc96_sp1000.pickle")
    end_time = time.time()
    print("read_pickle ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
    
    start_time = time.time()
    data = data.reset_index()
    data.to_feather("/home/samba/test/roc96_sp1000.feather")
    end_time = time.time()
    print("to_feather ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
    
    start_time = time.time()
    data = pd.read_feather("/home/samba/test/roc96_sp1000.feather")
    end_time = time.time()
    print("read_feather ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")

    start_time = time.time()
    data.to_csv("/home/samba/test/roc96_sp1000.csv", chunksize=20000)
    end_time = time.time()
    print("to_csv ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
    
    start_time = time.time()
    data = pd.read_csv("/home/samba/test/roc96_sp1000.csv")
    end_time = time.time()
    print("read_csv ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")

    start_time = time.time()
    data.to_hdf("/home/samba/test/roc96_sp1000.h5", key='roc96', mode='w', complevel=9, data_columns=True)
    end_time = time.time()
    print("to_hdf ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")