一、Data Feed
1、Data Feed简介
-
Backtrader中Feeds模块提供了灵活的数据加载和处理功能,支持多种数据源和格式,可以添加一个或者多个股票数据。
-
Feed是一个数据源对象,负责向策略提供时间序列数据,如股票的开盘价、收盘价、成交量等。每个Feed对象代表一个数据源,可以是本地的CSV文件,也可以是实时的股票数据。Backtrader内置了多种常用的Feed类:
- GenericCSVData:用于加载通用的CSV格式数据。
- YahooFinanceData:用于从Yahoo Finance下载数据。
- PandasData:用于从Pandas DataFrame加载数据。
- IBData:用于从Interactive Brokers API获取实时数据。
-
PandasData数据加载方式如下:
stock_hfq_df = pd.read_csv("../data/sh000300.csv",index_col='date',parse_dates=True) start_date = datetime(2021, 9, 1) # 回测开始时间 end_date = datetime(2021, 9, 30) # 回测结束时间 data = bt.feeds.PandasData(dataname=stock_hfq_df, fromdate=start_date, todate=end_date) # 加载数据
-
Pandas方便进行数据的预处理,是量化数据常用格式数据,为了后续直接获取数据方便处理,统一使用PandasData进行说明。
2、数据存储
- BackTrader把每个股票数据看作为一张表,一张时间维度和指标维度的表,self.datas是集合了多个股票的数据集,形成了一个三维数据源,分别是: 数据表维度、时间维度、指标维度。
-
Data Feeds中的self.datas数据类型是list,每个Data Feed是一张包含时间维度和指标维度的数据表,行情数据按照导入的顺序依次确定索引位置,第一个导入的行情数据的索引位置为 0 ,依次递增。
-
数据表维度是list集合,集合了所有添加进来的股票数据,每个股票数据都有时间维度和指标维度构成的数据表,通过self.datas[N]访问。
-
指标维度是回测时使用的指标,除了常用指标还可以自定义指标。可以通过self.datas[N].lines.xxx[M]访问指标数据,通过self.data.lines.getlinealiases()获取所有指标名称。
字段 类型 描述 datetime float 日期,如果打印日期,用datetime.date[0] open float 开盘价 high float 最高价 low float 最低价 close float 收盘价 volume float 成交量 openinterest float 持仓量 扩展指标 自定义或扩展指标,列如 pe 、pb -
时间维度是回测时间段,fromdate-todate之间,可以通过self.data[N].lines.datetime.date(M)访问。
-
要导入的数据表格的指标数量和排列顺序并不需要严格按照预定义一致,只要告诉GenericCSVData、PandasData 、PandasDirectData各个指标在数据源中的位置,如果没有,则设置为-1。
3、数据索引
-
self.datas数据类型是list,可以通过多种方式进行索引:
-
下标索引:self.datas[N],其中N为0到N-1时,是正向,N为-1到-N时为反向。
-
缩写索引:self.dataN ,不是datas,N为0到N-1。
-
表名索引:self.getdatabyname(‘name’),其中name为导入数据时adddata(date_feed, name=code)时设置的表名。
-
第一个数据集索引:self.datas[0]等价self.data0等价self.data。
# 访问第一个数据集的 close 线 self.data.lines.close # 可省略 lines 简写成:self.data.close self.data.lines_close # 可省略 lines 简写成:self.data_close # 访问第二个数据集的 open 线 self.data1.lines.close # 可省略 lines 简写成:self.data1.close self.data1.lines_close # 可省略 lines 简写成:self.data1_close # 注:只有从 self.datas 调用 line 时可以省略 lines,调用 indicators 中的 line 时不能省略s
-
-
日期通过self.datas[N].lines.datetime.date(M)索引,其它通过self.datas[N].lines.其它字段名(open、high、low、close、volume)[M]索引。
-
datetime是以float数据类型存储,访问需要借助xxx.date(N)函数进行转换,也可以使用bt.num2date()函数将时datatime格式将其转为“xxxx-xx-xx xx:xx:xx”。
4、切片方式
-
对数据进行切片使用get方法进行获取:
self.data1.lines.close.get(ago=N, size=M)
- ago:索引开始位置
- size:切片大小
- 返回值:array数组[close[N-(M-1)],…,close[N-1],close[N]]
5、Strategy数据流
-
回测长度:N = self.data.buflen()
-
已经回测长度:len(self.data)
-
索引下标0在init()函数和next()函数不一样:
- 在init()函数中索引0代表回测时间todate,只运行一次,可以计算指标、买卖信号等耗时操作,为next函数准备一切数据。
- 在next()函数中索引0代表当前回测的时间,运行时间维度总长度的次数,索引0是当前运行时间的节点。
-
在init()函数中,索引0是todate ,索引1是fromdate,支持正向和反向访问的两种方式。
- 正向索引的索引下标为1、2 … N
- 反向索引下标为0 、-1、-2 … -(N-1)
-
在next()函数中,索引0永远是当前的时间节点,索引0随着以时间维度的循环,不停移动。backward是已经回测过的,forward是还没有回测到的。
6、自定义数据集
-
如果每次都要设置参数来告知指标位置很麻烦,可以重新自定义数据读取函数,自定义的方式就是继承数据加载类GenericCSVData、PandasData再构建一个新的类,然后在新类里统一设置参数。
class My_PandasData(bt.feeds.PandasData): params = ( ('fromdate', datetime.datetime(2019, 1, 2)), ('todate', datetime.datetime(2021, 1, 28)), ('nullvalue', 0.0), ('dtformat', ('%Y-%m-%d')), ('datetime', 0), ('time', -1), ('high', 3), ('low', 4), ('open', 2), ('close', 5), ('volume', 6), ('openinterest', -1) )
二、PandasData
1、PandasData实例化
PandasData实例化
-
PandasData继承体系从AbstractDataBase开始继承元类,Pandas实例化时,首先会调用MetaBase的
__call__
,MetaBase的__call__
代码如下:class MetaBase(type): def doprenew(cls, *args, **kwargs): return cls, args, kwargs def donew(cls, *args, **kwargs): _obj = cls.__new__(cls, *args, **kwargs) return _obj, args, kwargs def dopreinit(cls, _obj, *args, **kwargs): return _obj, args, kwargs def doinit(cls, _obj, *args, **kwargs): _obj.__init__(*args, **kwargs) return _obj, args, kwargs def dopostinit(cls, _obj, *args, **kwargs): return _obj, args, kwargs def __call__(cls, *args, **kwargs): cls, args, kwargs = cls.doprenew(*args, **kwargs) _obj, args, kwargs = cls.donew(*args, **kwargs) _obj, args, kwargs = cls.dopreinit(_obj, *args, **kwargs) _obj, args, kwargs = cls.doinit(_obj, *args, **kwargs) _obj, args, kwargs = cls.dopostinit(_obj, *args, **kwargs) return _obj
-
__call__
中会顺序执行doprenew、donew、dopreinit、doinit和dopostinit。prenew时还没实例,只能返回cls,donew时候才会实例化,因此后续返回对象实例。 -
调用doprenew,由于没有任何MetaBase子类重写doprenew函数,只能调用MetaBase自己的doprenew。
-
调用donew,MetaLineSeries重写了donew函数:
def donew(cls, *args, **kwargs): ''' Intercept instance creation, take over lines/plotinfo/plotlines class attributes by creating corresponding instance variables and add aliases for "lines" and the "lines" held within it ''' # _obj.plotinfo shadows the plotinfo (class) definition in the class plotinfo = cls.plotinfo() for pname, pdef in cls.plotinfo._getitems(): setattr(plotinfo, pname, kwargs.pop(pname, pdef)) # Create the object and set the params in place _obj, args, kwargs = super(MetaLineSeries, cls).donew(*args, **kwargs) # set the plotinfo member in the class _obj.plotinfo = plotinfo # _obj.lines shadows the lines (class) definition in the class _obj.lines = cls.lines() # _obj.plotinfo shadows the plotinfo (class) definition in the class _obj.plotlines = cls.plotlines() # add aliases for lines and for the lines class itself _obj.l = _obj.lines if _obj.lines.fullsize(): _obj.line = _obj.lines[0] for l, line in enumerate(_obj.lines): setattr(_obj, 'line_%s' % l, _obj._getlinealias(l)) setattr(_obj, 'line_%d' % l, line) setattr(_obj, 'line%d' % l, line) # Parameter values have now been set before __init__ return _obj, args, kwargs
-
调用MetaLineSeries父类的donew进行实例化和参数映射。
-
MetaLineSeries父类是MetaLineRoot,然后MetaLineRoot的donew代码:
class MetaLineRoot(metabase.MetaParams): ''' Once the object is created (effectively pre-init) the "owner" of this class is sought ''' def donew(cls, *args, **kwargs): _obj, args, kwargs = super(MetaLineRoot, cls).donew(*args, **kwargs) # Find the owner and store it # startlevel = 4 ... to skip intermediate call stacks ownerskip = kwargs.pop('_ownerskip', None) _obj._owner = metabase.findowner(_obj, _obj._OwnerCls or LineMultiple, skip=ownerskip) # Parameter values have now been set before __init__ return _obj, args, kwargs
-
调用MetaLineRoot的父类的donew,MetaLineRoot的父类是MetaParas,即调用MetaBase的donew,donnew再对PandasData类进行实例化,并将参数映射到类属性。
-
完成父类的实例化后,需要继续实例化Lines。
-
Lines实例化
-
Lines是一个普通类,实例化时会用
__new__
实例化,并调用__init__
方法初始化,初始化代码如下:class Lines(object): def __init__(self, initlines=None): ''' Create the lines recording during "_derive" or else use the provided "initlines" ''' self.lines = list() for line, linealias in enumerate(self._getlines()): kwargs = dict() self.lines.append(LineBuffer(**kwargs)) # Add the required extralines for i in range(self._getlinesextra()): if not initlines: self.lines.append(LineBuffer()) else: self.lines.append(initlines[i])
- 初始化lines容器,然后针对每一个的line实例化一个LineBuffer。初始lines包含close,low, high, open, volume, openinterest,还有一个datetime。
- 实例化额外的lines为Linebuffer。
LineBuffer实例化
-
LineBuffer继承了元类,实例化受MetaBase的控制。元类实例化时首先会调用doprenew,LineBuffer继承体系并没有重写doprenew,调用MebaBase的doprenew。
-
donew方法则执行MetaLineRoot的donew。
class MetaLineRoot(metabase.MetaParams): ''' Once the object is created (effectively pre-init) the "owner" of this class is sought ''' def donew(cls, *args, **kwargs): _obj, args, kwargs = super(MetaLineRoot, cls).donew(*args, **kwargs) # Find the owner and store it # startlevel = 4 ... to skip intermediate call stacks ownerskip = kwargs.pop('_ownerskip', None) _obj._owner = metabase.findowner(_obj, _obj._OwnerCls or LineMultiple, skip=ownerskip) # Parameter values have now been set before __init__ return _obj, args, kwargs
- 执行MetaLineRoot父类MetaParas的donew,在MetaParas中继续调用Metabase的donew对LineBuffer类进行实例化,并将参数映射到类属性。
- 调用findowner找到创建LineBuffer实例的主人,是PandasData实例。
-
dopreinit在LineBuffer继承体系中没有重写,执行Metabase的dopreinit,直接返回。
-
doinit调用LineBuffer的
__init__
方法:def __init__(self): self.lines = [self] self.mode = self.UnBounded self.bindings = list() self.reset() self._tz = None
- 先把自己加进lines
- 属性的初始化
- 调用reset重置内存存储的结构和索引
-
完成PandasData实例化。
2、PandasData初始化
-
PandasData执行donew完成实例化后,会继续执行dopreinit完成初始化。PandasData的父类中MetaAbstractDataBase重写了dopreinit。
def dopreinit(cls, _obj, *args, **kwargs): _obj, args, kwargs = \ super(MetaAbstractDataBase, cls).dopreinit(_obj, *args, **kwargs) # Find the owner and store it _obj._feed = metabase.findowner(_obj, FeedBase) _obj.notifs = collections.deque() # store notifications for cerebro _obj._dataname = _obj.p.dataname _obj._name = '' return _obj, args, kwargs
- 调用父类的dopreinint,最终调用MetaBase的dopreinint。
- 查找Pandas的owner,返回为空,因为PandasData是实例化的初始发起者,没有owner。
- 初始化notifs,用于存储发送给Cerebro的通知。
- 名称name赋值为空串。
-
执行dopreinit后,继续执行doinit,即执行PandasData类本身的
__init__
函数:def __init__(self): super(PandasData, self).__init__() # these "colnames" can be strings or numeric types colnames = list(self.p.dataname.columns.values) if self.p.datetime is None: # datetime is expected as index col and hence not returned pass # try to autodetect if all columns are numeric cstrings = filter(lambda x: isinstance(x, string_types), colnames) colsnumeric = not len(list(cstrings)) # Where each datafield find its value self._colmapping = dict() # Build the column mappings to internal fields in advance for datafield in self.getlinealiases(): defmapping = getattr(self.params, datafield) if isinstance(defmapping, integer_types) and defmapping < 0: # autodetection requested for colname in colnames: if isinstance(colname, string_types): if self.p.nocase: found = datafield.lower() == colname.lower() else: found = datafield == colname if found: self._colmapping[datafield] = colname break if datafield not in self._colmapping: # autodetection requested and not found self._colmapping[datafield] = None continue else: # all other cases -- used given index self._colmapping[datafield] = defmapping
- 调用PandasData父类的
__init__
,。MetaAbstractDataBase - 将输入参数dataname的列名记录到colnames中。
- 参数datetime通常不用输入,用于指示datetime在哪一列,通常都是放到第一列(索引为0)。
- 检查有没有列名字是数字,有则直接记录到
_colmapping
。如果参数中指定了列名称所在的列,那么系统就不会通过名字来映射,直接使用数字。 - 将Pandas.DataFrame的列名称和PandasData的数据字段做好映射。PandasData的缺省字段包括:[‘datetime’, ‘open’, ‘high’, ‘low’, ‘close’, ‘volume’, ‘openinterest’]。映射关系放到
_colmapping
字典中,结果:{‘close’: ‘close’, ‘low’: ‘low’, ‘high’: ‘high’, ‘open’: ‘open’, ‘volume’: ‘volume’} .
- 调用PandasData父类的
-
PandasData初始化完成。
3、PandasData数据加载
-
如果Cerebro初始化时没有指定预加载数据,则Cerebro在run时会进行数据预加载(在Cerebro的runstrategies函数),在预加载数据前首先需要对数据进行reset。PandasData类继承体系中Lines实现了reset,遍历调用Line(对应LineBuffer实例)的reset,包括close,low, high, open, volume, openinterest,datetime,每一个line都会进行reset。reset实际上初始化一个array.array用于存储数据。
if not predata: for data in self.datas: data.reset() if self._exactbars < 1: # datas can be full length data.extend(size=self.params.lookahead) data._start() if self._dopreload: data.preload()
-
_start函数在PandasData的父类AbstractDataBase中定义:
def _start_finish(self): # A live feed (for example) may have learnt something about the # timezones after the start and that's why the date/time related # parameters are converted at this late stage # Get the output timezone (if any) self._tz = self._gettz() # Lines have already been create, set the tz self.lines.datetime._settz(self._tz) # This should probably be also called from an override-able method self._tzinput = bt.utils.date.Localizer(self._gettzinput()) # Convert user input times to the output timezone (or min/max) if self.p.fromdate is None: self.fromdate = float('-inf') else: self.fromdate = self.date2num(self.p.fromdate) if self.p.todate is None: self.todate = float('inf') else: self.todate = self.date2num(self.p.todate) # FIXME: These two are never used and could be removed self.sessionstart = time2num(self.p.sessionstart) self.sessionend = time2num(self.p.sessionend) self._calendar = cal = self.p.calendar if cal is None: self._calendar = self._env._tradingcal elif isinstance(cal, string_types): self._calendar = PandasMarketCalendar(calendar=cal) self._started = True def _start(self): self.start() if not self._started: self._start_finish()
-
start方法在PandasData类进行定义:
def start(self): super(PandasData, self).start() # reset the length with each start self._idx = -1 # Transform names (valid for .ix) into indices (good for .iloc) if self.p.nocase: colnames = [x.lower() for x in self.p.dataname.columns.values] else: colnames = [x for x in self.p.dataname.columns.values] for k, v in self._colmapping.items(): if v is None: continue # special marker for datetime if isinstance(v, string_types): try: if self.p.nocase: v = colnames.index(v.lower()) else: v = colnames.index(v) except ValueError as e: defmap = getattr(self.params, k) if isinstance(defmap, integer_types) and defmap < 0: v = None else: raise e # let user now something failed self._colmapping[k] = v
- 调用父类start方法。
- 初始化索引为-1,后续+1便可以得到0的起始索引。
- colnames保存Padas.DataFrame原始数据的列名字。
- PandasData初始化时_colmapping记录的是PandasData对应的原始数据的列名字,start函数内修改为原始数据的列索引:{‘close’: 4, ‘low’: 3, ‘high’: 2, ‘open’: 1, ‘volume’: 5, ‘openinterest’: None, ‘datetime’: None} ,后面两个没有对应的列。datetime因为原始数据中date直接作为索引,colnames中就没有。
-
父类的satrt方法向上层级传递最终调用在AbstractDataBase定义的start:
def start(self): self._barstack = collections.deque() self._barstash = collections.deque() self._laststatus = self.CONNECTED
-
_start_finish函数在PandasData的父类AbstractDataBase定义如下:
def _start_finish(self): # A live feed (for example) may have learnt something about the # timezones after the start and that's why the date/time related # parameters are converted at this late stage # Get the output timezone (if any) self._tz = self._gettz() # Lines have already been create, set the tz self.lines.datetime._settz(self._tz) # This should probably be also called from an override-able method self._tzinput = bt.utils.date.Localizer(self._gettzinput()) # Convert user input times to the output timezone (or min/max) if self.p.fromdate is None: self.fromdate = float('-inf') else: self.fromdate = self.date2num(self.p.fromdate) if self.p.todate is None: self.todate = float('inf') else: self.todate = self.date2num(self.p.todate) # FIXME: These two are never used and could be removed self.sessionstart = time2num(self.p.sessionstart) self.sessionend = time2num(self.p.sessionend) self._calendar = cal = self.p.calendar if cal is None: self._calendar = self._env._tradingcal elif isinstance(cal, string_types): self._calendar = PandasMarketCalendar(calendar=cal) self._started = True
- 设置data的时区以及各个lines datatime的时区。
- 将参数输入时间变换为数字形式,以公元1年1月1日零时记为1,每过1天就增加1,不够一天按照按照比例记(比如中午12点,记为0.5)。把时间转化为一个独一无二的数字,方便对数据的快速处理。
- 记录日历信息到_calendar。
- 标记start完成。
-
preload函数在PandasData的父类AbstractDataBase中定义,preload内循环调用load函数加载数据:
def preload(self): while self.load(): pass self._last() self.home() def load(self): while True: # move data pointer forward for new bar self.forward() if self._fromstack(): # bar is available return True if not self._fromstack(stash=True): _loadret = self._load() if not _loadret: # no bar use force to make sure in exactbars # the pointer is undone this covers especially (but not # uniquely) the case in which the last bar has been seen # and a backwards would ruin pointer accounting in the # "stop" method of the strategy self.backwards(force=True) # undo data pointer # return the actual returned value which may be None to # signal no bar is available, but the data feed is not # done. False means game over return _loadret # Get a reference to current loaded time dt = self.lines.datetime[0] # A bar has been loaded, adapt the time if self._tzinput: # Input has been converted at face value but it's not UTC in # the input stream dtime = num2date(dt) # get it in a naive datetime # localize it dtime = self._tzinput.localize(dtime) # pytz compatible-ized self.lines.datetime[0] = dt = date2num(dtime) # keep UTC val # Check standard date from/to filters if dt < self.fromdate: # discard loaded bar and carry on self.backwards() continue if dt > self.todate: # discard loaded bar and break out self.backwards(force=True) break # Pass through filters retff = False for ff, fargs, fkwargs in self._filters: # previous filter may have put things onto the stack if self._barstack: for i in range(len(self._barstack)): self._fromstack(forward=True) retff = ff(self, *fargs, **fkwargs) else: retff = ff(self, *fargs, **fkwargs) if retff: # bar removed from systemn break # out of the inner loop if retff: # bar removed from system - loop to get new bar continue # in the greater loop # Checks let the bar through ... notify it return True # Out of the loop ... no more bars or past todate return False
- 调用forward
- 调用
_fromstack
从_barstack
或者_barstash
中获取数据,现在无法获取,因为_start时是空的。 - 加载数据后,如果数据中输入了时区,那么就转化为本地时间,并更新datetime line的数据为新的日期数值。
- 如果小于参数起始日期(fromdate)或者大于参数中终止日期(enddate),就会调用backwards函数。
-
forward函数在PandasData父类LineSeries中定义:
def forward(self, value=NAN, size=1): ''' Proxy line operation ''' for line in self.lines: line.forward(value, size=size)
-
LineBuffer的forward函数实现如下:
def forward(self, value=NAN, size=1): ''' Moves the logical index foward and enlarges the buffer as much as needed Keyword Args: value (variable): value to be set in new positins size (int): How many extra positions to enlarge the buffer ''' self.idx += size self.lencount += size for i in range(size): self.array.append(value)
- 索引加1(缺省步幅)。缺省idx是-1,第一次调用forward函数就变成0。
- 长度加1
- array中加入NAN无效值(初始化值)
-
pandasData类重写了_load函数:
def _load(self): self._idx += 1 if self._idx >= len(self.p.dataname): # exhausted all rows return False # Set the standard datafields for datafield in self.getlinealiases(): if datafield == 'datetime': continue colindex = self._colmapping[datafield] if colindex is None: # datafield signaled as missing in the stream: skip it continue # get the line to be set line = getattr(self.lines, datafield) # indexing for pandas: 1st is colum, then row line[0] = self.p.dataname.iloc[self._idx, colindex] # datetime conversion coldtime = self._colmapping['datetime'] if coldtime is None: # standard index in the datetime tstamp = self.p.dataname.index[self._idx] else: # it's in a different column ... use standard column index tstamp = self.p.dataname.iloc[self._idx, coldtime] # convert to float via datetime and store it dt = tstamp.to_pydatetime() dtnum = date2num(dt) self.lines.datetime[0] = dtnum # Done ... return return True
- 首先索引加1,从0开始。如果索引大于原始数据的行数,表明加载完成。
- 根据data中每个line的别名(初始化时是close,low, high, open, volume, openinterest),在原始数据中找到对应的列编号(记录在_colmapping中)。然后将对应列的原始数据加入到line的array.array中。
- 找datetime,datetime通常作为索引放到第一列,所以从第一列取一个数据,并调用date2num转换为数字记录到所有line的datetime中。
-
经过LineSeries的backforward直接到LineBuffer的backforward函数:
def backwards(self, size=1, force=False): ''' Moves the logical index backwards and reduces the buffer as much as needed Keyword Args: size (int): How many extra positions to rewind and reduce the buffer ''' # Go directly to property setter to support force self.set_idx(self._idx - size, force=force) self.lencount -= size for i in range(size): self.array.pop()
- 首先将idx回退,加第一个数据的时候是0,回退到-1.
- 长度也减去回退的步幅。
- 将最新增加的值删除掉。
-
Cerebro在runstrategies函数中调用preload时需要满足一个条件:
if self._dopreload: data.preload()
-
但有两种情况不会预加载:
- 数据源datas包含实时数据。
- 数据源包含resample和replay数据。
-
此时需要在next函数中加载。
4、重采样
Resampling
-
Resampling主要用于将粒度小的数据重新抽样为粒度大的数据,例如日线转为周线。
-
Cerebro中resampledata定义如下:
def resampledata(self, dataname, name=None, **kwargs): ''' Adds a ``Data Feed`` to be resample by the system If ``name`` is not None it will be put into ``data._name`` which is meant for decoration/plotting purposes. Any other kwargs like ``timeframe``, ``compression``, ``todate`` which are supported by the resample filter will be passed transparently ''' if any(dataname is x for x in self.datas): dataname = dataname.clone() dataname.resample(**kwargs) self.adddata(dataname, name=name) self._doreplay = True return dataname
- 从已经加入的data中找到参数中指定的需要resample的数据,克隆一个完全一样的数据。
- 调用data的resample函数。
-
data的resample方法在AbstractDataBase定义:
def resample(self, **kwargs): self.addfilter(Resampler, **kwargs) def replay(self, **kwargs): self.addfilter(Replayer, **kwargs) def addfilter(self, p, *args, **kwargs): if inspect.isclass(p): pobj = p(self, *args, **kwargs) self._filters.append((pobj, [], {})) if hasattr(pobj, 'last'): self._ffilters.append((pobj, [], {})) else: self._filters.append((p, args, kwargs))
- 调用addfilter加载滤器类(Resampler)。函数输入既可以是类,也可以是实例。
- Resample加入的数据,和原数据完全一样,只是增加一个Resampler对象,在加载数据时进行针对性处理。
Resampler实例化
-
Resampler继承自
_BaseResampler
,_BaseResampler
继承自元类MetaParams,MetaParams继承自MetaBase,因此Resampler最终实例化也需要执行MetaBase类的doprenew方法以及MetaParams的donew方法。MetaParams重写的donew函数进行参数到属性的映射,并完成实例化。class MetaBase(type): def doprenew(cls, *args, **kwargs): return cls, args, kwargs def donew(cls, *args, **kwargs): _obj = cls.__new__(cls, *args, **kwargs) return _obj, args, kwargs def dopreinit(cls, _obj, *args, **kwargs): return _obj, args, kwargs def doinit(cls, _obj, *args, **kwargs): _obj.__init__(*args, **kwargs) return _obj, args, kwargs def dopostinit(cls, _obj, *args, **kwargs): return _obj, args, kwargs def __call__(cls, *args, **kwargs): cls, args, kwargs = cls.doprenew(*args, **kwargs) _obj, args, kwargs = cls.donew(*args, **kwargs) _obj, args, kwargs = cls.dopreinit(_obj, *args, **kwargs) _obj, args, kwargs = cls.doinit(_obj, *args, **kwargs) _obj, args, kwargs = cls.dopostinit(_obj, *args, **kwargs) return _obj class MetaParams(MetaBase): def __new__(meta, name, bases, dct): def donew(cls, *args, **kwargs): class _BaseResampler(with_metaclass(metabase.MetaParams, object)): params = ( ('bar2edge', True), ('adjbartime', True), ('rightedge', True), ('boundoff', 0), ('timeframe', TimeFrame.Days), ('compression', 1), ('takelate', True), ('sessionend', True), ) def __init__(self, data): self.subdays = TimeFrame.Ticks < self.p.timeframe < TimeFrame.Days self.subweeks = self.p.timeframe < TimeFrame.Weeks self.componly = (not self.subdays and data._timeframe == self.p.timeframe and not (self.p.compression % data._compression)) self.bar = _Bar(maxdate=True) # bar holder self.compcount = 0 # count of produced bars to control compression self._firstbar = True self.doadjusttime = (self.p.bar2edge and self.p.adjbartime and self.subweeks) self._nexteos = None # Modify data information according to own parameters data.resampling = 1 data.replaying = self.replaying data._timeframe = self.p.timeframe data._compression = self.p.compression self.data = data class Resampler(_BaseResampler): params = ( ('bar2edge', True), ('adjbartime', True), ('rightedge', True), ) replaying = False def last(self, data): def __call__(self, data, fromcheck=False, forcedata=None):
Resampler初始化
-
实例化后,就开始初始化。
_BaseResampler
的__init__
主要负责参数的初始化,将自己和data对象绑定。参数 缺省值 含义 adjbartime TRUE 使用边界时间调整采样后时间,而不是最后看到的时间戳。如果重新采样时间粒度为5s,那么时间调整为hh:mm:05,即使在时间宽度范围内最后一个bar的时间戳是hh:mm:04.33。 bar2edge TRUE 以时间边界为目标的重采样。将ticks(时间戳)按照5秒粒度重新采样,则生成的5秒粒度将对齐如下:xx:00,xx:05,xx:10 boundoff 0 向前移动一定数量数据用来resample。比如现在是1分钟粒度抽样为15分钟粒度,系统缺省是从00:01:00到00:15:00 15个1分钟粒度的数据产生1个15分钟粒度。如果boundoff值设置为1,那么向前移动一位,从00:00:00到00:14:00也是15个1分钟粒度的数据产生一个15分钟粒度的数据。 compression 1 压缩比,比如compression为2,表示2个小粒度数据压缩为1个目标粒度数据。 rightedge TRUE 使用边界时间的右边缘作为采用后的时间,如果采用目标长度是5s。设置为False:hh:mm:00到hh:mm:04之间的秒数抽样为为hh:mm:00(边界的开始时间) 设置为True,那么抽样后为hh:mm:05(边界结束时间)
Resampler数据加载
-
Resampler的数据加载过程和普通data类似,加载的是DataClone。在runstrategis函数中调用DataClone重写的_start函数:
class DataClone(AbstractDataBase): _clone = True def __init__(self): self.data = self.p.dataname self._dataname = self.data._dataname # Copy date/session parameters self.p.fromdate = self.p.fromdate self.p.todate = self.p.todate self.p.sessionstart = self.data.p.sessionstart self.p.sessionend = self.data.p.sessionend self.p.timeframe = self.data.p.timeframe self.p.compression = self.data.p.compression def _start(self): # redefine to copy data bits from guest data self.start() # Copy tz infos self._tz = self.data._tz self.lines.datetime._settz(self._tz) self._calendar = self.data._calendar # input has already been converted by guest data self._tzinput = None # no need to further converr # Copy dates/session infos self.fromdate = self.data.fromdate self.todate = self.data.todate # FIXME: if removed from guest, remove here too self.sessionstart = self.data.sessionstart self.sessionend = self.data.sessionend def start(self): super(DataClone, self).start() self._dlen = 0 self._preloading = False
- 调用DataClone的父类AbstractDataBase的start。
- 设定lines的时区。
- 记录起始结束日期。
-
调用dataclone的next,再调用DataClone的_load函数。
def _load(self): # assumption: the data is in the system # simply copy the lines if self._preloading: # data is preloaded, we are preloading too, can move # forward until have full bar or data source is exhausted self.data.advance() if len(self.data) > self.data.buflen(): return False for line, dline in zip(self.lines, self.data.lines): line[0] = dline[0] return True # Not preloading if not (len(self.data) > self._dlen): # Data not beyond last seen bar return False self._dlen += 1 for line, dline in zip(self.lines, self.data.lines): line[0] = dline[0] return True
Resample应用场景
- 在BackTrader框架中主要用于对数据进行重采样,以适应不同的时间间隔或频率。
- 时间间隔转换:原始数据的时间间隔可能不适合策略或分析需求。原始数据可能是每分钟获取的,但策略需要更长的时间间隔,如每小时或每天。可以使用 cerebro.resampledata()方法将数据重采样到所需的时间间隔。
- 数据平滑:重采样数据还可以用于数据平滑。通过将数据重采样到更大的时间间隔,可以减少数据的波动性,从而更好地观察市场趋势和模式。
- 跨市场对齐:如果策略需要在多个市场进行交易,而不同市场的数据时间间隔不同,可以使用cerebro.resampledata()方法将不同市场的数据进行对齐。
- 预测模型:对于基于预测的策略,使用重采样数据可能更为关键。一些预测模型可能需要特定频率的数据输入,如每日收盘价而不是更高频率的数据。通过使用cerebro.resampledata()方法,可以将数据调整为预测模型的输入要求。
5、自定义数据
-
lines包含close,low, high, open, volume, openinterest,datetime。如果选股需要更多数据,比如PE、ROE和turnover等等,可以自定义一个继承自PandasData数据类。
class MyCustomdata(PandasData): lines = ('turnover',) params = (('turnover',-1),)
- 增加一个line(也可以加多个),其它lines从PandasData继承。
- 添加一个参数,指示Line(turnover)对应的原始数据PandasFrame的列号。如果是-1,让系统从原始数据Pandas.DataFrame列名称中匹配查找。
-
自定义类使用如下:
stock_hfq_df = pd.read_csv("../data/sh600000.csv",index_col='date',parse_dates=True) start_date = datetime(2021, 9,1 ) # 回测开始时间 end_date = datetime(2021, 9, 30) # 回测结束时间 data=MyCustomdata(dataname=stock_hfq_df, fromdate=start_date,todate=end_date) def next(self): self.log('Close:%.3f' % self.data0.close[0]) self.log('turnover, %.8f' % self.data0.turnover[0])
三、数据存储格式
-
Python Pandas常用数据存储格式包括:CSV、HDF5、Parquet、Feather、Pickle。
-
Pandas读写不同数据格式性测试如下:
import os import pandas as pd import time if __name__ == "__main__": start_time = time.time() data = pd.read_hdf("/home/samba/test/Market/stocks/stocks_post_1min/000004.XSHE.h5") end_time = time.time() print(data.shape) print("read_hdf ", len(data)/(end_time - start_time), "row/s") start_time = time.time() data.to_parquet("/home/samba/test/000004.XSHE.parquet") end_time = time.time() print("to_parquet ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s") start_time = time.time() data = pd.read_parquet("/home/samba/test/000004.XSHE.parquet") end_time = time.time() print("read_parquet ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s") start_time = time.time() data.to_pickle("/home/samba/test/000004.XSHE.pickle") end_time = time.time() print("to_pickle ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s") start_time = time.time() data = pd.read_pickle("/home/samba/test/000004.XSHE.pickle") end_time = time.time() print("read_pickle ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s") start_time = time.time() data = data.reset_index() data.to_feather("/home/samba/test/000004.XSHE.feather") end_time = time.time() print("to_feather ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s") start_time = time.time() data = pd.read_feather("/home/samba/test/000004.XSHE.feather") end_time = time.time() print("read_feather ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s") start_time = time.time() data.to_csv("/home/samba/test/000004.XSHE.csv", chunksize=20000) end_time = time.time() print("to_csv ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s") start_time = time.time() data = pd.read_csv("/home/samba/test/000004.XSHE.csv") end_time = time.time() print("read_csv ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s") start_time = time.time() data.to_hdf("/home/samba/test/000004.XSHE.h5", key='data', mode='w', complevel=9, data_columns=True) end_time = time.time() print("to_hdf ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s") start_time = time.time() data = pd.read_hdf("/home/samba/test/Market/factors/factors_post_5min/roc96_sp1000.h5", key='roc96') end_time = time.time() print(data.shape) print("read_hdf ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s") start_time = time.time() data.to_parquet("/home/samba/test/roc96_sp1000.parquet") end_time = time.time() print("to_parquet ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s") start_time = time.time() data = pd.read_parquet("/home/samba/test/roc96_sp1000.parquet") end_time = time.time() print("read_parquet ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s") start_time = time.time() data.to_pickle("/home/samba/test/roc96_sp1000.pickle") end_time = time.time() print("to_pickle ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s") start_time = time.time() data = pd.read_pickle("/home/samba/test/roc96_sp1000.pickle") end_time = time.time() print("read_pickle ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s") start_time = time.time() data = data.reset_index() data.to_feather("/home/samba/test/roc96_sp1000.feather") end_time = time.time() print("to_feather ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s") start_time = time.time() data = pd.read_feather("/home/samba/test/roc96_sp1000.feather") end_time = time.time() print("read_feather ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s") start_time = time.time() data.to_csv("/home/samba/test/roc96_sp1000.csv", chunksize=20000) end_time = time.time() print("to_csv ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s") start_time = time.time() data = pd.read_csv("/home/samba/test/roc96_sp1000.csv") end_time = time.time() print("read_csv ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s") start_time = time.time() data.to_hdf("/home/samba/test/roc96_sp1000.h5", key='roc96', mode='w', complevel=9, data_columns=True) end_time = time.time() print("to_hdf ", end_time - start_time, "s ", len(data)/(end_time - start_time), "row/s")
-
Panda读写性能如下:
- 根据Pandas读写窄表、宽表数据的读写性能、压缩性能进行综合评价,选择Feather格式作为数据存储格式。