Turtle Trading Rules
Chapter 23 The Statistical Basis of History Testing
Chapter 23 The Statistical Basis of History Testing (1)
Trading poorly is like going to school juggling on a boat in a storm.This is certainly not impossible, but it is much easier to juggle standing on the ground.
Now that you know some of the main reasons why test results based on historical data do not match reality, you may now be thinking, "In this case, how can I know how much return I can get", or "How can I avoid all the problems in Chapter 11 said questions", or "How do I test my system the right way".This chapter discusses the general principles of historical testing.To master this chapter, you must have a solid understanding of the root causes of those forecast biases described in the previous chapter.So if you were only skimming through the last chapter, I suggest you re-read it carefully first.
When looking at historical simulation results, you get at best a rough sense of future trends.Fortunately, even cursory knowledge can give a good trader a big enough edge.To understand what factors affect the margin of error (or roughness) of your perception, you need to grasp a few basic statistical concepts that are the theoretical basis of historical testing.I don't like books that are full of mathematical formulas and long expositions, so I try to use as little math as possible and strive for concise exposition.
Validity of test samples
Appropriate testing takes into account both statistical concepts that affect the explanatory power of the test and the inherent limitations of those interpretations.Improper testing can lead to false confidence when in reality there is little or no guarantee of the predictive value of the test results.In fact, bad tests can give completely wrong answers.
As for why the historical simulation is at best a rough estimate of the future, most of the reasons have already been talked about how to improve the predictive value of the test and get the best rough estimate within the possible range.
The inference of population characteristics from sample characteristics is a field in statistics and the theoretical basis for the future predictive value of historical test results.The core idea is that if you have a large enough sample, you can use the sample to approximate the overall situation.Therefore, if you have sufficient research on the historical trading records of a particular trading strategy, you can draw conclusions about the future potential of such a system.Pollsters use this method to infer the views of the general public.For example, they could survey 500 random people in a state to infer the views of voters across the state.Similarly, scientists can judge the effectiveness of a drug for a disease based on a relatively small group of patients because such conclusions are statistically valid.
The statistical validity of sample analysis is affected by two factors: one is the sample size, and the other is the representativeness of the sample to the population.Conceptually, many traders and newcomers to system testing know what sample size means, but they think sample size refers only to the number of trades they are testing.They don't understand that if a rule or concept applies to only a few trades, even if they test it on thousands of trades, it is not enough to ensure statistical validity.
They also often ignore the representativeness of the sample to the population, because this is a complex issue that is difficult to measure without some subjective analysis.System testers assume that past conditions are representative of future conditions, and if this is true and we have a large enough sample, we can draw conclusions from past conditions and apply those conclusions to future transactions.But if our sample is not representative of the future, then our tests are useless and provide no indication of the future performance of the system.Therefore, this assumption is crucial.Even if a sample of 500 people is enough to tell us who is going to be the new president, and using a representative sample with a margin of error of less than 2%, does a random sample of 500 people at the Democratic National Convention reflect the will of the entire American electorate?Of course not, because the sample is not representative of the population—it only includes Democrats, but the true electorate also includes many Republicans.Who the Republicans vote for may not match your poll results.If you make a sampling error like this, you can come to a conclusion, maybe the conclusion you want to see, but it's not necessarily the correct conclusion.
Pollsters know that how representative a sample is of the population is the key question.Surveyors who make inaccurate conclusions drawn from unrepresentative samples get fired.In the world of trading, this is also a key question.Unfortunately, traders are not the same as pollsters.Most pollsters understand sampling statistics, but most traders do not.In this regard, traders' near-term bias is perhaps the most common indication - traders only focus on recent transactions, or only use recent data for historical testing, which is like sampling voters at the Democratic convention Same.
The problem with short-term testing is that the market may only have one or two states in this short period of time, rather than all 4 states we talked about in .For example, if the market has been in a state of steady volatility, mean reversion and counter-trend strategies work very well.But if the state of the market changes, the method you tested may not work as well anymore.So, your testing method must improve as much as possible the representativeness of the samples you are testing for the future.
Measuring Robustness of Metrics
In system testing, what you do is observe relative performance, analyze future potential, and determine whether a particular idea is worthwhile.But here's the problem, that is, those performance measures that are generally accepted are not very stable, that is, they are not robust enough.This makes judging the relative strength of an idea very difficult, as small changes in just a few trades can have a huge impact on the value of these volatile indicators.Metric instability can lead testers to overestimate an idea, or blindly discard an idea with potential because it is affected by erratic metrics and does not show the potential it should.
A statistical measure is said to be robust if a slight change to the data does not significantly affect it.However, the existing indicators are too sensitive to changes in the data, so they are not robust.Because of this, when we conduct historical simulation tests on trading systems, slight changes in parameter values can lead to large changes in the values of some indicators.These metrics are inherently unstable—that is, they are too sensitive to slight changes in the data.Any factor that has an impact on the data will have an oversized impact on the test results, which can easily lead to data fitting and easily confuse you with unrealistic test results.To effectively test the Turtle approach, the first thing we need to do is overcome this problem and find robust performance measures.
Bill Eckhart asked me this question during my initial interview for the Turtle Project: "Do you know what a robust statistical measure is?" I sat blankly for a few seconds, then confessed Say, "I don't know." Now I can answer the question.In fact, there is a branch of mathematics that deals with incomplete information and false assumptions called robust statistics.
It is clear from this question that Bill had a clear understanding of the imperfect nature of testing and historical data research, and a study of uncertainty, which was not only valuable then, but still is today.I believe this is one of the reasons Bill was able to achieve such impressive performance.
This is yet another proof of just how ahead of its time Rich and Bill's research and thinking was.The more I learn, the more I am in awe of their contributions to the field.But I was also surprised to find that the trading industry has not progressed much since Rich and Bill met in 1983.
前面的章节把MAR比率、CAGR(平均复合增长率)和夏普比率用作相对表现的衡量指标。但这些指标并不稳健,因为它们对测试期的起始日和终止日非常敏感。这对短于10年的测试来说尤其明显。让我们看看将一次测试的起始日和终止日调整几个月会怎么样。假设我们从1996年2月1日而不是1月1日开始测试,一直测试到2006年4月30日而不是6月30日。也就是说,我们去掉了最初的一个月和最后的两个月。
During the initial testing period, testing of the triple moving average system yielded a return of 43.2%, a MAR ratio of 1.39, and a Sharpe ratio of 1.25.But after modifying the start and end dates, the return rose to 46.2%, the MAR ratio improved to 1.61, and the Sharpe ratio improved to 1.37. Initial testing of the ATR channel breakout system yielded a return of 51.7%, a MAR ratio of 1.31, and a Sharpe ratio of 1.39.Adjusted for start and end dates, the rate of return climbed to 54.9%, the MAR ratio rose to 1.49, and the Sharpe ratio increased to 1.47.
The reason why these three indicators are so sensitive is that the rate of return indicator is very sensitive to the start and end of the test period, and the rate of return is an element of the MAR ratio and the Sharpe ratio (CAGR for the MAR ratio, and CAGR for the Sharpe ratio is the monthly average rate of return).The maximum fade indicator is also highly sensitive to the start and end dates of the test period if the fade occurs near the beginning or end of the test period.This makes the MAR ratio particularly sensitive because both its numerator and denominator components are sensitive to the test start and end dates, and the effect of changes is multiplied in the calculation.
The reason why CAGR is sensitive to the start and end date of the test is that it is equal to the slope of the connecting line between the beginning and end of the curve in the logarithmic scale graph, and changing the start and end date will greatly change the slope of this straight line.We can see this effect.
The line labeled "Revised Test Dates" has a higher slope than the line labeled "Initial Test Dates".During the initial tests, there was one fade in January 1996 and another in May and June 1.Therefore, after we cut off the beginning and end of the test period, we also removed these two declines.This can be seen clearly in Figure 2006-12: After removing the fading at the front and rear ends, the slope of the connecting line representing CAGR is greatly improved.
Regression Annual Return
The above two lines are very different, but if we run a simple linear regression on all points on all possible lines, we can get a better measure of the rate of return.If you don't like mathematics, let me tell you that the regression line is the line that best fits the trend of the scatter distribution, sometimes called the line of best fit.You can understand it as a straight line passing through the center of all scatter points. The regression process is like grabbing the two ends of the scatter graph and stretching it continuously, keeping the overall direction of the graph unchanged until all the ups and downs disappear. , aggregated into a straight line.
The linear regression line and the return it represents gives us a new indicator, which I call RAR (regressed annual return, regressed annual return).This indicator is far less sensitive to changes in the start and end dates of the test period than CAGR.From Figure 12-2, we can see that the slope of the regression line changes much less when the start and end days of the RAR test are changed.
If we compare the RAR before and after the change of the test period like CAGR, we will find that the RAR indicator is not as sensitive to the change of the test start and end date as CAGR, because the slope difference of the two regression lines before and after is much smaller.The RAR of the initial test is 54.67%, and after the modification of the start and end date, it becomes 54.78%, which is only increased by 0.11%.In contrast, the CAGR increased from 43.2% to 46.2%, a change of 3.0%.For this test, CAGR is almost 30 times more sensitive to test start and end dates than RAR.
The monthly return measure used in the Sharpe ratio is also very sensitive to such changes, and since we remove the three underperforming months on either side, the average return will of course be affected, although not to the same degree as the CAGR.It is better to use the RAR indicator in the numerator of the Sharpe ratio.
正如前面所说,用于计算MAR比率的最大衰落指标也对测试起止日的变化高度敏感。只要最大衰落发生在测试期的前后任意一端,MAR就会受到很大的影响。最大衰落只是净值曲线上的一个点,所以你并没有看到其他一些有价值的数据。一个包含更多衰落期的指标要好于这个指标。假如一个系统有32%、34%、35%、35%和36%这5次大幅衰落,而另一个系统的5次大衰落分别为20%、25%、26%、29%和36%,那么后者显然好于前者。
In addition, the degree of fading is only a one-dimensional indicator, and not all 30% fading magnitudes have the same meaning.I don't care much if a system turns to new highs after only a two-month decline, but a two-year decline is a different story.The length of the recovery time or fade period itself is also very important.
Robust Risk Reward Ratio
To take all of the above into account, I have invented a new risk-reward ratio metric, which I call the robust risk/reward ratio.I also like to call it R-cube, because I still have a bit of a technical legacy in my bones and am used to using such terms. The numerator of R cube is RAR, and the denominator is also a new indicator, which I call length-adjusted average maximum drawdown.This denominator metric has two elements: average maximum fade and length adjustment.
The average maximum fading is the average of the five maximum fading magnitudes.The length adjustment is to divide the average number of days of the 5 fade periods by 5 days and then multiply this result by the average maximum fade.The calculation principle of the average fading days is the same as that of the average fading range, that is, the days of five fading periods are added and then divided by 365.Therefore, if the RAR is 5%, the average maximum fade is 5%, and the average fade length is 50 year, which is 25 days, then R cubed is equal to 1—that is, 365%/(2.0%×50/25).As a risk-reward ratio indicator, R-cube considers risk issues from two angles of degree and time.The indicator it uses is not as sensitive to changes in the test start and end dates, so it is more robust than the MAR indicator—that is, it is less prone to large changes in response to small changes in the data.
(End of this chapter)
Trading poorly is like going to school juggling on a boat in a storm.This is certainly not impossible, but it is much easier to juggle standing on the ground.
Now that you know some of the main reasons why test results based on historical data do not match reality, you may now be thinking, "In this case, how can I know how much return I can get", or "How can I avoid all the problems in Chapter 11 said questions", or "How do I test my system the right way".This chapter discusses the general principles of historical testing.To master this chapter, you must have a solid understanding of the root causes of those forecast biases described in the previous chapter.So if you were only skimming through the last chapter, I suggest you re-read it carefully first.
When looking at historical simulation results, you get at best a rough sense of future trends.Fortunately, even cursory knowledge can give a good trader a big enough edge.To understand what factors affect the margin of error (or roughness) of your perception, you need to grasp a few basic statistical concepts that are the theoretical basis of historical testing.I don't like books that are full of mathematical formulas and long expositions, so I try to use as little math as possible and strive for concise exposition.
Validity of test samples
Appropriate testing takes into account both statistical concepts that affect the explanatory power of the test and the inherent limitations of those interpretations.Improper testing can lead to false confidence when in reality there is little or no guarantee of the predictive value of the test results.In fact, bad tests can give completely wrong answers.
As for why the historical simulation is at best a rough estimate of the future, most of the reasons have already been talked about how to improve the predictive value of the test and get the best rough estimate within the possible range.
The inference of population characteristics from sample characteristics is a field in statistics and the theoretical basis for the future predictive value of historical test results.The core idea is that if you have a large enough sample, you can use the sample to approximate the overall situation.Therefore, if you have sufficient research on the historical trading records of a particular trading strategy, you can draw conclusions about the future potential of such a system.Pollsters use this method to infer the views of the general public.For example, they could survey 500 random people in a state to infer the views of voters across the state.Similarly, scientists can judge the effectiveness of a drug for a disease based on a relatively small group of patients because such conclusions are statistically valid.
The statistical validity of sample analysis is affected by two factors: one is the sample size, and the other is the representativeness of the sample to the population.Conceptually, many traders and newcomers to system testing know what sample size means, but they think sample size refers only to the number of trades they are testing.They don't understand that if a rule or concept applies to only a few trades, even if they test it on thousands of trades, it is not enough to ensure statistical validity.
They also often ignore the representativeness of the sample to the population, because this is a complex issue that is difficult to measure without some subjective analysis.System testers assume that past conditions are representative of future conditions, and if this is true and we have a large enough sample, we can draw conclusions from past conditions and apply those conclusions to future transactions.But if our sample is not representative of the future, then our tests are useless and provide no indication of the future performance of the system.Therefore, this assumption is crucial.Even if a sample of 500 people is enough to tell us who is going to be the new president, and using a representative sample with a margin of error of less than 2%, does a random sample of 500 people at the Democratic National Convention reflect the will of the entire American electorate?Of course not, because the sample is not representative of the population—it only includes Democrats, but the true electorate also includes many Republicans.Who the Republicans vote for may not match your poll results.If you make a sampling error like this, you can come to a conclusion, maybe the conclusion you want to see, but it's not necessarily the correct conclusion.
Pollsters know that how representative a sample is of the population is the key question.Surveyors who make inaccurate conclusions drawn from unrepresentative samples get fired.In the world of trading, this is also a key question.Unfortunately, traders are not the same as pollsters.Most pollsters understand sampling statistics, but most traders do not.In this regard, traders' near-term bias is perhaps the most common indication - traders only focus on recent transactions, or only use recent data for historical testing, which is like sampling voters at the Democratic convention Same.
The problem with short-term testing is that the market may only have one or two states in this short period of time, rather than all 4 states we talked about in .For example, if the market has been in a state of steady volatility, mean reversion and counter-trend strategies work very well.But if the state of the market changes, the method you tested may not work as well anymore.So, your testing method must improve as much as possible the representativeness of the samples you are testing for the future.
Measuring Robustness of Metrics
In system testing, what you do is observe relative performance, analyze future potential, and determine whether a particular idea is worthwhile.But here's the problem, that is, those performance measures that are generally accepted are not very stable, that is, they are not robust enough.This makes judging the relative strength of an idea very difficult, as small changes in just a few trades can have a huge impact on the value of these volatile indicators.Metric instability can lead testers to overestimate an idea, or blindly discard an idea with potential because it is affected by erratic metrics and does not show the potential it should.
A statistical measure is said to be robust if a slight change to the data does not significantly affect it.However, the existing indicators are too sensitive to changes in the data, so they are not robust.Because of this, when we conduct historical simulation tests on trading systems, slight changes in parameter values can lead to large changes in the values of some indicators.These metrics are inherently unstable—that is, they are too sensitive to slight changes in the data.Any factor that has an impact on the data will have an oversized impact on the test results, which can easily lead to data fitting and easily confuse you with unrealistic test results.To effectively test the Turtle approach, the first thing we need to do is overcome this problem and find robust performance measures.
Bill Eckhart asked me this question during my initial interview for the Turtle Project: "Do you know what a robust statistical measure is?" I sat blankly for a few seconds, then confessed Say, "I don't know." Now I can answer the question.In fact, there is a branch of mathematics that deals with incomplete information and false assumptions called robust statistics.
It is clear from this question that Bill had a clear understanding of the imperfect nature of testing and historical data research, and a study of uncertainty, which was not only valuable then, but still is today.I believe this is one of the reasons Bill was able to achieve such impressive performance.
This is yet another proof of just how ahead of its time Rich and Bill's research and thinking was.The more I learn, the more I am in awe of their contributions to the field.But I was also surprised to find that the trading industry has not progressed much since Rich and Bill met in 1983.
前面的章节把MAR比率、CAGR(平均复合增长率)和夏普比率用作相对表现的衡量指标。但这些指标并不稳健,因为它们对测试期的起始日和终止日非常敏感。这对短于10年的测试来说尤其明显。让我们看看将一次测试的起始日和终止日调整几个月会怎么样。假设我们从1996年2月1日而不是1月1日开始测试,一直测试到2006年4月30日而不是6月30日。也就是说,我们去掉了最初的一个月和最后的两个月。
During the initial testing period, testing of the triple moving average system yielded a return of 43.2%, a MAR ratio of 1.39, and a Sharpe ratio of 1.25.But after modifying the start and end dates, the return rose to 46.2%, the MAR ratio improved to 1.61, and the Sharpe ratio improved to 1.37. Initial testing of the ATR channel breakout system yielded a return of 51.7%, a MAR ratio of 1.31, and a Sharpe ratio of 1.39.Adjusted for start and end dates, the rate of return climbed to 54.9%, the MAR ratio rose to 1.49, and the Sharpe ratio increased to 1.47.
The reason why these three indicators are so sensitive is that the rate of return indicator is very sensitive to the start and end of the test period, and the rate of return is an element of the MAR ratio and the Sharpe ratio (CAGR for the MAR ratio, and CAGR for the Sharpe ratio is the monthly average rate of return).The maximum fade indicator is also highly sensitive to the start and end dates of the test period if the fade occurs near the beginning or end of the test period.This makes the MAR ratio particularly sensitive because both its numerator and denominator components are sensitive to the test start and end dates, and the effect of changes is multiplied in the calculation.
The reason why CAGR is sensitive to the start and end date of the test is that it is equal to the slope of the connecting line between the beginning and end of the curve in the logarithmic scale graph, and changing the start and end date will greatly change the slope of this straight line.We can see this effect.
The line labeled "Revised Test Dates" has a higher slope than the line labeled "Initial Test Dates".During the initial tests, there was one fade in January 1996 and another in May and June 1.Therefore, after we cut off the beginning and end of the test period, we also removed these two declines.This can be seen clearly in Figure 2006-12: After removing the fading at the front and rear ends, the slope of the connecting line representing CAGR is greatly improved.
Regression Annual Return
The above two lines are very different, but if we run a simple linear regression on all points on all possible lines, we can get a better measure of the rate of return.If you don't like mathematics, let me tell you that the regression line is the line that best fits the trend of the scatter distribution, sometimes called the line of best fit.You can understand it as a straight line passing through the center of all scatter points. The regression process is like grabbing the two ends of the scatter graph and stretching it continuously, keeping the overall direction of the graph unchanged until all the ups and downs disappear. , aggregated into a straight line.
The linear regression line and the return it represents gives us a new indicator, which I call RAR (regressed annual return, regressed annual return).This indicator is far less sensitive to changes in the start and end dates of the test period than CAGR.From Figure 12-2, we can see that the slope of the regression line changes much less when the start and end days of the RAR test are changed.
If we compare the RAR before and after the change of the test period like CAGR, we will find that the RAR indicator is not as sensitive to the change of the test start and end date as CAGR, because the slope difference of the two regression lines before and after is much smaller.The RAR of the initial test is 54.67%, and after the modification of the start and end date, it becomes 54.78%, which is only increased by 0.11%.In contrast, the CAGR increased from 43.2% to 46.2%, a change of 3.0%.For this test, CAGR is almost 30 times more sensitive to test start and end dates than RAR.
The monthly return measure used in the Sharpe ratio is also very sensitive to such changes, and since we remove the three underperforming months on either side, the average return will of course be affected, although not to the same degree as the CAGR.It is better to use the RAR indicator in the numerator of the Sharpe ratio.
正如前面所说,用于计算MAR比率的最大衰落指标也对测试起止日的变化高度敏感。只要最大衰落发生在测试期的前后任意一端,MAR就会受到很大的影响。最大衰落只是净值曲线上的一个点,所以你并没有看到其他一些有价值的数据。一个包含更多衰落期的指标要好于这个指标。假如一个系统有32%、34%、35%、35%和36%这5次大幅衰落,而另一个系统的5次大衰落分别为20%、25%、26%、29%和36%,那么后者显然好于前者。
In addition, the degree of fading is only a one-dimensional indicator, and not all 30% fading magnitudes have the same meaning.I don't care much if a system turns to new highs after only a two-month decline, but a two-year decline is a different story.The length of the recovery time or fade period itself is also very important.
Robust Risk Reward Ratio
To take all of the above into account, I have invented a new risk-reward ratio metric, which I call the robust risk/reward ratio.I also like to call it R-cube, because I still have a bit of a technical legacy in my bones and am used to using such terms. The numerator of R cube is RAR, and the denominator is also a new indicator, which I call length-adjusted average maximum drawdown.This denominator metric has two elements: average maximum fade and length adjustment.
The average maximum fading is the average of the five maximum fading magnitudes.The length adjustment is to divide the average number of days of the 5 fade periods by 5 days and then multiply this result by the average maximum fade.The calculation principle of the average fading days is the same as that of the average fading range, that is, the days of five fading periods are added and then divided by 365.Therefore, if the RAR is 5%, the average maximum fade is 5%, and the average fade length is 50 year, which is 25 days, then R cubed is equal to 1—that is, 365%/(2.0%×50/25).As a risk-reward ratio indicator, R-cube considers risk issues from two angles of degree and time.The indicator it uses is not as sensitive to changes in the test start and end dates, so it is more robust than the MAR indicator—that is, it is less prone to large changes in response to small changes in the data.
(End of this chapter)
You'll Also Like
-
After Entering the Book, She Became Rich in the 1980s
Chapter 441 4 hours ago -
My singer girlfriend is super fierce
Chapter 1294 6 hours ago -
After waking up from a thousand years of sleep, the 749 Bureau came to the door
Chapter 130 6 hours ago -
Three Kingdoms: Plundering Entries, From Merchants to Emperors
Chapter 79 19 hours ago -
Bad man, the system crashed.
Chapter 349 19 hours ago -
Plants vs. Cultivation
Chapter 245 1 days ago -
The Psychic Resurrection: Riding the Mirage
Chapter 328 1 days ago -
The Lucky Wife of the Era Married a Rough Man With Space
Chapter 585 1 days ago -
Eagle Byzantium
Chapter 1357 1 days ago -
With full level of enlightenment, I turned the lower world into a fairyland
Chapter 170 1 days ago