How to Do a Proper Self-Experiment, and Why Your “N” Doesn’t Technically Equal “1”

Visit Us
Aravind recently suggested in the comments that I write a blog post about a discussion he and I had in the hallway at the Ancestral Health Symposium about “n=1 experiments.”  The thrust of this discussion was that if you want to do a true self-experiment where you can definitively demonstrate cause and effect, you can actually conduct a randomized, controlled trial on yourself where your n is equal to the number of repeated observations rather than “1,” although we can still casually call them “n-of-one experiments.”  Anything less than this provides interesting information, but not necessarily a demonstration of cause and effect.

First I'll describe why we should perform self-experiments in this way (if we're going to perform them at all), then how to go about doing so, and finally what to do in the cases where such rigorous self-experimentation is obviously impractical.

Just to be clear, I'm not suggesting everyone should actually go forth and begin performing experiments in this manner.  But it's useful to understand the theoretical principles, and for those who are interested in seeing how certain foods affect their blood sugar, blood pressure, or some other parameter, this post will be of practical import.

There are two elements of importance here, each of which are essential to demonstrating cause and effect in any experimental situation:
  • Repeated observations.
  • Randomization.

The reason for repeated observations is simple: if I want to show that my response to two different foods is different, I need to show that the variation between them is greater than the variation within them.Say I want to know whether bananas spike my blood sugar more than strawberries.  To address this question, I eat bananas for breakfast on Monday and my blood sugar goes up to 130 mg/dL, and then I eat strawberries for breakfast on Tuesday and my blood sugar goes up to only 125 mg/dL.  Does this support my hypothesis?  Not really.  The reason is that I have no idea what my blood sugar would have gone up to if I had eaten either fruit a second or third time.  If I ate strawberries again on Wednesday and my blood sugar went up to 135 mg/dL, suddenly my conclusions would fall apart.

I can avoid this problem entirely by repeating my strawberry trial a few times and my banana trial a few times so I can assess the natural variation in my responses to each fruit.  If the difference in the average response to each fruit is large enough or the variation within my responses to each fruit is small enough, I can conclude that one affects my blood sugar worse than the other.  I'll describe how to make that decision below.

The reason for randomization is a little less obvious.  Since the secret's already out that this blog is just a front for the military-industrial complex, I'll defer to my good friend* Donald Rumsfeld:


Hat tip to Daniel Kirsner for the video.

Randomization is a way to control for the unknowns, and especially for the unknown unknowns.  

If we were going to divide people into two groups for a controlled clinical trial, we would have to allocate them randomly.  In our self-experiment, we have to allocate the order of trials randomly. In other words, I can't test the effect of bananas five times this week and then test the effect of strawberries five times next week.  I have to alternate between bananas and strawberries in random order.

The simple reason is that time is a confounder.  Time, in fact, is the worst confounder of all because it indirectly introduces a whole host of unknowns, both of the known and unknown variety.  We could all make lists of things that can change with time.  The lists might look very different from one another and if we pooled them all into one list it would be ginormous.  The confounders we didn't include because none of us thought of them would be more numerous still. In principle, randomizing the order of trials controls for them all by taking time out of the equation entirely.

How to Randomize

The easiest way to randomize the order of our self-experiment would be to use a random number generator.  If we hop over to Random.Org we can generate random numbers within a certain range.  A simple way of randomizing would be to have “0” code for doing strawberries first and bananas second, and have “1” code for the opposite.  We could randomly generate a few zeroes and ones and then we'd be done. Since we are only making a simple comparison between two fruits, we could opt instead to just flip a coin.

How to Choose the Number of Trials

Our ultimate goal is to determine, in this example, whether my average blood sugar response to one fruit is different from my average response to the other.  If my response to each fruit is very consistent, I might get by with just three measurements for each fruit.  If it's very inconsistent, it will be harder for me to estimate my average response and making this estimation will require a greater number of trials.  This will become clearer below.

How to Tell If the Responses Are Different

So how do we tell if my response to bananas is different from my response to strawberries?  The short answer is I should plug the data into some simple statistical software and run a t-test.  You can do this for free here:

GraphPad's Free T-Test Calculator

If my response to each fruit is consistent, I should only need to do about three tests with each of them.

If my response to each fruit is more variable, I might have to do more.  As a good rule of thumb, we could start with three and see if there's a significant difference.  If not, we could run a couple more tests and see if it gets closer to significance.  There are more rigorous ways to determine the sample size we need, but sheesh we're not trying to justify ourselves at the feet of some bureaucracy or publish a paper here, so I think we can cut a few corners.  We just need to be careful of bias — we don't want to keep performing the experiment until we get the result we want and then stop.

If we want to be really careful about this, though, we could perform a few tests to guestimate the “n” we need and then ignore all these results and start afresh, committing ourselves to a specific number of observations and then patting ourselves on the back for our objectivity.

In order to try to maintain as little variation as possible, and thus be able to get away with fewer trials of each fruit, we should attempt to keep any conditions we can think of as consistent as possible.  For example, we should conduct the test at the same time of day, having fasted for a similar length of time since our last meal.  Random differences in such conditions will not destroy the interpretation of the experiment but will decrease our statistical precision and require us to repeat more observations.

A Few Technical Considerations

There are two technical problems that could arise related to the independence of the trials.  We want to minimize any effect that one trial might have on another.  We can imagine a couple situations where that could be a problem.

For example, say we are taking a vitamin supplement.  The supplement might take a few days to clear from our system, so we would want to separate the trials by at least a few days.  This is called a wash-out period. Having a sufficient wash-out period between trials can help guarantee their independence.

The second problem is that there could be a time-dependent trend.  For example, if we are eating a low-carbohydrate diet and suddenly we start to run tests on our blood sugar response to different fruits, we may steadily adapt to eating fruit over several weeks and our blood sugar responses may steadily improve.  In this case, we can increase our statistical precision by using a paired t-test.  To do this we simply pair the first two trials, then the second two, the third two, and so on.  How to do this should be apparent after clicking on the above link to use the free t-test program.

We Don't Need to Know Everything

It would, quite clearly, be foolish to rest on definitive demonstrations of cause-and-effect for everything we do.  This would be paralyzing.  It's quite clear that if someone wants to go gluten-free for six months, they're not going to repeat this three or five times, randomly alternating with a six-month gluten-gobbling period.

A randomized, controlled self-experiment is the ideal form of self-experimentation but this doesn't mean we should ignore the rest of our personal experience.  We can, at a minimum, demonstrate that making a given dietary change is at least consistent with improved health simply by experiencing such an improvement in health after making such a dietary change.  We only have one life to live, and the most sensible thing might be to stick with what seems to work and move on.

Even so, understanding the essential role of randomization and repeated observations in demonstrating cause and effect can help us interpret that experience.  Realizing that many of our past experiences may not provide us with definitive cause-and-effect information can help us infuse some flexibility into our dietary theories and make the changes we might need to make now or in the future rather than getting trapped into dietary dogmatism.

Where practical, however, a randomized, controlled self-experiment can provide valuable information.  In the future, I'll conduct a few of these on myself and post about them.

* Mr. Rumsfeld and I go way back. One time in the 1990s when we were working on the Dole campaign together, he got so angry at a mouse who chewed through all his packets of NutraSweet in the middle of the night that he wanted to blow through a nest it had burrowed in the wall with a nuclear warhead. I reasoned with him that this could backfire and create a public relations disaster, and he backed down. I always found Rumsfeld's temper disturbing, but the compelling simplicity of his approach to statistical analysis remains beyond reproach to this day. I often wonder how the world would be different if Mr. Rumsfeld had chosen this discipline as his profession, but as he would always say to me, “One can never randomize the universe to alternative histories or futures with an n of 1.” Or as others say, you only have one life to live.

Acknowledgment: Special thanks to NYC-based statistical consultant Karen A. Buck for discussing this concept with me.

Visit Us

You may also like


  1. Hi Chris,
    Stumbled upon this old post of yours, and I found it incredibly useful. The addition of the paired t-test is a fantastic suggestion. I think self-experimenting should be fun, and that life is too short to stress on the semantics!
    Thanks again,

  2. My point on randomization is that it really doesn't matter what you do on the second week, since the first week is a destructive test. The second week results are more of a function of what happened during the first week, and not the order in which the fruit is eaten later.
    And, randomization absolutely does not eliminate any time-dependent effects, it just smooshes that factor around, and hopefully more evenly.
    I could never do those Jimmy Moore experiments without an extremely long wash-out period. Really. Sometimes three months. This is the biggest problem, and where I seriously disagree with your friend.

  3. Hi Exceptionally Brash!


    I was getting worried because *I* thought this was a back-and-forth thing and I didn't understand why I seemed to be repeating myself to no avail. Haha.

    Well, I'm glad we've made some agreement. I don't see the point of doing ANOVA here, but thank you for your thoughts.


  4. Oh, people reading this are probably thinking this is a back-and-forth thing. I just the three posts in a row, without realizing that you had already answered.
    Anyway, I realize that doing experiments with several people is not the same as doing a bunch of stuff on yourself.
    I also realize that one need not do ANOVA, but that's easier for some than a t test. It will give you the same results for the very simple model you suggest, it's just with ANOVA, you have so much more flexibility to deal with all the things that may have been introduced.
    By "better result" I am thinking more in line with strengthening the signal and reducing the noise.

  5. Hi Exceptionally Brash,

    I will try this one more time, but after that if we make no progress I think we should just agree to disagree.

    Yes, virtually anything you do will have transcriptional effects. In general, these occur over hours or days, sometimes weeks. These could contribute to time-dependent effects.

    Let's assume for the sake of argument these effects exist in this model.

    If you do not randomize the order, then the entire experiment is confounded by the time-dependent effects. If you randomize the order, then although the time-dependent effect exists, it has no more influence over strawberries than bananas. Thus, it does not introduce bias in favor of one or the other fruit. Remember, the whole purpose is simply to answer the question, "Do I respond differently to these two fruits?"

    It may decrease your statistical power by increasing the variability. This can be avoided by pairing.

    So, even if we generously assume that you are correct that there will be all kinds of "wonky" transcriptional effects," it doesn't matter. The test is designed to control for them.


  6. ..Now, I am assuming through all this that things are not independent, that the order of the fruits doesn't matter after the first week, because things have already gotten wonky. In other words, the test is a destructive test. As an example, think of giving someone SSRI's or something like that, several months instead of a day in the fruit example. Then take it away. Is the washout period really as long as it takes to get the drug out? No way. You have to re-up-regulate and down-regulate everything, if that can be done. Why would people doing research on a new drug limit the study to people who haven't been on similar drugs before? Why would Guyenet design a study looking for participants who hadn't already tried another diet? It's because they really don't want to waste their time doing an experiment on someone who is already wonky from experimenting.

  7. Yes, within one person, there will be variability. This is not a problem. The purpose of the t-test is to assess the variability, and to determine whether the difference in means indicates a truly different response to the two fruits (in this two-fruit example) or represents random variability.

    In testing a blood sugar response, each trial should last three hours, not one week.

    You may indeed have an effect of time. This is why you randomize the order. This eliminates any time-dependent effects. You can further do, as I indicated in my comment and revision, a paired t-test where you pair the first two trials, the next two, the next two, and so on, which will increase your statistical precision if there is a time-dependent effect.

    Thus, time is not a confounder in this model.

    Including other people would not give you a "better result." It would ask a completely different question and leave the initial question unanswered. This is a self-experiment. The question is "how do I respond?" It gives you no information about anyone else. Including other people gives you information about the population mean response, but gives you no information about yourself.

    Neither is "better" — they are just each answering a different question.


  8. My reason for making it a bit more complicated was to point out that if you were to just draw graphically person 1's responses to different fruits vs. control, they may see large differences in the response compared to another person. But, (and this is the problem), WITHIN that person 1's responses, there might also be considerable variability. And, it can be the case that the differences vary widely from trial to trial.
    If each trial lasts a week, you would have within-week variability and between-week variability. My point in adding the other person was to try to say that person 1's total variability may completely dwarf any differences there may be between 2 people. So, doing the experiment only one week with your friends would give you a better result than doing it over and over again on yourself.

  9. Hi Exceptionally Brash,

    I still don't follow you. The comparison is between the responses to the two fruits. A baseline isn't a "control" but rather an integral component of the calculation of the blood sugar response. The best way to do this would be to calculate the AUC, so you have one data point for each trial. You could use numerous other methods of calculation, such as maximum concentration, increase from baseline, etc, but AUC would be preferable.

    There is no need to do an ANOVA, because you are only compairing two things. If you were comparing three things, you could do an ANOVA. But that is hardly true to say that you might as well throw your friends in. First, that would complicate things to a further degree because now suddenly you have to make adjustments for having repeated measures and numerous subjects. You have to deal with the fact that your observations in each person are not independent. And second, because now you have muddied the inference. In one case, you are making an inference about a person, without generalizing to others. In the other case, you are making an inference about some population, and what is the population? Your "friends" aren't a random sample of anything. If it's a random sample of your friends, you could make an inference about the population of "your friends," but this would be awkward since there is no reason to postulate that your friends constitute a distinct population in some biologically meaningful way.

    In any case, what I described is, as I indicated in the comments above, a technique recognized in a large body of statistical literature. It's not uniquely my idea.


  10. My knee-jerk response to your experiment was that it is already a more complicated model since you have two fruits and a control. (You were going to do the baseline, weren't you?) So, instead of a paired t-test, I am already thinking an ANOVA. Once you go there, it's easy to add all your friends' results and back out any blocks you may have introduced, so that you can get a bigger better error term. (Er, smaller error term..oh well, you know what I mean here…)

  11. Chris,

    THis is not exactly relevant to the post but I didn't know where else to ask. I was discussing the role of Accutane with a nutrtionally inclined friend and he said that he read from you that it's possible that Accutane causes a Vitamin A deficiency of sorts? How would this work?

    I am curious because I took Accutane for Acne about 4 years ago and honestly it gave me weirder, harder to correct skin problems such as folliculitis, flaky skin, sensitive skin etc etc.

    Fermented Skate Liver oil liquid seems to make my skin look beautiful in super high doses… haven't tried super high doses of cod liver oil.

  12. I talked this over with a very high-profile statistician, whose email signature said that the contents of the email should be kept private, so I'm keeping this anonymous.

    He told me that the basic scheme here is correct, and that the trials are, in fact, independent because you are only making an inference about the particular person they are conducted on. He said that they are indeed called "n-of-one" trials even though technically the "n" used for calculating the standard error is the number of repeated observations. He further said that there is "a big literature on it, including at least one book," and that an early advocate was Dave Sackett (see below).

    He suggested using a paired t-test if there is a time-dependent trend, so that one conducts numerous pairs of tests with the pairs in either A-B or B-A orders in a random sequence, and each pair is "paired" within the t-test. He suggested using a wash-out period between trials whenever a measurement in a second trial would be taken before the effect of the first trial is gone. As an example, he used drug trials where the wash-out period would have to be at least three times the half-life of the drug.

    As a result, I have revised the titled and several sentences in the post to reflect that the "n=1" terminology is legitimate even though the "n" is technically more than one. I have also included a "technical note" on wash-out periods and paired t-tests.

    I googled "Dave Sackett n-of-one" and came up with this:

    "FAQ on Levels of Evidence

    It states that "A group of us" — where the group included "Chris Ball, Dave Sackett, Bob Phillips, Sharon Straus and Brian Haynes of the Centre for Evidence-Based Medicine at Oxford"– "started from the shared objective of wanting to maximise the help and minimise the harm we do to patients by basing our clinical decisions on the sorts of evidence that are least likely to be wrong."

    Considering "N-of-1 trials," it states the following:

    "We do N-of-1 trials to find out the best treatment for an individual patient, not for the average patient. We neither know nor care whether their result can be generalised and they rank as case reports (i.e. level 5)."

    It further states, however, that randomized crossover trials are essentially the composite of multiple N-of-1 trials and as such are generalizable to others and when properly randomized are of the highest level of evidence (along with randomized group trials):

    "Where would we place cross-over studies? If randomised, they can be level 1 (if multiple N-of-1's are conducted on the same intervention with the same outcomes for the same disorder, they sum to a multiple cross-over trial)."

  13. I think I understand your points quite well and that we are actually quite close. It is my fault for talking in code, and for also mixing statistical metaphors in with real statistics. I'll add more later, trying now to get out of the blue light.

  14. Responses to Josh, Aravind, and Exceptionally Brash.

    Hi Josh,

    That's a great point, and an unfortunate problem. However, it's a nice problem to have in that your problem is you don't have any problems. I hope it stays that way for you for a long time to come.


    I think you did have a point about independence. The independence of the trials is quite ambiguous and depends on how you apply the definition. A sufficient wash-out period should render the trials independent in the sense that one is not biochemically affecting the next. The person will not lose consciousness of the trials and in that sense there is some independence that can never be gained, but the randomization should remove any bias due to it by spreading it evenly across the things you are comparing. There is some independence lost in the fact that all the measurements are being taken from one person, but I think this is rendered irrelevant but the fact that the "population" you are making an inference about is the total responses of that person. The other point about pathological and non-pathological variation is also a good one.

    Exceptionally Brash,

    I'm very confused by your comments and I don't think you understood the main points of my post. This is probably my fault for not explaining them in sufficient detail. First, I very much agree that for certain trials a "wash-out" period would be required in order to make the trials independent of one another. I'm not sure what you mean by order-dependent effects not "evening out." It is not the order-dependent effect that evens out across the trials, but the trials that one deliberate distributes randomly across the order. If the trials are distributed randomly across the order, then it is impossible for order-dependent effects to bias the result in favor of one or the other dietary factor being tested. That is not because the order-dependent effect is eliminated but because the factor being tested is divorced entirely from the order during randomization.

    I don't know what you mean by comparing "within-person" to "between-person" variation. There can be no "between-person" variation at all because the experiment being discussed only involves a single person.

    The fact that within-person variation might itself vary between individuals is irrelevant, because the within-person variation is being directly measured in this scenario.

    I do not know what you mean by "add the total trials." I didn't suggest adding anything. I suggested performing a t-test, which takes the mean of each trial, and calculates a standard error that takes into account the number of observations and the amount of variation, which is an estimate of the precision with which one is estimating the "true mean." The t-test concludes by determining whether one can have a certain percentage confidence that the "true means" are different. There is no summation in this, except insofar as taking a sum is inevitably part of calculating a mean.

    Does this make any more sense?


  15. The other issue with summing N=1's is probably best explained in the language of variance components. In such an analysis, while you are looking at some main effect sources of variability, such as bananas or no bananas effect on some response, there is also the within-person variability, presumably lower than the between-person variability. There is also a common assumption that any within-person variability is fairly consistant from person to person. As I alluded to in my previous post, I believe that for some experiments, my own variability would be quite high compared with someone who can eat whatever they want. A better way would be to pool the paired comparisons within each person, not directly add the total trials.
    (And this would probably make much more sense if I had inherited Tim Russert's white board instead of trying to use words, so sorry about that.)

  16. Hello Chris! The problem with your randomization explanation is that the "within-person" errors might not be spread out evenly. I am thinking of the many cases where the effects of a diet experiment would be hormonally-mediated, and involve a long settling-out process before doing another trial. I know that in my own case, a weekend of banana debauchery might take me a couple of months to get back to the place where I was before. Testing anything within that recovery time-frame would be confounded. The different trials are not independent they are somewhat recursive, and any effects are order-dependent and will not "even out".

  17. Hi Howard,

    Exactly. If you're arthritis is gone and you don't have definitive proof of why, your time is probably much better spent doing all the things the arthritis was preventing you from enjoying than spending enormous amounts of time proving a theory. I hope the good health persists.


  18. Sometimes the n=1 is enough. E.g., for me, cutting grain from my diet and seeing almost immediate disappearance of my arthritis was sufficient. However, I did notice several years later that "low-carb" wheat products would cause a recurrence of the hand pain, so while I wasn't 100% sure it was gluten to begin with, my confidence level in that conclusion is getting close to 100%. Close enough to avoid gluten, anyway.

  19. I agree that it was probably not the appropriate use of the term. Also, I was switching between the concept of statistical independence and the notion of something being pathological vs. variation within "acceptable" physiological limits


  20. It's actually kind of frustrating that I mostly don't notice differences in how I feel based on what I eat, unless I eat a bunch of garbage. I guess that's one of the benefits of being young, but it would be nice to figure out what I "do best on" before I start to notice it.

  21. Hi Elizabeth,

    That's a great example, and one I can relate to. The nice thing about randomization is it corrects for all of that. If you randomly allocate a bunch of people to a control and treatment group, each person is different. Randomization spreads the error introduced by these differences evenly across both groups and thus makes it irrelevant. The same would be true in a randomized self-experiment. Any error from the rest of your life would be spread evenly across the things you are comparing. Not always practical, not always worth it, but in cases where it matters it can be the one thing that can give you a definitive interpretation.


  22. Hi Aravind,

    I agree with you about the OGTT. I'm not sure "independence" is the best word, but the example you give is one of the first that I thought of in this example scenario. If I conducted all my banana results first, they could easily affect my results for strawberries the following week, especially if I had been up to that point eating a diet very low in carbs or fruit and suddenly change that. But this becomes a non-issue if the order is randomized because a strawberry trial is no more likely to affect a banana trial than a banana trial is to affect a strawberry trial. To whatever extent the other things you eat the same day or on other days introduce some error, the error is spread evenly across the things you are comparing and thus becomes irrelevant.

    I think if it is practical to introduce an element of randomization into the Whole30 or any other food elimination/introduction protocol, it should be done. It is the difference between showing something is true and concluding that something seems possibly true. But as I said in the last section, it is often impractical to introduce randomization, and in those cases you necessarily have to forego much of your interpretive power. But it's ultimately a subjective choice of value: is it worth it to be rigorous? Why, why not? Each individual has to answer that for her or his own situation.

    I think your last question falls more into the purview of my dietary dogmatism post, but it does affect this one to. It's a wholly different problem though: you need to have good guestimate of how long your trial needs to be. The acute effect of bananas on my blood sugar, for example, is a totally different question from what would happen to me if I ate 30 bananas a day for a month. I think the first thing you need to do is see if there is any science behind the claim that a certain duration is necessary. For example it is quite reasonable based on current science to say that a couple weeks would be required to fully adapt to fat or carbs. If there is some guy on the internet you don't know who has a story about how he felt like crap for six months on a diet and then he felt the best ever, I think you need to keep in mind "ok, this is some guy I don't know, and I have no idea if this is true or if it applies to me. am I so desperate that I'm willing to feel like crap for six months to get better?" In some cases, this might actually be worth it, but in most cases a little common sense would suggest settling on something with more evidence behind that your own experience affirms.

    Warm regards back atcha!

  23. Interesting way of putting it. I've experimented with several different approaches in terms of diet, but in the end I've always found it fairly difficult to determine what factors are ultimately causing the most influence.

    Perhaps the biggest problem with n=1 experimentation is the how the rest of our lives can't really be controlled. I noticed a lot of my personal diet experiments coincide with other things going on in my life (more sleep or less sleep, stressful events, times of the year, etc.). Who can really tell how these factors affect the results as I interpreted them?

    I hear this a lot when people discuss their personal experiments, i.e. "Well, cod liver oil really cleared up my acne but I also went gluten-free at the same time and started doing a lot of volunteer work outdoors at the same time so I don't really know if it was the cod liver oil after all."

  24. Hello Chris,

    Thanks for writing this. Good stuff. For sports fans out there, you could use this link too – 🙂

    One thought that comes to mind is the consideration of independence between trials. The assumption here is that results of each trial are in fact independent of one another. In many experiments (unlike the fair coin toss), that might not be the case and so people need to be mindful of this.

    Take OGTT results vis-a-vis carbohydrate intake . If you are consistently LC and take an OGTT, you would have higher readings than if you were HC, all other things being equal, as I understand it. But eat HC for just 3-4 days straight, and you might get a pretty substantially different result if our random number generator happened to tell us to eat HC for several days in a row. Peter at Hyperlipid has written about this – Anyway, the point being independence (or lack of) should be considered too – in this example so as to not jump to the conclusion that you are pathologically insulin resistant.

    So what are your thoughts about a protocol like Whole 30 where you have many days of non randomized trials and concluding your tolerance to certain foods?

    Last point – in our hallway discussion, the other part that I thought was interesting was the notion of people feeling like crap on certain protocols, yet pushing through since they think the experience is "normal". I think our example was people eating LC and feeling awful yet persisting because of the assumption that the feeling will eventually pass.

    Warm regards,

  25. I added this to the intro:

    "Just to be clear, I'm not suggesting everyone should actually go out and begin performing experiments in this manner. But it's useful to understand the theoretical principles, and for those who are interested in seeing how certain foods affect their blood sugar, blood pressure, or some other parameter, this post will be of practical importance."

Leave a Reply

Your email address will not be published. Required fields are marked *