Showing posts with label China Study. Show all posts
Showing posts with label China Study. Show all posts

Monday, January 16, 2012

The China Study II: Wheat’s total effect on mortality is significant, complex, and highlights the negative effects of low animal fat diets

The graph below shows the results of a multivariate nonlinear WarpPLS () analysis including the variables listed below. Each row in the dataset refers to a county in China, from the publicly available China Study II dataset (). As always, I thank Dr. Campbell and his collaborators for making the data publicly available. Other analyses based on the same dataset are also available ().
    - Wheat: wheat flour consumption in g/d.
    - Aprot: animal protein consumption in g/d.
    - PProt: plant protein consumption in g/d.
    - %FatCal: percentage of calories coming from fat.
    - Mor35_69: number of deaths per 1,000 people in the 35-69 age range.
    - Mor70_79: number of deaths per 1,000 people in the 70-79 age range.


Below are the total effects of wheat flour consumption, along with the number of paths used to calculate them, and the respective P values (i.e., probabilities that the effects are due to chance). Total effects are calculated by considering all of the paths connecting two variables. Identifying each path is a bit like solving a maze puzzle; you have to follow the arrows connecting the two variables. Version 3.0 of WarpPLS (soon to be released) does that automatically, and also calculates the corresponding P values.


To the best of my knowledge, this is the first time that total effects are calculated for this dataset. As you can see, the total effects of wheat flour consumption on mortality in the 35-69 and 70-79 age ranges are both significant, and fairly complex in this model, each relying on 7 paths. The P value for mortality in the 35-69 age range is 0.038; in other words, the probability that the effect is “real”, and thus not due to chance, is 96.2 percent (100-3.8=96.2). The P value for mortality in the 70-79 age range is 0.024; a 97.6 percent probability that the effect is “real”.

Note that in the model the effects of wheat flour consumption on mortality in both age ranges are hypothesized to be mediated by animal protein consumption, plant protein consumption, and fat consumption. These mediating effects have been suggested by previous analyses discussed on this blog (). The strongest individual paths are between wheat flour consumption and plant protein consumption, plant protein consumption and animal protein consumption, as well as animal protein consumption and fat consumption.

So wheat flour consumption contributes to plant protein consumption, probably by being a main source of plant protein (through gluten). Plant protein consumption in turn decreases animal protein consumption, which significantly decreases fat consumption. From this latter connection we can tell that most of the fat consumed likely came from animal sources.

How much fat and protein are we talking about? The graphs below tell us how much, and these graphs are quite interesting. They suggest that, in this dataset, daily protein consumption tended to be on average 60 g, whatever the source. If more protein came from plant foods, the proportion from animal foods went down, and vice-versa.


The more animal protein consumed, the more fat is also consumed in this dataset. And that is animal fat, which comes mostly in the form of saturated and monounsaturated fats, in roughly equal amounts. How do I know that it is animal fat? Because of the strong association with animal protein. By the way, with a few exceptions (e.g., some species of fatty fish) animal foods in general provide only small amounts of polyunsaturated fats – omega-3 and omega-6.

Individually, animal protein and wheat flour consumption have the strongest direct effects on mortality in both age ranges. Animal protein consumption is protective, and wheat flour consumption detrimental.

Does the connection between animal protein, animal fat, and longevity mean that a diet high in saturated and monounsaturated fats is healthy for most people? Not necessarily, at least without extrapolation, although the results do not suggest otherwise. Look at the amounts of fat consumed per day. They range from a little less than 20 g/d to a little over 90 g/d. By comparison, one steak of top sirloin (about 380 g of meat, cooked) trimmed to almost no visible fat gives you about 37 g of fat.

These results do suggest that consumption of animal fats, primarily saturated and monounsaturated fats, is likely to be particularly healthy in the context of a low fat diet. Or, said in a different way, these results suggest that longevity is decreased by diets that are low in animal fats.

How much fat should one eat? In this dataset, the more fat was consumed together with animal protein (i.e., the more animal fat was consumed), the better in terms of longevity. In other words, in this dataset the lowest levels of mortality were associated with the highest levels of animal fat consumption. The highest level of fat consumption in the dataset was a little over 90 g/d.

What about higher fat intake contexts? Well, we know that men on a high fat diet such as a variation of the Optimal Diet can consume on average a little over 170 g/d of animal fat (130 g/d for women), and their health markers remain generally good ().

One of the critical limiting factors, in terms of health, seems to be the amount of animal fat that one can eat and still remain relatively lean. Dietary saturated and monounsaturated fats are healthy. But when accumulated as excess body fat, beyond a certain level, they become pro-inflammatory.

Saturday, November 5, 2011

The China Study II: How gender takes us to the elusive and deadly factor X

The graph below shows the mortality in the 35-69 and 70-79 age ranges for men and women for the China Study II dataset. I discussed other results in my two previous posts () (), all taking us to this post. The full data for the China Study II study is publicly available (). The mortality numbers are actually averages of male and female deaths by 1,000 people in each of several counties, in each of the two age ranges.


Men do tend to die earlier than women, but the difference above is too large.

Generally speaking, when you look at a set time period that is long enough for a good number of deaths (not to be confused with “a number of good deaths”) to be observed, you tend to see around 5-10 percent more deaths among men than among women. This is when other variables are controlled for, or when men and women do not adopt dramatically different diets and lifestyles. One of many examples is a study in Finland (); you have to go beyond the abstract on this one.

As you can see from the graph above, in the China Study II dataset this difference in deaths is around 50 percent!

This huge difference could be caused by there being significantly more men than women per county included the dataset. But if you take a careful look at the description of the data collection methods employed (), this does not seem to be the case. In fact, the methodology descriptions suggest that the researchers tried to have approximately the same number of women and men studied in each county. The numbers reported also support this assumption.

As I said before, this is a well executed research project, for which Dr. Campbell and his collaborators should be commended. I may not agree with all of their conclusions, but this does not detract even a bit from the quality of the data they have compiled and made available to us all.

So there must be another factor X causing this enormous difference in mortality (and thus longevity) among men and women in the China Study II dataset.

What could be this factor X?

This situation helps me illustrate a point that I have made here before, mostly in the comments under other posts. Sometimes a variable, and its effects on other variables, are mostly a reflection of another unmeasured variable. Gender is a variable that is often involved in this type of situation. Frequently men and women do things very differently in a given population due to cultural reasons (as opposed to biological reasons), and those things can have a major effect on their health.

So, the search for our factor X is essentially a search for a health-relevant variable that is reflected by gender but that is not strictly due to the biological aspects that make men and women different (these can explain only a 5-10 percent difference in mortality). That is, we are looking for a variable that shows a lot of variation between men and women, that is behavioral, and that has a clear impact on health. Moreover, as it should be clear from my last post, we are looking for a variable that is unrelated to wheat flour and animal protein consumption.

As it turns out, the best candidate for the factor X is smoking, particularly cigarette smoking.

The second best candidate for factor X is alcohol abuse. Alcohol abuse can be just as bad for one’s health as smoking is, if not worse, but it may not be as good a candidate for factor X because the difference in prevalence between men and women does not appear to be just as large in China (). But it is still large enough for us to consider it a close second as a candidate for factor X, or a component of a more complex factor X – a composite of smoking, alcohol abuse and a few other coexisting factors that may be reflected by gender.

I have had some discussions about this with a few colleagues and doctoral students who are Chinese (thanks William and Wei), and they mentioned stress to me, based on anecdotal evidence. Moreover, they pointed out that stressful lifestyles, smoking, and alcohol abuse tend to happen together - with a much higher prevalence among men than women.

What an anti-climax for this series of posts eh?

With all the talk on the Internetz about safe and unsafe starches, animal protein, wheat bellies, and whatnot! C’mon Ned, give me a break! What about insulin!? What about leucine deficiency … or iron overload!? What about choline!? What about something truly mysterious, related to an obscure or emerging biochemistry topic; a hormone du jour like leptin perhaps? Whatever, something cool!

Smoking and alcohol abuse!? These are way too obvious. This is NOT cool at all!

Well, reality is often less mysterious than we want to believe it is.

Let me focus on smoking from here on, since it is the top candidate for factor X, although much of the following applies to alcohol abuse and a combination of the two as well.

One gets different statistics on cigarette smoking in China depending on the time period studied, but one thing seems to be a common denominator in these statistics. Men tend to smoke in much, much higher numbers than women in China. And this is not a recent phenomenon.

For example, a study conducted in 1996 () states that “smoking continues to be prevalent among more men (63%) than women (3.8%)”, and notes that these results are very similar to those in 1984, around the time when the China Study II data was collected.

A 1995 study () reports similar percentages: “A total of 2279 males (67%) but only 72 females (2%) smoke”. Another study () notes that in 1976 “56% of the men and 12% of the women were ever-smokers”, which together with other results suggest that the gap increased significantly in the 1980s, with many more men than women smoking. And, most importantly, smoking industrial cigarettes.

So we are possibly talking about a gigantic difference here; the prevalence of industrial cigarette smoking among men may have been over 30 times the prevalence among women in the China Study II dataset.

Given the above, it is reasonable to conclude that the variable “SexM1F2” reflects very strongly the variable “Smoking”, related to industrial cigarette smoking, and in an inverse way. I did something that, grossly speaking, made the mysterious factor X explicit in the WarpPLS model discussed in my previous post. I replaced the variable “SexM1F2” in the model with the variable “Smoking” by using a reverse scale (i.e., 1 and 2, but reversing the codes used for “SexM1F2”). The results of the new WarpPLS analysis are shown on the graph below. This is of course far from ideal, but gives a better picture to readers of what is going on than sticking with the variable “SexM1F2”.


With this revised model, the associations of smoking with mortality in the 35-69 and 70-79 age ranges are a lot stronger than those of animal protein and wheat flour consumption. The R-squared coefficients for mortality in both ranges are higher than 20 percent, which is a sign that this model has decent explanatory power. Animal protein and wheat flour consumption are still significantly associated with mortality, even after we control for smoking; animal protein seems protective and wheat flour detrimental. And smoking’s association with the amount of animal protein and wheat flour consumed is practically zero.

Replacing “SexM1F2” with “Smoking” would be particularly far from ideal if we were analyzing this data at the individual level. It could lead to some outlier-induced errors; for example, due to the possible existence of a minority of female chain smokers. But this variable replacement is not as harmful when we look at county-level data, as we are doing here.

In fact, this is as good and parsimonious model of mortality based on the China Study II data as I’ve ever seen based on county level data.

Now, here is an interesting thing. Does the original China Study II analysis of univariate correlations show smoking as a major problem in terms of mortality? Not really.

The table below, from the China Study II report (), shows ALL of the statistically significant (P<0.05) univariate correlations with mortality in 70-79 age range. I highlighted the only measure that is directly related to smoking; that is “dSMOKAGEm”, listed as “questionnaire AGE MALE SMOKERS STARTED SMOKING (years)”.


The high positive correlation with “dSMOKAGEm” does not even make a lot of sense, as one would expect a negative correlation here – i.e., the earlier in life folks start smoking, the higher should be the mortality. But this reverse-signed correlation may be due to smokers who get an early start dying in disproportionally high numbers before they reach age 70, and thus being captured by another age range mortality variable. The fact that other smoking-related variables are not showing up on the table above is likely due to distortions caused by inter-correlations, as well as measurement problems like the one just mentioned.

As one looks at these univariate correlations, most of them make sense, although several can be and probably are distorted by correlations with other variables, even unmeasured variables. And some unmeasured variables may turn out to be critical. Remember what I said in my previous post – the variable “SexM1F2” was introduced by me; it was not in the original dataset. “Smoking” is this variable, but reversed, to account for the fact that men are heavy smokers and women are not.

Univariate correlations are calculated without adjustments or control. To correct this problem one can adjust a variable based on other variables; as in “adjusting for age”. This is not such a good technique, in my opinion; it tends to be time-consuming to implement, and prone to errors. One can alternatively control for the effects of other variables; a better technique, employed in multivariate statistical analyses. This latter technique is the one employed in WarpPLS analyses ().

Why don’t more smoking-related variables show up on the univariate correlations table above? The reason is that the table summarizes associations calculated based on data for both sexes. Since the women in the dataset smoked very little, including them in the analysis together with men lowers the strength of smoking-related associations, which would probably be much stronger if only men were included. It lowers the strength of the associations to the point that their P values become higher than 0.05, leading to their exclusion from tables like the one above. This is where the aggregation process that may lead to ecological fallacy shows its ugly head.

No one can blame Dr. Campbell for not issuing warnings about smoking, even as they came mixed with warnings about animal food consumption (). The former warnings, about smoking, make a lot of sense based on the results of the analyses in this and the last two posts.

The latter warnings, about animal food consumption, seem increasingly ill-advised. Animal food consumption may actually be protective in regards to the factor X, as it seems to be protective in terms of wheat flour consumption ().

Monday, October 31, 2011

The China Study II: Gender, mortality, and the mysterious factor X

WarpPLS and HealthCorrelator for Excel were used to do the analyses below. For other China Study analyses, many using WarpPLS as well as HealthCorrelator for Excel, click here. For the dataset used, visit the HealthCorrelator for Excel site and check under the sample datasets area. As always, I thank Dr. T. Colin Campbell and his collaborators for making the data publicly available for independent analyses.

In my previous post I mentioned some odd results that led me to additional analyses. Below is a screen snapshot summarizing one such analysis, of the ordered associations between mortality in the 35-69 and 70-79 age ranges and all of the other variables in the dataset. As I said before, this is a subset of the China Study II dataset, which does not include all of the variables for which data was collected. The associations shown below were generated by HealthCorrelator for Excel.


The top associations are positive and with mortality in the other range (the “M006 …” and “M005 …” variables). This is to be expected if ecological fallacy is not a big problem in terms of conclusions drawn from this dataset. In other words, the same things cause mortality to go up in the two age ranges, uniformly across counties. This is reassuring from a quantitative analysis perspective.

The second highest association in both age ranges is with the variable “SexM1F2”. This variable is a “dummy” variable coded as 1 for male sex and 2 for female, which I added to the dataset myself – it did not exist in the original dataset. The association in both age ranges is negative, meaning that being female is protective. They reflect in part the role of gender on mortality, more specifically the biological aspects of being female, since we have seen before in previous analyses that being female is generally health-protective.

I was able to add a gender-related variable to the model because the data was originally provided for each county separately for males and females, as well as through “totals” that were calculated by aggregating data from both males and females. So I essentially de-aggregated the data by using data from males and females separately, in which case the totals were not used (otherwise I would have artificially reduced the variance in all variables, also possibly adding uniformity where it did not belong). Using data from males and females separately is the reverse of the aggregation process that can lead to ecological fallacy problems.

Anyway, the associations with the variable “SexM1F2” got me thinking about a possibility. What if females consumed significantly less wheat flour and more animal protein in this dataset? This could be one of the reasons behind these strong associations between being female and living longer. So I built a more complex WarpPLS model than the one in my previous post, and ran a linear multivariate analysis on it. The results are shown below.


What do these results suggest? They suggest no strong associations between gender and wheat flour or animal protein consumption. That is, when you look at county averages, men and women consumed about the same amounts of wheat flour and animal protein. Also, the results suggest that animal protein is protective and wheat flour is detrimental, in terms of longevity, regardless of gender. The associations between animal protein and wheat flour are essentially the same as the ones in my previous post. The beta coefficients are a bit lower, but some P values improved (i.e., decreased); the latter most likely due to better resample set stability after including the gender-related variable.

Most importantly, there is a very strong protective effect associated with being female, and this effect is independent of what the participants ate.

Now, if you are a man, don’t rush to take hormones to become a woman with the goal of living longer just yet. This advice is not only due to the likely health problems related to becoming a transgender person; it is also due to a little problem with these associations. The problem is that the protective effect suggested by the coefficients of association between gender and mortality seems too strong to be due to men "being women with a few design flaws".

There is a mysterious factor X somewhere in there, and it is not gender per se. We need to find a better candidate.

One interesting thing to point out here is that the above model has good explanatory power in regards to mortality. I'd say unusually good explanatory power given that people die for a variety of reasons, and here we have a model explaining a lot of that variation. The model  explains 45 percent of the variance in mortality in the 35-69 age range, and 28 percent of the variance in the 70-79 age range.

In other words, the model above explains nearly half of the variance in mortality in the 35-69 age range. It could form the basis of a doctoral dissertation in nutrition or epidemiology with important  implications for public health policy in China. But first the factor X must be identified, and it must be somehow related to gender.

Next post coming up soon ...

Monday, October 24, 2011

The China Study II: Animal protein, wheat, and mortality … there is something odd here!

WarpPLS and HealthCorrelator for Excel were used in the analyses below. For other China Study analyses, many using WarpPLS and HealthCorrelator for Excel, click here. For the dataset used, visit the HealthCorrelator for Excel site and check under the sample datasets area. I thank Dr. T. Colin Campbell and his collaborators at the University of Oxford for making the data publicly available for independent analyses.

The graph below shows the results of a multivariate linear WarpPLS analysis including the following variables: Wheat (wheat flour consumption in g/d), Aprot (animal protein consumption in g/d), Mor35_69 (number of deaths per 1,000 people in the 35-69 age range), and Mor70_79 (number of deaths per 1,000 people in the 70-79 age range).


Just a technical comment here, regarding the possibility of ecological fallacy. I am not going to get into this in any depth now, but let me say that the patterns in the data suggest that, with the possible exception of some variables (e.g., blood glucose, gender; the latter will get us going in the next few posts), ecological fallacy due to county aggregation is not a big problem. The threat of ecological fallacy exists, here and in many other datasets, but it is generally overstated (often by those whose previous findings are contradicted by aggregated results).

I have not included plant protein consumption in the analysis because plant protein consumption is very strongly and positively associated with wheat flour consumption. The reason is simple. Almost all of the plant protein consumed by the participants in this study was probably gluten, from wheat products. Fruits and vegetables have very small amounts of protein. Keeping that in mind, what the graph above tells us is that:

- Wheat flour consumption is significantly and negatively associated with animal protein consumption. This is probably due to those eating more wheat products tending to consume less animal protein.

- Wheat flour consumption is positively associated with mortality in the 35-69 age range. The P value (P=0.06) is just shy of the 5 percent (i.e., P=0.05) that most researchers would consider to be the threshold for statistical significance. More consumption of wheat in a county, more deaths in this age range.

- Wheat flour consumption is significantly and positively associated with mortality in the 70-79 age range. More consumption of wheat in a county, more deaths in this age range.

- Animal protein consumption is not significantly associated with mortality in the 35-69 age range.

- Animal protein consumption is significantly and negatively associated with mortality in the 70-79 age range. More consumption of animal protein in a county, fewer deaths in this age range.

Let me tell you, from my past experience analyzing health data (as well as other types of data, from different fields), that these coefficients of association do not suggest super-strong associations. Actually this is also indicated by the R-squared coefficients, which vary from 3 to 7 percent. These are the variances explained by the model on the variables above the R-squared coefficients. They are low, which means that the model has weak explanatory power.

R-squared coefficients of 20 percent and above would be more promising. I hate to disappoint hardcore carnivores and the fans of the “wheat is murder” theory, but these coefficients of association and variance explained are probably way less than what we would expect to see if animal protein was humanity's salvation and wheat its demise.

Moreover, the lack of association between animal protein consumption and mortality in the 35-69 age range is a bit strange, given that there is an association suggestive of a protective effect in the 70-79 age range.

Of course death happens for all kinds of reasons, not only what we eat. Still, let us take a look at some other graphs involving these foodstuffs to see if we can form a better picture of what is going on here. Below is a graph showing mortality at the two age ranges for different levels of animal protein consumption. The results are organized in quintiles.


As you can see, the participants in this study consumed relatively little animal protein. The lowest mortality in the 70-79 age range, arguably the range of higher vulnerability, was for the 28 to 35 g/d quintile of consumption. That was the highest consumption quintile. About a quarter to a third of 1 lb/d of beef, and less of seafood (in general), would give you that much animal protein.

Keep in mind that the unit of analysis here is the county, and that these results are based on county averages. I wish I had access to data on individual participants! Still I stand by my comment earlier on ecological fallacy. Don't worry too much about it just yet.

Clearly the above results and graphs contradict claims that animal protein consumption makes people die earlier, and go somewhat against the notion that animal protein consumption causes things that make people die earlier, such as cancer. But they do so in a messy way - that spike in mortality in the 70-79 age range for 21-28 g/d of animal protein is a bit strange.

Below is a graph showing mortality at the two age ranges (i.e., 35-69 and 70-79) for different levels of wheat flour consumption. Again, the results are shown in quintiles.


Without a doubt the participants in this study consumed a lot of wheat flour. The lowest mortality in the 70-79 age range, which is the range of higher vulnerability, was for the 300 to 450 g/d quintile of wheat flour consumption. The high end of this range is about 1 lb/d of wheat flour! How many slices of bread would this be equivalent to? I don’t know, but my guess is that it would be many.

Well, this is not exactly the smoking gun linking wheat with early death, a connection that has been reaching near mythical proportions on the Internetz lately. Overall, the linear trend seems to be one of decreased longevity associated with wheat flour consumption, as suggested by the WarpPLS results, but the relationship between these two variables is messy and somewhat weak. It is not even clearly nonlinear, at least in terms of the ubiquitous J-curve relationship.

Frankly, there is something odd about these results.

This oddity led to me to explore, using HealthCorrelator for Excel, all ordered associations between mortality in the 35-69 and 70-79 age ranges and all of the other variables in the dataset. That in turn led me to a more complex WarpPLS analysis, which I’ll talk about in my next post, which is still being written.

I can tell you right now that there will be more oddities there, which will eventually take us to what I refer to as the mysterious factor X. Ah, by the way, that factor X is not gender - but gender leads us to it.

Monday, May 23, 2011

The China Study II: Wheat may not be so bad if you eat 221 g or more of animal food daily

In previous posts on this blog covering the China Study II data we’ve looked at the competing effects of various foods, including wheat and animal foods. Unfortunately we have had to stick to the broad group categories available from the specific data subset used; e.g., animal foods, instead of categories of animal foods such as dairy, seafood, and beef. This is still a problem, until I can find the time to get more of the China Study II data in a format that can be reliably used for multivariate analyses.

What we haven’t done yet, however, is to look at moderating effects. And that is something we can do now.  A moderating effect is the effect of a variable on the effect of another variable on a third. Sounds complicated, but WarpPLS makes it very easy to test moderating effects. All you have to do is to make a variable (e.g., animal food intake) point at a direct link (e.g., between wheat flour intake and mortality). The moderating effect is shown on the graph as a dashed arrow going from a variable to a link between two variables.

The graph below shows the results of an analysis where animal food intake (Afoods) is hypothesized to moderate the effects of wheat flour intake (Wheat) on mortality in the 35 to 69 age range (Mor35_69) and mortality in the 70 to 79 age range (Mor70_79). A basic linear algorithm was used, whereby standardized partial regression coefficients, both moderating and direct, are calculated based on the equations of best-fitting lines.


From the graph above we can tell that wheat flour intake increases mortality significantly in both age ranges; in the 35 to 69 age range (beta=0.17, P=0.05), and in the 70 to 79 age range (beta=0.24, P=0.01). This is a finding that we have seen before on previous posts, and that has been one of the main findings of Denise Minger’s analysis of the China Study data. Denise and I used different data subsets and analysis methods, and reached essentially the same results.

But here is what is interesting about the moderating effects analysis results summarized on the graph above. They suggest that animal food intake significantly reduces the negative effect of wheat flour consumption on mortality in the 70 to 79 age range (beta=-0.22, P<0.01). This is a relatively strong moderating effect. The moderating effect of animal food intake is not significant for the 35 to 69 age range (beta=-0.00, P=0.50); the beta here is negative but very low, suggesting a very weak protective effect.

Below are two standardized plots showing the relationships between wheat flour intake and mortality in the 70 to 79 age range when animal food intake is low (left plot) and high (right plot). As you can see, the best-fitting line is flat on the right plot, meaning that wheat flour intake has no effect on mortality in the 70 to 79 age range when animal food intake is high. When animal food intake is low (left plot), the effect of wheat flour intake on mortality in this range is significant; its strength is indicated by the upward slope of the best-fitting line.


What these results seem to be telling us is that wheat flour consumption contributes to early death for several people, perhaps those who are most sensitive or intolerant to wheat. These people are represented in the variable measuring mortality in the 35 to 69 age range, and not in the 70 to 79 age range, since they died before reaching the age of 70.

Those in the 70 to 79 age range may be the least sensitive ones, and for whom animal food intake seems to be protective. But only if animal food intake is above a certain level. This is not a ringing endorsement of wheat, but certainly helps explain wheat consumption in long-living groups around the world, including the French.

How much animal food does it take for the protective effect to be observed? In the China Study II sample, it is about 221 g/day or more. That is approximately the intake level above which the relationship between wheat flour intake and mortality in the 70 to 79 age range becomes statistically indistinguishable from zero. That is a little less than ½ lb, or 7.9 oz, of animal food intake per day.

Monday, April 4, 2011

The China Study II: Carbohydrates, fat, calories, insulin, and obesity

The “great blogosphere debate” rages on regarding the effects of carbohydrates and insulin on health. A lot of action has been happening recently on Peter’s blog, with knowledgeable folks chiming in, such as Peter himself, Dr. Harris, Dr. B.G. (my sista from anotha mista), John, Nigel, CarbSane, Gunther G., Ed, and many others.

I like to see open debate among people who hold different views consistently, are willing to back them up with at least some evidence, and keep on challenging each other’s views. It is very unlikely that any one person holds the whole truth regarding health matters. Unfortunately this type of debate also confuses a lot of people, particularly those blog lurkers who want to get all of their health information from one single source.

Part of that “great blogosphere debate” debate hinges on the effect of low or high carbohydrate dieting on total calorie consumption. Well, let us see what the China Study II data can tell us about that, and about a few other things.

WarpPLS was used to do the analyses below. For other China Study analyses, many using WarpPLS as well as HealthCorrelator for Excel, click here. For the dataset used here, visit the HealthCorrelator for Excel site and check under the sample datasets area.

The two graphs below show the relationships between various foods, carbohydrates as a percentage of total calories, and total calorie consumption. A basic linear analysis was employed here. As carbohydrates as a percentage of total calories go up, the diet generally becomes a high carbohydrate diet. As it goes down, we see a move to the low carbohydrate end of the scale.


The left parts of the two graphs above are very similar. They tell us that wheat flour consumption is very strongly and negatively associated with rice consumption; i.e., wheat flour displaces rice. They tell us that fruit consumption is positively associated with rice consumption. They also tell us that high wheat flour consumption is strongly and positively associated with being on a high carbohydrate diet.

Neither rice nor fruit consumption has a statistically significant influence on whether the diet is high or low in carbohydrates, with rice having some effect and fruit practically none. But wheat flour consumption does. Increases in wheat flour consumption lead to a clear move toward the high carbohydrate diet end of the scale.

People may find the above results odd, but they should realize that white glutinous rice is only 20 percent carbohydrate, whereas wheat flour products are usually 50 percent carbohydrate or more. Someone consuming 400 g of white rice per day, and no other carbohydrates, will be consuming only 80 g of carbohydrates per day. Someone consuming 400 g of wheat flour products will be consuming 200 g of carbohydrates per day or more.

Fruits generally have much less carbohydrate than white rice, even very sweet fruits. For example, an apple is about 12 percent carbohydrate.

There is a measure that reflects the above differences somewhat. That measure is the glycemic load of a food; not to be confused with the glycemic index.

The right parts of the graphs above tell us something else. They tell us that the percentage of carbohydrates in one’s diet is strongly associated with total calorie consumption, and that this is not the case with percentage of fat in one’s diet.

Given the above, one may be interested in looking at the contribution of individual foods to total calorie consumption. The graph below focuses on that. The results take nonlinearity into consideration; they were generated using the Warp3 algorithm option of WarpPLS.


As you can see, wheat flour consumption is more strongly associated with total calories than rice; both associations being positive. Animal food consumption is negatively associated, somewhat weakly but statistically significantly, with total calories. Let me repeat for emphasis: negatively associated. This means that, as animal food consumption goes up, total calories consumed go down.

These results may seem paradoxical, but keep in mind that animal foods displace wheat flour in this dataset. Note that I am not saying that wheat flour consumption is a confounder; it is controlled for in the model above.

What does this all mean?

Increases in both wheat flour and rice consumption lead to increases in total caloric intake in this dataset. Wheat has a stronger effect. One plausible mechanism for this is abnormally high blood glucose elevations promoting abnormally high insulin responses. Refined carbohydrate-rich foods are particularly good at raising blood glucose fast and keeping it elevated, because they usually contain a lot of easily digestible carbohydrates. The amounts here are significantly higher than anything our body is “designed” to handle.

In normoglycemic folks, that could lead to a “lite” version of reactive hypoglycemia, leading to hunger again after a few hours following food consumption. Insulin drives calories, as fat, into adipocytes. It also keeps those calories there. If insulin is abnormally elevated for longer than it should be, one becomes hungry while storing fat; the fat that should have been released to meet the energy needs of the body. Over time, more calories are consumed; and they add up.

The above interpretation is consistent with the result that the percentage of fat in one’s diet has a statistically non-significant effect on total calorie consumption. That association, although non-significant, is negative. Again, this looks paradoxical, but in this sample animal fat displaces wheat flour.

Moreover, fat leads to no insulin response. If it comes from animals foods, fat is satiating not only because so much in our body is made of fat and/or requires fat to run properly; but also because animal fat contains micronutrients, and helps with the absorption of those micronutrients.

Fats from oils, even the healthy ones like coconut oil, just do not have the latter properties to the same extent as unprocessed fats from animal foods. Think slow-cooking meat with some water, making it release its fat, and then consuming all that fat as a sauce together with the meat.

In the absence of industrialized foods, typically we feel hungry for those foods that contain nutrients that our body needs at a particular point in time. This is a subconscious mechanism, which I believe relies in part on past experience; the reason why we have “acquired tastes”.

Incidentally, fructose leads to no insulin response either. Fructose is naturally found mostly in fruits, in relatively small amounts when compared with industrial foods rich in refined sugars.

And no, the pancreas does not get “tired” from secreting insulin.

The more refined a carbohydrate-rich food is, the more carbohydrates it tends to pack per unit of weight. Carbohydrates also contribute calories; about 4 calories per g. Thus more carbohydrates should translate into more calories.

If someone consumes 50 g of carbohydrates per day in excess of caloric needs, that will translate into about 22.2 g of body fat being stored. Over a month, that will be approximately 666.7 g. Over a year, that will be 8 kg, or 17.6 lbs. Over 5 years, that will be 40 kg, or 88 lbs. This is only from carbohydrates; it does not consider other macronutrients.

There is no need to resort to the “tired pancreas” theory of late-onset insulin resistance to explain obesity in this context. Insulin resistance is, more often than not, a direct result of obesity. Type 2 diabetes is by far the most common type of diabetes; and most type 2 diabetics become obese or overweight before they become diabetic. There is clearly a genetic effect here as well, which seems to moderate the relationship between body fat gain and liver as well as pancreas dysfunction.

It is not that hard to become obese consuming refined carbohydrate-rich foods. It seems to be much harder to become obese consuming animal foods, or fruits.

Monday, March 7, 2011

The China Study II: Fruit consumption and mortality

I ran several analyses on the effects of fruit consumption on mortality on the China Study II dataset using WarpPLS. For other China Study analyses, many using WarpPLS as well as HCE, click here.

The results are pretty clear – fruit consumption has no significant effect on mortality.

The bar charts figure below shows what seems to be a slight downward trend in mortality, in the 35-69 and 70-79 age ranges, apparently due to fruit consumption.


As it turns out, that slight trend may be due to something else: in the China Study II dataset, fruit consumption is positively associated with both animal protein and fat consumption. And, as we have seen from previous analyses (e.g., this one), the latter two seem to be protective.

So, if you like to eat fruit, maybe you should also make sure that you eat animal protein and fat as well.

Monday, February 21, 2011

The China Study II: Wheat, dietary fat, and mortality

In this post on the China Study II data we have seen that wheat apparently displaces dietary fat a lot, primarily fat from animal sources. We have also seen in that post that wheat is strongly and positively associated with mortality in both the 35-69 and 70-79 age ranges, whereas dietary fat is strongly and negatively associated with mortality in those ranges.

This opens the door for the hypothesis that wheat increased mortality in the China Study II sample mainly by displacing dietary fat, and not necessarily by being a primary cause of health problems. In fact, given the strong displacement effect discussed in the previous post, I thought that this hypothesis was quite compelling. I was partly wrong, as you’ll see below.

A counterintuitive hypothesis no doubt, given that wheat is unlikely to have been part of the diet of our Paleolithic ancestors, and thus the modern human digestive tract may be maladapted to it. Moreover, wheat’s main protein (gluten) is implicated in celiac disease, and wheat contains plant toxins such as wheat germ agglutinin.

Still, we cannot completely ignore this hypothesis because: (a) the data points in its general direction; and (b) wheat-based foods are found in way more than trivial amounts in the diets of populations that have relatively high longevity, such as the French.

Testing the hypothesis essentially amounts to testing the significance of two mediating effects; of fat as a mediator of the effects of wheat on mortality, in both the 35-69 and 70-79 age ranges. There are two main approaches for doing this. One is the classic test discussed by Baron & Kenny (1986). The other is the modern test discussed by Preacher & Hayes (2004), and extended by Hayes & Preacher (2010) for nonlinear relationships.

I tested the meditating effects using both approaches, including the nonlinear variation. I used the software WarpPLS for this; the results below are from WarpPLS outputs. Other analyses of the China Study data using WarpPLS can be found here (calorie restriction and longevity), and here (wheat, rice, and cardiovascular disease). For yet other studies, click here.

The graphs below show the path coefficients and chance probabilities of two models. The one at the top-left suggests that wheat flour consumption seems to be associated with a statistically significant increase in mortality in the 70-79 age range (beta=0.23; P=0.04). The effect in the 35-69 age range is almost statistically significant (beta=0.22; P=0.09); the likelihood that it is due to chance is 9 percent (this is the meaning of the P=0.09=9/100=9%).


The graph at the bottom-right suggests that the variable “FatCal”, which is the percentage of calories coming from dietary fat, is indeed a significant mediator of the relationships above between wheat and mortality, in both ranges. But “FatCal” is only a partial mediator.

The reason why “FatCal” is not a “perfect” mediator is that the direct effects of wheat on mortality in both ranges are still relatively strong after “FatCal” is added to the model (i.e., controlled for). In fact, the effects of wheat on mortality don’t change that much with the introduction of the variable “FatCal”.

This analysis suggests that, in the China Study II sample, one of wheat’s main sins might indeed have been to displace dietary fat from animal sources. Wheat consumption is strongly and negatively associated with dietary fat (beta=-0.37; P<0.01), and dietary fat is relatively strongly and negatively associated with mortality in both ranges (more in the 70-79 age range).

Why is dietary fat more protective in the 70-79 than in the 35-69 age range, with the latter effect only being significant at the P=0.10 level (a 10 percent chance probability)? My interpretation is that, as with almost any dietary habit, it takes years for a chronically low fat diet to lead to problems. See graph below; fat was not a huge contributor to the total calorie intake in this sample.


The analysis suggests that wheat also caused problems via other paths. What are them? We can’t say for sure based on this dataset. Perhaps the paths involve lectins and/or gluten. One way or another, the relationship is complex. As you can see from the graph below, the relationship between wheat consumption and mortality is nonlinear for the 70-79 age range, most likely due to confounding factors. The effect size is small for the 35-69 age range, even though it looks linear or quasi-linear in that range.


As you might recall from this post, rice does NOT displace dietary fat, and it seems to be associated with increased longevity. Carbohydrate content per se does not appear to be the problem here. Both rice and wheat foods are rich in them, and have a high glycemic index. Wheat products tend to have a higher glycemic load though.

And why is dietary fat so important as to be significantly associated with increased longevity? This is not a trivial question, because if too much of that fat is stored as body fat it will actually decrease longevity. Dietary fat is very calorie-dense, and can be easily stored as body fat.

Dietary fat is important for various reasons, and probably some that we don’t know about yet. It leads to the formation of body fat, which is not only found in adipocytes or used only as a store of energy. Fat is a key component of a number of important tissues, including 60 percent of our brain. Since fat in the human body undergoes constant turnover, more in some areas than others, lack of dietary fat may compromise the proper functioning of various organs.

Without dietary fat, the very important fat-soluble vitamins (A, D, E and K) cannot be properly absorbed. Taking these vitamins in supplemental form will not work if you don’t consume fat as well. A very low fat diet is almost by definition a diet deficient in fat-soluble vitamins, even if those vitamins are consumed in large amounts via supplements.

Moreover, animals store fat-soluble vitamins in their body fat (as well as in organs), so we get these vitamins in one of their most natural and potent forms when we consume animal fat. Consuming copious amounts of olive and/or coconut oil will not have just the same effect.

References

Baron, R. M., & Kenny, D. A. (1986). The moderator–mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality & Social Psychology, 51(6), 1173-1182.

Preacher, K.J., & Hayes, A.F. (2004). SPSS and SAS procedures for estimating indirect effects in simple mediation models. Behavior Research Methods, Instruments, & Computers, 36 (4), 717-731.

Hayes, A. F., & Preacher, K. J. (2010). Quantifying and testing indirect effects in simple mediation models when the constituent paths are nonlinear. Multivariate Behavioral Research, 45(4), 627-660.

Sunday, January 30, 2011

The China Study II: A look at mortality in the 35-69 and 70-79 age ranges

This post is based on an analysis of a subset of the China Study II data, using HealthCorrelator for Excel (HCE), which is publicly available for download and use on a free trial basis. You can access the original data on the HCE web site, under “Sample datasets”.

HCE was designed to be used with small and individual personal datasets, but it can also be used with larger datasets for multiple individuals.

This analysis focuses on two main variables from the China Study II data: mortality in the 35-69 age range, and mortality in the 70-79 range. The table below shows the coefficients of association calculated by HCE for those two variables. The original variable labels are shown.


One advantage of looking at mortality in these ranges is that they are more likely to reflect the impact of degenerative diseases. Infectious diseases likely killed a lot of children in China at the time the data was being collected. Heart disease, on the other hand, is likely to have killed more people in the 35-69 and 70-79 ranges.

It is also good to have data for both ranges, because factors that likely increased longevity were those that were associated with decreased mortality in both ranges. For example, a factor that was strongly associated with mortality in the 35-69 range, but not the 70-79 range, might simply be very deadly in the former range.

The mortalities in both ranges are strongly correlated with each other, which is to be expected. Next, at the very top for both ranges, is sex. Being female is by far the variable with the strongest, and negative, association with mortality.

While I would expect females to live longer, the strengths of the associations make me think that there is something else going on here. Possibly different dietary or behavioral patterns displayed by females. Maybe smoking cigarettes or alcohol abuse was a lot less prevalent among them.

Markedly different lifestyle patterns between males and females may be a major confounding variable in the China Study sample.

Some of the variables are redundant; meaning that they are highly correlated and seem to measure the same thing. This is clear when one looks at the other coefficients of association generated by HCE.

For example, plant food consumption is strongly and negatively correlated with animal food consumption; so strongly that you could use either one of these two variables to measure the other, after inverting the scale. The same is true for consumption of rice and white flour.

Plant food consumption is not strongly correlated with plant protein consumption; many plant foods have little protein in them. The ones that have high protein content are typically industrialized and seed-based. The type of food most strongly associated with plant protein consumption is white flour, by far. The correlation is .645.

The figure below is based on the table above. I opened a separate instance of Excel, and copied the coefficients generated by HCE into it. Then I built two bar charts with them. The variable labels were replaced with more suggestive names, and some redundant variables were removed. Only the top 7 variables are shown, ordered from left to right on the bar charts in order of strength of association. The ones above the horizontal axis possibly increase mortality in each age range, whereas the ones at the bottom possibly decrease it.


When you look at these results as a whole, a few things come to mind.

White flour consumption doesn’t seem to be making people live longer; nor does plant food consumption in general. For white flour, it is quite the opposite. Plant food consumption reflects white flour consumption to a certain extent, especially in counties where rice consumption is low. These conclusions are consistent with previous analyses using more complex statistics.

Total food is positively associated with mortality in the 35-69 range, but not the 70-79 range. This may reflect the fact that folks who reach the age of 70 tend to naturally eat in moderation, so you don’t see wide variations in food consumption among those folks.

Eating in moderation does not mean practicing severe calorie restriction. This post suggests that calorie restriction doesn't seem to be associated with increased longevity in this sample. Eating well, but not too much, is.

The bar for rice (consumption) on the left chart is likely a mirror reflection of the white flour consumption, so it may appear to be good in the 35-69 range simply because it reflects reduced white flour consumption in that range.

Green vegetables seem to be good when you consider the 35-69 range, but not the 70-79 range.

Neither rice nor green vegetables seem to be bad either. For overall longevity they may well be neutral, with the benefits likely coming from their replacement of white flour in the diet.

Dietary fat seems protective overall, particularly together with animal foods in the 70-79 range. This may simply reflect a delayed protective effect of animal fat and protein consumption.

The protective effect of dietary fat becomes clear when we look at the relationship between carbohydrate calories and fat calories. Their correlation is -.957, which essentially means that carbohydrate intake seriously displaces fat intake.

Carbohydrates themselves may not be the problem, even if coming from high glycemic foods (except wheat flour, apparently). This post shows that they are relatively benign if coming from high glycemic rice, even at high intakes of 206 to 412 g/day. The problem seems to be caused by carbohydrates displacing nutrient-dense animal foods.

Interestingly, rice does not displace animal foods or fat in the diet. It is positively correlated with them. Wheat flour, on the other hand, displaces those foods. Wheat flour is negatively and somewhat strongly correlated with consumption of animal foods, as well as with animal fat and protein.

There are certainly several delayed effects here, which may be distorting the results somewhat.  Degenerative diseases don’t develop fast and kill folks right away. They often require many years of eating and doing the wrong things to be fatal.

Tuesday, October 5, 2010

The China Study II: Does calorie restriction increase longevity?

The idea that calorie restriction extends human life comes largely from studies of other species. The most relevant of those studies have been conducted with primates, where it has been shown that primates that eat a restricted calorie diet live longer and healthier lives than those that are allowed to eat as much as they want.

There are two main problems with many of the animal studies of calorie restriction. One is that, as natural lifespan decreases, it becomes progressively easier to experimentally obtain major relative lifespan extensions. (That is, it seems much easier to double the lifespan of an organism whose natural lifespan is one day than an organism whose natural lifespan is 80 years.) The second, and main problem in my mind, is that the studies often compare obese with lean animals.

Obesity clearly reduces lifespan in humans, but that is a different claim than the one that calorie restriction increases lifespan. It has often been claimed that Asian countries and regions where calorie intake is reduced display increased lifespan. And this may well be true, but the question remains as to whether this is due to calorie restriction increasing lifespan, or because the rates of obesity are much lower in countries and regions where calorie intake is reduced.

So, what can the China Study II data tell us about the hypothesis that calorie restriction increases longevity?

As it turns out, we can conduct a preliminary test of this hypothesis based on a key assumption. Let us say we compared two populations (e.g., counties in China), based on the following ratio: number of deaths at or after age 70 divided by number deaths before age 70. Let us call this the “ratio of longevity” of a population, or RLONGEV. The assumption is that the population with the highest RLONGEV would be the population with the highest longevity of the two. The reason is that, as longevity goes up, one would expect to see a shift in death patterns, with progressively more people dying old and fewer people dying young.

The 1989 China Study II dataset has two variables that we can use to estimate RLONGEV. They are coded as M005 and M006, and refer to the mortality rates from 35 to 69 and 70 to 79 years of age, respectively. Unfortunately there is no variable for mortality after 79 years of age, which limits the scope of our results somewhat. (This does not totally invalidate the results because we are using a ratio as our measure of longevity, not the absolute number of deaths from 70 to 79 years of age.) Take a look at these two previous China Study II posts (here, and here) for other notes, most of which apply here as well. The notes are at the end of the posts.

All of the results reported here are from analyses conducted using WarpPLS. Below is a model with coefficients of association; it is a simple model, since the hypothesis that we are testing is also simple. (Click on it to enlarge. Use the "CRTL" and "+" keys to zoom in, and CRTL" and "-" to zoom out.) The arrows explore associations between variables, which are shown within ovals. The meaning of each variable is the following: TKCAL = total calorie intake per day; RLONGEV = ratio of longevity; SexM1F2 = sex, with 1 assigned to males and 2 to females.



As one would expect, being female is associated with increased longevity, but the association is just shy of being statistically significant in this dataset (beta=0.14; P=0.07). The association between total calorie intake and longevity is trivial, and statistically indistinguishable from zero (beta=-0.04; P=0.39). Moreover, even though this very weak association is overall negative (or inverse), the sign of the association here does not fully reflect the shape of the association. The shape is that of an inverted J-curve; a.k.a. U-curve. When we split the data into total calorie intake terciles we get a better picture:


The second tercile, which refers to a total daily calorie intake of 2193 to 2844 calories, is the one associated with the highest longevity. The first tercile (with the lowest range of calories) is associated with a higher longevity than the third tercile (with the highest range of calories). These results need to be viewed in context. The average weight in this dataset was about 116 lbs. A conservative estimate of the number of calories needed to maintain this weight without any physical activity would be about 1740. Add about 700 calories to that, for a reasonable and healthy level of physical activity, and you get 2440 calories needed daily for weight maintenance. That is right in the middle of the second tercile.

In simple terms, the China Study II data seems to suggest that those who eat well, but not too much, live the longest. Those who eat little have slightly lower longevity. Those who eat too much seem to have the lowest longevity, perhaps because of the negative effects of excessive body fat.

Because these trends are all very weak from a statistical standpoint, we have to take them with caution. What we can say with more confidence is that the China Study II data does not seem to support the hypothesis that calorie restriction increases longevity.

Reference

Kock, N. (2010). WarpPLS 1.0 User Manual. Laredo, Texas: ScriptWarp Systems.

Notes

- The path coefficients (indicated as beta coefficients) reflect the strength of the relationships; they are a bit like standard univariate (or Pearson) correlation coefficients, except that they take into consideration multivariate relationships (they control for competing effects on each variable). Whenever nonlinear relationships were modeled, the path coefficients were automatically corrected by the software to account for nonlinearity.

- Only two data points per county were used (for males and females). This increased the sample size of the dataset without artificially reducing variance, which is desirable since the dataset is relatively small (each county, not individual, is a separate data point is this dataset). This also allowed for the test of commonsense assumptions (e.g., the protective effects of being female), which is always a good idea in a multivariate analyses because violation of commonsense assumptions may suggest data collection or analysis error. On the other hand, it required the inclusion of a sex variable as a control variable in the analysis, which is no big deal.

- Mortality from schistosomiasis infection (MSCHIST) does not confound the results presented here. Only counties where no deaths from schistosomiasis infection were reported have been included in this analysis. The reason for this is that mortality from schistosomiasis infection can severely distort the results in the age ranges considered here. On the other hand, removal of counties with deaths from schistosomiasis infection reduced the sample size, and thus decreased the statistical power of the analysis.

Friday, September 17, 2010

Strong causation can exist without any correlation: The strange case of the chain smokers, and a note about diet

Researchers like to study samples of data and look for associations between variables. Often those associations are represented in the form of correlation coefficients, which go from -1 to 1. Another popular measure of association is the path coefficient, which usually has a narrower range of variation. What many researchers seem to forget is that the associations they find depend heavily on the sample they are looking at, and on the ranges of variation of the variables being analyzed.

A forgotten warning: Causation without correlation

Often those who conduct multivariate statistical analyses on data are unaware of certain limitations. Many times this is due to lack of familiarity with statistical tests. One warning we do see a lot though is: Correlation does not imply causation. This is, of course, absolutely true. If you take my weight from 1 to 20 years of age, and the price of gasoline in the US during that period, you will find that they are highly correlated. But common sense tells me that there is no causation whatsoever between these two variables.

So correlation does not imply causation alright, but there is another warning that is rarely seen: There can be strong causation without any correlation. Of course this can lead to even more bizarre conclusions than the “correlation does not imply causation” problem. If there is strong causation between variables B and Y, and it is not showing as a correlation, another variable A may “jump in” and “steal” that “unused correlation”; so to speak.

The chain smokers “study”

To illustrate this point, let us consider the following fictitious case, a study of “100 cities”. The study focuses on the effect of smoking and genes on lung cancer mortality. Smoking significantly increases the chances of dying from lung cancer; it is a very strong causative factor. Here are a few more details. Between 35 and 40 percent of the population are chain smokers. And there is a genotype (a set of genes), found in a small percentage of the population (around 7 percent), which is protective against lung cancer. All of those who are chain smokers die from lung cancer unless they die from other causes (e.g., accidents). Dying from other causes is a lot more common among those who have the protective genotype.

(I created this fictitious data with these associations in mind, using equations. I also added uncorrelated error into the equations, to make the data look a bit more realistic. For example, random deaths occurring early in life would reduce slightly any numeric association between chain smoking and cancer deaths in the sample of 100 cities.)

The table below shows part of the data, and gives an idea of the distribution of percentage of smokers (Smokers), percentage with the protective genotype (Pgenotype), and percentage of lung cancer deaths (MLCancer). (Click on it to enlarge. Use the "CRTL" and "+" keys to zoom in, and CRTL" and "-" to zoom out.) Each row corresponds to a city. The rest of the data, up to row 100, has a similar distribution.


The graphs below show the distribution of lung cancer deaths against: (a) the percentage of smokers, at the top; and (b) the percentage with the protective genotype, at the bottom. Correlations are shown at the top of each graph. (They can vary from -1 to 1. The closer they are to -1 or 1, the stronger is the association, negative or positive, between the variables.) The correlation between lung cancer deaths and percentage of smokers is slightly negative and statistically insignificant (-0.087). The correlation between lung cancer deaths and percentage with the protective genotype is negative, strong, and statistically significant (-0.613).


Even though smoking significantly increases the chances of dying from lung cancer, the correlations tell us otherwise. The correlations tell us that lung cancer does not seem to cause lung cancer deaths, and that having the protective genotype seems to significantly decrease cancer deaths. Why?

If there is no variation, there is no correlation

The reason is that the “researchers” collected data only about chain smokers. That is, the variable “Smokers” includes only chain smokers. If this was not a fictitious case, focusing the study on chain smokers could be seen as a clever strategy employed by researchers funded by tobacco companies. The researchers could say something like this: “We focused our analysis on those most likely to develop lung cancer.” Or, this could have been the result of plain stupidity when designing the research project.

By restricting their study to chain smokers the researchers dramatically reduced the variability in one particular variable: the extent to which the study participants smoked. Without variation, there can be no correlation. No matter what statistical test or software is used, no significant association will be found between lung cancer deaths and percentage of smokers based on this dataset. No matter what statistical test or software is used, a significant and strong association will be found between lung cancer deaths and percentage with the protective genotype.

Of course, this could lead to a very misleading conclusion. Smoking does not cause lung cancer; the real cause is genetic.

A note about diet

Consider the analogy between smoking and consumption of a particular food, and you will probably see what this means for the analysis of observational data regarding dietary choices and disease. This applies to almost any observational study, including the China Study. (Studies employing experimental control manipulations would presumably ensure enough variation in the variables studied.) In the China Study, data from dozens of counties were collected. One may find a significant association between consumption of food A and disease Y.

There may be a much stronger association between food B and disease Y, but that association may not show up in statistical analyses at all, simply because there is little variation in the data regarding consumption of food B. For example, all those sampled may have eaten food B; about the same amount. Or none. Or somewhere in between, within a rather small range of variation.

Statistical illiteracy, bad choices, and taxation

Statistics is a “necessary evil”. It is useful to go from small samples to large ones when we study any possible causal association. By doing so, one can find out whether an observed effect really applies to a larger percentage of the population, or is actually restricted to a small group of individuals. The problem is that we humans are very bad at inferring actual associations from simply looking at large tables with numbers. We need statistical tests for that.

However, ignorance about basic statistical phenomena, such as the one described here, can be costly. A group of people may eliminate food A from their diet based on coefficients of association resulting from what seem to be very clever analyses, replacing it with food B. The problem is that food B may be equally harmful, or even more harmful. And, that effect may not show up on statistical analyses unless they have enough variation in the consumption of food B.

Readers of this blog may wonder why we explicitly use terms like “suggests” when we refer to a relationship that is suggested by a significant coefficient of association (e.g., a linear correlation). This is why, among other reasons.

One does not have to be a mathematician to understand basic statistical concepts. And doing so can be very helpful in one’s life in general, not only in diet and lifestyle decisions. Even in simple choices, such as what to be on. We are always betting on something. For example, any investment is essentially a bet. Some outcomes are much more probable than others.

Once I had an interesting conversation with a high-level officer of a state government. I was part of a consulting team working on an information technology project. We were talking about the state lottery, which was a big source of revenue for the state, comparing it with state taxes. He told me something to this effect:

Our lottery is essentially a tax on the statistically illiterate.

Sunday, September 12, 2010

The China Study II: Wheat flour, rice, and cardiovascular disease

In my last post on the China Study II, I analyzed the effect of total and HDL cholesterol on mortality from all cardiovascular diseases. The main conclusion was that total and HDL cholesterol were protective. Total and HDL cholesterol usually increase with intake of animal foods, and particularly of animal fat. The lowest mortality from all cardiovascular diseases was in the highest total cholesterol range, 172.5 to 180; and the highest mortality in the lowest total cholesterol range, 120 to 127.5. The difference was quite large; the mortality in the lowest range was approximately 3.3 times higher than in the highest.

This post focuses on the intake of two main plant foods, namely wheat flour and rice intake, and their relationships with mortality from all cardiovascular diseases. After many exploratory multivariate analyses, wheat flour and rice emerged as the plant foods with the strongest associations with mortality from all cardiovascular diseases. Moreover, wheat flour and rice have a strong and inverse relationship with each other, which suggests a “consumption divide”. Since the data is from China in the late 1980s, it is likely that consumption of wheat flour is even higher now. As you’ll see, this picture is alarming.

The main model and results

All of the results reported here are from analyses conducted using WarpPLS. Below is the model with the main results of the analyses. (Click on it to enlarge. Use the "CRTL" and "+" keys to zoom in, and CRTL" and "-" to zoom out.) The arrows explore associations between variables, which are shown within ovals. The meaning of each variable is the following: SexM1F2 = sex, with 1 assigned to males and 2 to females; MVASC = mortality from all cardiovascular diseases (ages 35-69); TKCAL = total calorie intake per day; WHTFLOUR = wheat flour intake (g/day); and RICE = and rice intake (g/day).


The variables to the left of MVASC are the main predictors of interest in the model. The one to the right is a control variable – SexM1F2. The path coefficients (indicated as beta coefficients) reflect the strength of the relationships. A negative beta means that the relationship is negative; i.e., an increase in a variable is associated with a decrease in the variable that it points to. The P values indicate the statistical significance of the relationship; a P lower than 0.05 generally means a significant relationship (95 percent or higher likelihood that the relationship is “real”).

In summary, the model above seems to be telling us that:

- As rice intake increases, wheat flour intake decreases significantly (beta=-0.84; P<0.01). This relationship would be the same if the arrow pointed in the opposite direction. It suggests that there is a sharp divide between rice-consuming and wheat flour-consuming regions.

- As wheat flour intake increases, mortality from all cardiovascular diseases increases significantly (beta=0.32; P<0.01). This is after controlling for the effects of rice and total calorie intake. That is, wheat flour seems to have some inherent properties that make it bad for one’s health, even if one doesn’t consume that many calories.

- As rice intake increases, mortality from all cardiovascular diseases decreases significantly (beta=-0.24; P<0.01). This is after controlling for the effects of wheat flour and total calorie intake. That is, this effect is not entirely due to rice being consumed in place of wheat flour. Still, as you’ll see later in this post, this relationship is nonlinear. Excessive rice intake does not seem to be very good for one’s health either.

- Increases in wheat flour and rice intake are significantly associated with increases in total calorie intake (betas=0.25, 0.33; P<0.01). This may be due to wheat flour and rice intake: (a) being themselves, in terms of their own caloric content, main contributors to the total calorie intake; or (b) causing an increase in calorie intake from other sources. The former is more likely, given the effect below.

- The effect of total calorie intake on mortality from all cardiovascular diseases is insignificant when we control for the effects of rice and wheat flour intakes (beta=0.08; P=0.35). This suggests that neither wheat flour nor rice exerts an effect on mortality from all cardiovascular diseases by increasing total calorie intake from other food sources.

- Being female is significantly associated with a reduction in mortality from all cardiovascular diseases (beta=-0.24; P=0.01). This is to be expected. In other words, men are women with a few design flaws, so to speak. (This situation reverses itself a bit after menopause.)

Wheat flour displaces rice

The graph below shows the shape of the association between wheat flour intake (WHTFLOUR) and rice intake (RICE). The values are provided in standardized format; e.g., 0 is the mean (a.k.a. average), 1 is one standard deviation above the mean, and so on. The curve is the best-fitting U curve obtained by the software. It actually has the shape of an exponential decay curve, which can be seen as a section of a U curve. This suggests that wheat flour consumption has strongly displaced rice consumption in several regions in China, and also that wherever rice consumption is high wheat flour consumption tends to be low.


As wheat flour intake goes up, so does cardiovascular disease mortality

The graphs below show the shapes of the association between wheat flour intake (WHTFLOUR) and mortality from all cardiovascular diseases (MVASC). In the first graph, the values are provided in standardized format; e.g., 0 is the mean (or average), 1 is one standard deviation above the mean, and so on. In the second graph, the values are provided in unstandardized format and organized in terciles (each of three equal intervals).



The curve in the first graph is the best-fitting U curve obtained by the software. It is a quasi-linear relationship. The higher the consumption of wheat flour in a county, the higher seems to be the mortality from all cardiovascular diseases. The second graph suggests that mortality in the third tercile, which represents a consumption of wheat flour of 501 to 751 g/day (a lot!), is 69 percent higher than mortality in the first tercile (0 to 251 g/day).

Rice seems to be protective, as long as intake is not too high

The graphs below show the shapes of the association between rice intake (RICE) and mortality from all cardiovascular diseases (MVASC). In the first graph, the values are provided in standardized format. In the second graph, the values are provided in unstandardized format and organized in terciles.



Here the relationship is more complex. The lowest mortality is clearly in the second tercile (206 to 412 g/day). There is a lot of variation in the first tercile, as suggested by the first graph with the U curve. (Remember, as rice intake goes down, wheat flour intake tends to go up.) The U curve here looks similar to the exponential decay curve shown earlier in the post, for the relationship between rice and wheat flour intake.

In fact, the shape of the association between rice intake and mortality from all cardiovascular diseases looks a bit like an “echo” of the shape of the relationship between rice and wheat flour intake. Here is what is creepy. This echo looks somewhat like the first curve (between rice and wheat flour intake), but with wheat flour intake replaced by “death” (i.e., mortality from all cardiovascular diseases).

What does this all mean?

- Wheat flour displacing rice does not look like a good thing. Wheat flour intake seems to have strongly displaced rice intake in the counties where it is heavily consumed. Generally speaking, that does not seem to have been a good thing. It looks like this is generally associated with increased mortality from all cardiovascular diseases.

- High glycemic index food consumption does not seem to be the problem here. Wheat flour and rice have very similar glycemic indices (but generally not glycemic loads; see below). Both lead to blood glucose and insulin spikes. Yet, rice consumption seems protective when it is not excessive. This is true in part (but not entirely) because it largely displaces wheat flour. Moreover, neither rice nor wheat flour consumption seems to be significantly associated with cardiovascular disease via an increase in total calorie consumption. This is a bit of a blow to the theory that high glycemic carbohydrates necessarily cause obesity, diabetes, and eventually cardiovascular disease.

- The problem with wheat flour is … hard to pinpoint, based on the results summarized here. Maybe it is the fact that it is an ultra-refined carbohydrate-rich food; less refined forms of wheat could be healthier. In fact, the glycemic loads of less refined carbohydrate-rich foods tend to be much lower than those of more refined ones. (Also, boiled brown rice has a glycemic load that is about three times lower than that of whole wheat bread; whereas the glycemic indices are about the same.) Maybe the problem is wheat flour's  gluten content. Maybe it is a combination of various factors, including these.

Reference

Kock, N. (2010). WarpPLS 1.0 User Manual. Laredo, Texas: ScriptWarp Systems.

Acknowledgment and notes

- Many thanks are due to Dr. Campbell and his collaborators for collecting and compiling the data used in this analysis. The data is from this site, created by those researchers to disseminate their work in connection with a study often referred to as the “China Study II”. It has already been analyzed by other bloggers. Notable analyses have been conducted by Ricardo at Canibais e Reis, Stan at Heretic, and Denise at Raw Food SOS.

- The path coefficients (indicated as beta coefficients) reflect the strength of the relationships; they are a bit like standard univariate (or Pearson) correlation coefficients, except that they take into consideration multivariate relationships (they control for competing effects on each variable). Whenever nonlinear relationships were modeled, the path coefficients were automatically corrected by the software to account for nonlinearity.

- The software used here identifies non-cyclical and mono-cyclical relationships such as logarithmic, exponential, and hyperbolic decay relationships. Once a relationship is identified, data values are corrected and coefficients calculated. This is not the same as log-transforming data prior to analysis, which is widely used but only works if the underlying relationship is logarithmic. Otherwise, log-transforming data may distort the relationship even more than assuming that it is linear, which is what is done by most statistical software tools.

- The R-squared values reflect the percentage of explained variance for certain variables; the higher they are, the better the model fit with the data. In complex and multi-factorial phenomena such as health-related phenomena, many would consider an R-squared of 0.20 as acceptable. Still, such an R-squared would mean that 80 percent of the variance for a particularly variable is unexplained by the data.

- The P values have been calculated using a nonparametric technique, a form of resampling called jackknifing, which does not require the assumption that the data is normally distributed to be met. This and other related techniques also tend to yield more reliable results for small samples, and samples with outliers (as long as the outliers are “good” data, and are not the result of measurement error).

- Only two data points per county were used (for males and females). This increased the sample size of the dataset without artificially reducing variance, which is desirable since the dataset is relatively small. This also allowed for the test of commonsense assumptions (e.g., the protective effects of being female), which is always a good idea in a complex analysis because violation of commonsense assumptions may suggest data collection or analysis error. On the other hand, it required the inclusion of a sex variable as a control variable in the analysis, which is no big deal.

- Since all the data was collected around the same time (late 1980s), this analysis assumes a somewhat static pattern of consumption of rice and wheat flour. In other words, let us assume that variations in consumption of a particular food do lead to variations in mortality. Still, that effect will typically take years to manifest itself. This is a major limitation of this dataset and any related analyses.

- Mortality from schistosomiasis infection (MSCHIST) does not confound the results presented here. Only counties where no deaths from schistosomiasis infection were reported have been included in this analysis. Mortality from all cardiovascular diseases (MVASC) was measured using the variable M059 ALLVASCc (ages 35-69). See this post for other notes that apply here as well.