MULTIPLE REGRESSION/ EVOLUTION AND ANALYSIS
MULTIPLE REGRESSION: EVOLUTION AND ANALYSIS
ABSTRACT:
The author of this article attempts to explain the evolution and analysis of
multiple regression. The author will also describe the history, the present, and the
future of multi-variate analysis. The future will be described in more general
terms as reality is so chaotic in what appears at a distance to be orderly.
At the time of this writing, it appears that multiple regression is lessening in
prestige although it may guarantee publication. Simpler measures should be
included and monitored replication may be necessary. Without making significant
changes, the social science may find that there is even less trust in the fields of
social science or behavioral methodologies. Multiple Regression is a limited tool
and must be monitored.
INTRODUCTION:
Social sciences findings shaky
“The reliability of social science studies has been thrown into question,
after an attempt to replicate project 21 high profile experiments yielded
only 13 successful reproductions. The two year research project
centered on studies that were published between 2010 and 2015 in
highly respected journals Science and Nature…” THE WEEK, September
14,2018, 20
Sir James George Frazier in 1890 (Editors, 2011) described the evolution to
validity from 1) Magic 2) Religion and 3) Science. Multiple Regression comes from
science. Gauss C.F in 1809 (Editors, 2019) introduced multiple regression. Years
later, it took desk calculators days to complete the task. This was true in the 50’s
and 60’s. By the early 70’s, the author was present during the “Big Data Days.”
(Snell, J. & M. Marsh, 2012) Multiple Regression became quick and
comprehensive. However, both nominal and ordinal numbers were troubling.
Kerlinger F. & E. Pedhauser (1973) warned of mixing nominal and ordinal numbers
with interval and ratio numbers. So multiple regression became offensive to those
fearful of the change and those who questioned mixing nominal, ordinal, interval,
and ratio numbers, and did not support hard number theory.
Some 50 years later, Ioannidis J.P.A (2005) tried to replicate numerous studies
from Nature and Science and found many that were different from the original
findings. His publication had over 1 million hits. Another study Geortzl, T.( 2002)
showed that a number of multiple regression studies were not salient. This article
was published in 36 languages. Again, nine years later Ioannidis (2014) made
some suggestions but there was still trouble.
The technology was there to do such action. But, there was still trouble in terms
of accuracy. To quote Hornbrook T. (2009). “We don’t live in a Gaussian world of
order and stability. David Li using a variation of multiple regression helped bring
down Wall Street in 2008.” Robert Merton and his “Portfolio Theory” nearly
destroyed the market in the 90’s (Taleb, N. 2007). It’s orgins are in multiple
regression. Weedmark (2018) indicates problems with Regression Analysis
blurring flawed data or incomplete data. He does make some positive remarks.
Card , D.and S. Srivistastava (2014) not only do they criticize multi-regression, but
its awkward cousin, meta-analysis. See also ( Snell, J. & M. Marsh, 2003) (Snell, J.,
2010.) Staff (2018) march across disciplines to highlight that multiple regression
does not replicate. Staff (2014) adds that most medical research is wrong do to
the use of multiple regression. Geortzel, T. (2002) calls multiple regression “junk
science.” Taleb (2007) adds to the criticism. Hornbrook, T (2014) Chamber, S.C.
(2014) Friedman, S (2010) concur.
However, Snell and Marsh (2012) suggest that multiple regression has value and
should be continued within the research of the social sciences. There is support
for multiple regressions. It is an ambivalent tool. Further, in a fairly exhaustive
research, the author found an article, a powerful support for multiple regression.
See in Card D. &S. Srivastava (, Jagger and Leek). They make a strong statement to
analysis to correct errors in multiple regression. Further, there are hundreds of
descriptions which could be thought as affirmative because individuals choose to
work with this strategy and explain it. So that could be thought of as a positive.
On the other hand, chi-square as a strategic application is a joy (Editors, 2014).
Easy to understand, it takes all subjects and gathers them into multi tabular form
or cross tabs (chi-square.) It is of course much less powerful in the family of
empiricism, but is there to do its labor. One should read the cross tabs which is
one of the first statistics printed from the software/ computer.
Looking at multiple regression, nominal numbers can be made to be interval with
0+1= 1/2which is .5. Trochim . S.(2006)
On the other hand, ordinal numbers have been ordered into rank. Ordinal
numbers can be used with an average of 2 groups are measured against each
other, thus you can overcome median values. So John Doe’s supporters have a
higher rank with Likert scores than Sam Smith’s supporters
We cannot say that one group is” happier” if it is an ordinal Likert scale on
“Happiness “is used. That is the work of interval and ratio numbers. Often, the
mean, mode, and median will give the same results in ordinal level of
measurement. Not all agree on this controversial measure, however when earlier
work on smaller samples indicate validity, we are on common ground. There are
other supportive statements using the mean of ordinal data Saur, J. (2016.)
Winship, C. & R. Mare (1984) found little trouble replacing interval strategies to
ordinal numbers.
Interval numbers do not have a zero base line but are very suitable for
computation. Ratio numbers that become part of the formula for the study are
the empirical peak in multiple regressions. It has a 0 baseline, equal distance
between the numbers and each individual number has equal weight. Thus,
multiple regression is workable. Special emphasis should be discussed on finding
the mean of ordinal numbers. This source appears to support this. Why this is
important is that averages need at least interval calculation. If a dummy variable
is treated nominally and a mean cannot be constructed from ordinal variables,
the impact of multiple regression is lessened. Thus, the basis for this statistical
strategy is interval numbers. They must all be interval or above and anything that
is less than that will sour the findings Snell, J. and M. Marsh, (2014.)
That means that the four levels if correctly created and conducted become
comparable so that perhaps 5 independent variables or more can be “regressed”
on a dependent variable and that of the 5, some have higher beta weights and
thus explained variance so that some are more powerful than others. Further, a
dummy variable can be used because although it appears to be nominal, it is an
interval measure (0+1/2=.5) See S.Trochim above.
Incidentally, multicollinearity is reduced with step wise solution. Noted social
neuroscientist John Hibbing defended multiple regression and using this strategy
has placed him and his research group in a position in the upper reaches of
academia.(personal correspondence)
DESCRIPTION
The question for this article is what is souring regression analysis and what can be
done to improve it? What is lost is that when significant numbers of studies are
not replicated, this strategy loses faith of others as well as funding.
We begin with that which makes multi-variate analysis questionable.
FRAUD This one is very clear. We first encounter and are dazzled by best linear
unbiased estimator, dummy variable, multicollinearity, beta weights, and related
with multiple-regression. Let’s assume that hypothetical author(s) want to
support a new strategy in some area. They cheat using the mosaic of multiple
regression. The most noted is Andrew Wakefield, his study supported a position
that a vaccine would cause a serious childhood disease such as autism Freidman,
S. (2010). This problem caused by his fraud is still with us today. His motive was to
develop a new vaccine with heavy funding from a second party. Thus, he could
create new patients and increase his profits by selling a new product (the
alternative vaccine) He was caught and can no longer practice medicine in the
USA and the UK. As this is being written, there is an outbreak of measles. In the
main, this disease was once considered preventable by a multiple vaccine (M3) In
the mean time, a social movement blossoms and the value of science and
medicine is reduced. Conspiracies abound.
INTERPRETATION
A study does not replicate. This can be caused for many reasons including a
misunderstanding, but a new study may be created, found credible, and then
replicated. The replication supports the study. Indeed it becomes a valuable
portion of information. Thus, the review and interpretation is what caused the
problem. It was not caught by the editor or the editorial board.
Editor flaw can become a problem when a questionable article is surrounded with
multi- syllbalic terminology, a field of distorted numbers, Greek symbols, or
tangential information that is meant to blur not clarify. When formulas are
written in the most obtuse way, this is the approach that is used; the editor is
fooled as is his or her statistics editor. Nearly 70 researchers have been caught
falsifying or changing data to fit their hypotheses. Carey, B.(2015)
SAMPLE SIZE Zetterberg (1965) discussed validity at two levels. One is the level
of discovery. This means a researcher may have found a surprise finding from a
small sample size. It needs replication. Large, random sample may indicate that
there is not a correlation. This is the level of verification.
Except, in special cases an N of 200 is the accepted number. That is a margin of
error of .06 .Further the sample is not random, but it is purposive. The author is
suggesting that using cross tabs or chi-square is better for small numbers.
Although one loses power in the measure, it is safer and more cautious.
RELATED
We now look at other strategies and materials that have yet to be introduced.
Rather than define them, we list them. They are not the focus of this article. Premodern
is serendipity. It is a surprise finding or trial and error. It could distort a
finding with regression analysis. It may be more cautious and clearer to use
descriptive statistics.
Modernism is Gauss and it is fundamental to regression. It assumes an orderly
world. Further, it does take into account that fumbling, stumbling, and distorting
and yet we can still arrive with order. This is the heart of Chaos theory. Snell et.al
(2006) Modernism is a multiple regression phenomena and in the future may
complete modernism and introduce t Meta Modernism. Please see Snell. Cinema,
J. & C. Kowalski (2008)
From pre-modern to modern, we now arrive at meta-modernism. The first stage
or chapter of this theory was post-modernism. However, it was too scrambled
and sometimes foolish. Its major power was in deconstruction of slick and obtuse
modernism. The founder is Mandelbrot and it characteristics include “off label”
“deconstructionism” “outliers” “black swans” “chaotic butterflies” “urban
legends” “paradoxes” “alternative paradigms” “misinterpation” and
“conundrums” Snell (2016.) Now these terms, we are not yet completely ready to
fully define it. We have some of it. However, it would be better that we finish and
improve the later stages of Modernism.
We then can make a more clear evolution from pre-modernism to modernism
and should be prepared for meta-modernism. However, we are still in the middle
of the ending of modernism and we should prepare so that all can be used at
some future date. Multiple regression will be in the mix.
CORRECTIONS
How can we reduce replication errors? Indicated below follows the list
indicated above.
FRAUD This can be reduced where a website is created for “file drawer”
material as well as outright fraud is available for the reader and
reviewer. This can be the spot to incorporate such material.
Additionally, replication can unearth obivious strategies to gain
information created by the author(s.) Accidents should be discovered
but not indicated as conscious fraud. The ranking of the journal may
have something to do with this area. However, the author says this
humbly. Some very credible articles are published in 4th level quartile
journals of SCIMAGO. JOURNALS AND COUNTRIES but have an
important finding. We also need “professional replicators.” They may
have as much power as a senior board member. A finding may indeed
suggest creditability when a manuscript is both refereed and
professionally replicated. Replicators can be so important that they
may review many computational articles of many different journals. In
a sense Ioannidis is that person now. WE NEED MORE.
INTERPRETATION
The cause of a wrong answer is due to editor, editorial board, or
author. It is generally not meant for false publication. It would seem to
this writer that articles with calculations and statistics should be given
as much priority to straight interpretative and theory building
discussion.
META-MODERNISM
Roughly, meta-modernism was introduced in 2005 to 2015. It comes
from Hegelian-Marx struggle to validate. This new modernism has
boundaries. It comes from deconstructed bundles of information that
form and oscillate from one side to another. Deconstruction, multiple
sources, and boundaries are the key to this approach (Snell, J. 2016) Its
major importance is clarity. How can a large number of people use it
and find validity? What the majority do not want is more fog and
intertwined verbiage. Further, it can bleed into other fields. Snell (2016)
VALIDITY
Replication! Replication! Replication! Refereeing through one prestige
journal is not enough. The author’s next manuscript hopefully describes
a “Professional Replicator” who analyses the material before it is
allowed to be distributed to the public and the wider population.
SUMMARY AND CONCLUSION
The three modernism may be best portrayed by Pre-Modern which
displays word description and some raw numbers or percentages with
descriptive statistics. Modernism of which the bell curve is the symbol
for its presentation and Meta-Modernism is a concentric circle and
spiral of numerous bundle of sources of information that oscillate
relative to time.
In the mean time, multiple regression will probably survive if heavily
monitored by an independent third party. Analysis and statistics need
no longer be lost on most readers. The tax payer who funds these
projects is not impressed with stats that are not understandable and
with poor replication. It is probable that no one variable will suffice and
that ongoing research can hopefully get us to a clearer picture of what
is called “reality.”
REFERENCE CITED
Card, D. & S. Srivastava (2014) Findings Are False, Statistics’ Journal
Club, 36-825.
Carey. B. (2015) Stanford Researchers Uncover Patterns In How
Scientists, Lie About Their Data, news.standford.edu/
Chamber, S. C. (2014) “Physics Envy” Do Hard Sciences Hold the
Solution to the Replication Crisis in Psychology, The Guardian.com
Editors, (2014) Chi Sqaured Test, Wikipedia. Org.
Editors (2019) Gauss, the German Mathematician. Update,
Encyclopedia Britannica.com/
Frazier, J. (1890) The Golden Bough, update in Encyclopedia
Brinticanica. com, update 2011.
Freedman, S. (2010)“Lies Damned Lies and Medical Science” The
Atlantic.com
Geortzl, T. (2002) Myths of Murder and Multiple Regression. The
Skeptical Inquirer , 19-23.
Goertzel L (2004) Adding the Dummy Variable, the Skeptical Inquirer,
20-34.
Hollbrok, M. (2009) Was David Li the guy who blew up Wall Street. CBC
News 1-7
Hornbook, T. (2014) “Multicollinearity”and Regression Analysis” Stat
Trek. Com/
Jagger and Leek (2014) from Card D. & S.Srivistava (see above) Why
Most Findings Are True
Ioannidis, J. P.A. (2005) Why Most Published Research Findings are
False, Plos Medicine 2, 8.
Ioannidis. P.A. (2014) Contradicted and Initially Stronger Effects in
Highly Cited and Clinical Research, Journal of American Medical
Association, 294 218-224.
Jagger and Leek (2014) from Card D. & S.Srivistava (see above) Why
Most Findings Are True.
Kerlinger, F. & Pedhaur,E. (1973) Theory, Application, and Regression
Analysis, Multiple Regression in Behavioral Research, New York: Holt,
Reinhardt, & Winston,445
Snell, J. M. Marsh, (2003) Meta-Cognitive Analysis: An Alternative for
the Sciences, Education, Vol.124.
Snell, Joel, Cangemi, J. & C. Kowalski (2008) Social Essays on Chaos
Theory, Boston: McGraw Hill
Snell,J. (2010) Moderating Meta-Analysis, Journal of Instructional
Psychology
Snell, J.& Marsh M. (2011)Meta-Analytic Derivation Journal of
Instructional Psychology
Snell, J & Marsh M. (2012) Multiple Regression and It’s Discontents,
Education, Spring, 517-521.
Snell, J. (2014) Deconstructing Statistical Analysis, Education
Snell, J (2016).Meta-Modernism; An Introduction , Education, 201-203
Taleb, N. (2007) The Black Swan New York: Random House
Weedmark, D. (2018) The Advantages and Disadvantages of the
Multiple Regression Model, Sciencing.Com/
Winship, C. & Robert Mare (1984) Regression Models With Ordinal
Variables, American Sociological Review, 8, 512-515.
Special note:
(It is the author’s opinion that the social sciences have stretched themselves into believing that the
fields (economics, political science, sociology, psychology and others) are sciences. Generally, the first
chapter of text books usually make that claim. This will address this problem only indirectly. We are
trying to indicate that the sciences nearly use ratio numbers all the time. The variables are more
homogenous and one is able to manipulate numbers more accurately. Studies that are sciences can
replicate and generally return with highest level of validity and reliability (accuracy and consistently)
Practioners use the scientific method.
At the other end is art. Original art reflects personal nuances and that grain of sand that holds the
universe within it. It may be replicated, but the original generally is not and it is ranked using very
subjective criteria.
That which is in the middle is the social methodologies. It uses many strategies to assess validity and
reliability along with association or independence. The cause (s) and effect(s) are likely to be more
heterogeneous.
Early on, in the struggle for prestige and legitimacy, the social methodologies replicated in terms of
strategy and terminology the sciences. The fields have had tremendous value and it is this author’s
opinion to use the word science is no longer needed. The rest of the argument will be left to others as
this essay is in support of using simpler measures along with the more complicated that accompanied
the rise of the computer. The regression analysis and related and supportive theories came before the
computer, but using these measures with the computers made it calculations so much easier. However,
the outcomes may be sullied by complexity.
THOUGHT EXPERIMENT
Two manuscripts from roughly two prestigious schools are sent to a mid size journal. The first uses Chi
Square. The essence of the manuscript deals with a bi-variet analysis comparing a non-parametric
number of 50,000 the voting totals (in real numbers) of Jones (10,000) and the other is Smith (40,000
votes). The main issues are that Smith wants to “improve the public schools” and Jones wants to
privatize all schools by neighborhood. Smith is the big winner. Votes are analyzed in a 2 by 2 cells.
To shorten things Jones wants to segregate the schools in a new way. The predominantly black town
feels their children will be short changed. Chi square is calculated and is significant beyond the .05 level.
Further, cross tabs is used and the reader can see the independence of the numbers when controlling
for other demographics. By eye-balling the data, it appears that a middle class coalition of both colors
were the winning strategy. In the study, Chi-square is a nominal test; it can however easily handle
ordinal, interval, and ratio. What is sacrificed in this is the robust power of the test. Further, nominal
test can handle all levels up to ratio, but cannot do the opposite, make nominal into rational numbers.
Additionally, Chi Square can mix number power properties. It is a nominal test, but can handle the 3
other levels. Most do not do this because of the loss of power. Further it is a loss of prestige. The author
may want to place the formula of Greek and English acronyms, but the formula is fairly simple.
Now let’s change the thought experiment as little as possible. The names of the candidates in the
example are changed and so is the issue. All else is the same. This time step wise multiple regression is
used and nearly all the variable are treated as hard number ratio data which they are not. The formula is
included in the manuscript and it is elegant. It includes numbers, Greek font, and as well English
acronyms. Nearly 5 independent variables of all number power are each regressed on the vote
candidate. Each is labeled a beta weight and the outcome is that the best explained variance is color,
then class; most others are of little value. A dummy variable is added to control for related, but still
tangential information.
In other words, the end of the study is roughly the same. Chi-square follows the statistical rules and the
Step Wise hovers near illegitimacy, but is still within bounds. Which one will be published? Both are sent
within a week of each other.
Parenthetically, in an earlier era, before the rise of the computer, writing science was quite an art. Multisyllabic
terminology, underlining, italicizing and the like was used. Long paragraphs were embedded into
footnotes. Long lengthy titles complimented the manuscript. At times new paragraphs were introduced
into a footnote and then refer to its place in the manuscript. It was hard to read, hard to understand,
thus it was scientific.
My guess is that the study with Step wise multiple regression is the one published. It is harder to
understand. It uses a more challenging test to discriminate variables. The editor understands a lot of it,
but s (he) is not so sure, so that this manuscript is sent to a number crunching reviewer. This person
believes research methodology, but he has soured that numbers alone can be assigned to a social
phenomena and the power number levels can be mixed. He likes both manuscripts and says so. As he
ages, he sees that behind the numbers is a sense of connectedness that can be assessed down a number
of rows. In other words, given this is doing research methodology the relationship is close enough. The
last reviewer knows little of statistics. He got his doctorate when he overwhelmed his committee on the
impact serendipity had on discovery science.
What the author is suggesting is that ALL strategies can be used, but nominal based strategies and
triangulation are the easiest to understand and could be superior. It is a paradox of life.)
Joel Charles Snell M.A.
Emeritus Professor
Kirkwood College
Cedar Rapids, Iowa
joelsnell@hotmail.com/
3105 Alleghany Dr. NE
Apple wood Hills
319/366-0063