How to Use Dummy Variables in Regression Analysis

by Zach Bobbitt Posted on Last updated on May 31, 2021

Linear regression is a method we can use to quantify the relationship between one or more predictor variables and a response variable.

Typically we use linear regression with quantitative variables. Sometimes referred to as “numeric” variables, these are variables that represent a measurable quantity. Examples include:

Number of square feet in a house
Population size of a city
Age of an individual

However, sometimes we wish to use categorical variables as predictor variables. These are variables that take on names or labels and can fit into categories. Examples include:

https://c54fb1db06495c2003cf7ff85c20888d.safeframe.googlesyndication.com/safeframe/1-0-45/html/container.html

Eye color (e.g. “blue”, “green”, “brown”)
Gender (e.g. “male”, “female”)
Marital status (e.g. “married”, “single”, “divorced”)

When using categorical variables, it doesn’t make sense to just assign values like 1, 2, 3, to values like “blue”, “green”, and “brown” because it doesn’t make sense to say that green is twice as colorful as blue or that brown is three times as colorful as blue.

Instead, the solution is to use dummy variables. These are variables that we create specifically for regression analysis that take on one of two values: zero or one.

Dummy Variables: Numeric variables used in regression analysis to represent categorical data that can only take on one of two values: zero or one.

https://imasdk.googleapis.com/js/core/bridge3.721.0_en.html#fid=goog_346577978

The number of dummy variables we must create is equal to k-1 where k is the number of different values that the categorical variable can take on.

The following examples illustrate how to create dummy variables for different datasets.

Example 1: Create a Dummy Variable with Only Two Values

Suppose we have the following dataset and we would like to use gender and age to predict income:

https://c54fb1db06495c2003cf7ff85c20888d.safeframe.googlesyndication.com/safeframe/1-0-45/html/container.html

To use gender as a predictor variable in a regression model, we must convert it into a dummy variable.

Since it is currently a categorical variable that can take on two different values (“Male” or “Female”), we only need to create k-1 = 2-1 = 1 dummy variable.

To create this dummy variable, we can choose one of the values (“Male” or “Female”) to represent 0 and the other to represent 1.

In general, we usually represent the most frequently occurring value with a 0, which would be “Male” in this dataset.

Thus, here’s how we would convert gender into a dummy variable:

https://imasdk.googleapis.com/js/core/bridge3.721.0_en.html#fid=goog_346577976

We could then use Age and Gender_Dummy as predictor variables in a regression model.

Example 2: Create a Dummy Variable with Multiple Values

Suppose we have the following dataset and we would like to use marital status and age to predict income:

https://c54fb1db06495c2003cf7ff85c20888d.safeframe.googlesyndication.com/safeframe/1-0-45/html/container.html

To use marital status as a predictor variable in a regression model, we must convert it into a dummy variable.

Since it is currently a categorical variable that can take on three different values (“Single”, “Married”, or “Divorced”), we need to create k-1 = 3-1 = 2 dummy variables.

To create this dummy variable, we can let “Single” be our baseline value since it occurs most often. Thus, here’s how we would convert marital status into dummy variables:

https://imasdk.googleapis.com/js/core/bridge3.721.0_en.html#fid=goog_346577960

We could then use Age, Married, and Divorced as predictor variables in a regression model.

How to Interpret Regression Output with Dummy Variables

Suppose we fit a multiple linear regression model using the dataset in the previous example with Age, Married, and Divorced as the predictor variables and Income as the response variable.

Here’s the regression output:

How to interpret dummy variables in regression output

The fitted regression line is defined as:

Income = 14,276.21 + 1,471.67*(Age) + 2,479.75*(Married) – 8,397.40*(Divorced)

https://c54fb1db06495c2003cf7ff85c20888d.safeframe.googlesyndication.com/safeframe/1-0-45/html/container.html

We can use this equation to find the estimated income for an individual based on their age and marital status. For example, an individual who is 35 years old and married is estimated to have an income of $68,264:

Income = 14,276.21 + 1,471.67*(35) + 2,479.75*(1) – 8,397.40*(0) = $68,264

Here is how to interpret the regression coefficients from the table:

Intercept: The intercept represents the average income for a single individual who is zero years old. Obviously you can’t be zero years old, so it doesn’t make sense to interpret the intercept by itself in this particular regression model.
Age: Each one year increase in age is associated with an average increase of $1,471.67 in income. Since the p-value (.00) is less than .05, age is a statistically significant predictor of income.
Married: A married individual, on average, earns $2,479.75 more than a single individual. Since the p-value (0.80) is not less than .05, this difference is not statistically significant.
Divorced: A divorced individual, on average, earns $8,397.40 less than a single individual. Since the p-value (0.53) is not less than .05, this difference is not statistically significant.

https://imasdk.googleapis.com/js/core/bridge3.721.0_en.html#fid=goog_346577962

Since both dummy variables were not statistically significant, we could drop marital status as a predictor from the model because it doesn’t appear to add any predictive value for income.

Additional Resources

Qualitative vs. Quantitative Variables
The Dummy Variable Trap
How to Read and Interpret a Regression Table
An Explanation of P-Values and Statistical Significance

Posted in Programming

Zach Bobbitt

Hey there. My name is Zach Bobbitt. I have a Masters of Science degree in Applied Statistics and I’ve worked on machine learning algorithms for professional businesses in both healthcare and retail. I’m passionate about statistics, machine learning, and data visualization and I created Statology to be a resource for both students and teachers alike. My goal with this site is to help you learn statistics through using simple terms, plenty of real-world examples, and helpful illustrations.

PrevAttributable Risk Calculator

NextHow to Create Dummy Variables in R (Step-by-Step)

10 Replies to “How to Use Dummy Variables in Regression Analysis”

elifJuly 8, 2022 at 10:55 amThis was very useful, thank you!Reply
A_HJanuary 31, 2023 at 10:54 amDear Zach,thank you for great and simple explanation. I am new to this field and your notes really help me a lot to gain initial understanding!Reply
ranchoSeptember 24, 2023 at 12:56 amthanks for this wonderful contentReply
ManoharFebruary 20, 2024 at 2:08 amWhat if some, but not all, level of k-1 levels of a categorical variable are significant? Do we drop those that are not significant? The categorical variable is represented by k-1 variables and a reference, so dropping some doesn’t seem rightReply
1. FredMay 1, 2024 at 8:40 amYes, drop the statistically insignificant dummy variables and re-run the regression to obtain new regression estimates. The dummy variables that are statistically insignificant are no different from the category that was omitted in the n-1 choice, For example, in the example discusses above, the fact that “Married” and “Divorced” have insignificant coefficients means that they ae no different from the “Single” category. On the other hand, if say “Divorced” had been significant but “Married” was not, the regression should be rerun again with the “Divorced” dummy variable omitted to get a new coefficient for “Married”. This would be a more precise measure of the effect of “Married” from both “Single” and “Divorced.Reply
MarizaFebruary 27, 2024 at 2:22 amWell explained!Reply
hamza maroofJune 18, 2024 at 9:14 amAmaizing content teacher, keep it upReply
1. James CarmichaelJune 18, 2024 at 6:40 pmThank you hamza! We appreciate your feedback and support!Reply
JohnDecember 29, 2024 at 9:14 amHello. We have the number 111188.00 obtained by multiplying three components a*d*c=111188.00. What is the probability of such a set of digits in a number?Reply
1. James CarmichaelDecember 30, 2024 at 3:53 pmTo determine the probability of such a set of digits appearing in a number like , we need to clarify what you mean by “the probability of such a set of digits.” There are a couple of interpretations we could consider:### 1. **Probability of a specific digit composition (e.g., 1 appears 4 times, 8 appears 2 times, 0 appears 2 times):**
  If you are asking for the probability that a randomly chosen number has this exact composition of digits, the analysis would depend on the range of numbers (e.g., are we considering all possible numbers, only numbers with 8 digits, etc.) and the distribution of digits in your context (uniform, natural, etc.).If digits are equally likely (uniform distribution), the probability of having exactly four 1s, two 8s, and two 0s in an 8-digit number can be computed using combinatorics. For a uniform random 8-digit number:
  – There are  ways to arrange 4 ones, 2 eights, and 2 zeros.
  – The total number of possible 8-digit combinations is  (assuming leading zeros are allowed).Thus, the probability is approximately:
  ### 2. **Probability of this number specifically (111188.00):**
  If you are asking for the probability of a single number like , assuming a uniform distribution of numbers, the probability of drawing that exact number from all possible numbers in a given range is:
  
  For example, if the range includes all real numbers with two decimal places, there are infinitely many possibilities, making the probability effectively zero.### 3. **Probability based on the product :**
  If , , and  are random variables with known distributions, we could compute the joint probability distribution of  producing the product . This requires:
  – Defining the distributions of  and .
  – Solving for the joint probability that their product equals , which often involves integrating over possible values.Can you clarify which interpretation aligns with your question?Reply

Your email address will not be published. Required fields are marked *

Comment *

Name *

Email *

Search

Search for:Search

ABOUT STATOLOGY

Statology makes learning statistics easy by explaining topics in simple and straightforward ways. Our team of writers have over 40 years of experience in the fields of Machine Learning, AI and Statistics. Learn more about our team here.

Statology Study

Statology Study is the ultimate online statistics study guide that helps you study and practice all of the core concepts taught in any elementary statistics course and makes your life so much easier as a student.

Introduction to Statistics Course

Introduction to Statistics is our premier online video course that teaches you all of the topics covered in introductory statistics. Get started with our course today.

https://c54fb1db06495c2003cf7ff85c20888d.safeframe.googlesyndication.com/safeframe/1-0-45/html/container.html

Wisteria Theme by WPFriendship ⋅ Powered by WordPressDO NOT SELL OR SHARE MY INFORMATION

https://imasdk.googleapis.com/js/core/bridge3.721.0_en.html#fid=goog_346577941

https://c54fb1db06495c2003cf7ff85c20888d.safeframe.googlesyndication.com/safeframe/1-0-45/html/container.html

56thst.com

Articles, books and commentary by Joel Snell

DUMMY VARIABLE

How to Use Dummy Variables in Regression Analysis

Example 1: Create a Dummy Variable with Only Two Values

Example 2: Create a Dummy Variable with Multiple Values

How to Interpret Regression Output with Dummy Variables

Additional Resources

Post navigation

10 Replies to “How to Use Dummy Variables in Regression Analysis”

Leave a Reply

Search

ABOUT STATOLOGY

Featured Posts

Statology Study

Introduction to Statistics Course

You Might Also Like