Linear Regression in Pricing Analysis, Essential Things to Know

Tags: AmesHousing dplyr equatiomatic broom

Pricing is hard.

Price is Right Contestant… struggling

This is particularly true with large complicated products, common in Business to Business sales (B2B). B2B sellers may lack important information (e.g. accurate estimates of customer budget or ‘street’ prices for the various configurations of their products – the Pricing challenges section discusses other internal and external limitations in setting prices). However organizations typically do have historical data on internal sales transactions as well as leadership with a strong desire for insights into pricing behavior. For now I’ll put aside the question of how to use econometric approaches to set ideal prices. Instead, I’ll walk through some statistical methods that rely only on historical sales information and that can be used for analyzing differences, trends, and abnormalities in your organizations pricing.

With internal data1 you can still support answers to many important questions and provide a starting place for more sophisticated pricing strategies or analyses. I will be writing a series of posts on pricing (see Future pricing posts section for likely topics). In this post, I will focus on the basic ideas and considerations important when using regression models to understand prices.

I will use data from the Ames, Iowa housing market. See the Dataset considerations section for why I use the ames dataset as an analogue for B2B selling / pricing scenarios (as well as problems with this choice). My examples were built using the R programming language, you can find the source code at my github page.

What influences price?

Products have features. These features2 can be used to train a model to estimate price. For a linear model, the outputted coefficients associated with these features can act as proxies for the expected dollar per unit change associated with the component3 (ceteris paribus). In pricing contexts, the idea that regression coefficients relate to the value (i.e. ‘implicit price’) of the constituent components of the product is sometimes called hedonic modeling4. An assumption in hedonic modeling is that our model includes all variables that matter to price5. This assumption is important in that it suggests:

  1. Regression modeling of price is not well suited to contexts where you cannot explain a reasonably high proportion of the variance in the price of your product.
  2. You should be particularly thoughtful regarding the variables you include in your model and avoid including variables that represent overlapping/duplicated information about your product.

For a more full discussion on hedonic modeling6 see the Handbook on Residential Property Prices Indices. In this post I will build very simple models that obviously don’t represent all relevant factors or meet some of the strong assumptions in hedonic modeling. Instead, my focus is on illustrating some basic considerations in regression that are particular important in pricing contexts.

Simple linear regression model

Let’s build a model for home price that uses just house square footage, represented by Gr_Liv_Area7, as a feature for predicting home price.

\[ \operatorname{\widehat{Sale\_Price}} = 13290 + 112(\operatorname{Gr\_Liv\_Area}) \]

The coefficient on sale price of 112 is a measure of expected dollars per unit change in square foot. If you build the model without an intercept8, the coefficient more directly equates to dollars per square foot9. However it’s typically more appropriate to leave the intercept in the model10.

Inference and challenges

In evaluating the impact of a component on the price, we don’t want just an estimate of the magnitude of the impact. Instead we want a measure of the likely range this estimate falls within. The traditional way to compute this is by using the standard error associated with our estimate.

term estimate std.error
(Intercept) 13289.6 3269.7
Gr_Liv_Area 111.7 2.1

We can do {coefficient estimate} +/- 2\(\cdot\){standard error of estimate} to get a 95% confidence interval for where we believe the ‘true’ coefficient estimate for Gr_Liv_Area falls. In this case, this means that across our observations, the mean price change per square foot (while only taking into account this variables) is roughly between 108 and 11611.

Violation of model assumptions

Linear regression has a number of model assumptions. Following these is less important when using the model for predictions compared to for inference12. However if you are interpreting the coefficients as representations of the value associated with components of a product (as in our case), model assumptions matter13. I will leave it up to you and Google to read more on model assumptions14.

The tug-of-war between colinear inputs

Let’s add to our regression model another variable, number of bathrooms represented by the bathrooms variable.

\[ \operatorname{\widehat{Sale\_Price}} = 5491 + 94(\operatorname{Gr\_Liv\_Area}) + 19555(\operatorname{bathrooms}) \]

term estimate std.error
(Intercept) 5491.2 3356.2
Gr_Liv_Area 94.0 2.9
bathrooms 19555.3 2284.9

The coefficient on square footage has decreased – this is because number of bathrooms and square feet of home are correlated (they have a correlation of 0.71). Some of the impact on home price that previously existed entirely in the coefficient for Gr_Liv_Area is now shared with the related bathrooms variable. Also, the standard error on Gr_Liv_Area has increased – representing greater uncertainty as to the mean impact of the variable within the model (compared to the prior simple linear regression example).

Let’s consider a model with another variable added: TotRms_AbvGrd, the total number of rooms (above ground and excluding bathrooms) in the home. This variable is also correlated with Gr_Liv_Area and number of bathrooms (correlation of ~0.8 and ~0.6 respectively).

\[ \begin{aligned} \operatorname{\widehat{Sale\_Price}} &= 35600 + 122(\operatorname{Gr\_Liv\_Area})\ + \\ &\quad 20411(\operatorname{bathrooms}) - 11389(\operatorname{TotRms\_AbvGrd}) \end{aligned} \]

term estimate std.error
(Intercept) 35600.0 4384.3
Gr_Liv_Area 121.8 3.9
bathrooms 20410.7 2245.6
TotRms_AbvGrd -11389.4 1093.6

Notice the coefficient on TotRms_AbvGrd is negative at -11792.2. This doesn’t mean houses with more bedrooms are associated with negative home prices. Though it suggests a house with the same square footage and number of bathrooms will be less expensive if it has more rooms15.

Theoretical challenge:

Pretend we put in another variable: half_bathrooms that represented the number of half bathrooms in the home. Our previous variable bathrooms already included both full and half bathrooms. This presents a theoretical problem for the model: bathrooms would be represented in two different variables that have a necessary overlap16 with one another. Our understanding of the value of a bathroom as its coefficient value would become less clear17.

Beyond this theoretical challenge, duplicated or highly associated inputs also create numeric challenges. The remainder of this post will be focused on numeric challenges and considerations in fitting regression models. These lessons can be applied broadly across inferential contexts but are particularly important in pricing analysis.

Numeric challenge:

Linear regression models feature this ‘tug-of-war’ between the magnitude of coefficients whereby correlated variables share general influences in the model. At times this causes similar variables to seem to have opposing impacts18. When evaluating coefficients for pricing analysis exercises19 this competition between coefficients has potential drawbacks:

  • As you increase the number of variables in the model, colinearity can make for models with a high degree of instability / variance in the parameter estimates – meaning that the coefficients in your model (and your resulting predictions) could change dramatically even from small changes in the training data20, which undermines confidence in estimates.
  • You may want to limit methods that result in models with unintuitive variable relationships (e.g. where related factors have coefficients that appear to act in opposing directions).

Improving model fit, considerations

I do not discuss the topic of variable selection, but highly recommend the associated chapter in the online textbook Feature Engineering and Selection by Max Kuhn and Kjell Johnson.

Data transformations

Before modeling, transformations to the underlying data are often applied for one of several reasons:

  • To help satisfy model assumptions or to minimize the impact of outliers and influential points on estimates.
  • To improve the fit of the model.
  • To help with model interpretation21.
  • To facilitate preprocessing requirements important to the fitting procedure22.

Important in pricing contexts, transformations to the data alter the meaning of the coefficients23. Data transformations may improve model fit, but may complicate coefficient interpretability. In some cases this may be helpful in other cases it may not – it all depends on the aims of the model and the types of interpretations the analyst is hoping to make24. As part of an internal presentation given at NetApp on pricing, I describe some common variable transformations and how these affect the resulting interpretation of the coefficients:

More sophisticated Machine Learning Methods:

When using more sophisticated machine learning techniques the term data transformation is sometimes supplanted by the term feature engineering (though the latter typically suggests more numerous or more complicated changes to input data). Some machine learning techniques25 can also identify hard to find relationships or obviate the need for complex data transformations that would be required to produce good model fits with a linear model26. This may save an analyst time or allow them to produce models with a better fit but may come at a cost to ease of model interpretability. For a brief discussion, see the section Interpretability of machine learning methods. For this post, I will stick to linear models.

Fitting procedures

Alternatives to the standard optimization technique for linear regression, Ordinary Least Squares (OLS), may be more robust to model assumptions and influential points or tend to produce more stable estimates27. A few options:

  • Regularization: puts constraints on the linear model that discourage high levels of variance in your coefficient estimates. See the section Regularization and colinear variables for a more full discussion on how L1 & L2 penalties affect estimates for colinear inputs differently28.
  • Bayesian approaches: can use priors and rigorous estimation procedures to limit overfitting and subdue extreme estimates.
  • Robust regression: typically refers to using weighted least squares (or similar methods) which allows for giving different amounts of weight to observations (typically to reduce the weight of extreme and influential points).

Each of these fitting procedures has different advantages and disadvantages and will modulate coefficient estimates differently29.

Closing notes and tips

You can use regression models to evaluate the impact of different factors on price. However it is important to consider how coefficient estimates will respond to your particular input data (e.g. multicolinearity of your inputs or violations of your model assumptions) and to use techniques that will produce an appropriate model fit for your needs. In pricing contexts in particular you should consider the types of inferences you will be asked to make and build your model in a way that fits your business requirements.

Some tips for building models for inference in pricing contexts:

  • If your model doesn’t explain a high proportion of the data, be careful what you say to stakeholders about the respective value of components30.
  • Getting a good model fit should be a driving force. However, in a similar way to how you may prefer fewer variables or a more simple modeling technique, you may also prefer fewer and less complicated variable transformations31.
  • When evaluating the influence of the components of your product, review the variability in your coefficient estimates and not just the estimates themselves.
  • Consider building linear models using multiple model fit techniques32 33.
  • Even if you plan on using a linear model, using a generic more complex machine learning model can be helpful as a sanity check. If model performance is not substantially different between your models, you are fine, if it is, there may be an important relationship you are missing and need to identify.

Stay tuned for Future pricing posts on related topics.

Appendix

Pricing challenges

Final price paid by a customer may vary substantially within a given product. This variability is often due in part to a high degree of complexity inherent in the product and different configurations between customers34. Fluctuations in product demand and macroeconomic factors are other important influences, as are factors associated with the buyer’s / seller’s negotiation skill and ability to use market information to leverage a higher or lower discount.

The final price paid may also be influenced by a myriad of competing internal interests. Sales representatives may want leniency in price guidelines so they can hit their quota. Leadership may be concerned about potential brand erosion that often comes with lowering prices. Equity holders may be focused on immediate profitability or may be willing to sacrifice margin in order to expand market share. Effectively setting price guidelines requires the application of various economic, mathematical, and sociological principles35 which may not be feasible to set-up36. Implementation of which requires reliable data, which could be lacking due to:

  • Market information may be inaccurate or unavailable37.
  • Total costs of production may not be accessible (from your position in the organization).
  • Current organizational goals may not be well defined.
  • Information on successful deals may be more reliable than information on missed deals.

These (or a host of other gaps in information) may make it difficult to define an objective function for identifying optimal price guidelines.

Future pricing posts

In a series of posts I will tackle a variety of questions stakeholders may ask regarding organizational pricing. Some likely topics include:

  1. How do differences in product components associate with differences in price? What is the magnitude of the influence of these factors?
  2. How have these factors changed over time?
  3. Which customers fall outside the ‘normal’ behavior in regard to the price they are receiving?
  4. How can complexities in pricing strategy be captured by a statistically rigorous modeling framework (E.g. when volume dictates price)?

Dataset considerations

The relevant qualities of a dataset I was looking for were:

  1. Multiple years of data
  2. Many features, with a few key variables associated with a large proportion of the variance

The ames housing dataset meets these qualifications and i was already familiar with it. Evaluating home prices can serve as a practical analogue for our problem; both home sales and business to business sales often represent large purchases with many features influencing price. You can pretend that individual rows represent B2B transactions for a large corporation selling a complicated product line (rather than individual home sales).

There are also many important differences between home sales and B2B sales that make this a weaker analogue. To name a few:

  • in B2B contexts, repeat sales are typically more important than initial sales. In the housing market, repeat sales don’t exist.
  • information on home prices and prior home sales is accessible to both the buyer and seller – meaning there are no options for targeted pricing.
  • in B2B contexts, an influential buyer may be able to leverage the possibility of a partnership of some kind in order to secure a better deal on a large purchase38.
  • Volume selling schemes and other pricing strategies may have less of an impact on house prices compared to in B2B settings.

For the notes in this first post, these don’t matter.

Interpretability of machine learning methods

In some pricing scenarios tree-based methods may be particularly helpful in modeling price – particularly in contexts where the price of a product can be well defined by if-then statements. This may be useful in cases where there is volume pricing – e.g. the pricing approach is different depending on the amount you are purchasing. Perhaps better though would be cubist models which start as decision trees but then terminate into individual linear models (allowing for different linear models based off pre-defined if-then statements).

(Ignoring figuring out the ideal type of model or feature engineering regiment39 for your problem) the typical juxtaposition between linear models and more sophisticated machine learning techniques is in how easy they are to interpret. Sophisticated machine learning methods (sometimes described as ‘black-boxes’40) can be made to be interpretable. Interpretation typically involves some approach that evaluates how the predictions change in relation to some change in the underlying data. This prediction focused way of interpreting a model has the advantage of being more standard across model types. The argument goes that regardless of the structure of the model, you always get predictions, hence you should use these predictions to drive your interpretations of the model. This enables you to compare models (across things other than just raw performance) regardless of the type of model you use.

The advantage linear models have is that the model form itself is highly interpretable. Unlike other models the parameters of linear models are directly aggregatable. With a linear model you can more easily say how much value a component of a product adds to the price. With other types of models this translation is usually more difficult.

Linear models can be understood by a wider audience and also may be viewed as more logical or fair41 42. However, if you build a linear model with highly complicated transformations, interactions, or non-linear terms, notions of this ‘interpretability advantage’ start to deteriorate43.

In summary, the breakdown of linear regression vs complicated machine learning models may be similar in pricing contexts as it is in other problem spaces:

  • If you only care about accuracy of your predictions (i.e. pricing estimates) or want to save time on complex feature engineering more sophisticated machine learning techniques may be valuable.
  • If you care about interpretability or have audit requirements regarding prices, linear models have a certain advantages.

Regularization and colinear variables

Regularization typically comes in two flavors; either an L1 penalty (lasso regression) or L2 penalty (ridge regression), or some combination of these (elastic net) is applied to the linear model. These penalties provides a cost for larger coefficient which acts to decrease the variance in our estimates44. In conditions of colinear inputs, these two penalties act differently on coefficient estimates of colinear features:

  • Lasso regression tends to choose a ‘best’ variable (among a subset of colinear variables) whose coefficient ‘survives’, while the other associated variables’ coefficients are pushed towards zero
  • For ridge regression, coefficients of similar variables gravitate to a similar value

Coefficients of a regularized model

Variable inputs are usually standardized before applying regularization. Hence because inputs are all (essentially) put on the same scale, the coefficient estimates can be directly compared with one another as measures of their relative influence on the target (home price). This ease of comparison may be convenient. However if our goal is interpreting the coefficient estimates in terms of dollar change per unit increase, we will need to back-transform the coefficients.


  1. Internal sales data alone is limited in that its focused on only a component of sales, rather than considering the full picture – this puts the analyst in a familiar position of one with incomplete information, and a constrained scope of influence.↩︎

  2. Dataset should be structured such that each feature is a column and each row an observation, e.g. a sale.↩︎

  3. Sort of, and under certain contexts…↩︎

  4. https://en.wikipedia.org/wiki/Hedonic_regression↩︎

  5. Missing important components or misattributing influence of price can cause bias in the model (omitted variable bias).↩︎

  6. and how it can also be used for things like creating price indices↩︎

  7. Does not including basement.↩︎

  8. I.e. make it zero so that the expected value of a house of 0 square foot is $0↩︎

  9. In this case, the coefficient for the model becomes 119.7 if the intercept is set to zero.↩︎

  10. Hedonic modeling also has a variety of approaches associated with evaluating changes in the intercept term between models that (again) can be read in the the Handbook on Residential Property Prices Indices.↩︎

  11. Note there are more modern approaches to estimating this range using Bayesian or simulation based methods.↩︎

  12. At least to the extent that satisfying them doesn’t improve your predictions, or suggest a different model type may be more appropriate.↩︎

  13. Although some would argue that you don’t need to worry too much about any of your assumptions except that your observations are independent of one another.↩︎

  14. Model assumptions of linear regression by Ordinary Least Squares is already covered extensively in essentially every tutorial and Introduction to Statistics textbook on regression.↩︎

  15. Perhaps representing a preference for larger rooms or open space among buyers or a confounding effect with some other variable. For the purposes of this post i simply want to point out how coefficient values can vary under conditions of colinearity.↩︎

  16. (though not perfect colinearity)↩︎

  17. Hence the importance of being particularly thoughtful of the variables you input into the model and avoiding variables that roughly duplicate one another.↩︎

  18. A common rule of thumb for when variables are ‘too correlated’ is 0.90 – at least in regression contexts and cases where you are focused on inference. In other contexts (e.g. those that appear in Kaggle prediction competitions) this threshold can be much higher. However, as discussed, lower levels of correlation can still contribute to instability in your coefficient estimates↩︎

  19. Where you care about the individual parameter estimates and want them to be meaningful.↩︎

  20. This is what “variance” means in the bias-variance trade-off common in model development. This may also be referred to as instability in the model or parameter estimates.↩︎

  21. An example of this may be standardizing the underlying data so that the coefficient estimates may be more directly compared to one another (as the underlying data is all on the same scale).↩︎

  22. Standardizing the data is also important for many fitting methods, e.g. regularization.↩︎

  23. E.g. a log transform on an input alters the interpretation of the coefficient to be something closer to dollars per percentage point change of input. Log on target means percentage change in price per unit change in input. If you take the log of both your inputs and your target, the coefficient represents percent change in x related to percent change in y, also known as an ‘elasticity’ model in econometrics.↩︎

  24. There may be a preference to speak in the simplest terms: change in price as a function of unit change in component – which may put pressure on the analyst to limit data transformations. It is the analysts job then to strike the correct balance between producing a model that fits the data and one that can be understood by stakeholders.↩︎

  25. Neural networks in particular.↩︎

  26. Many non-linear methods still have sophisticated preprocessing requirements. Though these are sometimes more generic – meaning less work to customize between problems to reach at least some minimum level of fit between problems (again, in some contexts).↩︎

  27. Some of what I read on hedonic modeling seemed to discourage the use of methods other than Ordinary Least Squares (e.g. Weighted Least Squares) but I’ve found other methods to be helpful.↩︎

  28. Regularization with an L1 penalty provides the added bonus of also doing variable selection.↩︎

  29. For robust methods and regularization, there are less established methods for producing confidence intervals. You may need to use simulation methods (which are more computationally intensive).↩︎

  30. Generally it is a good idea to describe the mean error or some other measure, so that they can get a sense of how close the model you are describing is fitting the data, or whether the effects you are talking about are general, but not particularly useful for predictions.↩︎

  31. And a preference for transformations that retain an intuitive interpretability for the model.↩︎

  32. Then review the coefficient estimates across them.↩︎

  33. I tend to rely heavily on regularization techniques.↩︎

  34. A variety of factors though push organizations to simplify their products and this process – for the purposes of this post though, I’ll assume a complicated product portfolio.↩︎

  35. For a more full discussion on these concepts see UVA coursera specialization on Cost and Economics in Pricing Strategy.↩︎

  36. Organizations may lack the money or the will.↩︎

  37. Maybe your company doesn’t want to pay the expensive prices that data vendors set for this information (this may especially be a problem if you are a small organization with a small budget).↩︎

  38. While a home seller may be more sympathetic to some buyers over others (E.g. a newly wedded couple looking to start a family over a real-estate mogul looking for investment properties), such preferences likely impact price less than the analogue in the B2B contexts where sellers seek to strike details with popular brands as means of establishing product relevance and enabling further marketing and potentially collaboration opportunities. It is important to note though that the ‘Clayton Act’ and ‘Robinson Patman Act’ make price discrimination in B2B contexts illegal (except in certain circumstances).↩︎

  39. PCA or factor analysis seems like a potentially useful approach in pricing contexts in cases where the variables you have do not clearly represent discrete components of the product – hopefully PCA would help to identify these implicit components.↩︎

  40. Due to the lack of transparency into how they produce predictions.↩︎

  41. Or at least in places where a price seems unfair, it may be easier to quickly identify where the issue lies.↩︎

  42. There is a book Weapons of Math Destruction by Cathy O’Neil that points to a lack of interpretability as one of the chief concerns with modern learning algorithms.↩︎

  43. May as well use a Machine Learning method at this point.↩︎

  44. In both cases non-informative regressors will tend towards zero (in the case of ridge regression, they will never quite reach zero). These approaches typically require tuning to identify the ideal weight (i.e. pressure) assigned to the penalty.↩︎