Kings County Housing Prices

Bonny Nichol
3 min readFeb 28, 2020
Image Credit: https://breteuilimmo.com

For this project, I used a flat dataset of the Kings County Housing Prices in Washington State with data from 2014 to 2015. After importing the .csv file, we are presented with a large list of potential features and they won’t all be necessary for the regression model later.

Looking for NaN values in the dataset

With a quick data check, we can observe there is missing data.

Scrub, Scrub, Scrub

Cleaning the data is the most time demanding part of the project and includes replacing the missing data with the median (mean is sensitive to outliers) as well as replace other values like “?” that would affect our model later. In addition to replacing missing data, we have to normalize the data distribution by removing outliers and performing a log transformation on some features.

Scatterplot/Histogram Visualization

Additionally, we can see in our scatterplots that certain features would perform better in the model by turning them into categorical variables. (In the scatterplot, these are generally the features with a vertical orientation of the data.) One of the best visualizations for data exploration I discovered in this project was the scatterplot/histogram combo. This visualization efficiently shows the relationships that features could have to one another (negative or positive linear scatterplot distribution) as well as showing if the feature has a normal distribution or not in its histogram graph.

Looking for Multicollinearity

Explorations

After cleaning the data, it is important to look for multicollinearity amongst our features. If multicollinearity is too high (for example, more than 0.75), this could negatively affect our regression model by causing big fluctuations in our coefficients.

Regression and Feature Selection

Interesting relationship between age of house and sales price

Finally we can create a regression model with a stepwise selection and feature selection. The results of the model are the coefficients of each feature which can advise us to which features most affect the dependent variable (Price).

Business Value

The most important part of this exercise is the recognize the business value of studying datasets and performing regression models in order to learn the most from our data. Understanding which features can profit a company’s sales and predicting the sales price of a house is very useful information for real estate companies and individuals interested in specific housing markets.

--

--

Bonny Nichol

Creative Data Enthusiast ✏️ | Passionate for data & science | Coffee lover ☕️ | Let’s get in touch: linkedin.com/in/bonny-nichol/