Northwind Database Analysis

Bonny Nichol
3 min readNov 4, 2020
Image Credit: https://www.infineon.com

The Northwind database, a free and open-source database created by Microsoft, was created for a fictional company for the purpose of practicing SQL queries and statistical analysis.

The goal of the project is to query the database and to perform statistical analysis and hypothesis testing to generate analytical insights that can be of value to the company. From the database we are able to gather information about employees, orders, shipping performance, and other valuable information to provide advice to the company.

The methodology of this project included general data exploration, Welch’s T-Test, Cohen’s D, and ANOVA. Hypothesis testing for NULL and Alternative hypothesis were done to search for evidence of statistical significance of various metrics.

Data Exploration

Data Exploration and querying of the database provided general information about the Northwind company and the orders contained in the database. This exploration also provided the relationships amongst the data within the database. This information included important data about the size of Northwind, the shipping suppliers, the orders (number of orders, quantity) in the database, and more. By exploring the data, hypothesis were able to be created to be able to provide analytical insight for the company.

Hypothesis Testing

The project addressed 4 questions:

  1. Does discount amount have a statistically significant effect on the quantity of a product in an order? If so, at what level(s) of discount?
  2. Is there is a statistical difference between the performances of the shipping companies?
  3. Does the time of year (first half of the year or second half of the year) have an impact on the quantity of orders?
  4. Is there a statistically significant difference between performance of UK employees and US employees?

For each question a Null hypothesis and an Alternative Hypothesis were set up. The null hypothesis basically states that there is no evidence of statistical significance. The alternative hypothesis states that there is statistical significance. The purpose of the hypothesis testing is to see if there is enough evidence to reject or to not reject the null hypothesis.

Methodology

For each question, the data was queried from the database and an initial plot was made to show probability distributions to visually see sample differences in the mean and standard deviation. The graph below is for the first question above and displays there is a visual difference in the means. The distribution is skewed to the right and with a non normal distribution although they seem to be similar shapes.

Welch’s T-test

After calculating the mean of each population and variances of the populations, it is evident that they are different and therefore require Welch’s T-Test for hypothesis testing. This two sample test is used when there are two populations to see if the populations have equal means.

Cohen’s D

Cohen’s D is used to accompany other statistical testings like Welch’s t-test or ANOVA to calculate the effect size to calculate the magnitude of difference (effect size).

ANOVA

ANOVA tests multiple pairwise comparisons and creates a chart output to show significance of each factor.

Conclusions

From our hypothesis testing we were able to reject the null hypothesis for the question of if discounts had a significant effect on the quantity of the order. Therefore there is strong evidence that discount does play a role in how much someone orders a product. For the other three questions, there was no significant evidence that allowed to reject the null hypothesis. The performances of shipping companies, the time of year vs the quantity of orders, and the differences of performances of UK and US employees did all not yield any significant statistical differences.

--

--

Bonny Nichol

Creative Data Enthusiast ✏️ | Passionate for data & science | Coffee lover ☕️ | Let’s get in touch: linkedin.com/in/bonny-nichol/