Data Overview

Dataset Introduction

The "Online Retail" dataset contains all transactions that occurred for a UK-based online retail gift shop between December 2010 and December 2011. The dataset consists of 541,909 rows and 8 columns representing retail transactions.

Key Variables:

  • InvoiceNo: Unique invoice number for each transaction
  • StockCode: Product code
  • Description: Product description
  • Quantity: Quantity purchased of each product
  • InvoiceDate: Date and time of purchase
  • UnitPrice: Unit price for each product
  • CustomerID: Unique identifier for each customer
  • Country: Country where the transaction took place

Data Cleaning Process

The raw dataset required several cleaning steps to prepare it for analysis:

Issues Addressed:

  • Removed 5,268 duplicate rows
  • Handled missing values in Description (0.27%)
  • Handled missing values in CustomerID (24.93%)
  • Identified 10,587 return transactions (negative quantities)
  • Identified 2,512 transactions with zero or negative prices
  • Handled outliers in Quantity and UnitPrice

Additional Variables Created:

  • TotalPrice: Quantity × UnitPrice
  • Year, Month, Day, Hour: Time components
  • DayOfWeek: Day of the week (0-6)
  • IsReturn: Flag for return transactions
  • RFM metrics: Recency, Frequency, Monetary value

Data Distribution

Quantity Distribution
Quantity Distribution

Most transactions involve small quantities, with a long tail of larger orders

Unit Price Distribution
Unit Price Distribution

Most products are priced under £10, with a few premium items

Geographic Distribution

Transactions by Country
Transactions by Country

The United Kingdom dominates with 91.8% of all transactions

Key Geographic Insights:

  • United Kingdom: 91.8% of total revenue
  • Netherlands: 3.2% of total revenue
  • EIRE (Ireland): 3.1% of total revenue
  • Germany: 2.5% of total revenue
  • France: 2.3% of total revenue
  • 38 countries in total represented in the dataset

Temporal Distribution

Transactions Over Time
Transactions Over Time

Transaction volume increases toward the end of the year (holiday season)

Temporal Patterns:

  • Best month: November (pre-holiday shopping)
  • Best day of week: Thursday
  • Peak hour: 12 PM (noon)
  • Lowest activity: Weekends and early mornings

Data Timespan:

  • Start date: December 1, 2010
  • End date: December 9, 2011
  • Total days: 374
  • Complete months: 12

Data Quality Summary

MetricValueNotes
Total Rows (Original)541,909Before cleaning
Total Rows (Cleaned)536,641After removing duplicates
Missing CustomerID24.93%Affects customer-level analysis
Missing Description0.27%Minimal impact
Return Transactions10,587Identified by negative quantities
Zero/Negative Prices2,512Potential data entry errors
Unique Products4,070Based on StockCode
Unique Customers4,339With valid CustomerID