Data Overview
Dataset Introduction
The "Online Retail" dataset contains all transactions that occurred for a UK-based online retail gift shop between December 2010 and December 2011. The dataset consists of 541,909 rows and 8 columns representing retail transactions.
Key Variables:
- InvoiceNo: Unique invoice number for each transaction
- StockCode: Product code
- Description: Product description
- Quantity: Quantity purchased of each product
- InvoiceDate: Date and time of purchase
- UnitPrice: Unit price for each product
- CustomerID: Unique identifier for each customer
- Country: Country where the transaction took place
Data Cleaning Process
The raw dataset required several cleaning steps to prepare it for analysis:
Issues Addressed:
- Removed 5,268 duplicate rows
- Handled missing values in Description (0.27%)
- Handled missing values in CustomerID (24.93%)
- Identified 10,587 return transactions (negative quantities)
- Identified 2,512 transactions with zero or negative prices
- Handled outliers in Quantity and UnitPrice
Additional Variables Created:
- TotalPrice: Quantity × UnitPrice
- Year, Month, Day, Hour: Time components
- DayOfWeek: Day of the week (0-6)
- IsReturn: Flag for return transactions
- RFM metrics: Recency, Frequency, Monetary value
Data Distribution
Quantity Distribution

Most transactions involve small quantities, with a long tail of larger orders
Unit Price Distribution

Most products are priced under £10, with a few premium items
Geographic Distribution
Transactions by Country

The United Kingdom dominates with 91.8% of all transactions
Key Geographic Insights:
- United Kingdom: 91.8% of total revenue
- Netherlands: 3.2% of total revenue
- EIRE (Ireland): 3.1% of total revenue
- Germany: 2.5% of total revenue
- France: 2.3% of total revenue
- 38 countries in total represented in the dataset
Temporal Distribution
Transactions Over Time

Transaction volume increases toward the end of the year (holiday season)
Temporal Patterns:
- Best month: November (pre-holiday shopping)
- Best day of week: Thursday
- Peak hour: 12 PM (noon)
- Lowest activity: Weekends and early mornings
Data Timespan:
- Start date: December 1, 2010
- End date: December 9, 2011
- Total days: 374
- Complete months: 12
Data Quality Summary
| Metric | Value | Notes |
|---|---|---|
| Total Rows (Original) | 541,909 | Before cleaning |
| Total Rows (Cleaned) | 536,641 | After removing duplicates |
| Missing CustomerID | 24.93% | Affects customer-level analysis |
| Missing Description | 0.27% | Minimal impact |
| Return Transactions | 10,587 | Identified by negative quantities |
| Zero/Negative Prices | 2,512 | Potential data entry errors |
| Unique Products | 4,070 | Based on StockCode |
| Unique Customers | 4,339 | With valid CustomerID |