Big Data Project
Big Data Project
Background:
You are hired by a leading e-commerce company to enhance its operations and improve the
overall user experience. The company's platform generates massive amounts of data daily,
including user activities, purchases, and product reviews. The data requirements for each
category are as follows:
Fields:
UserID (String)
Activity Type (String, e.g., View, Click, Add to Cart)
ProductID (String, if applicable)
Timestamp (Datetime)
Volume: Millions of records per day
Velocity: Real-time streaming
Purchase Data:
Fields:
UserID (String)
ProductID (String)
Quantity (Integer)
Price (Float)
Timestamp (Datetime)
Volume: Hundreds of thousands of records per day
Velocity: Near real-time
Fields:
UserID (String)
ProductID (String)
Rating (Integer, 1 to 5)
Review (String)
Timestamp (Datetime)
Volume: Tens of thousands of records per day
Velocity: Batch processing
1. Ingestion Layer:
a. Task: Design the architecture for ingesting and processing streaming data from the e-
commerce platform.
b. Deliverable: Create a detailed architectural diagram illustrating the components and
flow of data in the ingestion layer. An accompanying paragraph should explain the
rationale behind your design choices.
2. Storage Layer:
a. Task: Choose a suitable database or storage system for storing integrated e-commerce
data.
b. Deliverable: Provide a diagram representing the chosen storage architecture. Include a
paragraph justifying your selection based on scalability, data retrieval, and
compatibility with different data types.
3. Big Data Analytics:
a. Task: Formulate analytical queries to extract meaningful insights from user activities,
purchases, and product reviews.
b. Deliverable: Present SQL-like queries or code snippets for analytics tasks.
Additionally, create visual representations (diagrams or charts) showcasing sample
results. Accompany these visuals with a paragraph explaining the significance of the
chosen analytics.
4. Data Visualizations:
a. Task: Create visualizations that represent key metrics and trends in user behavior,
product popularity, and sales performance.
b. Deliverable: Share visualizations in the form of charts, graphs, or dashboards using
visualization tools like Tableau or Power BI. Include a paragraph describing how
these visuals effectively communicate insights and KPIs.
1. Ingestion Layer
Identification
Identify all of the different data sources that need to be ingested. User activity data,
Purchase data, Product review data
Filtration
Once the data sources have been identified, they need to be filtered to remove any
unwanted or irrelevant data. The ingestion layer may filter out:
a. User activity data that is older than a certain period of time
b. Purchase data for abandoned shopping carts
c. Product review data that is not relevant to the products that the e-commerce
company sells
Validation
Validate the data to ensure that it is accurate and complete. The ingestion layer may
validate:
a. User activity data to make sure that the user IDs are valid
b. Purchase data to make sure that the product IDs and quantities are valid
c. Product review data to make sure that the ratings and reviews are valid
Noise Reduction
Identifying and removing any noise or errors from the data. The ingestion layer may
remove:
a. Duplicate records
b. Records with missing values
c. Records with inconsistent data
Transformation
Transform the data into a format that is compatible with the downstream processing
systems. The ingestion layer may:
a. Convert all of the date and time fields to a consistent format
b. Aggregate the user activity data to create daily, weekly, and monthly summaries
c. Enrich the product review data by adding sentiment analysis and other features
Compression
Compress the data to reduce the storage space and bandwidth requirements. The
ingestion layer may compress the data using a gzip algorithm.
Integration
The final step is to integrate the processed data with the downstream processing
systems. The ingestion layer may:
a. Integrate the transformed user activity data with a recommendation engine
b. Integrate the transformed purchase data with a data warehouse for analytics
c. Integrate the transformed product review data with a machine learning model to
predict product ratings
2. Storage Layer
HDFS Architecture
In addition to the above benefits, HDFS is also well-suited for storing large volumes of data,
which is important for e-commerce companies that generate massive amounts of data daily.
3.
4. Data Visualizations
A. User Behavior
User Activity by Device
A pie chart showing the percentage of user activities by device (desktop, mobile,
tablet). This chart can help identify which devices are most popular and ensure
that the e-commerce platform is optimized for those devices.
B. Product Popularity
Product Views by Category
A bar chart showing the number of product views by category. This chart can help
identify which categories are most popular and which ones may need more
attention.
C. Sales Performance
Sales by Product
A bar chart showing the total sales for each product. This chart can help identify
which products are selling well and which ones may need to be promoted or
discontinued.