Back to projects

Social network / Course project

PennBook

A distributed social network with a Spark-based news pipeline, media uploading, and DynamoDB data modeling.

Pipeline artifact

Spark Data Pipeline

Status: Completed EMR Cluster Idle
[INFO] [10:00:01] Starting automated news pipeline on AWS EMR...
[INFO] [10:00:03] Downloading News_Category_Dataset_v2.json from S3 bucket...
[INFO] [10:00:15] Seeding DynamoDB Tables (Users, Articles, Follows)...
[INFO] [10:00:22] Compiling Apache Spark job (mvn clean package)...
[INFO] [10:00:45] Uploading target/news-pipeline-1.0.jar to S3...
[INFO] [10:00:50] Submitting job to Livy server (ec2-xxx.compute-1.amazonaws.com)...
[INFO] [10:01:10] Job ID 42: Running Spark context...
[INFO] [10:02:30] Phase 1: Filtering stop words and extracting NLP keywords...
[INFO] [10:03:15] Phase 2: Building inverted indices for fast search queries...
[INFO] [10:04:00] Phase 3: Executing PageRank over user follow graphs...
[INFO] [10:05:42] Phase 4: Generating personalized news feed recommendations...
[SUCCESS] [10:06:05] Pipeline finished. 85,000+ articles indexed and updated in DynamoDB.

Problem

A scalable social network needs resilient data storage for tracking friendships, posts, and media, while simultaneously requiring batch processing to generate relevant news recommendations via PageRank and inverted indices without bottlenecking the main server.

Approach

We built a Node.js API with a DynamoDB schema and direct AWS S3 integration for profile pictures and music. For the analytics backend, we implemented an automated Apache Spark pipeline on AWS EMR, orchestrated via Livy, to process a large news dataset, compute follow-graph PageRanks, and build fast-search inverted indices.

Result

The deployed system provides a full social experience including authentication, real-time posts, and an embedded music player streaming directly from S3, alongside daily personalized news feeds generated by the asynchronous Spark backend.