Three-Project Series

Stream Processing with Kafka and Spark you own this product

prerequisites
intermediate Scala • basic shell • basic Kafka • basic Spark
skills learned
set up Kafka Cluster • write Kafka Producer • connect Spark to Kafka • basic stream processing • complex stream processing
Gaurav Bhardwaj
3 weeks · 5-7 hours per week average · BEGINNER

pro $24.99 per month

  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose one free eBook per month to keep
  • exclusive 50% discount on all purchases

lite $19.99 per month

  • access to all Manning books, including MEAPs!

team

5, 10 or 20 seats+ for your team - learn more


Welcome to Free Power Corporation Limited (FPCL), a London-based energy company looking for a solution to deal with surging energy costs. FPCL has installed Smart Meters, which generate energy readings every thirty minutes, in households across London. As a data engineer for FPCL, you’ll create a Kafka cluster and ingest the real-time Smart Meter data into it. You’ll use Spark to read, clean, join, and process the data, adding logic to handle potential real-world problems like data loss and duplicate data. To meet the different business requirements of various FPCL teams, you’ll also perform advanced stream processing on the data streams. By the end of this series of liveProjects, you’ll have the experience and skills to ingest large amounts of data and perform complex analysis on it in real time using Apache Kafka and Spark.

These projects are designed for learning purposes and are not complete, production-ready applications or solutions.

The project is taking a very good progressive way to bring the user from basics to advanced covering the foundations of Kafka.

Georges Michel, founder and president, Paaneah, LLC.

here's what's included

Project 1 Ingest Consumer Data

As a first step in dealing with surging energy prices, Free Power Corporation Limited (FPCL) has installed Smart Meters, which generate energy readings every thirty minutes, in households across London in order to analyze consumers’ energy usage. As a new data engineer for the power company, your task is to ingest the data from the Smart Meter readings and stream it to FPCL data centers for processing. Using the Kafka command-line tool, you’ll create topics in a Kafka cluster for storing the data, and you’ll create partitions for distributing the load within the topics. You’ll add logic to deal with potential problems such as data loss and duplicate records, and you’ll add a method to convert the energy readings to the widely used, easy-to-parse JSON format before the final step of ingesting the data. When you’re finished, FPCL will have pertinent data for analyzing energy consumption patterns, and you’ll have practical experience using Kafka to ingest large amounts of data.

Project 2 Real-time Data Processing

As part of an endeavor to better handle surging energy prices, Free Power Corporation Limited (FPCL) has a Kafka cluster that ingests large amounts of consumer energy data. As a data engineer for FPCL, you’re already familiar with the data, so the London-based power company has tasked you with building a streaming solution that processes the data as soon as it’s available. Using Apache Spark, you’ll create an application to read the data from the Kafka streams, and you’ll save the streams to a data lake. Using a Spark API, you’ll prepare the data for analysis by performing aggregation on the fly. You’ll join the real-time stream with the static data, enriching it with customer details and enabling FPCL’s research team to gain insights about customer energy consumption patterns. When you’re done, FPCL will be better equipped to deal with rising energy costs, and you’ll have hands-on experience building a real-time data processing solution using Apache Spark and Kafka.

Project 3 Advanced Stream Processing

You’re the star data engineer at Free Power Corporation Limited (FPCL). The London-based power company is interested in gaining insight into its customers’ energy usage patterns, and it’s up to you to deliver a data-rich solution that satisfies the requirements of FPCL’s various teams. You’ll create a streaming Spark application to read the consumer event stream from Kafka, you’ll add information that helps the teams determine when data was generated, ingested, and processed, and you’ll write logic to reorder any late or out-of-order data. To provide vital household energy consumption statistics to the sales and electrical engineering teams, you’ll join Kafka data streams and perform complex computations on the resulting stream. To be sure your solution is ready for the teams to use, you’ll test it on the local Spark cluster. When you’ve finished, you’ll have learned advanced stream processing skills that empower you to meet the different business requirements of various enterprise departments.

book resources

When you start each of the projects in this series, you'll get full access to the following book for 90 days.

choose your plan

team

monthly
annual
$49.99
$499.99
only $41.67 per month
  • five seats for your team
  • access to all Manning books, MEAPs, liveVideos, liveProjects, and audiobooks!
  • choose another free product every time you renew
  • choose twelve free products per year
  • exclusive 50% discount on all purchases
  • Stream Processing with Kafka and Spark project for free

It's a very good project to learn Spark Streaming with Kafka. Very well executed with simple steps.

Rambabu Posa, data engineer, Sai Aashika Consultancy Limited

For me, it is a definite game changer as I can now say I have real-life experience with event streaming as my current company (due to regulatory constraints) cannot adopt new technologies on a whim.

Monil Chheda, engineering manager, eClinicalWorks

project author

Gaurav Bhardwaj

Gaurav Bhardwaj has almost two decades of experience designing and developing enterprise software for large-scale data processing and machine learning. Currently working as a Big Data Architect for an IT consulting firm, he helps clients build mature data platforms (on-prem, cloud, and hybrid) and develops solutions involving large-scale data processing, data management and governance, machine learning models, and more. He has also authored official documentation for Apache HBase coprocessor.

Prerequisites

These liveProjects are for intermediate Scala developers and data engineers with basic knowledge of distributed computing technologies such as Apache Spark. To begin these liveProjects you’ll need to be familiar with the following:

TOOLS
  • Basic Apache Kafka
  • Basic Scala
TECHNIQUES
  • Data Ingestion
  • Real-time stream processing

you will learn

In this liveProject series, you’ll learn to use Kafka and Spark to ingest, stream, and process large amounts of data.

  • Install a local Kafka cluster
  • Create a topic in the Kafka cluster
  • Determine and create the appropriate number of partitions for each topic
  • Configure Spark Streaming to read data from Kafka
  • Write and run streaming jobs
  • Save a data stream to a data lake
  • Enrich a data stream
  • Write a Spark stream to a Kafka topic to be used by other systems
  • Handle late data and data arriving out-of-order
  • Join two streams
  • Use arbitrary stateful processing for advanced steam processing
  • Deploy and test on a local cluster

features

Self-paced
You choose the schedule and decide how much time to invest as you build your project.
Project roadmap
Each project is divided into several achievable steps.
Get Help
While within the liveProject platform, get help from other participants and our expert mentors.
Compare with others
For each step, compare your deliverable to the solutions by the author and other participants.
book resources
Get full access to select books for 90 days. Permanent access to excerpts from Manning products are also included, as well as references to other resources.