Big Data Warehousing teaches you new techniques for common data warehousing tasks such as data ingest, SQL queries and report generation in a big data environment. You’ll get a quick tour of using Hive and Impala to query and analyze large semi-structured datasets and learn how to build an Extract, Load, and Transform (ETL) workflow You’ll explore data extraction with Sqoop and address the practical question of schemas for modeling and transforming big data. As you progress through the book, you’ll survey data governance with Falcon, how to build dataflows with Oozie, approaches to data processing, writing queries with SparkSQL, and data security using Apache Sentry and Knox.
Data warehouses, once the exclusive domain of large enterprises, are becoming increasingly commonplace as businesses shift to data-driven decision making However, the traditional tools and approaches to building data warehouses can no longer cost-effectively handle the amount of data that even a modest-sized business can capture. On the other hand, the new ecosystem of big data tools surrounding Spark and Hadoop not only handle these data volumes they are accessible to a wide range of users with diverse needs - including business analysts, data scientists, and application developers.
This book assumes you're familiar with SQL-based data warehousing technologies and patterns. Readers do not need to be familiar with Java or Scala programming, but it helps.
Karthik Ramachandran is a software engineer and Big Data expert who makes big data technologies and machine learning accessible to business users. He has extensive experience both with traditional enterprise data warehousing solutions as well as with the Hadoop ecosystem. Istvan Szegedi is a senior technical solutions architect working with enterprise data technologies and Hadoop. Richard Saltzer is a Software Engineer on Cloudera's internal data platform team where he builds scalable ingestion pipelines with Impala.
geekle is based on a wordle clone.