Data Science on the Google Cloud Platform: Implementing End-to-End Real-Time Data Pipelines: From Ingest to Machine Learning
Original price was: $64.99.$22.49Current price is: $22.49.
Price: [price_with_discount]
(as of [price_update_date] – Details)
Learn how easy it is to apply sophisticated statistical and machine learning methods to real-world problems when you build on top of the Google Cloud Platform (GCP). This hands-on guide shows developers entering the data science field how to implement an end-to-end data pipeline, using statistical and machine learning methods and tools on GCP. Through the course of the book, youâ??ll work through a sample business decision by employing a variety of data science approaches.
Follow along by implementing these statistical and machine learning solutions in your own project on GCP, and discover how this platform provides a transformative and more collaborative way of doing data science.
Youâ??ll learn how to:
Automate and schedule data ingest, using an App Engine application Create and populate a dashboard in Google Data Studio Build a real-time analysis pipeline to carry out streaming analytics Conduct interactive data exploration with Google BigQuery Create a Bayesian model on a Cloud Dataproc cluster Build a logistic regression machine-learning model with Spark Compute time-aggregate features with a Cloud Dataflow pipeline Create a high-performing prediction model with TensorFlow Use your deployed model as a microservice you can access from both batch and real-time pipelines
From the Publisher
From the Preface
In this book, we walk through an example of this new transformative, more collaborative way of doing data science. You will learn how to implement an end-to-end data pipeline-we will begin with ingesting the data in a serverless way and work our way through data exploration, dashboards, relational databases, and streaming data all the way to training and making operational a machine learning model. I cover all these aspects of data-based services because data engineers will be involved in designing the services, developing the statistical and machine learning models and implementing them in large-scale production and in real time.
Who This Book Is For
If you use computers to work with data, this book is for you. You might go by the title of data analyst, database administrator, data engineer, data scientist, or systems programmer today. Although your role might be narrower today (perhaps you do only data analysis, or only model building, or only DevOps), you want to stretch your wings a bit-you want to learn how to create data science models as well as how to implement them at scale in production systems.
Google Cloud Platform is designed to make you forget about infrastructure. The marquee data services-Google BigQuery, Cloud Dataflow, Cloud Pub/Sub, and Cloud ML Engine-are all serverless and autoscaling. When you submit a query to BigQuery, it is run on thousands of nodes, and you get your result back; you don’t spin up a cluster or install any software. Similarly, in Cloud Dataflow, when you submit a data pipeline, and in Cloud Machine Learning Engine, when you submit a machine learning job, you can process data at scale and train models at scale without worrying about cluster management or failure recovery. Cloud Pub/Sub is a global messaging service that autoscales to the throughput and number of subscribers and publishers without any work on your part. Even when you’re running open source software like Apache Spark that’s designed to operate on a cluster, Google Cloud Platform makes it easy. Leave your data on Google Cloud Storage, not in HDFS, and spin up a job-specific cluster to run the Spark job. After the job completes, you can safely delete the cluster. Because of this job-specific infrastructure, there’s no need to fear overprovisioning hardware or running out of capacity to run a job when you need it. Plus, data is encrypted, both at rest and in transit, and kept secure. As a data scientist, not having to manage infrastructure is incredibly liberating.
The reason that you can afford to forget about virtual machines and clusters when running on Google Cloud Platform comes down to networking. The network bisection bandwidth within a Google Cloud Platform datacenter is 1 PBps, and so sustained reads off Cloud Storage are extremely fast. What this means is that you don’t need to shard your data as you would with traditional MapReduce jobs. Instead, Google Cloud Platform can autoscale your compute jobs by shuffling the data onto new compute nodes as needed. Hence, you’re liberated from cluster management when doing data science on Google Cloud Platform.
These autoscaled, fully managed services make it easier to implement data science models at scale-which is why data scientists no longer need to hand off their models to data engineers. Instead, they can write a data science workload, submit it to the cloud, and have that workload executed automatically in an autoscaled manner. At the same time, data science packages are becoming simpler and simpler. So, it has become extremely easy for an engineer to slurp in data and use a canned model to get an initial (and often very good) model up and running. With well-designed packages and easy-to-consume APIs, you don’t need to know the esoteric details of data science algorithms-only what each algorithm does, and how to link algorithms together to solve realistic problems. This convergence between data science and data engineering is why you can stretch your wings beyond your current role.
Rather than simply read this book cover-to-cover, I strongly encourage you to follow along with me by also trying out the code. The full source code for the end-to-end pipeline I build in this book is on GitHub. Create a Google Cloud Platform project and after reading each chapter, try to repeat what I did by referring to the code and to the Readme file in each folder of the GitHub repository.
Publisher : O’Reilly Media; 1st edition (February 6, 2018)
Language : English
Paperback : 402 pages
ISBN-10 : 1491974567
ISBN-13 : 978-1491974568
Item Weight : 1.44 pounds
Dimensions : 7.25 x 1 x 9.25 inches
[ad_2]
hroark –
A very good book on data science, that also covers Google Cloud Platform
I knew this book for me just a few pages into the first chapter. This book by Lake is unlike many other books of data science and particular technology that just enumerate the how-to’s of the particular technology. Lak starts with a concrete user problem strongly anchored in probabilistic outcomes, and then steps through a typical data science process of discovery, refinement, and then converting to a production pipeline. While teaching about GCP technologies along the way, the book stays strongly anchored in the original user-problem. There is not a corner of GCP that is needed for a full production data science product that goes untouched in this book. The material is well covered, with pointers to deeper material and user manuals.I received the first edition. As GCP technology evolved, Lak was posting updates to his blog on Medium so that everyone could take understand the updates to GCP and how to use them. I was pleasantly surprised by getting these updates and made having the book that much more valuable.
Douglas –
Data analysis and engineering is democratized for all
Wow. A true tour of data science and engineering on the cloud.It’s been a few years since I’ve worked with tools in this field, but this book was a clear level-headed view for data engineers looking to derive and drive insights from data. Using a core example use case and following it end to end through the entire book (and indeed cloud tools integrated with each other) helped me keep track of what was going on, and kept things from becoming a book on theory rather than one of accomplishment and answers. The purpose and process for each tool was clear, and I also appreciated the explanations of trade-offs and the value added for the choices made. The practice of data science is a LOT easier now with cloud/serverless tools than eight or nine years ago, and I feel this brought me back to the state of the art.
James White –
If all you care about is the answer and not why, buy another book
While Lakâs conversational style can be a turn off to some who just want an answer and donât care about how, I liked this book. Many times with books like this you get an answer or a recipe and youâre done. What happens when your answer or recipe isnât right for the situation? Iâm glad Lak explains his rationale and letâs it be known that thereâs more than one way to do it. Could the book have been condensed without the explanations? Yes. Would it have been like almost every other book in the space? Yes. Check out this book if you want a well thought out answer and maybe alternates. If you just want the âright answerâ, then buy something else.
David L –
Needs updating
I do not understand the high reviews for this book, especially ones written in 2020. I’m only into chapter 2 and the code to download the files fails. There is a supplement on the github page that allowed me to copy the bucket. But, the explanation, like many things is vague and not accurate (you don’t provide the path to your bucket, but just the name of the bucket). I assumed this book was an introduction to using the Google Cloud Platform for data science. So I am expecting an introduction. This book has detail where it doesn’t need it, and lacks detail where it does. It just assumes you have already been using GCP, but if that were the case this book isn’t really needed then.Major Problems:1. Code is not working.2. Code is not explained in any detail.3. Vague details about how to navigate GCP (chapter one has you create a bucket, but doesn’t explain what a bucket is, and how to create it, yet there are three pages about the definition of a data engineer).4. Inconsistent assumptions about your background knowledge.Good parts:1. The use of a case study for learning.
Burak –
Great resource for data scientists beginning on GCP
The book is easy to follow with detailed descriptions of each step followed to build a project from start to end on the Google Cloud Platform.The book is also accompanied by a code repository which lets the readers try out the project themselves.Strongly recommended for data scientists learning to use the platform.
L. Fischer –
Narrative structure, working code
Narrative structure in a technical book is hard to find, and this was executed last masterfully, with lots of code examples for you to follow along with on your own. Highly recommended.
Volney –
Book covers exactly what the title says.
Excellent book for learning which GCP services can be used for what portions of data analytic pipelines. From data acquisition all the way to model revalidation.
Jacob Moore –
Not a reference book, if you don’t work through the first 99 pages you won’t understand the 100th
This product is more akin to a course than a reference book. I tried flipping over to the chapter on Cloud-SQL (actually the author only goes into BigQuery so I ended up scrolling through Stack Overflow anyway.) When I finally found the relevant chapter, it was impossible to disentangle the SQL code from the class objects built in the proceeding 6 chapters. Do not buy this book if you have any intention other than reading every single page in order. Otherwise, you’ll end up doing what I did, which reading stack overflow and medium articles to mixed effect.
KMoreno8 –
Cualquier persona que trabaje en ámbito de datos potencialmente usará Google Cloud. Este libro te da un buen fundamento para ello.
Amazon Customer –
Great
Chandra Shekhar Singh –
Its sets very clear direction for aspiring data engineers / scientists as well what is expected out of them.
å®æå¥½ã –
è¯ãååã§ããè±èªã§ããä¸éãã®ãã¨ãå¦ã¹ãããã«æ¸ãã¦ããã¾ããã°ã¼ã°ã«ã¯ã©ã¦ãã§ãã¼ã¿ãµã¤ã¨ã³ã¹ããããã¨æã£ã¦ãã人éã§ããè¯ãå ¥éæ¸ã¨ãªãã¾ããã
Wolfgang Giersche –
Very knowledgeable author. Balanced and at time beautiful reasoning, presented in an understandable way. Definitely a must-read for Google cloud practitioners.