Skip to the content.
GCP Associate Cloud Engineer (ACE)
Big Data and Machine Learning
Google Cloud Big Data Services
- They are serverless services, managed buy Google
- Services:
- Cloud Dataproc (managed Hadoop):
- Is an open-source framework for big data
- It is based on the map-reduce programming model
- Dataproc is a fast and easy managed way to run Hadoop, Spark, Hive and Pig on GCP
- A cluster can be created in 90 seconds or less
- We can scale up and down jobs on the fly
- We pay only for hardware resources used during the life of the cluster (billing is done down to second)
- We can save money be using preemptible compute instances (up to 80% cheaper)
- Cloud Dataflow:
- It is an unified programming model and managed service
- It can be used for ETL, batch processing, stream processing
- We used Dataflow to create data pipelines used for batch processing a streaming purposes
- Orchestration: we can create pipelines that coordinate services, including external services
- Dataflow integrates with Cloud Storage, Pub/Sub, BigQuery and BigTable
- BigQuery:
- Ad-hoc SQL queries on massive datasets
- Provides near real-time interactive analysis of massive datasets
- It is a fully managed, low cost, petabyte scale data warehouse
- No cluster maintenance is required
- It lets us specify the region where the data will be kept, compute and storage is separated with terabit network in-between
- We only pay for storage and processing used
- Provides automatic discount for long-term data storage (after 90 days)
- Cloud Pub/Subs:
- It is meant to serve as a simple, reliable, scalable foundation for stream analytics
- Pub: publisher, sub: subscriber
- Receiving messages does have to synchronous
- It is great for decoupling systems
- It is designed to provide at least once delivery at low latency (some messages might be deliver more than once)
- It is recommended to be used of systems where data arrives at high and unpredictable rates (example IOT)
- We can configure subscribers to receive messages on a push or pull basis
- Cloud Datalab:
- Built on project Jupyter, it provides managed lab notebooks
- It lets us create web based notebooks containing Python code
- We only pay for the resources used for the compute
- Data can be visualized with Google Charts of Matplotlib
- Google Machine learning solutions as a managed service
- Tensorflow:
- Open source tool to build and run neural network models
- It has wide platform support
- We can run Tensorflow on GCP ML Platform
- Tensorflow can take access of Tensor processing units provided by GCP
- Google Cloud Machine Learning Engine:
- Fully managed machine learning service
- Optimized ofr Google infrastructure, integrates with BigQuery and Cloud Storage
- Cloud Vision API:
- Managed platform used to understand the content of an image
- Can quickly classify images
- Can provide sentiment analysis and text extraction
- Cloud Speech API:
- Convert audio to text
- We can transcribe audio files to text
- Cloud Natural Language API:
- Uses machine learning models to reveal structure and meaning from text
- Can extract information about items mentioned in text documents, news articles and blog posts
- Cloud Translation API:
- Translate arbitrary text into another language
- Cloud Video Intelligence API:
- Can be used to annotate contents of videos
- Can detect scene changes
- Can be used to flag inappropriate content
- Supports a variety of video formats