When the smaller table is significantly smaller than the larger table
When both tables are very small
When both tables are moderately sized or large
Which of the following are bottlenecks you can detect with the Spark UI? (Select all that apply.)
Shuffle writes
Shuffle reads
Incompatible data formats
Data Skew
What is a stage boundary?
When all of the slots or available units of processing have to sync with one another
A narrow transformation
An action caused by a SQL query is predicate
Any transition between Spark tasks
What happens when Spark code is executed in local mode?
A cluster of virtual machines is used rather than physical machines
The code is executed against a local cluster
The executor and driver are on the same machine
The code is executed in the cloud
Answers 02
Question
Answer
1
i, ii, iii, iv
2
iv
3
ii, iv
4
iv
5
i, ii
6
iii
7
iv
8
i, ii, iv
9
i
10
iii
Quiz 03
Decoupling storage and compute means storing data in one location and processing it using a separate resource. What are the benefits of this design principle? (Select all that apply.)
It results in copies of the data in case of a data center outage
Resources are isolated and therefore more manageable and debuggable
It allows for elastic resources so larger storage or compute resources are used only when needed
It makes updates to new software versions easier
You want to run a report entailing summary statistics on a large dataset sitting in a database. What is the main resource limitation of this task?
CPU: computation is more demanding than the data transfer
CPU: the transfer of data is more demanding than the computation
IO: the transfer of data is more demanding than the computation
IO: computation is more demanding that the data transfer
Processing virtual shopping cart orders in real time is an example of…
Online Analytical Processing (OLAP)
Online Transaction Processing (OLTP)
When are BLOB stores an appropriate place to store data? (Select all that apply.)
For online transaction processing on a website
For cheap storage
For storing large files
For a “data lake” of largely unstructured data
JDBC is the standard protocol for interacting with databases in the Java environment. How do parallel connections work between Spark and a database using JDBC?
Specify the number of partitions using COALESCE. Spark then creates one parallel connection for each partition.
Specify the number of partitions using REPARTITION. Spark then creates one parallel connection for each partition.
Specify a column, number of partitions, and the column’s minimum and maximum values. Spark then divides that range of values between parallel connections.
Specify the numPartitions configuration setting. Spark then creates one parallel connection for each partition.
What are some of the advantages of the file format Parquet over CSV? (Select all that apply.)
Compression
Parallelism
Corruptible
Columnar
SQL is normally used to query tabular (or “structured”) data. Semi-structured data like JSON is common in big data environments. Why? (Select all that apply.)
It allows for data change over time
It allows for easy joins between relational JSON tables
It allows for complex data types
It does not need a formal structure
It allows for missing data
Data writes in Spark can happen in serial or in parallel. What controls this parallelism?
The number of stages in a write operation
The number of data partitions in a DataFrame
The numPartitions setting in the Spark configuration
The number of jobs in a write operation
Fill in the blanks with the appropriate response below:
A ___ table manages ___and a DROP TABLE command will result in data loss.
Managed, both the data and metadata such as the schema and data location
Unmanaged, only the metadata such as the schema and data location
Unmanaged, both the data and metadata such as the schema and data location
Managed, only the metadata such as the schema and data location
Answers 03
Question
Answer
1
ii, iii, iv
2
iii
3
ii
4
ii, iii, iv
5
iii
6
i, ii, iv
7
i, ii, v
8
ii
9
i
Quiz 04
Machine learning is suited to solve which of the following tasks? (Select all that apply.)
Image Recognition
A/B Testing
Fraud Detection
Churn Analysis
Natural Language Processing
Reporting
Financial Forecasting
Is a model that is 99% accurate at predicting breast cancer a good model?
Likely yes because it accounts for false negatives and we’d want to make sure we catch every case of cancer
Likely yes because this is generally a high score
Likely no because there are too many false positives
Likely no because there are not many cases of cancer in a general population
What is an appropriate baseline model to compare a machine learning solution to?
The average of the dataset
The minimum value of the dataset
Zero
What is Machine Learning? (Select all that apply.)
Statistical moments calculated against a dataset
Learning patterns in your data without being explicitly programmed
Hand-coded logic
A function that maps features to an output
(Fill in the blanks with the appropriate answer below.)
Predicting whether a website user is fraudulent or not is an example of ___ machine learning. It is a ____ task.
unsupervised, classification
supervised, classification
supervised, regression
unsupervised, regression
(Fill in the blanks with the appropriate answer below.)
Grouping similar users together based on past activity is an example of ___ machine learning. It is a _______ task.
unsupervised, clustering
unsupervised, classification
supervised, clustering
supervised, classification
Predicting the next quarter of a company’s earnings is an example of…
Reinforcement
Clustering
Classification
Semi-supervised
Regression
Why do we want to perform a train/test split before we train a machine learning model? (Select all that apply.)
To calculate a baseline model
To evaluate how our model performs on unseen data
To give us subsets of our data so we can compare a model trained on one versus the model trained on the other
To keep the model from “overfitting” where it memorizes the data it has seen
What is a linear regression model learning about your data?
The best split points in a decision tree
The average of the data
The value of the closest points to the one you’re trying to predict
The formula for the line of best fit
How do you define a custom function not already part of core Spark?