Exercises 03 - Learning Notes

Course 03: course name

Week 01: Some name

Quiz 01

Which of the following are true when it comes to the business value of big data? (Select all that apply.)
1. The size of the data businesses collect is growing
2. Businesses are increasingly making data-driven decisions
3. Automated technologies mean that data scientists and data analysts are no longer needed
Spark uses… (Select all that apply.)
1. One very large computer that is able to run computation against large databases
2. A distributed cluster of networked computers made of a driver node and many executor nodes
3. Your database technology (e.g., Postgres or SQL Server) to run Spark queries
4. A distributed cluster of networked computers made of many driver nodes and many executor nodes
5. A driver node to distribute work across a number of executor nodes
How does Spark execute code backed by DataFrames? (Select all that apply.)
1. It executes code determined in advance
2. It optimizes your query by figuring out the best “how” to execute what you want
3. It iterates over all of the source data to exhaustively evaluate queries
4. It separates the “logical plan” of what you want to accomplish from the “physical plan” of how to do it so it can optimize the query
What are the properties of Spark DataFrames? (Select all that apply.)
1. Resilient: Fault-tolerant
2. Distributed: Computed across multiple nodes
3. Dataset: Collection of partitioned data
4. Tables: Operates as any table in SQL environments
What is the difference between Spark and database technologies? (Select all that apply.)
1. Spark does not interact with databases but uses its proprietary DataFrame technology instead
2. Spark is a highly optimized compute engine and is not a database
3. Spark in an alternative to traditional databases
4. Spark operates for both data storage and computation
5. Spark is a computation engine and is not for data storage
What is Amdahl’s law of scalability? (Select all that apply.)
1. A formula that gives the theoretical speedup as a function of the percentage of a computation that can be parallelized
2. Amdahl’s law states that the speedup of a task is a function of how much of that task can be parallelized
3. A formula that gives the number of processors (or other unit of parallelism) needed to complete a task
4. A formula that gives the theoretical speedup as a function of the size of a partition (or subset) of data
5. A formula that gives the expected speed of a single processor performing a computation
Spark offers a unified approach to analytics. What does this include? (Select all that apply.)
1. Spark unifies applications such as SQL queries, streaming, and machine learning
2. Spark allows analysts, data scientists, and data engineers to all use the same core technology
3. Spark code can be written in the following languages: SQL, Scala, Java, Python, and R
4. Spark is able to connect to data where it lives in any number of sources, unifying the components of a data application
5. Spark unifies databases with optimized computation allowing for faster computation against the data it stores
What is a Databricks notebook?
1. A single Spark query
2. A collaborative, interactive workspace that allows you to execute Spark queries at scale
3. A cluster that executes Spark code
4. A Spark instance that executes queries
How can you get data into Databricks? (Select all that apply.)
1. By “mounting” data backed by cloud storage
2. By registering the data as a table
3. By connecting to Dropbox or Google Drive
4. By uploading it through the user interface
What are the qualities of big data? (Select all that apply.)
1. Variety: the diversity of data
2. Valorous: the positives impact of data
3. Velocity: the speed of data
4. Volume: the amount of data
5. Veracity: the reliability of data

Answers 01

Question	Answer
1	i, ii
2	ii, v
3	ii, iv
4	i, ii, iii
5	ii, v
6	i, ii
7	all
8	ii
9	i, ii, iv
10	i, iii, iv, v

Quiz 02

What are the different units of parallelism? (Select all that apply.)
1. Partition
2. Task
3. Executor
4. Core
What is a partition?
1. A division of computation that executes a query
2. The result of data filtered by a WHERE clause
3. A synonym with “task”
4. A portion of a large distributed set of data
What is the difference between in-memory computing and other technologies? (Select all that apply.)
1. In-memory operations were not realistic in older technologies when memory was more expensive
2. In-memory operates from RAM while other technologies operate from disk
3. In-memory computing is slower than other types of computing
4. Computation not done in-memory (such as Hadoop) reads and writes from disk in between each step
Why is caching important?
1. It improves queries against data read one or more times
2. It always stores data in-memory to improve performance
3. It reformats data already stored in RAM for faster access
4. It stores data on the cluster to improve query performance
Which of the following is a wide transformation? (Select all that apply.)
1. GROUP BY
2. ORDER BY
3. WHERE
4. SELECT
Broadcast joins…
1. Shuffle both of the tables, minimizing data transfer by transferring data in parallel
2. Transfer the smaller of two tables to the larger, increasing data transfer requirements
3. Transfer the smaller of two tables to the larger, minimizing data transfer
4. Shuffle both of the tables, minimizing computational resources
When is it appropriate to use a shuffle join?
1. Never. Broadcast joins always out-perform shuffle joins.
2. When the smaller table is significantly smaller than the larger table
3. When both tables are very small
4. When both tables are moderately sized or large
Which of the following are bottlenecks you can detect with the Spark UI? (Select all that apply.)
1. Shuffle writes
2. Shuffle reads
3. Incompatible data formats
4. Data Skew
What is a stage boundary?
1. When all of the slots or available units of processing have to sync with one another
2. A narrow transformation
3. An action caused by a SQL query is predicate
4. Any transition between Spark tasks
What happens when Spark code is executed in local mode?
1. A cluster of virtual machines is used rather than physical machines
2. The code is executed against a local cluster
3. The executor and driver are on the same machine
4. The code is executed in the cloud

Answers 02

Question	Answer
1	i, ii, iii, iv
2	iv
3	ii, iv
4	iv
5	i, ii
6	iii
7	iv
8	i, ii, iv
9	i
10	iii

Quiz 03

Decoupling storage and compute means storing data in one location and processing it using a separate resource. What are the benefits of this design principle? (Select all that apply.)
1. It results in copies of the data in case of a data center outage
2. Resources are isolated and therefore more manageable and debuggable
3. It allows for elastic resources so larger storage or compute resources are used only when needed
4. It makes updates to new software versions easier
You want to run a report entailing summary statistics on a large dataset sitting in a database. What is the main resource limitation of this task?
1. CPU: computation is more demanding than the data transfer
2. CPU: the transfer of data is more demanding than the computation
3. IO: the transfer of data is more demanding than the computation
4. IO: computation is more demanding that the data transfer
Processing virtual shopping cart orders in real time is an example of…
1. Online Analytical Processing (OLAP)
2. Online Transaction Processing (OLTP)
When are BLOB stores an appropriate place to store data? (Select all that apply.)
1. For online transaction processing on a website
2. For cheap storage
3. For storing large files
4. For a “data lake” of largely unstructured data
JDBC is the standard protocol for interacting with databases in the Java environment. How do parallel connections work between Spark and a database using JDBC?
1. Specify the number of partitions using COALESCE. Spark then creates one parallel connection for each partition.
2. Specify the number of partitions using REPARTITION. Spark then creates one parallel connection for each partition.
3. Specify a column, number of partitions, and the column’s minimum and maximum values. Spark then divides that range of values between parallel connections.
4. Specify the numPartitions configuration setting. Spark then creates one parallel connection for each partition.
What are some of the advantages of the file format Parquet over CSV? (Select all that apply.)
1. Compression
2. Parallelism
3. Corruptible
4. Columnar
SQL is normally used to query tabular (or “structured”) data. Semi-structured data like JSON is common in big data environments. Why? (Select all that apply.)
1. It allows for data change over time
2. It allows for easy joins between relational JSON tables
3. It allows for complex data types
4. It does not need a formal structure
5. It allows for missing data
Data writes in Spark can happen in serial or in parallel. What controls this parallelism?
1. The number of stages in a write operation
2. The number of data partitions in a DataFrame
3. The numPartitions setting in the Spark configuration
4. The number of jobs in a write operation
Fill in the blanks with the appropriate response below: A ___ table manages ___and a DROP TABLE command will result in data loss.
1. Managed, both the data and metadata such as the schema and data location
2. Unmanaged, only the metadata such as the schema and data location
3. Unmanaged, both the data and metadata such as the schema and data location
4. Managed, only the metadata such as the schema and data location

Answers 03

Question	Answer
1	ii, iii, iv
2	iii
3	ii
4	ii, iii, iv
5	iii
6	i, ii, iv
7	i, ii, v
8	ii
9	i

Quiz 04

Machine learning is suited to solve which of the following tasks? (Select all that apply.)
1. Image Recognition
2. A/B Testing
3. Fraud Detection
4. Churn Analysis
5. Natural Language Processing
6. Reporting
7. Financial Forecasting
Is a model that is 99% accurate at predicting breast cancer a good model?
1. Likely yes because it accounts for false negatives and we’d want to make sure we catch every case of cancer
2. Likely yes because this is generally a high score
3. Likely no because there are too many false positives
4. Likely no because there are not many cases of cancer in a general population
What is an appropriate baseline model to compare a machine learning solution to?
1. The average of the dataset
2. The minimum value of the dataset
3. Zero
What is Machine Learning? (Select all that apply.)
1. Statistical moments calculated against a dataset
2. Learning patterns in your data without being explicitly programmed
3. Hand-coded logic
4. A function that maps features to an output
(Fill in the blanks with the appropriate answer below.) Predicting whether a website user is fraudulent or not is an example of ___ machine learning. It is a ____ task.
1. unsupervised, classification
2. supervised, classification
3. supervised, regression
4. unsupervised, regression
(Fill in the blanks with the appropriate answer below.) Grouping similar users together based on past activity is an example of ___ machine learning. It is a _______ task.
1. unsupervised, clustering
2. unsupervised, classification
3. supervised, clustering
4. supervised, classification
Predicting the next quarter of a company’s earnings is an example of…
1. Reinforcement
2. Clustering
3. Classification
4. Semi-supervised
5. Regression
Why do we want to perform a train/test split before we train a machine learning model? (Select all that apply.)
1. To calculate a baseline model
2. To evaluate how our model performs on unseen data
3. To give us subsets of our data so we can compare a model trained on one versus the model trained on the other
4. To keep the model from “overfitting” where it memorizes the data it has seen
What is a linear regression model learning about your data?
1. The best split points in a decision tree
2. The average of the data
3. The value of the closest points to the one you’re trying to predict
4. The formula for the line of best fit
How do you define a custom function not already part of core Spark?
1. You can’t write your own functions in Spark
2. By extending the open source code base
3. With a User-Defined Function

Answers 04

Question	Answer
1	i, ii, iii, iv, v, vii
2	iv
3	i
4	ii, iv
5	ii
6	i
7	iv
8	ii, iv
9	iv
10	iv