2023 Updated Verified Pass Professional-Data-Engineer Exam - Real Questions & Answers [Q162-Q177]

Share

2023 Updated Verified Pass Professional-Data-Engineer Exam - Real Questions and Answers

Dumps Moneyack Guarantee - Professional-Data-Engineer Dumps Approved Dumps

NEW QUESTION # 162
Which of these statements about BigQuery caching is true?

  • A. There is no charge for a query that retrieves its results from cache.
  • B. BigQuery caches query results for 48 hours.
  • C. Query results are cached even if you specify a destination table.
  • D. By default, a query's results are not cached.

Answer: A

Explanation:
When query results are retrieved from a cached results table, you are not charged for the query.
BigQuery caches query results for 24 hours, not 48 hours.
Query results are not cached if you specify a destination table.
A query's results are always cached except under certain conditions, such as if you specify a destination table.
Reference: https://cloud.google.com/bigquery/querying-data#query-caching


NEW QUESTION # 163
Government regulations in the banking industry mandate the protection of client's personally identifiable information (PII). Your company requires PII to be access controlled encrypted and compliant with major data protection standards In addition to using Cloud Data Loss Prevention (Cloud DIP) you want to follow Google-recommended practices and use service accounts to control access to PII. What should you do?

  • A. Use Cloud Storage to comply with major data protection standards. Use multiple service accounts attached to IAM groups to grant the appropriate access to each group
  • B. Use one service account to access a Cloud SQL database and use separate service accounts for each human user
  • C. Use Cloud Storage to comply with major data protection standards. Use one service account shared by all users
  • D. Assign the required identity and Access Management (IAM) roles to every employee, and create a single service account to access protect resources

Answer: A


NEW QUESTION # 164
Why do you need to split a machine learning dataset into training data and test data?

  • A. So you can try two different sets of features
  • B. To allow you to create unit tests in your code
  • C. So you can use one dataset for a wide model and one for a deep model
  • D. To make sure your model is generalized for more than just the training data

Answer: D

Explanation:
The flaw with evaluating a predictive model on training data is that it does not inform you on how well the model has generalized to new unseen data. A model that is selected for its accuracy on the training dataset rather than its accuracy on an unseen test dataset is very likely to have lower accuracy on an unseen test dataset. The reason is that the model is not as generalized. It has specialized to the structure in the training dataset. This is called overfitting.
Reference: https://machinelearningmastery.com/a-simple-intuition-for-overfitting/


NEW QUESTION # 165
You create an important report for your large team in Google Data Studio 360. The report uses Google
BigQuery as its data source. You notice that visualizations are not showing data that is less than 1 hour
old. What should you do?

  • A. Disable caching by editing the report settings.
  • B. Clear your browser history for the past hour then reload the tab showing the virtualizations.
  • C. Refresh your browser tab showing the visualizations.
  • D. Disable caching in BigQuery by editing table details.

Answer: A

Explanation:
Explanation/Reference:
Reference: https://support.google.com/datastudio/answer/7020039?hl=en


NEW QUESTION # 166
Your organization has been collecting and analyzing data in Google BigQuery for 6 months. The majority of the data analyzed is placed in a time-partitioned table named events_partitioned. To reduce the cost of queries, your organization created a view called events, which queries only the last 14 days of data. The view is described in legacy SQL. Next month, existing applications will be connecting to BigQuery to read the eventsdata via an ODBC connection. You need to ensure the applications can connect. Which two actions should you take? (Choose two.)

  • A. Create a service account for the ODBC connection to use for authentication
  • B. Create a Google Cloud Identity and Access Management (Cloud IAM) role for the ODBC connection and shared "events"
  • C. Create a new view over events_partitioned using standard SQL
  • D. Create a new partitioned table using a standard SQL query
  • E. Create a new view over events using standard SQL

Answer: B,E


NEW QUESTION # 167
Your company is running their first dynamic campaign, serving different offers by analyzing real-time data
during the holiday season. The data scientists are collecting terabytes of data that rapidly grows every
hour during their 30-day campaign. They are using Google Cloud Dataflow to preprocess the data and
collect the feature (signals) data that is needed for the machine learning model in Google Cloud Bigtable.
The team is observing suboptimal performance with reads and writes of their initial load of 10 TB of data.
They want to improve this performance while minimizing cost. What should they do?

  • A. Redesign the schema to use row keys based on numeric IDs that increase sequentially per user
    viewing the offers.
  • B. Redesign the schema to use a single row key to identify values that need to be updated frequently in
    the cluster.
  • C. The performance issue should be resolved over time as the site of the BigDate cluster is increased.
  • D. Redefine the schema by evenly distributing reads and writes across the row space of the table.

Answer: D


NEW QUESTION # 168
You are working on a sensitive project involving private user data. You have set up a project on Google Cloud Platform to house your work internally. An external consultant is going to assist with coding a complex transformation in a Google Cloud Dataflow pipeline for your project. How should you maintain users' privacy?

  • A. Create a service account and allow the consultant to log on with it.
  • B. Grant the consultant the Cloud Dataflow Developer role on the project.
  • C. Create an anonymized sample of the data for the consultant to work with in a different project.
  • D. Grant the consultant the Viewer role on the project.

Answer: A


NEW QUESTION # 169
Your company is using WHILECARD tables to query data across multiple tables with similar names. The SQL statement is currently failing with the following error:
# Syntax error : Expected end of statement but got "-" at [4:11]
SELECT age
FROM
bigquery-public-data.noaa_gsod.gsod
WHERE
age != 99
AND_TABLE_SUFFIX = '1929'
ORDER BY
age DESC
Which table name will make the SQL statement work correctly?

  • A. 'bigquery-public-data.noaa_gsod.gsod'*
  • B. 'bigquery-public-data.noaa_gsod.gsod'
  • C. 'bigquery-public-data.noaa_gsod.gsod*`
  • D. bigquery-public-data.noaa_gsod.gsod*

Answer: D


NEW QUESTION # 170
By default, which of the following windowing behavior does Dataflow apply to unbounded data sets?

  • A. Windows at every 1 minute
  • B. Windows at every 10 minutes
  • C. Windows at every 100 MB of data
  • D. Single, Global Window

Answer: D

Explanation:
Explanation
Dataflow's default windowing behavior is to assign all elements of a PCollection to a single, global window, even for unbounded PCollections Reference: https://cloud.google.com/dataflow/model/pcollection


NEW QUESTION # 171
For the best possible performance, what is the recommended zone for your Compute Engine instance and Cloud Bigtable instance?

  • A. Have both the Compute Engine instance and the Cloud Bigtable instance to be in different zones.
  • B. Have the Cloud Bigtable instance to be in the same zone as all of the consumers of your data.
  • C. Have the Compute Engine instance in the furthest zone from the Cloud Bigtable instance.
  • D. Have both the Compute Engine instance and the Cloud Bigtable instance to be in the same zone.

Answer: D

Explanation:
Explanation
It is recommended to create your Compute Engine instance in the same zone as your Cloud Bigtable instance for the best possible performance, If it's not possible to create a instance in the same zone, you should create your instance in another zone within the same region. For example, if your Cloud Bigtable instance is located in us-central1-b, you could create your instance in us-central1-f. This change may result in several milliseconds of additional latency for each Cloud Bigtable request.
It is recommended to avoid creating your Compute Engine instance in a different region from your Cloud Bigtable instance, which can add hundreds of milliseconds of latency to each Cloud Bigtable request.
Reference: https://cloud.google.com/bigtable/docs/creating-compute-instance


NEW QUESTION # 172
You need to choose a database to store time series CPU and memory usage for millions of computers.
You need to store this data in one-second interval samples. Analysts will be performing real-time, ad hoc analytics against the database. You want to avoid being charged for every query executed and ensure that the schema design will allow for future growth of the dataset. Which database and data model should you choose?

  • A. Create a narrow table in Cloud Bigtable with a row key that combines the Computer Engine computer identifier with the sample time at each second
  • B. Create a table in BigQuery, and append the new samples for CPU and memory to the table
  • C. Create a wide table in Cloud Bigtable with a row key that combines the computer identifier with the sample time at each minute, and combine the values for each second as column data.
  • D. Create a wide table in BigQuery, create a column for the sample value at each second, and update the row with the interval for each second

Answer: A

Explanation:
https://cloud.google.com/bigtable/docs/schema-design-time-series


NEW QUESTION # 173
MJTelco is building a custom interface to share data. They have these requirements:
* They need to do aggregations over their petabyte-scale datasets.
* They need to scan specific time range rows with a very fast response time (milliseconds).
Which combination of Google Cloud Platform products should you recommend?

  • A. BigQuery and Cloud Bigtable
  • B. BigQuery and Cloud Storage
  • C. Cloud Bigtable and Cloud SQL
  • D. Cloud Datastore and Cloud Bigtable

Answer: A


NEW QUESTION # 174
MJTelco Case Study
Company Overview
MJTelco is a startup that plans to build networks in rapidly growing, underserved markets around the world.
The company has patents for innovative optical communications hardware. Based on these patents, they can create many reliable, high-speed backbone links with inexpensive hardware.
Company Background
Founded by experienced telecom executives, MJTelco uses technologies originally developed to overcome communications challenges in space. Fundamental to their operation, they need to create a distributed data infrastructure that drives real-time analysis and incorporates machine learning to continuously optimize their topologies. Because their hardware is inexpensive, they plan to overdeploy the network allowing them to account for the impact of dynamic regional politics on location availability and cost.
Their management and operations teams are situated all around the globe creating many-to-many relationship between data consumers and provides in their system. After careful consideration, they decided public cloud is the perfect environment to support their needs.
Solution Concept
MJTelco is running a successful proof-of-concept (PoC) project in its labs. They have two primary needs:
* Scale and harden their PoC to support significantly more data flows generated when they ramp to more than 50,000 installations.
* Refine their machine-learning cycles to verify and improve the dynamic models they use to control topology definition.
MJTelco will also use three separate operating environments - development/test, staging, and production - to meet the needs of running experiments, deploying new features, and serving production customers.
Business Requirements
* Scale up their production environment with minimal cost, instantiating resources when and where needed in an unpredictable, distributed telecom user community.
* Ensure security of their proprietary data to protect their leading-edge machine learning and analysis.
* Provide reliable and timely access to data for analysis from distributed research workers
* Maintain isolated environments that support rapid iteration of their machine-learning models without affecting their customers.
Technical Requirements
Ensure secure and efficient transport and storage of telemetry data
Rapidly scale instances to support between 10,000 and 100,000 data providers with multiple flows each.
Allow analysis and presentation against data tables tracking up to 2 years of data storing approximately 100m records/day Support rapid iteration of monitoring infrastructure focused on awareness of data pipeline problems both in telemetry flows and in production learning cycles.
CEO Statement
Our business model relies on our patents, analytics and dynamic machine learning. Our inexpensive hardware is organized to be highly reliable, which gives us cost advantages. We need to quickly stabilize our large distributed data pipelines to meet our reliability and capacity commitments.
CTO Statement
Our public cloud services must operate as advertised. We need resources that scale and keep our data secure.
We also need environments in which our data scientists can carefully study and quickly adapt our models.
Because we rely on automation to process our data, we also need our development and test environments to work as we iterate.
CFO Statement
The project is too large for us to maintain the hardware and software required for the data and analysis. Also, we cannot afford to staff an operations team to monitor so many data feeds, so we will rely on automation and infrastructure. Google Cloud's machine learning will allow our quantitative researchers to work on our high-value problems instead of problems with our data pipelines.
You need to compose visualizations for operations teams with the following requirements:
* The report must include telemetry data from all 50,000 installations for the most resent 6 weeks (sampling once every minute).
* The report must not be more than 3 hours delayed from live data.
* The actionable report should only show suboptimal links.
* Most suboptimal links should be sorted to the top.
* Suboptimal links can be grouped and filtered by regional geography.
* User response time to load the report must be <5 seconds.
Which approach meets the requirements?

  • A. Load the data into Google Sheets, use formulas to calculate a metric, and use filters/sorting to show only suboptimal links in a table.
  • B. Load the data into Google BigQuery tables, write Google Apps Script that queries the data, calculates the metric, and shows only suboptimal rows in a table in Google Sheets.
  • C. Load the data into Google BigQuery tables, write a Google Data Studio 360 report that connects to your data, calculates a metric, and then uses a filter expression to show only suboptimal rows in a table.
  • D. Load the data into Google Cloud Datastore tables, write a Google App Engine Application that queries all rows, applies a function to derive the metric, and then renders results in a table using the Google charts and visualization API.

Answer: D


NEW QUESTION # 175
You need to create a data pipeline that copies time-series transaction data so that it can be queried from within BigQuery by your data science team for analysis. Every hour, thousands of transactions are updated with a new status. The size of the intitial dataset is 1.5 PB, and it will grow by 3 TB per day. The data is heavily structured, and your data science team will build machine learning models based on this data. You want to maximize performance and usability for your data science team. Which two strategies should you adopt?
Choose 2 answers.

  • A. Preserve the structure of the data as much as possible.
  • B. Use BigQuery UPDATE to further reduce the size of the dataset.
  • C. Copy a daily snapshot of transaction data to Cloud Storage and store it as an Avro file. Use BigQuery's support for external data sources to query.
  • D. Develop a data pipeline where status updates are appended to BigQuery instead of updated.
  • E. Denormalize the data as must as possible.

Answer: C,D


NEW QUESTION # 176
Which is not a valid reason for poor Cloud Bigtable performance?

  • A. The table's schema is not designed correctly.
  • B. There are issues with the network connection.
  • C. The workload isn't appropriate for Cloud Bigtable.
  • D. The Cloud Bigtable cluster has too many nodes.

Answer: D

Explanation:
The Cloud Bigtable cluster doesn't have enough nodes. If your Cloud Bigtable cluster is overloaded, adding more nodes can improve performance. Use the monitoring tools to check whether the cluster is overloaded.


NEW QUESTION # 177
......


The Google Professional-Data-Engineer exam consists of multiple-choice and multiple-select questions, and candidates are given two hours to complete it. Professional-Data-Engineer exam covers a range of topics, including designing data processing systems, building and maintaining data structures and databases, data analysis, machine learning, and data visualization.


To prepare for the Google Professional-Data-Engineer exam, candidates can take advantage of various resources provided by Google. These include online courses, study guides, practice exams, and hands-on labs. Additionally, candidates can also take advantage of various third-party resources, such as books, videos, and online communities, to enhance their knowledge and skills.

 

Updated PDF (New 2023) Actual Google Professional-Data-Engineer Exam Questions: https://www.actualvce.com/Google/Professional-Data-Engineer-valid-vce-dumps.html

Verified Professional-Data-Engineer Exam Dumps PDF [2023] Access using ActualVCE: https://drive.google.com/open?id=1rGX5kCfI9T_tatEr_X_zkQfRWP_fnreu