0% found this document useful (0 votes)
5 views

Study Material for Interview

Uploaded by

Vishal Maheshuni
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Study Material for Interview

Uploaded by

Vishal Maheshuni
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 47

Neetesh Rajoriya

4:37 PM
Table A
ID
1
1
2
2
2
1
3
3
3
3
4
4
5
5
5
4

Numbers which came consecutively at least 3 times

select distinct num


from (
select num, lag(num,1) over (order by id)=num
and lag (num,2) over (order by id)=num c3 from
logs)x
where c3;

Source Destination Distance


Mumbai Goa 500
Goa Mumbai 500
Pune Delhi 600
Delhi Pune 600
Hyderabad Banglore 500
Banglore Hyderabad 500

select Source, Destination, Distance


from table
where source < distination
union all
select destination, source, distance
from table d
where source > destination and not exist(
select 1 from table d2
where d2.source = d.destination and d2.destination=d.source
);
Can you explain the process of designing and implementing a data warehouse?

Once the data model is ready, I would proceed with data extraction from the source systems
using ETL tools like Informatica or Snowflake. I would transform the data, clean it, and load it
into the data warehouse using appropriate transformations and business rules. Finally, I would
implement the necessary security measures and setup data access controls to ensure data
privacy and integrity.

How do you handle the extraction and transformation of large volumes of data in a data
warehouse?
Extracting and transforming large volumes of data is a common challenge in a data warehouse.
To address this, I would first analyze the data sources and their characteristics to identify any
performance bottlenecks. Then, I would optimize data extraction by using techniques such as
incremental loading, partitioning, or multi-threading to improve the extraction speed. Similarly, I
would optimize data transformation by using parallel processing and optimizing SQL queries

How do you handle data quality issues?

I understand the importance of data quality in a data warehouse. To ensure data quality, I would
start by implementing data validation checks during the ETL process to identify any
inconsistencies or errors in the data. These checks can include checking data types, referential
integrity, or running data profiling scripts. I would also collaborate with the business users and
data owners to define data quality rules and metrics. In case of data quality issues, I would
investigate the root cause, fix the issues, and re-run the ETL process to ensure the data
warehouse contains accurate and reliable information. Additionally, I would regularly monitor
data quality using data profiling and monitoring tools

Can you explain the difference between a dimensional data model and a normalized data
model?
A dimensional data model is designed to support efficient analytical queries and reporting. It
organizes data into dimension tables and fact tables, where dimension tables contain descriptive
attributes and fact tables contain numeric measures. This model allows for easy data aggregation
and enables fast query performance. On the other hand, a normalized data model follows the
principles of normalization to eliminate data redundancy and reduce data anomalies. It organizes
data into multiple smaller tables, reducing data duplication. While a dimensional model is
optimized for querying, a normalized model is optimized for data integrity and flexibility in
transactional systems. Depending on the requirements, a data warehouse may use either a
dimensional or a normalized data model, or a combination of both.

How do you handle data versioning and historical data in a data warehouse?

Data versioning and handling historical data are important aspects of a data warehouse. To
handle data versioning, I would maintain a metadata repository that tracks the versioning
information for each data element. This repository would capture the source system timestamp,
the extraction timestamp, and a unique identifier for each record. This allows for tracking
changes over time and auditing. To handle historical data, I would implement slow-changing
dimensions in the data model, which allow for preserving historical records. Using techniques
like Type 2 or Type 4 slowly changing dimensions, I can track changes to dimension attributes
over time and link facts to the appropriate version of the dimension records. By maintaining
historical data, the data warehouse can provide valuable insights into trends and historical
analysis.

How do you optimize the performance of SQL queries in a data warehouse?

To optimize the performance of SQL queries in a data warehouse, I employ various techniques.
Firstly, I ensure that appropriate indexes are created on the relevant columns to speed up data
retrieval. I also utilize techniques such as query rewriting and query optimization using tools like
Explain Plan or Query Execution Plans to improve query performance. Partitioning tables based
on specific criteria like date or region helps to reduce the amount of data scanned during query
execution. Additionally, I leverage techniques like materialized views or summary tables to pre-
aggregate data and reduce query execution time. Regular performance tuning and monitoring of
queries are performed to identify bottlenecks and optimize query performance.

How do you ensure data security and privacy in a data warehouse?


A.
Sample answer
As a Data Warehouse Developer/Engineer, ensuring data security and privacy is one of my top
priorities. I keep the following measures in place: Firstly, I implement strict access controls and
user management for the data warehouse, granting privileges only to authorized users. Data
encryption techniques like Transparent Data Encryption (TDE) or Secure Sockets Layer (SSL) are
employed to protect data in transit and at rest. I also monitor and log access to the data
warehouse, enabling auditing and tracking of any unauthorized activities. Regular vulnerability
assessments and security patches are performed to mitigate potential risks. Data masking or
anonymization techniques are used to protect sensitive data during development or testing
processes. Compliance with relevant data protection regulations like GDPR or HIPAA is ensured
to safeguard user privacy and maintain legal compliance.

What techniques do you use for data extraction and loading in a data warehouse?
A.
Sample answer
As a Data Warehouse Developer/Engineer, I am well-versed in various techniques for data
extraction and loading. For data extraction, I use SQL queries to extract data from relational
databases like Oracle or SQL Server. I also have experience with ETL tools like Informatica
PowerCenter, which provide functionalities for data extraction from different sources such as
files, APIs, or web services. I perform data cleaning and transformation using ETL tools and
apply business rules during the loading process. I also leverage parallel processing and bulk
loading techniques to optimize the loading speed and performance of the data warehouse..

How do you handle incremental data loading in a data warehouse?


A.
Sample answer
Incremental data loading is a common requirement in data warehousing to keep the data
warehouse up-to-date. To handle this, I implement change data capture (CDC) techniques to
identify and capture the changed or new data from the source systems. I use CDC
functionalities provided by ETL tools like Informatica or CDC mechanisms offered by database
systems like Oracle. Incremental data loading can be achieved by comparing source system
timestamps with the last load timestamp and only extracting and loading the changed data. I
also leverage incremental loading strategies like delta loads or using CDC flags to efficiently
update the data warehouse with minimal impact on system resources. Monitoring and
validating incremental data loads ensure the data warehouse remains consistent and accurate.
Can you explain the concept of data partitioning and its benefits in a data warehouse?
Data partitioning is the technique of dividing large tables into smaller, more manageable units
called partitions based on specific criteria such as range or list. The benefits of data partitioning
in a data warehouse include improved query performance by reducing the amount of data
scanned during query execution. It enables parallel processing by allowing queries to be
processed in parallel on individual partitions, leading to faster query response times.
Partitioning also simplifies data management tasks like data loading, data archiving, or purging,
as these operations can be performed at the partition level instead of the entire table. By
optimizing data retrieval and management, data partitioning enhances the overall
performance and maintainability of a data warehouse.

Can you explain the concept of data staging in a data warehouse environment?
A.
Sample answer
Data staging is the process of preparing and storing data before it is loaded into the data
warehouse. It acts as an intermediate storage area between the source systems and the data
warehouse. During the staging process, data is transformed, cleaned, and standardized to
ensure its quality and consistency. It involves extracting data from the source systems,
applying business rules and transformations, and then loading the prepared data into staging
tables. Staging allows for validation and reconciliation of data before it enters the data
warehouse, ensuring that only quality data is loaded. It also decouples the extraction and
transformation processes from the data warehouse, providing flexibility and ease of
maintenance.

How do you ensure data consistency across multiple data sources in a data warehouse?
A.
Sample answer
Ensuring data consistency across multiple data sources in a data warehouse involves a careful
approach. Firstly, I perform data profiling and analysis to understand the data structure,
business rules, and key fields in each source system. Then, I map and align the data elements
from different sources to ensure consistency in terms of naming, semantics, and data types. I
also implement data integration techniques like data cleansing, data merging, or data
transformation to standardize and reconcile the data from different sources. Regular data
validation and reconciliation processes are conducted to identify and resolve any
inconsistencies in the data warehouse. By maintaining data consistency, the data warehouse
provides a single source of truth for decision-making and reporting.

What steps do you follow to ensure data quality in a data warehouse?


Data profiling techniques, such as statistical analysis or data histogram, are used to analyze the
data and identify anomalies or outliers. I also perform root cause analysis and implement
corrective actions to resolve data quality issues. Regular monitoring and reporting of data
quality metrics allow for continuous improvement and maintenance of data integrity.

How do you handle data transformation and cleansing in a data warehouse?


A.
Sample answer
Data transformation and cleansing are crucial steps in preparing data for a data warehouse. To
transform data, I leverage ETL tools like Informatica or SQL queries to apply various operations
like aggregation, joining, splitting, or reshaping the data. I also use scripting languages like
Python or PySpark for more complex transformations. Cleansing involves removing or
correcting any inconsistencies, errors, or duplicates in the data. I implement data validation
checks, business rules, and data standardization techniques to ensure data quality. Techniques
like data profiling and outlier detection are used to identify and handle data anomalies.
Through a combination of ETL tools and custom scripting, I ensure that the data entering the
data warehouse is accurate, consistent, and reliable.

What strategies do you employ for data archiving and purging in a data warehouse?
A.
Sample answer
To manage data growth and optimize the performance of a data warehouse, I employ data
archiving and purging strategies. Firstly, I identify and classify the data based on its relevance
and usage patterns. I define retention policies to determine how long each type of data should
be retained. Once the retention period expires, I archive the data by moving it to a separate
storage tier or system. Archiving reduces the data size in the active production environment,
improving query performance and reducing storage costs. For data that is no longer required, I
implement data purging processes to permanently remove the data from the data warehouse.
By archiving and purging data, the data warehouse maintains optimal performance and
efficiency.

Can you explain the importance of version control in data warehousing development? How
have you used version control in your previous projects?
A.
Sample answer
Version control is crucial in data warehousing development as it allows for tracking and
managing changes made to the data warehouse over time. It ensures that there is a history of
all modifications, making it easier to revert to a previous version if needed. In my previous
projects, I have used version control systems like Git to manage code changes in the data
warehouse. I would create a repository for the project and commit changes regularly,
providing descriptive commit messages to track the purpose of each change. By using version
control, I was able to collaborate effectively with other team members and easily track the
evolution of the data warehouse.

Can you explain the concept of branching in version control and how it can be applied in a
data warehousing project?
A.
Sample answer
Branching is a powerful feature in version control that allows developers to create separate
lines of development within a project. In a data warehousing project, branching can be applied
in various scenarios. For example, when working on a new feature, I would create a branch
specifically for that feature. This allows me to isolate my changes from the main codebase until
they are thoroughly tested and ready to be merged. Similarly, if I need to fix a bug in the data
warehouse, I would create a separate branch for the bug fix to prevent unintentional changes
to other parts of the project. Branching enables parallel development and reduces the risk of
conflicts among developers. It provides a controlled environment to experiment and iterate
without affecting the stability of the main codebase.

Have you used any specific version control tools or platforms in your data warehousing
projects? How do you ensure that the chosen tool meets the project's requirements?
A.
Sample answer
In my data warehousing projects, I have primarily used Git as the version control tool. Git is a
widely adopted and versatile tool that fulfills the requirements of most data warehousing
projects. However, the selection of the version control tool depends on the specific needs of
the project. Before choosing a tool, I carefully analyze the project requirements, including
factors such as team size, collaboration needs, and integration capabilities with other
development tools. By evaluating these factors, I ensure that the chosen version control tool
aligns with the project's requirements and enhances the efficiency and effectiveness of the
data warehousing development process.

How do you ensure the security and confidentiality of sensitive data in a data warehousing
project when using version control?
A.
Sample answer
Ensuring the security and confidentiality of sensitive data is of utmost importance in a data
warehousing project. When using version control, I adopt several measures to protect sensitive
information. Firstly, I strictly adhere to access control policies and restrict access to sensitive
data only to authorized personnel. This involves setting up appropriate user permissions and
roles within the version control system. Secondly, I enforce encryption of sensitive data at rest
and in transit. This ensures that even if unauthorized access occurs, the data remains
encrypted and unusable. Additionally, I make sure to avoid committing or storing sensitive
data, such as passwords or API keys, directly in the version control repository. Instead, I utilize
configuration files or secure key storage systems. By implementing these security measures, I
safeguard the confidentiality and integrity of sensitive data within the data warehousing
project.

How have you utilized Oracle SQL in your role as a Data warehouse Developer / Engineer?
A.
Sample answer
In my role as a Data warehouse Developer / Engineer, I have extensively utilized Oracle SQL for
querying, manipulating, and analyzing data stored in Oracle databases. I have experience in
writing complex SQL queries involving multiple tables, joins, and aggregations to extract
relevant information. I have used Oracle SQL functions and expressions to derive calculated
fields and transform raw data. Additionally, I have leveraged Oracle SQL's indexing and
performance optimization techniques to enhance query execution speed. Oracle SQL's rich
feature set and robustness have been instrumental in delivering efficient and scalable data
warehousing solutions.

How have you utilized Pyspark in your role as a Data warehouse Developer / Engineer?
A.
Sample answer
As a Data warehouse Developer / Engineer, I have extensively utilized Pyspark for processing
and analyzing large datasets. Pyspark provides a Python API for Apache Spark, enabling me to
leverage the power of distributed computing. I have experience in writing Pyspark scripts to
perform data transformations, aggregations, and machine learning tasks. Pyspark's integration
with Apache Spark's libraries, such as Spark SQL and MLlib, allows me to efficiently manipulate
and analyze data at scale. Additionally, I have utilized Pyspark's parallel processing capabilities
to optimize data processing workflows. Pyspark has been a valuable tool in implementing high-
performance data processing pipelines in our data warehousing projects.

How have you utilized Tableau in your role as a Data warehouse Developer / Engineer?
A.
Sample answer
In my role as a Data warehouse Developer / Engineer, I have extensively utilized Tableau for
data visualization and analysis. Tableau allows me to connect to various data sources, including
data warehouses, and create interactive dashboards and reports. I have experience in
designing visually appealing and insightful visualizations using Tableau's rich set of features and
functionalities. I utilize Tableau's drag-and-drop interface to quickly explore and analyze data
from multiple dimensions. Additionally, I have expertise in creating calculated fields,
hierarchies, and advanced visualizations within Tableau. Tableau has been instrumental in
enabling data-driven decision-making and enhancing data visibility within our organization.

How have you utilized UNIX in your role as a Data warehouse Developer / Engineer?
A.
Sample answer
In my role as a Data warehouse Developer / Engineer, I have extensively utilized UNIX for
various tasks related to data extraction, processing, and automation. I have proficiency in UNIX
shell scripting, which allows me to automate repetitive tasks and schedule data integration
workflows. I have experience in conducting data transfers and file manipulations using UNIX
commands and utilities. Additionally, I have utilized crontab to schedule batch jobs and
perform regular data updates. UNIX's command-line interface and robust scripting capabilities
have been crucial in managing and manipulating data in the data warehouse environment.

How have you utilized SQL Server in your role as a Data warehouse Developer / Engineer?
A.
Sample answer
In my role as a Data warehouse Developer / Engineer, I have extensively utilized SQL Server for
various data warehousing tasks. I have experience in designing and optimizing database
schemas, writing complex SQL queries, and creating efficient stored procedures. SQL Server
Integration Services (SSIS) has been my go-to tool for designing and executing ETL workflows.
SSIS provides a visual development environment, which allows for easy integration with other
SQL Server components. Additionally, I have utilized SQL Server Reporting Services (SSRS) for
creating interactive reports and dashboards to visualize data. SQL Server's robustness and
integration capabilities make it a valuable asset in data warehousing projects.

How have you utilized ETL concepts and methodology in your role as a Data warehouse
Developer / Engineer?
A.
Sample answer
As a Data warehouse Developer / Engineer, I have extensive experience with ETL (Extract,
Transform, Load) concepts and methodology. I understand the importance of extracting
relevant data from various sources, transforming it into a suitable format, and loading it into
the data warehouse. I have applied various transformation techniques such as filtering, joining,
aggregating, and data cleansing to ensure data integrity and accuracy. Additionally, I have
applied business rules and calculations during the transformation phase to generate
meaningful insights. I have experience with scheduling and monitoring ETL workflows to
ensure timely and efficient data integration. Overall, my expertise in ETL concepts and
methodology has been instrumental in delivering successful data warehousing solutions.

How have you utilized ETL tools like Informatica in your role as a Data warehouse
Developer / Engineer?
A.
Sample answer
In my role as a Data warehouse Developer / Engineer, I have extensively utilized Informatica
for ETL (Extract, Transform, Load) processes. For example, I have used Informatica
PowerCenter to extract data from various source systems, apply necessary transformations
and business rules, and load it into the data warehouse. Informatica provides a user-friendly
interface to design, develop, and schedule ETL workflows, making it efficient and easy to use.
By utilizing Informatica, I have been able to automate complex data integration tasks, ensuring
the accuracy and timeliness of data in the data warehouse.

How do you utilize Python in your role as a Data warehouse Developer / Engineer?
A.
Sample answer
As a Data warehouse Developer / Engineer, I leverage Python for various tasks related to data
processing, analysis, and automation. For example, I use Python's libraries such as Pandas and
NumPy to perform data transformations, data cleansing, and data aggregations. Python's
flexibility and extensive libraries make it a powerful tool for handling large datasets. I also
utilize Python for automating repetitive tasks, such as generating reports or data validation
scripts. Python's integration with SQL databases allows me to efficiently interact with the data
warehouse and perform complex data manipulations.

How do you handle the deployment of code changes in a data warehousing project to
minimize downtime and ensure smooth transition?
A.
Sample answer
Minimizing downtime and ensuring a smooth transition during the deployment of code
changes is critical in a data warehousing project. To achieve this, I follow a well-defined
deployment plan. Firstly, I schedule deployments during off-peak hours to minimize the impact
on users and avoid disruption to ongoing operations. Secondly, I perform extensive testing and
validation of the changes in a staging environment that closely mirrors the production
environment. This helps in identifying and resolving any issues or conflicts prior to deployment.
Thirdly, I create a rollback plan in case any unexpected issues arise during the deployment. This
includes taking necessary backups and establishing a clear process to revert to the previous
stable state. By following these practices, I ensure minimal downtime and a seamless transition
during code deployments in the data warehousing project.

Q.
What strategies do you implement to ensure efficient collaboration among team members in
a data warehousing project using version control?
A.
Sample answer
Efficient collaboration among team members is essential in a data warehousing project to
ensure smooth development and successful delivery. When using version control, I employ
several strategies to enhance collaboration. Firstly, I establish clear communication channels,
such as regular team meetings or dedicated communication platforms, to facilitate information
exchange and issue resolution. Secondly, I encourage the use of branching in version control to
allow parallel development and prevent conflicts. This enables team members to work on
separate features or bug fixes without disrupting each other's progress. Thirdly, I emphasize
the importance of proper code documentation and commit messages to enhance
understanding and knowledge sharing among team members. Lastly, I promote a culture of
constructive feedback and code reviews, which helps in identifying improvement opportunities
and maintaining code quality. These strategies foster effective collaboration in a data
warehousing project using version control.
In a data warehousing project, how do you ensure traceability and accountability of changes
made to the codebase using version control?
A.
Sample answer
Ensuring traceability and accountability of changes made to the codebase is essential in a data
warehousing project. When using version control, I implement several measures to achieve
this. Firstly, I enforce the use of descriptive commit messages that clearly state the purpose
and impact of each change. This makes it easier to trace the intent behind modifications.
Secondly, I maintain a centralized repository where all changes are logged and tracked,
allowing for easy access to the commit history. Thirdly, I establish a code review process to
ensure that changes are thoroughly reviewed by peers. This adds an additional layer of
accountability and helps in identifying potential issues or improvement opportunities. By
combining these practices, I ensure traceability and accountability of changes made to the
codebase in a data warehousing project.

What steps do you take to ensure data consistency and accuracy during the deployment of
code changes in a data warehousing project using version control?
A.
Sample answer
Deploying code changes in a data warehousing project requires careful consideration to
maintain data consistency and accuracy. In my approach, I follow a structured deployment
process. Firstly, I perform thorough testing of the changes in a controlled environment that
closely resembles the production environment. This includes executing end-to-end tests to
verify the correctness of the modifications and data integrity. Secondly, I create deployment
scripts or packages that encapsulate the changes and ensure proper execution. By automating
the deployment process, I reduce the risk of manual errors and inconsistencies. Thirdly, before
deploying to the production environment, I perform a final round of testing to validate the
changes' impact on system performance and data accuracy. This includes checking for any
potential conflicts or regressions. By following these steps, I ensure that data consistency and
accuracy are maintained during the deployment of code changes in the data warehousing
project.

How do you track and manage dependencies in a data warehousing project when using
version control?
A.
Sample answer
Tracking and managing dependencies is crucial in a data warehousing project to ensure that
changes are implemented in the correct order and do not introduce conflicts. When using
version control, I employ several techniques to handle dependencies effectively. Firstly, I
maintain a documentation or diagram that outlines the dependencies among different
components or modules of the data warehouse. This provides a clear visual representation and
helps in planning the sequence of changes. Secondly, I leverage the branching and merging
capabilities of the version control tool to manage the implementation of dependent changes
systematically. By creating separate branches for each change and merging them in the correct
order, I ensure that dependencies are resolved appropriately. Additionally, I communicate and
coordinate with other team members involved to align the implementation of changes and
identify any potential conflicts or issues. These practices enable accurate tracking and seamless
management of dependencies in a data warehousing project.
with cte as

select e.First_name, d.DEPARTMENT_NAME as dept_name, e.salary,


row_number() over (partition by d.DEPARTMENT_ID order by e.salary
desc)as rnk from employees e

join departments d

on e.DEPARTMENT_ID=d.DEPARTMENT_ID

select * from cte

where rnk<=5
1. Write a query to calculate the median salary of employees in a

table.

2. Identify products that were sold in all regions.

3. Retrieve the name of the manager who supervises the most

employees.

4. Write a query to group employees by age ranges (e.g., 20–30, 31–

40) and count the number of employees in each group.

5. Display the cumulative percentage of total sales for each product.

6. Write a query to retrieve the first order placed by each customer.

7. Identify employees who have never received a performance

review.

8. Find the most common value (mode) in a specific column.

9. Display all months where sales exceeded the average monthly

sales.

10. Write a query to identify the employee(s) whose salary is closest


to the average salary of the company.

Answers:

Solution 1

SELECT AVG(salary) AS median_salary

FROM (

SELECT salary

FROM employees

ORDER BY salary

LIMIT 2 - (SELECT COUNT(*) FROM employees) % 2

OFFSET (SELECT (COUNT(*) - 1) / 2 FROM employees)

) subquery;

Solution 2

SELECT product_id

FROM sales

GROUP BY product_id

HAVING COUNT(DISTINCT region_id) = (SELECT COUNT(*) FROM

regions);

Solution 3
SELECT manager_id, COUNT(*) AS num_employees

FROM employees

GROUP BY manager_id

ORDER BY num_employees DESC

LIMIT 1;

Solution 4

SELECT CASE

WHEN age BETWEEN 20 AND 30 THEN '20-30'

WHEN age BETWEEN 31 AND 40 THEN '31-40'

WHEN age BETWEEN 41 AND 50 THEN '41-50'

ELSE '50+'

END AS age_range,

COUNT(*) AS num_employees

FROM employees

GROUP BY age_range;

Solution 5

SELECT product_id,

SUM(sales) AS product_sales,

SUM(SUM(sales)) OVER (ORDER BY SUM(sales) DESC) * 100.0 /


SUM(SUM(sales)) OVER () AS cumulative_percentage

FROM sales_table

GROUP BY product_id;

Solution 6

SELECT customer_id, MIN(order_date) AS first_order_date

FROM orders

GROUP BY customer_id;

Solution 7

SELECT *

FROM employees

WHERE employee_id NOT IN (SELECT employee_id FROM

performance_reviews);

Solution 8

SELECT column_name, COUNT(*) AS frequency

FROM table_name

GROUP BY column_name

ORDER BY frequency DESC

LIMIT 1;
Solution 9

SELECT month, SUM(sales) AS monthly_sales

FROM sales

GROUP BY month

HAVING monthly_sales > (SELECT AVG(SUM(sales)) FROM sales

GROUP BY month);

Solution 10

SELECT employee_id, salary

FROM employees

ORDER BY ABS(salary - (SELECT AVG(salary) FROM employees)) ASC

LIMIT 1;

Tableau

What is data visualization in Tableau?


Data visualization is a way to represent data that is visually
appealing and interactive. With advancements in technology, the
number of business intelligence tools has increased which helps
users understand data, data sets, data points, charts, graphs, and
focus on its impact rather than understanding the tool itself.

Table of Contents

What is the difference between various BI tools


and Tableau?

The basic difference between the traditional BI tools and Tableau


lies in the efficiency and speed. The architecture of Traditional BI
tools has hardware limitations. While Tableau does not have any
sort of dependencies The traditional BI tools work on complex
technologies while Tableau uses simple associative search to make
it dynamic. Traditional BI tools do not support multi-thread, in-
memory, or multi-core computing while Tableau supports all these
features after integrating complex technologies. Traditional BI tools
have a pre-defined data view while Tableau does a predictive
analysis for business operations.

Table of Contents

What are different Tableau products?

Tableau like other BI tools has a range of products:

Tableau Desktop: Desktop product is used to create optimized


queries out from pictures of data. Once the queries are ready, you
can perform those queries without the need to code. Tableau
desktop encompasses data from various sources into its data engine
and creates an interactive dashboard.

Tableau Server: When you have published dashboards using


Tableau Desktop, Tableau servers help in sharing them throughout
the organization. It is an enterprise-level feature that is installed on
a Windows or Linux server.

Tableau Reader: Tableau Reader is a free feature available on


Desktop that lets you open and views data visualizations. You can
filter or drill down the data but restricts editing any formulas or
performing any kind of actions on it. It is also used to extract
connection files.
Tableau Online: Tableau online is also a paid feature but doesn’t
need exclusive installation. It comes with the software and is used
to share the published dashboards anywhere and everywhere.

Tableau Public: Tableau public is yet another free feature to view


your data visualizations by saving them as worksheets or workbooks
on Tableau Server.

Table of Contents

What is a parameter in Tableau?

The parameter is a variable (numbers, strings, or date) created to


replace a constant value in calculations, filters, or reference lines.
For example, you create a field that returns true if the sales are
greater than 30,000 and false if otherwise. Parameters are used to
replace these numbers (30000 in this case) to dynamically set this
during calculations. Parameters allow you to dynamically modify
values in a calculation. The parameters can accept values in the
following options: All: Simple text field List: List of possible values to
select from Range: Select values from a specified range

Table of Contents

Tell me something about measures and


dimensions?

In Tableau, when we connect to a new data source, each field in the


data source is either mapped as measures or dimensions. These
fields are the columns defined in the data source. Each field is
assigned a dataType (integer, string, etc.) and a role (discrete
dimension or continuous measure). Measures contain numeric
values that are analyzed by a dimension table. Measures are stored
in a table that allows storage of multiple records and contains
foreign keys referring uniquely to the associated dimension tables.
While Dimensions contain qualitative values (name, dates,
geographical data) to define comprehensive attributes to
categorize, segment, and reveal the data details.

Table of Contents

What are continuous and discrete field types?

Tableau’s specialty lies in displaying data differently either in


continuous format or discrete. Both of them are mathematical terms
used to define data where continuous means without interruptions
and discrete means are individually separate and distinct. While the
blue color indicates discrete behavior, the green color indicates
continuous behavior. On one hand, the discrete view defines the
headers and can be easily sorted, while continuous defines the axis
in a graph view and cannot be sorted. Discrete View Tableau

Table of Contents

What is aggregation and disaggregation of


data?

Aggregation of data means displaying the measures and dimensions


in an aggregated form. The aggregate functions available in the
Tableau tool are: SUM (expression): Adds up all the values used in
the expression. Used only for numeric values. AVG (expression):
Calculates the average of all the values used in the expression.
Used only for numeric values. Median (expression): Calculates the
median of all the values across all the records used in the
expression. Used only for numeric values. Count (expression):
Returns the number of values in the set of expressions. Excludes
null values. Count (distinct): Returns the number of unique values in
the set of expressions. Tableau, in fact, lets you alter the
aggregation type for a view.

Disaggregation of data means displaying each and every data field


separately.

Table of Contents

What are the different types of joins in


Tableau?

Tableau is pretty similar to SQL. Therefore, the types of joins in


Tableau are similar: Left Outer Join: Extracts all the records from the
left table and the matching rows from the right table. Right Outer
Join: Extracts all the records from the right table and the matching
rows from the left table. Full Outer Join: Extracts the records from
both the left and right tables. All unmatched rows go with the NULL
value. Inner Join: Extracts the records from both tables.

Table of Contents

Tell me the different connections to make with


a dataset?
There are two types of data connections in Tableau: LIVE: Live
connection is a dynamic way to extract real-time data by directly
connecting to the data source. Tableau directly creates queries
against the database entries and retrieves the query results in a
workbook. EXTRACT: A snapshot of the data, extract the file (.tde
or .hyper file) contains data from a relational database. The data is
extracted from a static source of data like an Excel Spreadsheet.
You can schedule to refresh the snapshots which are done using the
Tableau server. This doesn’t need any connection with the
database.

Table of Contents

What are the supported file extensions in


Tableau?

The supported file extensions used in Tableau Desktop are: Tableau


Workbook (TWB): contains all worksheets, story points, dashboards,
etc. Tableau Data Source (TDS): contains connection information
and metadata about your data source Tableau Data Extract (TDE):
contains data that has been extracted from other data sources.
Tableau Packaged Workbook (TWBX): contains a combination of the
workbook, connection data, and metadata, and the data itself in the
form of TDE. It can be zipped and shared. Tableau Packaged Data
Source (TDSX): contains a combination of different files. Tableau
Bookmark (TBM): to earmark a specific worksheet.

Table of Contents

What are the supported data types in Tableau?

The following data types are supported in Tableau: DataType


Possible Values Boolean True/False Date Date Value (December 28,
2016) Date & Time Date & Timestamp values (December 28, 2016
06:00:00 PM) Geographical Values Geographical Mapping (Beijing,
Mumbai) Text/String Text/String Number Decimal (8.00) Number
Whole Number (5)

Table of Contents

What are sets?

Sets are custom fields created as a subset of the data in your


Tableau desktop. Sets can be computed based on conditions or
created manually based on the dimensions of the data source. For
example, A set of customers that earned revenue more than some
value. Now, set data may update dynamically based on the
conditions applied. Learn More

Table of Contents

What are groups in Tableau?

Groups are created to visualize larger memberships using


dimensions. Groups can create their own fields to categorize values
in that specific dimension.

Table of Contents

What are shelves?

Tableau worksheets contain various named elements like columns,


rows, marks, filters, pages, etc. which are called shelves. You can
place fields on shelves to create visualizations, increase the level of
detail, or add context to it.

Table of Contents

Tell me something about Data blending in


Tableau?

Data blending is viewing and analyzing data from multiple sources


in one place. Primary and secondary are two types of data sources
that are involved in data blending.

Table of Contents

How do you generally perform load testing in


Tableau?

Load testing in Tableau is done to understand the server’s capacity


with respect to its environment, data, workload, and use. It is
preferable to conduct load testing at least 3-4 times in a year
because with every new user, upgrade, or content authoring, the
usage, data, and workload change. Tabjolt was created by Tableau
to conduct point-and-run load and performance testing specifically
for Tableau servers. Tabjolt: Automates the process of user-specified
loads Eliminates dependency on script development or script
maintenance Scales linearly with an increase in the load by adding
more nodes to the cluster
Table of Contents

Why would someone not use Tableau?

The limitations of using Tableau are: Not cost-effective: Tableau is


not that cost-effective when we compare it well with the other
available data visualization tools. In addition to this, it has software
upgrades, proper deployment, maintenance, and also training
people for using the tool. Not so secure: When it comes to data,
everyone is extra cautious. Tableau focussed on security issues but
fails to provide centralized data-level security. It pushes for row-
level security and creates an account for every user which makes it
more prone to security glitches. BI capabilities are not enough:
Tableau lacks basic BI capabilities like large-scale reporting, building
data tables, or creating static layouts. It has limited result-sharing
capabilities, email notification configuration is limited to admins,
and the vendor doesn’t support trigger-based notifications.

Table of Contents

What is Tableau data engine?

An analytical database that computes instant query responses,


predictive analysis of the server, and integrated data. The data
engine is useful when you need to create, refresh, or query extracts.
It can be used for cross-database joins as well.

Table of Contents

What are the various types of filters in


Tableau?

Tableau has 6 different types of filters: Extract Filter: This filter


retrieves a subset of data from the data source. Dimension Filter:
This filter is for non-aggregated data (discrete). Data Source Filter:
This filter refrains users from viewing sensitive information and thus
reduces data feeds. Context Filter: This filter creates datasets by
applying presets in Tableau. Measure Filter: This filter applies
various operations like sum, median, avg, etc. Table Calculation
Filter: This filter is applied after the view has been created.

Table of Contents

What are dual axes?


Dual axes are used to analyze two different measures at two
different scales in the same graph. This lets you compare multiple
attributes on one graph with two independent axes layered one
above the other. To add a measure as a dual-axis, drag the field to
the right side of the view and drop it when you see a black dashed
line appear. You can also right-click (control-click on Mac) the
measure on the Columns or Rows shelf and select Dual Axis.

Table of Contents

What is the difference between a tree and heat


map?

Both the maps help in analyzing data. While a heat map visualizes
and compares different categories of data, a treemap displays a
hierarchical structure of data in rectangles. Heat map visualizes
measures against dimensions by depicting them in different colors.
Similar to a text table with values defined in different colors.
Heatmap In Tableau Treemap visualizes the hierarchy of data in
nested rectangles. Hierarchy levels are displayed from larger
rectangles to smaller ones. Example - Below treemap shows
aggregated sales totals across a range of product categories:
TreeMap in Tableau

Table of Contents

What are extracts and schedules in Tableau


server?

Data extracts are the subsets of data created from data sources.
Schedules are scheduled refreshes made on extracts after
publishing the workbook. This keeps the data up-to-date. Schedules
are strictly managed by the server administrators.

Table of Contents

What are the components in a dashboard?

The components displayed in a dashboard are: Horizontal:


Horizontal view allows the users to combine the worksheets and
dashboard elements from left to right and edit the height of the
elements. Vertical: Vertical view allows the users to combine the
worksheets and dashboard elements from top to bottom and edit
the width of the elements. Text: All the textual fields. Image Extract:
To extract an image Tableau applies some code, extracts the image,
and saves it in a workbook in the XML format. Web URL: Hyperlink
that points to a web page, file, or other web resources outside of
Tableau

Table of Contents

What is a TDE file?

TDE is Tableau Desktop Extension with extension .tde. TDE file


points to a file that contains data from external sources like MS
Excel, MS Access, or CSV files. TDE makes it easier to analyze and
discover data.

Table of Contents

What is the story in Tableau?

Creating a story is effective in Tableau which is created by


combining various charts to portray a plot of viewers. A story is a
sheet that contains all the methods used to create those
worksheets. To create a story: Click the New Story on the
dashboard. Choose the right size of the story from the bottom-left
corner or choose a custom size. Start building the story by double-
clicking the sheet and add it to the story point. Add a caption to the
story by clicking Add a caption. You can update the highlights by
clicking Update in the toolbar. You can also add layout options,
format a story, or fit the story to your dashboard.

Table of Contents

What are different Tableau files?

Workbooks: Workbooks contain one or more worksheets and


dashboard elements. Bookmarks: Contains a single worksheet that
is easier to share. Packaged Workbooks: Contains a workbook along
with supporting local file data and background images. Data
Extraction Files: Extract files that contain a subset of data. Data
Connection Files: Small XML file with various connection
information.

Table of Contents

How do you embed views into webpages?


You can easily integrate interactive views from your Tableau Server
or Tableau online onto webpages, blogs, web applications, or
internet portals. But to have a look at the views, the permissions
demand the viewer to create an account on the Tableau Server. To
embed views, click the Share button on the top of the view and copy
the embed code to paste it on the web page. You can also
customize the embedded code or Tableau Javascript APIs to embed
views.

Table of Contents

What is the maximum num of rows Tableau can


utilize at one time?

The maximum number of rows or columns is indefinite because


even though Tableau contains petabytes of data, it intelligently uses
only those rows and columns which you need to extract for your
purpose.

Table of Contents

Mention what is the difference between


published data sources and embedded data
sources in Tableau?

Connection information is the details of data that you want to bring


into Tableau. Before publishing it, you can create an extract of the
same. Published Data Source: It contains connection information
that is independent of any workbook. Embedded Data Source: It
contains connection information which is connected to a workbook

Table of Contents

What is the DRIVE Program Methodology?

DRIVE program methodology creates a structure around data


analytics derived from enterprise deployments. The drive
methodology is iterative in nature and includes agile methods that
are faster and effective.

Table of Contents

How to use groups in a calculated field?


Add the GroupBy clause to SQL queries or create a calculated field
in the data window to group fields. Using groups in a calculation.
You cannot reference ad-hoc groups in a calculation. Blend data
using groups created in the secondary data source: Only calculated
groups can be used in data blending if the group was created in the
secondary data source. Use a group in another workbook. You can
easily replicate a group in another workbook by copy and pasting a
calculation.

Table of Contents

Explain when would you use Joins vs Blending


in Tableau?

While the two terms may sound similar, there is a difference in their
meaning and use in Tableau: While Join is used to combine two or
more tables within the same data source. Blending is used to
combine data from multiple data sources such as Oracle, Excel, SQL
server, etc.

Table of Contents

What is Assume referential integrity?

In some cases, you can improve query performance by selecting the


option to Assume Referential Integrity from the Data menu. When
you use this option, Tableau will include the joined table in the
query only if it is specifically referenced by fields in the view.

Table of Contents

What is a Calculated Field and How Will You


Create One?

Calculated fields are created using formulas based on other fields.


These fields do not exist but are created by you. You can create
these fields to: Segment data Convert the data type of a field, such
as converting a string to a date. Aggregate data Filter results
Calculate ratios

There are three main types of calculations that you can create:
Basic Calculations: Transform values of the data fields at the source
level Level of Detail (LOD) Expressions: Transform values of the data
fields at the source level like basic calculations but with more
granular access Table Calculations: Transform values of the data
fields only at the visualization level To create calculate fields: In
Tableau, navigate to Analysis>Create a calculated field. Input
details in the calculation editor. And, done!

Table of Contents

How Can You Display the Top Five and Bottom


Five Sales in the Same View?

You can see top five and bottom five sales with the help of these
functions: Drag customer name to row and sales to the column. Sort
Sum(sales) in descending order. Create a calculated field Rank of
Sales.

Table of Contents

What is the Rank Function in Tableau?

Rank function is used to give positions (rank) to any measure in the


data set. Tableau can rank measure in the following ways: Rank:
The rank function in Tableau accepts two arguments: aggregated
measure and ranking order (optional) with a default value of desc.
Rank_dense: The rank_dense also accepts the two arguments:
aggregated measure and ranking order. This assigns the same rank
to the same values but doesn’t stop there and keeps incrementing
with the other values. For instance, if you have values 10, 20, 20,
30, then ranks will be 1, 2, 2, 3. Rank_modified: The rank_modified
assigns the same rank to similar values. Rank_unique: The
rank_unique assigns a unique rank to each and every value. For
example, If the values are 10, 20, 20, 30 then the assigned ranks
will be 1,2,3,4 respectively.

Table of Contents

What is the difference between Tableau and


other similar tools like QlikView or IBM
Cognos?

Tableau is different than QlikView or IBM Cognos for various


reasons: Tableau is an intuitive data visualization tool simplifying
the story creation by simple drag and drop techniques. On the other
hand, BI tools like QlikView or Cognos convert data into metadata to
let the users explore data relations. If your presentation runs around
presenting data in aesthetic visualizations then opt for Tableau. If
not, and might need a full BI platform then go for Cognos/QlikView
The ease of use or extracting data details is easier in Tableau than
compared to extensive BI tools like Cognos. With Tableau, your
team members, be it a guy from sales can easily read the data and
give insights. But with Cognos, only members with extensive tool
knowledge are appreciated and welcomed. Table of Contents

BIG QUERY

What is BigQuery, and how does it fit into the


data engineering ecosystem?

BigQuery is a fully managed, server-less data warehouse solution


provided by Google Cloud Platform (GCP). It allows users to analyze
and query large datasets using SQL, with high scalability and
performance.

How does BigQuery handle data storage and


processing?

BigQuery uses a distributed architecture for data storage and


processing. It separates storage and compute, allowing users to
scale each independently. Data is stored in Capacitor, a proprietary
storage system, while processing is handled by Dremel, a
distributed query execution engine.

What are the key advantages of using


BigQuery?

Some advantages of BigQuery include:


 Scalability: It can handle massive datasets and query
volumes.
 Cost-effectiveness: Users only pay for the queries and storage
they use.
 Serverless architecture: No infrastructure management is
required.
 Integration with other GCP services: BigQuery can easily
integrate with other GCP tools for data ingestion and
processing.

What is the difference between BigQuery and traditional relational


databases?

BigQuery is a cloud-based, columnar data warehouse, whereas


traditional relational databases are usually on-premises and row-
based. BigQuery offers near-infinite scalability, while traditional
databases have limitations based on hardware and storage
capacity.

Explain the concept of partitioning in


BigQuery.

Partitioning in BigQuery involves dividing tables into smaller, more


manageable parts based on specific criteria, such as a time range or
key value. This helps improve query performance by reducing the
amount of data that needs to be scanned.

What is clustering, and how does it optimize


query performance?

Clustering in BigQuery involves organizing data within partitions


based on the values of one or more columns. It improves
performance by physically grouping related data together, allowing
the query engine to skip irrelevant data during the execution of
certain queries.

How do you load data into BigQuery?


Data can be loaded into BigQuery using various methods, including:

 Batch loading: Using the BigQuery web UI, command-line tools


like bq, or API calls.
 Streaming: Pushing individual records or small batches in real-
time using the BigQuery streaming API.
 Data transfer: Using services like Cloud Storage transfer
service or Dataflow to load data into BigQuery.

What are the different data export options in


BigQuery?

BigQuery provides several options for exporting data, such as:

 Exporting query results to Google Cloud Storage or a BigQuery


table.
 Exporting data to a Cloud Storage bucket using BigQuery Data
Transfer Service.
 Exporting data to other Google Cloud services, such as
Bigtable or Google Sheets.

Explain the concept of federated queries in


BigQuery.

Federated queries allow users to query data stored outside of


BigQuery, such as in Google Sheets or Cloud SQL, directly from
within BigQuery. It enables users to combine and analyze data from
multiple sources without having to move or replicate it.

What are the best practices for optimizing


query performance in BigQuery?

Some best practices for query performance optimization in BigQuery


include:

 Designing an optimal schema and choosing appropriate


column types.
 Partitioning and clustering tables based on query patterns.
 Avoiding SELECT * and fetching only the required columns.
 Using appropriate JOIN and GROUP BY techniques.
 Leveraging caching and materialized views where applicable.

How does BigQuery handle data security?

BigQuery provides several security features, including:

 Encryption at rest: Data stored in BigQuery is encrypted using


Google's default encryption keys.
 Encryption in transit: Data transfers between clients and
BigQuery are encrypted using HTTPS/TLS.
 IAM integration: Access to BigQuery resources can be
controlled using IAM roles and policies.
 Audit logs: BigQuery logs and tracks all user and system
activity, providing an audit trail.

What is the difference between a table and a


view in BigQuery?

A table in BigQuery represents a structured collection of data,


whereas a view is a virtual table derived from a query. Views do not
store data themselves but instead provide a way to present data in
a particular format or subset.

Explain the concept of nested and repeated


fields in BigQuery.

Nested fields allow for hierarchical structures within a table, where a


column can contain another record or a struct. Repeated fields, on
the other hand, allow for arrays or lists within a column, where
multiple values can be stored.

How can you schedule and automate jobs in


BigQuery?
BigQuery provides several ways to schedule and automate jobs,
including:

 BigQuery scheduled queries: You can schedule queries to run


at specified intervals using the BigQuery web UI or API.
 Cloud Scheduler: Use Cloud Scheduler to trigger queries at
specific times or intervals.
 Cloud Functions: You can create Cloud Functions that are
triggered by events and execute BigQuery jobs.

What is the role of BigQuery Data Transfer


Service?

BigQuery Data Transfer Service allows you to automate and


schedule data transfers from external data sources, such as Google
Ads or YouTube, into BigQuery. It simplifies the process of loading
data into BigQuery from various platforms.

How does BigQuery handle data ingestion from


streaming sources?

BigQuery can ingest data from streaming sources using the


BigQuery streaming API. It enables near real-time data processing
by allowing you to push individual records or small batches of data
directly into BigQuery.

What are the limitations or constraints of


using BigQuery?

Some limitations of using BigQuery include:

 Query costs: Large or complex queries can result in higher


costs.
 DML operations: BigQuery does not support traditional update
and delete operations on tables.
 Data consistency: BigQuery is designed for analytical
workloads and does not provide strong transactional
consistency.
 Schema changes: Modifying the schema of a large table can
be time-consuming and requires careful planning.

How can you monitor and optimize BigQuery


costs?

To monitor and optimize BigQuery costs, you can:

 Use BigQuery's query history and explain functionality to


analyze query costs.
 Enable BigQuery query auditing and review usage patterns.
 Set up budgets and alerts to track costs.
 Utilize BigQuery's slot reservations for more predictable
pricing.
 Optimize data storage by removing unused tables and
partitions.

Explain the difference between BigQuery slots


and slots reservation.

In BigQuery, slots represent the computational resources allocated


to execute queries. Slots are used to measure and bill for query
processing. Slot reservations allow you to reserve a specific number
of slots for your project, providing more predictable and cost-
effective query execution.

Can you share your experience with


implementing data pipelines in BigQuery?

The interviewer expects the candidate to share their practical


experience and challenges faced when implementing data pipelines
in BigQuery. The candidate can discuss topics like data ingestion,
transformation, orchestration, and monitoring in BigQuery.

What is the difference between a view and a


materialized view in BigQuery?
A materialized view in BigQuery is a precomputed table that stores
the results of a query, while a view is a virtual table that derives its
data from the underlying tables at query time.

How does BigQuery handle data partitioning


and clustering?

BigQuery supports partitioning tables based on a specific column's


values, which improves query performance by reducing the amount
of data scanned. Clustering, on the other hand, physically organizes
data within partitions based on one or more columns, further
enhancing query performance.

Can you explain the concept of data sharding


in BigQuery?

Data sharding in BigQuery involves dividing large datasets into


smaller, more manageable pieces called shards, typically based on
a shard key. It helps distribute data across multiple nodes and can
improve query performance when querying specific shards.

How does BigQuery handle schema changes for


large tables?

Modifying the schema of large tables in BigQuery can be time-


consuming, as it requires rewriting the entire table. To minimize
impact, it's recommended to create a new table with the desired
schema, load the data into it, and then swap the old and new tables.

What are the benefits of using partitioned


tables in BigQuery?

Partitioned tables in BigQuery offer several benefits, including faster


query performance by reducing the amount of data scanned, cost
optimization by querying specific partitions, and simplified data
lifecycle management through efficient data archiving and deletion.

How can you control access and permissions in


BigQuery?

Access and permissions in BigQuery can be controlled through


Identity and Access Management (IAM) roles and policies. You can
assign specific roles to users, groups, or service accounts to control
their ability to perform actions on BigQuery resources.

What is the role of service accounts in


BigQuery?

Service accounts in BigQuery are used to authenticate and


authorize applications and processes to access and interact with
BigQuery resources. They provide a way to grant permissions to
non-human entities, such as data pipelines or automated processes.

Can you explain the concept of slots in


BigQuery?

In BigQuery, slots represent computational resources allocated to


execute queries. Slots are used to measure and bill for query
processing. The number of slots determines the query's maximum
concurrency and affects its performance.

What is the purpose of BigQuery reservations?

BigQuery reservations allow you to allocate a specific number of


slots to your project, ensuring that the slots are available when
needed and providing more predictable and cost-effective query
execution.
How can you optimize query performance in
BigQuery?

To optimize query performance in BigQuery, you can follow best


practices such as minimizing data scanned by filtering partitions and
clustering columns, using appropriate data types, leveraging cache
and materialized views, and optimizing joins and aggregations.

How does BigQuery handle data encryption?

BigQuery provides encryption at rest, where data stored in BigQuery


is automatically encrypted using Google's default encryption keys.
Additionally, it supports encryption in transit through the use of
HTTPS/TLS for data transfers.

Can you explain the concept of query caching


in BigQuery?

BigQuery automatically caches the results of recent queries to


improve performance and reduce costs. If a subsequent query can
use the cached results, it is served directly from the cache without
incurring additional processing costs.

How can you export BigQuery query results to


a file?

You can export BigQuery query results to a file by specifying the


destination file format, such as CSV or JSON, and the destination
location, such as Google Cloud Storage. BigQuery then exports the
results to the specified file format and location.

What is the purpose of the BigQuery Data


Transfer Service?
The BigQuery Data Transfer Service allows you to automate and
schedule data transfers from various external data sources, such as
Google Marketing Platform or SaaS applications, into BigQuery,
simplifying the process of loading data into BigQuery.

Can you explain the concept of streaming


inserts in BigQuery?

Streaming inserts in BigQuery enable near real-time data ingestion


by allowing you to push individual records or small batches of data
directly into BigQuery through the streaming API. The data is
immediately available for querying.

What is the difference between a table


decorator and a snapshot decorator in
BigQuery?

A table decorator in BigQuery allows you to query a specific point in


time within a table's history, based on a timestamp or an
expression. A snapshot decorator, on the other hand, allows you to
query a consistent snapshot of all tables in a dataset.

How does BigQuery handle data deduplication?

BigQuery does not provide built-in data deduplication functionality.


However, you can deduplicate data during the data ingestion
process by leveraging unique keys or by using other data processing
tools or frameworks before loading the data into BigQuery.

Can you explain the concept of streaming


buffer in BigQuery?

When data is streamed into BigQuery, it initially lands in a


streaming buffer. The streaming buffer holds the data temporarily
until it is written to permanent storage, and the data in the buffer is
available for querying but subject to certain limitations.
What are the limitations of using BigQuery
streaming inserts?

Some limitations of BigQuery streaming inserts include higher costs


compared to batch loading, the limit on the number of rows per
second and per table, and the inability to update or delete individual
records once they are streamed.

How does BigQuery handle nested and


repeated fields in JSON data?

BigQuery supports nested and repeated fields in JSON data by


flattening the structure and representing nested fields as separate
columns. Repeated fields are represented as arrays in the flattened
schema.

Can you explain the concept of the BigQuery


Data Catalog?

The BigQuery Data Catalog is a centralized metadata management


service provided by BigQuery. It allows you to register, search, and
discover datasets, tables, views, and other resources across your
organization, promoting data discoverability and governance.

How can you optimize data storage costs in


BigQuery?

To optimize data storage costs in BigQuery, you can consider


partitioning and clustering tables, compressing data using
appropriate compression types, and regularly reviewing and
archiving or deleting unused or outdated data.
What is the purpose of the
INFORMATION_SCHEMA in BigQuery?

The INFORMATION_SCHEMA in BigQuery is a virtual database


schema that provides access to metadata about datasets, tables,
views, columns, and other database objects. It allows users to query
and retrieve information about the BigQuery resources.

Can you explain the concept of data lineage in


BigQuery?

Data lineage in BigQuery refers to the ability to trace the origin and
transformation history of a particular dataset or table. It helps users
understand where the data comes from, how it was derived, and the
dependencies between different datasets.

How does BigQuery handle nested data types


like arrays and structs?

BigQuery supports nested data types like arrays and structs by


allowing you to create tables with columns that contain nested
fields. You can query and manipulate the nested data using dot
notation or by using appropriate SQL functions.

What is the purpose of the BigQuery ML


service?

BigQuery ML is a service within BigQuery that allows you to build


and execute machine learning models using SQL queries. It provides
a simplified interface for data engineers and analysts to perform
machine learning tasks without leaving BigQuery.

How can you monitor and troubleshoot query


performance in BigQuery?
You can monitor and troubleshoot query performance in BigQuery
by analyzing query execution statistics, using the

Can you explain the concept of table clustering


and its benefits?

Table clustering in BigQuery involves physically organizing data


within partitions based on one or more columns. Clustering
improves query performance by reducing the amount of data that
needs to be scanned, resulting in faster query execution and cost
savings.

How does BigQuery handle query optimization


and query execution?

BigQuery's query optimizer automatically optimizes query execution


by analyzing the query's structure, data distribution, and available
indexes. It chooses the most efficient execution plan based on
factors such as data location, query complexity, and available
resources.

What is the purpose of BigQuery BI Engine?

The BigQuery BI Engine is an in-memory analysis service that


complements BigQuery. It provides highly interactive and low-
latency query performance for BI tools, allowing for real-time data
exploration and visualization on large datasets.select num as
consecutivenumber

from (

select *, (row_number() over order by id) - row_number() over


(partition by num order by id)

) as rnk from table) as cons

group by num,rnk
having count(*)>=3

Can you explain the concept of wildcard tables


in BigQuery?

Wildcard tables in BigQuery allow you to query multiple tables that


match a specific pattern using a single query. They are useful when
working with partitioned or date-sharded tables, enabling efficient
querying of data across multiple tables.

What are the different data ingestion options


in BigQuery?

BigQuery provides several data ingestion options, including batch


loading using the BigQuery web UI, command-line tools like bq, or
API calls. It also supports real-time data ingestion through the
streaming API or data transfer services for specific data sources.

How does BigQuery handle data deduplication


during batch loading?

BigQuery does not provide built-in data deduplication during batch


loading. However, you can preprocess your data to remove
duplicates using data cleaning techniques or leverage external data
processing tools before loading the data into BigQuery.

Can you explain the concept of clustering keys


in BigQuery?

Clustering keys in BigQuery determine how data is physically


organized within partitions. They are used to define the order in
which data is stored and improve query performance by allowing
the query engine to skip irrelevant data during execution.
What are the best practices for data modeling
in BigQuery?

Some best practices for data modeling in BigQuery include


denormalizing data to minimize JOIN operations, using appropriate
column types and compression, optimizing partitioning and
clustering, and designing schemas based on query patterns and
performance requirements.

How does BigQuery handle data backup and


recovery?

BigQuery provides built-in data redundancy and backup


mechanisms. Data is automatically replicated across multiple
storage locations within a region for durability, and snapshots of
table data can be created for point-in-time recovery or restoring
previous states of the data.

Can you explain the concept of materialized


views in BigQuery?

Materialized views in BigQuery are precomputed results of queries


that are stored as physical tables. They can be used to accelerate
query performance by caching the results and updating them
incrementally as the underlying data changes.

How does BigQuery handle data export to


external services?

BigQuery provides various options to export data to external


services. You can export query results to Google Cloud Storage or
other cloud storage platforms, export data to Cloud Pub/Sub, or use
data transfer services for specific integrations with other Google
Cloud services.
What is the purpose of BigQuery ML's CREATE
MODEL statement?

The CREATE MODEL statement in BigQuery ML is used to create a


machine learning model based on a specified algorithm and training
data. It allows you to build predictive models directly within
BigQuery using SQL syntax.

Can you explain the concept of geographic data


types in BigQuery?

BigQuery supports geographic data types for representing spatial


data, such as points, lines, and polygons. These types enable
storage and querying of location-based information and provide
functions for spatial analysis and calculations.

How does BigQuery handle data privacy and


security?

BigQuery provides various security features, including data


encryption at rest and in transit, fine-grained access controls
through IAM, audit logs for tracking activity, and integration with
other Google Cloud services like Cloud Key Management Service for
additional encryption options.

Can you explain the concept of slot


reservations in BigQuery?

Slot reservations in BigQuery allow you to reserve a specific number


of query execution slots for your project. Reservations provide more
predictable query performance and pricing, ensuring that resources
are available when needed.

What are the different types of pricing models


available for BigQuery?
BigQuery offers on-demand pricing, where you pay for the storage
used and the amount of data processed by queries. It also provides
flat-rate pricing with BigQuery slots, allowing for predictable costs
and increased concurrency.

How can you automate BigQuery tasks using


Cloud Composer?

Cloud Composer, a managed workflow orchestration service, can be


used to automate BigQuery tasks by creating and scheduling
workflows that include BigQuery operations, such as query
execution, data loading, or data export.

Can you explain the concept of BigQuery


Omni?

BigQuery Omni is an extension of BigQuery that allows you to


analyze data across multiple clouds, including Google Cloud, AWS,
and Azure, using a unified interface. It provides a consistent
experience for querying and analyzing data stored in different cloud
platforms.

What is the purpose of the BigQuery Storage


API?

The BigQuery Storage API enables high-performance read and write


access to data stored in BigQuery. It allows for efficient data
ingestion, faster data exports, and integration with external tools
and services that need direct access to BigQuery data.

How can you handle schema evolution in


BigQuery?

BigQuery can handle schema evolution by allowing you to add new


columns to existing tables without modifying the existing data. It
also supports schema inference when querying data, automatically
detecting new columns added to a table.

Can you explain the concept of time travel in


BigQuery?

Time travel in BigQuery allows you to query data at specific points


in time within a defined retention period. It provides the ability to
analyze historical data or recover from accidental changes or
deletions within the specified time window.

What is the purpose of the BigQuery ML


TRANSFORM statement?

The TRANSFORM statement in BigQuery ML is used to perform


feature engineering and data transformation tasks within the
context of machine learning models. It allows you to preprocess
data and create new features before training the ML model.

How does BigQuery handle data consistency in


distributed queries?

BigQuery is designed for eventual consistency in distributed queries,


meaning that query results may not reflect the latest changes in the
underlying data immediately. However, BigQuery ensures that
queries are consistent within a single table or partition.

Can you explain the concept of BigQuery's


query cache?

The query cache in BigQuery stores the results of recent queries


and can serve subsequent identical queries directly from the cache,
reducing the need for reprocessing. The cache is automatically
managed by BigQuery and helps improve query performance and
reduce costs.
What is the purpose of the BigQuery Data
Transfer Service for SaaS?

The BigQuery Data Transfer Service for SaaS enables automatic


data transfers from supported SaaS applications, such as Salesforce
or Marketo, into BigQuery. It simplifies the process of extracting and
loading data from these sources for analysis and reporting.

How can you monitor and troubleshoot


streaming data pipelines in BigQuery?

To monitor and troubleshoot streaming data pipelines in BigQuery,


you can review the streaming buffer statistics, monitor streaming
API errors and quotas, use BigQuery's monitoring and logging
integrations, and leverage Cloud Monitoring and Cloud Logging for
more detailed analysis.

Can you explain the concept of BigQuery


federated queries?

BigQuery federated queries allow you to query data stored in


external sources, such as Google Cloud Storage or other BigQuery
datasets, without loading the data into a BigQuery table. It provides
a unified interface for querying both external and internal data
sources.

What is the purpose of the BigQuery Data QnA


service?

The BigQuery Data QnA service is a natural language interface that


allows users to query and explore data in BigQuery using
conversational language. It leverages machine learning techniques
to understand user queries and provide relevant results.
Can you explain the concept of BigQuery's
workload management?

Workload management in BigQuery allows you to allocate and


prioritize resources for different types of queries or workloads. You
can define query priorities, set concurrency limits, and manage
resources to ensure optimal performance and resource allocation.

How does BigQuery handle data skew and


hotspots in queries?

BigQuery's query optimizer automatically handles data skew and


hotspots by redistributing data during query execution. It
dynamically adjusts the data distribution to ensure balanced
processing across multiple nodes, improving query performance.

What is the purpose of the BigQuery ML


EVALUATE statement?

The EVALUATE statement in BigQuery ML is used to evaluate the


performance of a machine learning model by comparing its
predictions against known labels. It provides metrics such as
accuracy, precision, recall, and others to assess the model's quality.

Can you explain the concept of BigQuery's


billing export?

Billing export in BigQuery allows you to export detailed billing data


to Google Cloud Storage or BigQuery tables. It provides granular
information about resource usage, costs, and usage trends, enabling
better cost management and analysis.

How can you automate BigQuery tasks using


Cloud Functions?
Cloud Functions, a serverless compute platform, can be used to
automate BigQuery tasks by triggering functions based on events,
such as new data arriving in a storage bucket or a schedule. Cloud
Functions can execute BigQuery queries or perform other actions.

You might also like