0% found this document useful (0 votes)
38 views

Take Assessment: Exercise 6: Index Choice and Query Optimization

Uploaded by

xgdsmxy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

Take Assessment: Exercise 6: Index Choice and Query Optimization

Uploaded by

xgdsmxy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 7

Take Assessment: Exercise 6

Please answer the following question(s).


If the assessment includes multiple-choice questions, click the "Submit
Answers" button when you have completed those questions.
1. Go to bottom of question.

Index Choice and Query Optimization

Background

In this exercise, you will gain hands on experience with query optimizations using indexes.
The data you will be working with consists of a simple real estate information system where
information about customers, lots and the lots owned by the customers is stored. The SQL
schema for these tables:

Customer( customer_id, customer_first_name, customer_last_name )


Lot ( lot_id, lot_description, lot_size, lot_district, lot_value,
lot_street_address )
Customer_lot (lot_id, customer_id)

You must design the access methods for the database such that the best possible
performance is achieved under a variety of operating conditions. These operating conditions
are the different types of queries that will run on your system. These different queries run
on the system are below.

1. Selecting the lot_id for all lots in a given size range. An example of such a query
would be select lot_id from lot where lot_size between 300 and 15000;
2. Selecting the lot_id for all lots in a given value range. An example of such a query
would be select lot_id from lot where lot_value between 3000 and 15000;
3. Selecting all of the information for a specific customer. An example of such a query
would be select * from customer where customer_id=12;
4. Inserting new customer or lot data. An example of such a query would be insert into
customer values (250001, 'Vince', 'Smith' );
5. Deleting a row of customer or lot data. An example of such a query would be delete
from customer where customer_id='250001';
6. Updating a row of customer or lot data. An example of such a query would be
update customer set customer_first_name='Vinny' where customer_id='249001';
7. Selecting the average lot size of all lots. An example of such a query would be
select avg(lot_size) from lot;

Your Tasks

1. Optimize the real estate tables above. Perform the following operations.
1. Build and populate these tables using the instructions for setting up the real
estate database.
2. Analyze the storage characteristics of the tables when stored in the
database. You are to submit the most accurate estimate of the number of
tuples for each table and the number of disk pages that are used to store
each table. Submit the code necessary to gather this information as well as
the output of these queries. What do you calculate as the blocking factor for
each table?
3. Analyze the run time characteristics of each query given above. Report both
PostgreSQL's estimate and the actual total cost expressed in the number of
disk accesses and the amount of time that each query takes to run. Submit
the code necessary for gathering this data.
4. For each query above, suggest an index, if applicable, to improve
performance. To answer this question completely, you must state what
columns you are indexing and what index you will use. You must fully
defend your choice with a complete explanation. If an index is not
appropriate for a query, clearly state why. Note that you may only use
indexes supported in PostgreSQL, e.g. hash or b-tree. You may also use
clustering.
5. Implement the indexes that you proposed in the previous step and analyze the run-
time performance of the same queries that you ran in step three. Submit a table
showing the number of estimated and actual total cost expressed in the number of
disk page accesses and query run times after the index was used. Also compute the
percentage increase or decrease in performance (based on run time) that results
from indexing the tables. If no performance benefit was gained from the index,
you must explain why. You may use the following table as a basis:
Performance
Without Index With Index
Improvement
Estimated Actual Actual Estimated Actual Actual
Query
Disk Disk Run Disk Disk Run
Number
Accesses Accesses Time Accesses Accesses Time
1              
2              
3              
4              
5              
6              
7              
6. Given the data that you have now obtained, which queries do index
structures slow down and which do they speed up? If query types 1, 2 and
3 are common (occurring 75% of the time) and types 4 is uncommon
(occurring 25% of the time), does the increase in performance for some
queries outweigh the decreases of others? How would your opinion change
if the ratio were closer to 50% for queries 1, 2, and 3, and 50% for type 4?
2. You are designing a database to store sensor records for a research study. For the
first year, the database will mostly be populated via insert statements gathered
from these sensors. After the first year, the database will mainly be queried for data
via select statements. You must decide whether to index the database initially or
wait until after the first year. From what you have witnessed, would you initially
index the table or wait until after the first year? Explain.
3. You are employed by a local hospital as one of team of database professionals. A
colleague has implemented a table that stores information about patients in
PostgreSQL. You are asked to obtain all records for any female patient. You write a
query select * from patient where gender='f'; You notice that query
performance is poor. Your colleague who implemented the patient table is currently
away and it is your responsibility to determine what is wrong. You first describe the
patient table:

hospital=# \d patient
Table "patient"
Column | Type | Modifiers
--------------------+--------------+-------------
id | integer | not null
firstname | text | not null
lastname | text | not null
title | text |
admissiondate | date |
address | text |
gender | char | default 'f'
Indexes: patient_gender,
patient_id,
patient_firstname,
patient_lastname
Primary key: patient_pkey

You notice that the gender column is indexed and still are unsure why query
performance is not good. You decide to run an explain analyze on the above query.
The output generated by explain analyze is:

hospital=# explain SELECT * FROM patient WHERE gender='f';


NOTICE: QUERY PLAN:

Seq Scan on patient (cost=0.00..173.07 rows=6406 width=70)

EXPLAIN

You now understand why the query performance is suffering. Even though there is
an index on gender, it appears as though it is not being used. You issue a query to
count the number of males and females in the table. You find that the distribution is
almost exactly 50% males and 50% females. Explain why the index is not being
used. Will clustering the patient_gender index help performance? Please explain.

4. Recall that PostgreSQL stores statistics about tables in the system table called
pg_class. The query planner accesses this table for every query. These statistics
may only be updated using the analyze command. If the analyze command is not
run often, the statistics in this table may not be accurate and the query planner
may make poor decisions which can degrade system performance. Another strategy
is for the query planner to generate these statistics for each query (including
selects, inserts, updates, and deletes). This approach would allow the query planner
to have the most up-to-date statistics possible. Why doesn't PostgreSQL do this?

Submission

Submit answers to all of these questions in a file named indexes.txt.

To help yourself do your best on this assessment, consult this general list of grading
guidelines.
Go to top of question.

File to submit:

Go to top of assessment.

© Copyright 2004 iCarnegie, Inc. All rights reserved.


练习十 索引选择及查询优化

背景
在这次联系中,你将亲自动手实验用索引进行查询优化。你操作的数
据由一个简单实时评估信息系统组成,该系统存储了用户,地段,
用户所拥有的地段的信息。 这些表格的 SQL 框架如下:
Customer( customer_id, customer_first_name,
customer_last_name )
Lot ( lot_id, lot_description, lot_size, lot_district,
lot_value, lot_street_address )
Customer_lot (lot_id, customer_id)
你必须设计数据库的访问方法以便在各种操作条件下,能够获得最
可能的效果。这些操作条件是运行在你的系统上的不同类型的查询。
这些在系统上运行的不同查询如下:
1. 给定一个大小范围,选出所有范围内的地段号 lot_id。该类查询
的实例应该是 select lot_id from lot where lot_size between
300 and 15000;
2. 给定一个值的范围,选出所有范围内的地段号 lot_id。该类查询
的实例应该是 select lot_id from lot where lot_value between
300 and 15000;
3. 选出一个指定用户的所有信息。该类查询的实例应该是 select *
from customer where customer_id=12;
4. 插入新的用户或地段数据。该类查询的实例应该是 insert into
customer values (250001, 'Vince', 'Smith' );
5. 删除一行用户或地段数据。该类查询的实例应该是 delete from
customer where customer_id='250001';
6. 更 新 一 行 用 户 或 地 段 数 据 。 该 类 查 询 的 实 例 应 该 是 update
customer set customer_first_name='Vinny' where
customer_id='249001';
7. 选出所有点段的平均地段大小。该类查询的实例应该是 select
avg(lot_size) from lot;
你的任务
1. 优化上面的实时评估表。进行下列操作。
1. 利用设置实时评估数据库的说明,构建并填充这些表格。
2. 当存储表格到数据库的时候,分析表格的存储特征。你将提交
最正确的评估每个表格的元组数量以及用于存储每个表格的磁盘
的页数。 提交收集这些信息的必要代码以及这些查询的输出。对于
每个表格, 你以什么为块因子进行计算?
3. 分析上面给出的每个查询的运行时特征。报告 PostgreSQL 的
评估,用于磁盘访问的实际次数以及每次查询花费时间的总计。 提
交收集这些数据的必要代码。
4. 对于上面每一个查询,如果可以,建议使用一个索引以提高
查询性能。要完整地回答这个问题,你必须描述出你编入索引的列
以及你将使用的索引。你必须用一个完整的解释来充分地为你的选择
进行辩护。如果一个索引不适合这个查询,那么要清晰地
解释原因。注意,你可以只使用 PostgreSQL,支持的索引,例如,
hash 或 b-tree。你也可以用聚类。
5. 实现那些你之前提出的索引并且分析步骤 3 中运行的呢些查
询的运行时性能。提交一个表,显示出使用索引后,评估的和实际总
的磁盘页的访问数以及查询运行时间。基于运行时间,计算出由引表
格导致的性能的增加或降低的百分比。如果性能没有因为
索引而得到提高,那么你必须解释原因。你可以用下面的表格作为基
础。
Performance
Without Index With Index
Improvement
Estimated Actual Actual Estimated Actual Actual
Query
Disk Disk Run Disk Disk Run
Number
Accesses Accesses Time Accesses Accesses Time
1              
2              
3              
4              
5              
6              
7              

6. 给定一个你已有的数据,索引结构使哪些查询减速了,使哪
些查询加速了?如果 1,2,3 类查询是常见的(75%的发生率),而 4
类 是不常见的(25%的发生率),一些查询性能的提高
是否超过了其他查询所降低的?如果 1,2,3 类查询的比率接近 50%并
且 4 类也 占 50%,那么你的想法会发生怎样改变?
2. 为了研究学习,你正设计一个用于存储传感器记录的数据库。第
一年,通过插入语句,将从感知器收集到的记录填充到数据库中。第
一年以后,数据库主要用于查询通过 select 语句。你必须决定是否
在最初或是等到第一年以后再索引数据库。从你已经证明的内容,你
将在最初索引数据库还是等到第一年以后再做?解释。
3. 假设你被一个本地医院雇佣作为数据库专家组的一员。一位同事
已经在 PostgreSQL 中实现了一个存储病人信息的表。要求你获取所
有女病人的记录。 你写了一个查询 select * from patient where
gender='f';你主要查询性能是比较差的。现在你那位实现病人表的
同事不在,你有责任确定哪是错误的。你首先描述了病人的表格。
hospital=# \d patient
Table "patient"
Column | Type | Modifiers
--------------------+--------------+-------------
id | integer | not null
firstname | text | not null
lastname | text | not null
title | text |
admissiondate | date |
address | text |
gender | char | default 'f'
Indexes: patient_gender,
patient_id,
patient_firstname,
patient_lastname
Primary key: patient_pkey
你注意到 gender 这一列被索引了并且你仍然不确定为什么查询性能
是不好的。你决定在上述查询上运行一个解释分析。 由解释分析生
成的输出如下:
hospital=# explain SELECT * FROM patient WHERE gender='f';
NOTICE: QUERY PLAN:
Seq Scan on patient (cost=0.00..173.07 rows=6406 width=70)
EXPLAIN
你现在明白了为什么查询性能这么差了。尽管在 gender 上有一个索
引,但它好像并没有被使用。 你给出了一个查询用于统计表格中男
病人和女病人的数量。你发现两者的分布是各占 50%。 解释一下为什
么索引没有被使用。 对 patient_gender 进行聚类是否对性能有帮助?
请解释。
4. 回忆一下 PostgreSQL 将表格统计信息存储到一个名为 pg_class
系统表中。查询规划师每次查询都要访问这个表。这些统计只能用分
析命令进行更新。如果分析命令不经常运行,那么表中的统计信息或
许是不正确的并且查询规划者或许做出一些会降低系统性能的不好
的决策。对于一个策略为每一次查询生成这些统计信息(包括查找,
增加,更新,删除)。这个方法可以让查询规划师能够得到最新的统
计信息。为什么 PostgreSQL 不能做这项工作?
提交
将所有问题的答案放在一个名为 indexes.txt 的文件中,提交该文
件。
独立完成这部分的评估,可查阅指导书。

You might also like