|
SQL Optimization
I. Overview
After learning the fundamentals of SQL, it's fairly easy to create
queries that allow you to selectively view the contents of one or more
tables. However, even though these queries may have taken very little
time to create, you may end up being surprised at how long they can
take to execute. In fact, it's very commonplace for the initial
delight of SQL to vanish quickly when the first query takes hours to
complete! This is due to two factors. First of all, SQL access through
an ODBC Driver will never be quite as fast as native manipulation of
the database. This makes sense as the ODBC Driver adds another layer
(or middleman) to the whole process. The more layers that are added,
the slower the process will be. However, it's also very common to
over-estimate this impact. Most of the time, the poor performance is
actually due to poorly optimized queries rather than a slow ODBC
Driver. The lack of optimization in a query is the number one reason
for those lengthy queries. It's very possible for an optimized query
to run over 6000% faster than a non-optimized query. Therefore, as
speed (and therefore usability) is always an issue, it's well worth it
to spend some time examining and optimizing your queries. Because it
is so important, the process of creating a query should include
careful planning up front and optimization verification before the
process is complete.
Because there are only a few techniques for optimization queries,
it would seem to be an easy job at first. After all, there are only a
few things to check, right? Although that's correct, some of the
techniques can be very time-consuming and require a good understanding
of the tables, Data Dictionary, and index theory.
The techniques for optimization are outlined below.
II. Ensuring Proper Use of Indices
Effectively using a table's indices is probably the most difficult
technique to employ. However, it also produces the best results and
is the most important technique for this reason. Before continuing,
it's worthwhile to quickly go over an index's purpose.
Index overview:
An index, or a key, is a field or combination of fields used to
identify a record. There are two main types of indexes - a primary
index and a secondary index. A primary index uniquely identifies a
record. In other words, no other record in the table will have that
same index. A non-unique or secondary index does not uniquely identify
a record. Two classic examples illustrate the point. A customer number
field uniquely identifies the customer, as no two customers can have
the same customer number. Therefore, the customer number field can be
used for the primary index. A customer's Zip Code, on the other hand,
is not unique. This makes sense, as many customers can be located
within the same Zip Code. Therefore, the customer's Zip Code may be
used as a secondary index.
Now that we've covered the two types of indices, what are they used
for? In general indices are used to enable the database engine to find
and sort records quickly. An index allows the engine to jump directly
to where the data is located, as opposed to reading every single
record to see if it matches your search criteria. As an example, most
books have an index. A book's index allows you to look up a particular
word, and informs you of every page that the word exists on.
Therefore, using a book's index allows you to quickly find all
occurrences of that particular word in the book. It's much quicker to
find the word this way than it would be to read the entire book,
marking every page that contains the word. It's very similar with
databases. An index allows the database engine to find records
quickly, as it enables the engine to jump to the locations in the file
that contain a desired record. Without the use of indices, the engine
would have to read every record to see if it matched the required
criteria. Armed with this knowledge, it's easy to see that indices are
imperative for good performance of a database, particularly with a
relational database or when dealing with a very large table.
Indices can vastly improve the speed of database operations - but
how and when they are implemented plays an integral part in realizing
performance gains. Because their purpose is to enable the database
engine to find records and sort them quickly, the likely candidates
for indexing are those fields that are frequently searched, sorted, or
used to join tables. However, it's still a good idea to be somewhat
judicious with the choice of fields to index. Whenever a record is
added or modified, the database engine has more work to do if there
are many indexed fields. This is because the indices must be updated
every time this sort of action takes place. If most of the table
access is spent querying, this won't impact performance. However, if
the table's records are frequently modified, there should be a
reasonable limit to the number of fields indexed.
Indices may be comprised of a single field, or multiple
fields. Which type of index to use should be determined by the types
of SELECTs, ORDER BYs, and JOINs that are used most often. In other
words, multiple field indices are useful when searching on multiple
fields (as in a LastName+FirstName example), and single field indices
are useful when searching on a single field (as in a Customer_Number
example).
Once the table's indices have been defined, queries to the table
can be optimized to take advantage of these indices. Generally
speaking, all of your SQL queries should be optimized so that the
database engine can use existing indices to find, sort, and join
records and tables quickly. This means that indexed fields should
always be used in the SQL statement's WHERE and ORDER BY clauses. If
the WHERE clause references an index, the records can be retrieved
from the file quickly, as in the book's index example. If the result
set is to be sorted, then the ORDER BY clause should also reference an
indexed field, as the data will be sorted much more quickly that way.
After the query has been created, the next step is to execute it
and verify that the database engine was able to optimize the query by
using the proper indices. With the BASIS ODBC Driver, this process is
accomplished by examining the log file after a query has been
performed. To illustrate this, we can perform both an indexed and
non-indexed query:
Indexed query:
SELECT * from CALL_LOG where ID > '0000022000' and TIME < 9
This will select all of the
records from the CALL_LOG table where the call came in before
9:00 AM and the call ID is greater than '0000022000' Because the
primary index is based on the call ID, the ODBC Driver can
retrieve the records quickly. Here's the excerpt from the log
file:
OPTIMIZATION STRATEGY:
(file 1) = f:/odbc/Local_Call_Hist/data/call_log
order_knum=-1
[Selected] Predicate: 1 constraints
*!: (file 1) (knum=0, kseg=1) ID (bracket head)
Predicate: 1 constraints
* : (file 1) (knum=-1, kseg=-1) ID (no bracketing)
Note that the log file contains several pieces of information to
indicate the optimization strategy used. First of all, it says
[SELECTED] to indicate that it was able to select an index to optimize
on. Secondly, it lists two possible options to optimize off of - one
for the ID field and one for the TIME field. The ID field is listed
first, and the knum=0, kseg=1 indicates that it is the primary key (as
it's the first key in the first keychain). The second line listed is
for the TIME field. As its knum and kseg are -1, this indicates that
there aren't any indices that are based on the TIME field. In the
above example, the ODBC Driver used the primary index based on the
call ID field to optimize on. This is indicated by the '!' preceding
its line.
Non-Indexed query:
SELECT * from CALL_LOG where TIME < 9
This will select all of the records from the CALL_LOG table where
the call came in before 9:00 AM. As there aren't any indexes based on
the time field, the ODBC Driver isn't able to optimize the
query. Here's the excerpt from the log file:
OPTIMIZATION STRATEGY:
(file 1) = f:/odbc/Local_Call_Hist/data/call_log
order_knum=-1
Predicate: 1 constraints
* : (file 1) (knum=-1, kseg=-1) ID (no bracketing)
Note that it doesn't have a
[SELECTED] or a '!' to indicated a selected index.
A more complex example:
SELECT CALL_TRN.TECH
CALL_TRN.TRN_DATE from CALL_LOG, CALL_TRN where CALL_LOG.ID
between '0000020000' and '0000022000 'and CALL_LOG.TIME <
9 and CALL_LOG.TECH='ABC' and CALL_LOG.ID=CALL_TRN.ID order
by CALL_TRN.TECH
This query is more complex than the previous examples as it does a
join on the CALL_LOG and CALL_TRN tables, it references the
CALL_LOG.ID field twice (for the between), and it does an order
by. Here's the log file excerpt:
OPTIMIZATION STRATEGY:
(file 1) = f:/odbc/Local_Call_Hist/data/call_log
order_knum=-1
[Selected] Predicate: 1 constraints
*!: (file 1) (knum=0, kseg=1) ID (bracket head)
Predicate: 1 constraints
* : (file 1) (knum=0, kseg=1) ID (bracket tail)
Predicate: 1 constraints
* : (file 1) (knum=-1, kseg=-1) ID (no bracketing)
Predicate: 1 constraints
* : (file 1) (knum=0, kseg=1) ID (partial key: knum=0)
(file 2) = f:/odbc/Local_Call_Hist/data/call_trn
order_knum=-1
[Selected] Predicate: 2 constraints
: (file 1) (knum=0, kseg=1) ID (primary key)
*!: (file 2) (knum=0, kseg=1) ID (partial key: knum=0)
The logfile shows that the ODBC Driver did more work with this
query in order to figure out the best field to optimize on. In fact,
it checked almost every field mentioned in the query. The first two
entries for the first file relate to the ID field. There are two
entries for this field, a bracket head and a bracket tail. This is
because the field was used twice for comparison. Additionally, note
that that all entries list the primary index (ID) after their knum and
kseg portion to indicate the primary key. The third entry could not be
used as there isn't an index associated with that field, as mentioned
above. The fourth entry says that it could have chosen to do a partial
key read based on the primary key. For the second file, it lists two
entries. The first is the primary key for the first table. The second
entry (which was picked) shows that it was able to do a partial key
read off of the primary key. Note that this is quite a bit different
from the first table. For the first table, it keyed directly off of
ID, but it says that it can only to a partial key read off of ID for
the second table. The reason for this is that the primary key for the
second file is based off of multiple fields. So, since CALL_TRN's
primary index was comprised of the ID field followed by two others,
the ODBC Driver couldn't key directly off of the primary key. Instead,
it could do a partial key read because the query referenced the first
portion of the primary key.
Comparison of indexed vs. non-indexed queries:
To demonstrate the importance of indices, two identical queries
were performed on a table and timed. The only difference between the
tests is that the first query referenced the primary index, and the
second query did not reference any indices. Here is the query:
select * from CALL_TRN where ID > '0000022670'
To query on the primary index, the table was defined in the Data
Dictionary to have an index for the ID field, and the physical data
file used the ID field as its primary key. Before the query was run a
second time, the Data Dictionary was modified so that the table's
index used a different field. The physical data file was also
re-written so that the ID field was not specified as a key. Here are
the results:
Query on indexed field: 0.25 seconds
Query on non-indexed field: 14.71 seconds
This means that the indexed query was 5884% faster than the
non-indexed query. Quite impressive, isn't it? It's worthwhile to
mention that this query was structured specifically to provide a large
discrepancy in the result times. In other words, performance increases
aren't going to always be in the 6000% range - it can vary greatly
depending on the query and the data in question. For this example, the
select will only return records where the record ID is greater than
'0000022670'. When looking at the data file, it turns out that only
239 records out of 48,145 match this criterion. This means that the
query will only return about one half of one percent of the records -
a fairly small amount. As the indexed query can jump directly to the
records in the data file that match the criterion, it only has to deal
with 239 records. The non-indexed query, on the other hand, has to
wade through 48,145 records to see if they match the criterion.
Multiple-field indices:
If the index for a table is comprised of multiple fields that have
been concatenated together, then some optimization can take place if
one rule is adhered to:
In order to include a field that is part of the index, you have to
include every field before it
As an example, let's say the index is comprised of three fields:
Customer_Number, Last_Name, and First_Name. The ODBC driver could not
perform any optimizations if the WHERE clause just included the
Last_Name field. If a query has to be done on the Last_Name field,
then the Customer_Number field should be included as well. Likewise,
if the query used the First_Name field it should include the other two
in order to ensure proper optimization.
What about adding indices to a Single Keyed MKEYED file?
With a multi-keyed MKEYED file, it possible to have several indices
defined. However, with a single-keyed MKEYED file, only one index is
allowed. This means all queries run against that table must reference
one of the fields referenced by the primary key according to the rule
outlined above. If a query is executed that does not follow that rule,
no optimization can take place. There are two possible workarounds to
this problem. The first is to start using multi-keyed MKEYED files
instead, as they are not limited to a single index. The second method
involves creating another file. This new file will have an index
defined using the first table's non-indexed field and one of its
indexed fields. The idea is that the new query can do a join on the
old table and the new table. The select will reference the desired
field in the new table, which is part of its index. As it will also
return the value of an indexed field, this information can be used to
find the desired records in the original table via the index.
Structuring the query properly:
Even though the query may be constructed to take advantage of
indexed fields, there are two other parameters that determine how
quickly the query will be able to be performed. The two parameters are
the order of the fields in the WHERE clause, and how simple the
comparison clause is. This is best demonstrated by an example. The
goal of the query is to find all of the information for each call
during a particular data range where the call is of a particular
priority level. Here's the first query:
select * from hist_log where
PRIORITY= '4' and CALL_DATE between {d '1997-07-07'} and {d
'1997-07-07'}
OPTIMIZATION STRATEGY:
(file 1) = f:/odbc/Local_Call_Hist/data/hist_log
order_knum=-1
[Selected] Predicate: 1 constraints
*!: (file 1) (knum=7, kseg=1) ID (Alternate key: knum=7)
Predicate: 1 constraints
* : (file 1) (knum=2, kseg=1) ID (bracket head)
Predicate: 1 constraints
* : (file 1) (knum=2, kseg=1) ID (bracket tail)
From the log file, we can tell that it chose the PRIORITY field to
optimize on. For this particular file, both the PRIORITY field and the
CALL_DATE field are secondary indices. The ODBC Driver chose the
PRIORITY field for two reasons
- That field was listed first in the WHERE clause
- The comparison for that field
was much simpler than for the CALL_DATE field
The latter reason is particularly
important. Because the CALL_DATE field was used for a more
complex comparison (thanks to the BETWEEN), the PRIORITY field
was chosen for optimization. This is made clearer by the
following query:
select * from hist_log where
CALL_DATE between {d '1997-07-07'} and {d '1997-07-07'} and
PRIORITY = '4'
OPTIMIZATION STRATEGY:
(file 1) = f:/odbc/Local_Call_Hist/data/hist_log
order_knum=-1
Predicate: 1 constraints
* : (file 1) (knum=2, kseg=1) ID (bracket head)
Predicate: 1 constraints
* : (file 1) (knum=2, kseg=1) ID (bracket tail)
[Selected] Predicate: 1 constraints
*!: (file 1) (knum=7, kseg=1) ID (Alternate key: knum=7)
In this case, the PRIORITY field was listed last in the WHERE
clause, but was still chosen over the CALL_DATE field as the
comparison was much simpler.
Now, what does all of this have to do with optimization? The answer
is that even though these two indices appear equal (as they are both
secondary indices), we can get far better performance by optimizing
off of the CALL_DATE field. The reason is simple - there are thousands
of records in the data file with a priority of '4', but only a couple
dozen within the specified date range. In other words, if the
optimization is based on the PRIORITY field, the ODBC Driver will end
up reading and comparing thousands of records to the specified
date. However, if the optimization is based on the CALL_DATE field,
the ODBC Driver will read and compare about 30 records to the
specified priority level. To put this in perspective, the query is
tried a third time:
select * from hist_log where
CALL_DATE between {d '1997-07-07'} and {d '1997-07-07'} and
PRIORITY like '4%'
OPTIMIZATION STRATEGY:
(file 1) = f:/odbc/Local_Call_Hist/data/hist_log
order_knum=-1
[Selected] Predicate: 1 constraints
*!: (file 1) (knum=2, kseg=1) ID (bracket head)
Predicate: 1 constraints
* : (file 1) (knum=2, kseg=1) ID (bracket tail)
Predicate: 1 constraints
* : (file 1) (knum=-1, kseg=-1) ID (no bracketing)
Predicate: 1 constraints
* : (file 1) (knum=7, kseg=1) ID (bracket head)
Predicate: 1 constraints
* : (file 1) (knum=7, kseg=1) ID (bracket tail)
This time, the query was able to 'force' the ODBC driver into
selecting the CALL_DATE field for optimization. To accomplish this,
the CALL_DATE field was listed first in the WHERE clause. However, as
demonstrated in the second query, that wasn't quite enough. The final
step was to make the PRIORITY field look much less attractive to the
ODBC Driver. This was done by making the comparison more complex - it
was changed from a equals to a like. Because of this, the CALL_DATE
field was chosen for optimization. As mentioned above, this can
drastically reduce the amount of work that the ODBC Driver has to
do. Here are the timing results:
Query optimized on PRIORITY: 3.25 seconds
Query optimized on CALL_DATE: 0.05 seconds
The timings indicate that tuning the query resulting in a 6500%
increase in speed - quite a dramatic improvement! In fact, the
performance gain was better with this query than it was for the query
that was comparing an indexed and non-indexed WHERE clause. This is
important because it points out that even though a query may be
referencing and optimizing off of indexed fields, it still may not be
performing as well as it could be. This type of tweaking can have
profound effect on the speed of the query, but also requires a great
deal of knowledge about the table and the data that is being queried.
The previous example illustrated that nicely - the performance gain
was only realized because the ODBC Driver was forced to optimize on a
particular field. This means that the user had to do some extra work
(and thinking!) to make the ODBC Driver optimize off of a particular
field. Additionally, the user knew ahead of time that the CALL_DATE
field would be a far better choice, as it would substantially limit
the amount of data that the ODBC Driver had to deal with. Without a
good knowledge of the table's contents, this type of optimization
would have never taken place.
Review:
To recap, referencing indexed fields in your queries is the most
effective way to ensure fast performance. If frequently run queries
can't be optimized because they don't reference indexed fields, it may
be worthwhile to restructure the table by adding indices to ensure
faster performance. Note that this is normally a two-step process. The
first step is to create the index in the Data Dictionary, and the next
is to re-key the physical BBx data file so that it has the new field
defined as a key. When done via PRO/5, this involves creating a new
file with a new key definition in the MKEYED statement that includes
an index for the desired field. Once the new file has been created, a
short program must be run - one that reads all of the records from the
old file and writes them into the new file. With the BASIS ODBC
Driver, however, it is possible to complete this operation in a single
step via the CREATE INDEX sql statement. When this type of sql
statement is executed, the ODBC Driver will not only modify the Data
Dictionary to include the new index, but it will also re-key the file.
III. Limiting the Amount of Data Returned
A often-overlooked way of improving query times is to only return
the data that is actually needed. It's very common to browse the
entire table, or simply do a 'SELECT * from table'. While it may
sometimes be necessary to view every single field of every single
record, in many cases it's done just because it's easier. Regardless
of whether all of the data is going to be used or not. Therefore,
before doing a query it's wise to plan out what fields and records are
actually needed, and have the query return only the necessary
data. This means that the WHERE clause should be constructed to filter
out as much un-needed data as possible. Additionally, if only a
portion of the fields are required, stay away from the common 'SELECT
* ...' type of query. Instead, have the SELECT reference only the
necessary fields, resulting in much less data being processed and
faster queries. This is especially important when the table has a
large number of columns, or columns that are defined as large
character fields. If some of these columns are not necessary for the
result table, eliminating them from the query will result in much less
data being transferred - which translates into faster
queries. Following is an example that demonstrates this:
The goal of this query is to find out all of the call reference
numbers that were handled by a particular technician. This query could
be accomplished two ways:
select * from hist_log where TECH = 'ABC'"
select id from hist_log where TECH = 'ABC'"
The first way asks for every field for each record returned. The
goal of the query was just to find out which calls were taken by the
technician, though, so only the ID field needs to be returned. The
second query takes advantage of that fact. Here are the result times:
Selecting all fields: 8.09 seconds
Selecting the ID field: 1.28 seconds
In this case, retrieving only the necessary data made the query run
632% faster!
IV. SQLPASSTHROUGH Mode
If at all possible, a SQLPASSTHROUGH type query should be
performed. This is usually an option for the type of query that the
3rd party application (like MS Access or MS Visual Basic) will
perform. With this type of query, the BASIS ODBC Driver is in charge
of optimizing the query instead of the other application. As the ODBC
Driver has a better knowledge of the PRO/5 file's indices and its own
capabilities for optimization, it will usually do a much better job of
selecting and sorting data than the other application would.
V. Data Source Configurations
The 'Fast Connect' option can help reduce query time, especially
with larger Data Dictionaries. Turning on this option does several
things to improve the connection time to the data source. The option
that makes the biggest time difference is the 'No Shadow Data
Dictionary Consistency Check', which is also available as a separate
option.
The shadow DD files represent a form of the Data Dictionary that is
customized for the ODBC Driver. They present the data in a format that
the ODBC Driver is more readily able to digest. They store type
information about the DD and information about the data types that are
supported. They also leave out a lot of information from the DD that
is Taos-specific. Because of this, the shadow DD files are typically
much smaller than the full-blown DD files.
When a BASIS ODBC Data source is initially connected to, the ODBC
Driver will read information from the Shadow DD files to find out
information about the database's tables, fields, indices, etc. If
neither the 'No Shadow Data Dictionary Consistency Check' or the 'Fast
Connect' options are checked, then the ODBC driver will re-create the
Shadow DD files. This means that it will read the entire DD, figure
out all of the pertinent information for the Shadow DD files, then
create and populate the Shadow DD files. Needless to say, this process
can take some time. The larger the Data Dictionary, the longer it
takes to create the Shadow DD files. So, if the ODBC Driver is
allowed to bypass the create process, it will save quite a bit of time
on the initial connect. The Shadow DD files only need to be recreated
if the original DD changes. In general, the Shadow DD files shouldn't
be created on every connection to the Data Source. Overall, it's best
(from the performance standpoint) to only recreate the Shadow DD files
when necessary. Therefore, this option should only be turned off if
the original DD was modified (for example, if tables, indices, or
fields have been added, dropped or modified).
VI. Data Server versus Mapped Drives
If the data that is being queried resides on a remote machine, it's
usually possible to access it either via the BASIS Data Server or via
a mapped drive. When possible, the BASIS Data Server should be used
instead of a mapped drive, as queries will be several times faster via
the Data Server. This is because the Data Server can reduce the amount
of data that is sent over the network. When reading information from a
share (a mapped drive), protocol demands that locking information be
sent back and forth so that the ODBC driver can read the
information. However, when the Data Server is used, it handles the
locking calls. This means that all of that handshaking is done locally
(to that remote machine), and the Data Server will send far less
information over the network. As the network link is relatively slow,
reducing the amount of information passed through this channel can
greatly improve response times.
VII. Machine Speed
ODBC queries can be quite demanding on the workstation's
resources. In particular, the machine's CPU speed and hard disk speed
(for local queries and ORDER BYs) can have a large impact on the
length of time it takes for a query to complete. As a test, a report
consisting of ten queries was run from two machines. The first machine
was a 486 DX2-66 and it took 117 seconds to complete the report. The
second machine was a Pentium 133 and it took 56 seconds to
complete. For this test, both machines were on the same network so
that didn't play a part in the speed increase, which was more than
200%. The deciding factor for this test was the CPU and hard disk
speed. This is due to the fact that the queries polled large amounts
of data, and ORDER BYs and GROUP BYs were used.
|
|