SQLMDX

Eventually it's all about data

Domain indexes. Quick dive.

Posted by sqlmdx on March 15, 2020

Extensible indexing interface

Domain indexes are used to speed up data retrieval for specific domains like text, spatial or image data. Oracle has built in indexes for above mentioned domains but developers can create their own using extensible indexing interface (ODCIIndex).

Let’s consider specific task when this can be useful and discover process end-to-end.

If we have a table with million rows and want to find those containing particular string then there is no way to get a result without scanning each and every row. Well, if table has multiple columns then we can create an index so that Oracle scans entire index instead of entire table but that will not help much.

Of course, we can use Oracle Text technology (which is free and included in all editions) but let’s create our own index to get better understanding how it works under the hood.

Table is populated with a sample data using below script

exec dbms_random.seed(666);

create table t(str) as
select dbms_random.string('A', 20)
from dual connect by level  select * from t where lower(str) like '%oracl%';

STR
--------------------
DRNXIrhioCpVQZtOrACL
pcwGSORACLVlEHCDsvnj

So how to avoid full scan?

In many cases the idea for domain indexing is that original data re-organized and stored in a separate table with B-tree index (usually in index organized table) in a way that it’s possible to take advantage of a B-tree index.

For our specific task we can extract n-char long substrings from every row starting with the first char and moving to the end of the string and store this data along with original rowids in index organized table. In this case auxiliary table will have number of rows equal to average length of the string * number of rows in original table.

Most important, we can use predicates in a form like 'value%' on auxiliary table instead of like '%value%' or instr(…, 'value') on the original one and thus leverage B-tree index access.

If we use domain indexes then Oracle takes care of all the maintenance of auxiliary tables for us. We just need to specify what happens after truncate and DMLs on original table as well as required actions for index create and drop. All this logic is encapsulated in so called implementation type. Also, this type is used to specify how to retrieve rowids for original rows from auxiliary tables for given predicate containing domain index operator.

Source code for implementation type is here: tp_instr.

In order to create domain index, we need to create index type which associates operator and implementation type and operator in turn requires a function. All necessary statements are below.

create or replace function f_instr(col      varchar2,
                                   value    varchar2,
                                   indexctx in sys.odciindexctx,
                                   scanctx  in out tp_instr,
                                   scanflg  in number) return number as
  -- this function is called only when domain index implementation type is not used to evaluate predicate
begin
  if (indexctx.indexinfo is not null) then
    -- this code may be reached when there is an equality predicate by rowid
    -- for example: 
    -- select rowid from t where op_instr(str, 'oracl') = 1 and rowid = 'AAAfLwAAHAAAPGtAET'
    -- raise_application_error(-20000, '!');
    return sign(instr(lower(col), lower(value)));
  else
    -- this code may be reached when table column appears in an expression in domain index operator
    -- for example:
    -- select rowid from t where op_instr(str || '+', 'oracl') = 1
    -- raise_application_error(-20101, 'There must be a domain index of idxtp_instr indextype on specified column');
    return sign(instr(lower(col), lower(value)));
  end if;
end f_instr;
/

create or replace operator op_instr
binding (varchar2, varchar2) return number
with index context, scan context tp_instr using f_instr;
     
create or replace indextype idxtp_instr for 
op_instr (varchar2, varchar2)
using tp_instr;

create index idx_t_str on t(str) indextype is idxtp_instr parallel 8 parameters ('4');

Create index statement creates a service IOT table in this case and that can be done in parallel. In order to handle parallelism we need to take ia.indexparadegree into account in odciindexcreate function.

Also, we provide the token length in service table as ODCI parameters. These parameters supplied only to odciindexcreate function. get_len function was created to get this information from index metadata so that length can be used in other interface routines where it’s needed (odciindexinsert/odciindexstart).

Even length equal to 3 may be sufficient to get candidate rowids for main table from the service table. For given data the expected number of rows when we look for 3-chat substring is 1e6 * (20 – 3 + 1) / power(26, 3) or 100 * (20 – 3 + 1) / power(26,3) = 0.1% of rows.

If we look for 2-char substrings then domain index will not give any noticeable benefit and if we look for a specific char then full scan is definitely better than domain index access.

Actually, operator and function can be simplified and defined without index and scan contexts so it works perfectly fine with below definitions

create or replace function f_instr_(col varchar2, value varchar2)
  return number as
begin
  return sign(instr(lower(col), lower(value)));
end f_instr_;
/

create or replace operator op_instr_ binding
(varchar2, varchar2) return number
using f_instr_;

create or replace indextype idxtp_instr_ for 
op_instr_ (varchar2, varchar2)
using tp_instr;

create index idx_t_str_ on t(str) indextype is idxtp_instr_ parallel 8;

Extensible optimizer interface

Unfortunately, optimizer cannot choose different plans based on literal value for the parameter in operator. It either uses implementation type to evaluate operator or does not regardless of actual literal value or predicate selectivity.

select count(*) from t where op_instr(str, 'o') = 1;

------------------------------------------------------------------------------
| Id  | Operation        | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT |           |     1 |    21 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE  |           |     1 |    21 |            |          |
|*  2 |   DOMAIN INDEX   | IDX_T_STR | 10000 |   205K|            |          |
------------------------------------------------------------------------------

Explain plan shows 10000 estimated rows because default cardinality is 1%.

To provide better cardinality estimates we can associate statistics with a function used in domain index operator. Selectivity is calculated based on the length of the value using the formula above. This is implemented in a type using ODCIStats interface.

create or replace type tp_instr_stat as object
(
  dummy int,
  static function odcigetinterfaces(ifclist out sys.odciobjectlist)
    return number,

  static function odcistatsselectivity(pred sys.odcipredinfo,
                                       sel  out number,
                                       args sys.odciargdesclist,
                                       strt number,
                                       stop number,
                                       ---
                                       col   varchar2,
                                       value varchar2,
                                       ---
                                       env sys.odcienv) return number
)
/

create or replace type body tp_instr_stat is
  static function odcigetinterfaces(ifclist out sys.odciobjectlist)
    return number is
  begin
    ifclist := sys.odciobjectlist(sys.odciobject('SYS', 'ODCISTATS2'));
    return odciconst.success;
  end odcigetinterfaces;

  static function odcistatsselectivity(pred sys.odcipredinfo,
                                       sel  out number,
                                       args sys.odciargdesclist,
                                       strt number,
                                       stop number,
                                       ---
                                       col   varchar2,
                                       value varchar2,
                                       ---
                                       env sys.odcienv) return number is
    c_avg_len     int := 20;
    c_unique_char int := 26;
    cur_len       int := length(value);
  begin
    sel := 100 * (c_avg_len - cur_len + 1) / power(c_unique_char, cur_len);
    return odciconst.success;
  end;
end;
/

Eventually statistics type is associated with the function.

associate statistics with functions f_instr using tp_instr_stat;
select count(*) from t where op_instr(str, 'orac') = 1;

------------------------------------------------------------------------------
| Id  | Operation        | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------
|   0 | SELECT STATEMENT |           |     1 |    21 |     3   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE  |           |     1 |    21 |            |          |
|*  2 |   DOMAIN INDEX   | IDX_T_STR |    37 |   777 |            |          |
------------------------------------------------------------------------------

Formula gives 37 rows while actual number of rows for this predicate is 38.

As mentioned above Oracle optimizer uses DOMAIN INDEX access regardless the estimated number of rows. For example, the estimated number of rows is 769K for predicate op_instr(str, 'o') = 1 and it’s clear that full scan is better but CBO goes for DOMAIN INDEX.

A couple of examples when Oracle will use TABLE ACCESS BY USER ROWID/TABLE ACCESS FULL instead of DOMAIN INDEX are provided as comments in f_instr. In such cases there must be a logic to evaluate operator without implementation type thus you can see an expression sign(instr(lower(col), lower(value))) in a f_instr. If we want to always rely on implementation type when domain index operator is used then the function may have one-line body with raise_application_error so that an exception is thrown when implementation type is not used.

Additional details about extensible indexing interface and extensible optimizer interface can be found at Database Data Cartridge Developer’s Guide.
Simple example of extensible indexing is available in SQL reference – Using Extensible Indexing.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

 
%d bloggers like this: