Home » Server Options » Text & interMedia » Oracle Text Performance (Oracle Database 12c Enterprise Edition Release 12.1.0.2.0 - 64bit Production)
Oracle Text Performance [message #670763] |
Tue, 24 July 2018 20:19 |
|
vka2b
Messages: 21 Registered: June 2018
|
Junior Member |
|
|
Hi,
This is more of a general question, and I'm using an example that I've seen on this forum before as the basis (and also used when I asked my first and only other question on this forum):
SCOTT@orcl_11gR2> SELECT SCORE(1), user_desc, emp_name
2 FROM emp_master, menu_user_d
3 WHERE CONTAINS (emp_name, '?' || REPLACE (user_desc, ' ', ',?'), 1) > 0
4 ORDER BY SCORE(1) DESC
5 /
SCORE(1) USER_DESC EMP_NAME
---------- ------------- -------------
4 Wajahat Wajahat
4 Imd Imad El Kane
2 Mohammed Arif Md.Arif
3 rows selected.
I am doing something similar to the above for my own project, and it works just fine functionally. However, the tables I'm using (equivalent to emp_master and menu_user_d) are huge -- one of them has around 1 million records, and the other around 1 thousand. Given that the above query essentially forms a cartesian product between the two tables, with my data the performance is quite slow. I am essentially trying to convert fuzzy name matching functionality we currently have in a C++ engine to using Oracle Text instead, and in the C++ code we have various short-circuiting logic (e.g. discarding match-pairs whose initials don't match and such from the processing). I'm sure I could write some sort of UDF in PL/SQL to achieve something similar, but before re-inventing the wheel, I wanted to see if there is any built-in filtering functionality I might be missing, or any other best practices anybody can advise on -- this area isn't my expertise (my background is the C/C++ world), so I apologize in advance if this is a ridiculous question, but I'd really appreciate any pointers. Thanks!
|
|
|
|
Re: Oracle Text Performance [message #670786 is a reply to message #670764] |
Wed, 25 July 2018 15:47 |
|
vka2b
Messages: 21 Registered: June 2018
|
Junior Member |
|
|
Thank you for the quick response, and apologies for my delay in answering your questions. The example was mainly to illustrate the join -- I'm not really using the fuzzy operator, I'm using NDATA functionality. My query looks more like this:
select candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
from candidate_name_stage, target_name_stage
where contains(target_name, 'ndata(name, ' || candidate_name || ')',1 ) > 0
AND SCORE(1) > 70
ORDER BY candidate_name
There is some preceding set up, but I didn't want to inundate you -- if you want everything I'm doing, let me know and I'll paste it all. But the bottom line is that I agree with you -- comparing these two tables IS extreme. That's what I'm trying to avoid. When you say "loop through one, comparing to the other" wouldn't that still involve a comparison of every record from one table to every record from the other? Or are you saying that by looping through one at a time, you would apply some logic at that point to avoid comparing to every record from the other table? I do have a PL/SQL block that loops through one at a time:
SET SERVEROUTPUT ON
declare
v_target_name target_name_stage.target_name%type;
v_type_cd target_name_stage.type_cd%type;
n_score NUMBER(3);
v_candidate_name candidate_name_stage.candidate_name%type;
begin
for c in (select candidate_name from candidate_name_stage) loop
begin
select c.candidate_name into v_candidate_name from dual;
select target_name, type_cd, score(1) as fuzzy_match_score
into v_target_name, v_type_cd, n_score
from target_name_stage
where contains(target_name, 'ndata(name, ' || c.candidate_name || ')',1 ) > 0
AND SCORE(1) > 70
order by score(1) desc;
exception
when NO_DATA_FOUND then
NULL;
when TOO_MANY_ROWS then
NULL;
when OTHERS then
NULL;
end;
end loop;
dbms_output.put_line(v_target_name);
dbms_output.put_line(v_type_cd);
dbms_output.put_line(n_score);
exception
when NO_DATA_FOUND then
dbms_output.put_line ('No data found for ' || v_candidate_name);
when TOO_MANY_ROWS then
dbms_output.put_line ('Your SELECT statement retrieved multiple rows for ' || v_candidate_name || ' '|| v_target_name || ' ' || v_type_cd || ' ' || n_score);
when OTHERS then
dbms_output.put_line ('some other exception hapenned for ' || v_candidate_name || ' '|| v_target_name || ' ' || v_type_cd || ' ' || n_score);
end;
/
But that doesn't avoid comparing every record to every record. I was essentially looking for common practices to filter some records out to avoid so many comparisons (similar to what I described I was doing in my C++ code). To give you a better understanding of the use case -- essentially I have a list of customer names from our clients (financial institutions), and then I try to see if there are any reasonable fuzzy matches (above a certain score threshold) with these names against whatever target watch list the customer wants to use (e.g. World Check) and then flag the associated account accordingly if so. What I am doing so far is comparing every candidate name to every target name but only returning the ones whose score came back greater than a certain threshold, but I'm sure there is a better way to do this...thanks for any thoughts you may have!
|
|
|
Re: Oracle Text Performance [message #670787 is a reply to message #670786] |
Wed, 25 July 2018 16:39 |
|
Barbara Boehmer
Messages: 9100 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
You might try using FUZZY and SOUNDEX and experimenting with the parameters and compare to your NDATA results and speed. The FUZZY and SOUNDEX also do not require a lot of setup or consume a lot of space. I know that NDATA is specifically intended for names, but it is still new and I don't know that they have all of the bugs worked out. You might get similar results and better performance from FUZZY and SOUNDEX.
I suggested looping through one table, comparing to the other, as your PL/SQL block does, because, according to Oracle, Oracle Text is not designed to handle such comparisons. In earlier versions, a query like you are running would work with a small data set, but a large data set would result in a syntax error and not even run, apparently exceeding some limit. They may have fixed that in recent versions. If the query alone is working for you, then there is no need for the PL/SQL block and looping.
You previously mentioned comparing initials to limit the results. If you have those initials stored in columns that you can create a regular index on or you can create a function-based index on that derives the intials for comparison, then you can add that to your filter conditions and the optimizer can use that index to rapidly and probably drastically reduce the result set that it then has to use the Text index on and should drastically increase performance. There are also some cases where structured data comparisons can be combined with non-structured text comparisons in one text index, using SDATA, so that the optimizer can process it with one index hit. If you don't understand or need an example, then please post a realistic example that includes data and how the initials are stored or derived and what results you want based on the comparisons.
|
|
|
Re: Oracle Text Performance [message #670789 is a reply to message #670787] |
Wed, 25 July 2018 21:50 |
|
vka2b
Messages: 21 Registered: June 2018
|
Junior Member |
|
|
Quote:You might get similar results and better performance from FUZZY and SOUNDEX.
Thank you for the above suggestion. I've already evaluated SOUNDEX, and unfortunately this won't work, as my understanding is that SOUNDEX is really only applicable to English, and I need to support any language that can be represented in UTF-8.
Quote:If the query alone is working for you, then there is no need for the PL/SQL block and looping.
Yes, the query alone is in fact working.
Quote:If you don't understand or need an example, then please post a realistic example that includes data and how the initials are stored or derived and what results you want based on the comparisons.
I am pasting a simplistic (but realistic, of course not in data volume) example followed by an explanation:
DROP TABLE candidate_name_stage;
CREATE TABLE candidate_name_stage (candidate_name VARCHAR2(350));
INSERT INTO candidate_name_stage
values ('John Smith');
DROP TABLE target_name_stage;
CREATE TABLE target_name_stage (target_name VARCHAR2(350), type_cd VARCHAR2(1));
insert into target_name_stage
values ('John Smythe', 'P');
insert into target_name_stage
values ('Mary Robinson', 'P');
exec ctx_ddl.drop_preference ('my_ds')
exec ctx_ddl.create_preference ('my_ds', 'MULTI_COLUMN_DATASTORE')
exec ctx_ddl.set_attribute ('my_ds', 'COLUMNS', '''<name>''||target_name||''</name>''')
exec ctx_ddl.drop_section_group ('my_secgrp')
exec ctx_ddl.create_section_group('my_secgrp', 'BASIC_SECTION_GROUP')
exec ctx_ddl.add_ndata_section ('my_secgrp', 'name', 'name')
exec ctx_ddl.drop_preference ('my_lex')
exec ctx_ddl.create_preference('my_lex', 'BASIC_LEXER');
exec ctx_ddl.drop_preference ('my_wl')
exec ctx_ddl.create_preference('my_wl', 'BASIC_WORDLIST');
exec ctx_ddl.drop_stoplist('my_sl');
exec ctx_ddl.create_stoplist('my_sl', 'BASIC_STOPLIST');
drop index target_name_idx
CREATE INDEX target_name_idx ON target_name_stage(target_name) INDEXTYPE IS CTXSYS.CONTEXT
parameters ('datastore my_ds section group my_secgrp lexer my_lex wordlist my_wl stoplist my_sl memory 1m sync(on commit)')
/
select count(*) from ctx_user_index_errors;
select candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
from candidate_name_stage, target_name_stage
where contains(target_name, 'ndata(name, ' || candidate_name || ')',1 ) > 0
AND SCORE(1) > 70
ORDER BY candidate_name
So, in this example, "John Smith" will get compared to "John Smythe", but it will also get compared to "Mary Robinson" which does not make sense to do. What I had mentioned earlier about comparing initials is something I had implemented previously with some hairy C/C++ code, which I'm not sure is directly applicable to what I can do here, but the upshot is that I create a tree structure based on initials of the target names. So, based on the above example, there would be the following trees with the following contents:
J: John Smythe
S: John Smythe
JS: John Smythe
M: Mary Robinson
R: Mary Robinson
MR: Mary Robinson
When the candidate name of "John Smith" comes in, it goes through a hashing function that tells it to go look in the JS tree (not M, R, or MR tree) to avoid the unnecessary comparisons with "Mary Robinson." I'm not necessarily trying to replicate this exact functionality here, but trying to figure out the best approach in a similar vein to just reduce the number of unnecessary comparisons for performance purposes. Does that give you what you were looking for in order to provide an example of the suggestions you mentioned?
|
|
|
Re: Oracle Text Performance [message #670795 is a reply to message #670789] |
Thu, 26 July 2018 01:44 |
|
Barbara Boehmer
Messages: 9100 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
You can simplify
exec ctx_ddl.set_attribute ('my_ds', 'COLUMNS', '''<name>''||target_name||''</name>''')
to
exec ctx_ddl.set_attribute ('my_ds', 'COLUMNS', 'target_name name')
All you need to do is provide the column alias and the multi_column_datastore automatically adds the tags behind the scenes. There is no need for duplicate tags.
Instead of
where contains(...,1)>0
and score(1)>70
just use
where contains(...,1)>70
In the following demonstration, I have used virtual columns and indexes on those columns, then compared the virtual columns. Another method would be to create function-based indexes on the original columns and compare using the functions. Using where cinits=tinits limits the result set to just those with matching intials, then the text comparison only has to be done on that smaller result set, so it should increase your performance.
-- tables and data that you provided:
SCOTT@orcl_12.1.0.2.0> DROP TABLE candidate_name_stage;
Table dropped.
SCOTT@orcl_12.1.0.2.0> CREATE TABLE candidate_name_stage (candidate_name VARCHAR2(350));
Table created.
SCOTT@orcl_12.1.0.2.0> INSERT INTO candidate_name_stage values ('John Smith');
1 row created.
SCOTT@orcl_12.1.0.2.0> DROP TABLE target_name_stage;
Table dropped.
SCOTT@orcl_12.1.0.2.0> CREATE TABLE target_name_stage (target_name VARCHAR2(350), type_cd VARCHAR2(1));
Table created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage values ('John Smythe', 'P');
1 row created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage values ('Mary Robinson', 'P');
1 row created.
-- Oracle Text datastore, section group, ndata section, and index:
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.drop_preference ('my_ds')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.create_preference ('my_ds', 'MULTI_COLUMN_DATASTORE')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.set_attribute ('my_ds', 'COLUMNS', 'target_name name')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.drop_section_group ('my_secgrp')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.create_section_group('my_secgrp', 'BASIC_SECTION_GROUP')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.add_ndata_section ('my_secgrp', 'name', 'name')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> CREATE INDEX target_name_idx ON target_name_stage(target_name) INDEXTYPE IS CTXSYS.CONTEXT
2 parameters ('datastore my_ds section group my_secgrp');
Index created.
-- virtual columns and additional indexes:
SCOTT@orcl_12.1.0.2.0> ALTER TABLE candidate_name_stage ADD (cinits VARCHAR2(4000)
2 GENERATED ALWAYS AS (upper(regexp_replace(candidate_name,'(^| )([^ ])([^ ])*','\2'))) VIRTUAL);
Table altered.
SCOTT@orcl_12.1.0.2.0> CREATE INDEX cns_inits_idx ON candidate_name_stage (cinits);
Index created.
SCOTT@orcl_12.1.0.2.0> ALTER TABLE target_name_stage ADD (tinits VARCHAR2(4000)
2 GENERATED ALWAYS AS (upper(regexp_replace(target_name,'(^| )([^ ])([^ ])*','\2'))) VIRTUAL);
Table altered.
SCOTT@orcl_12.1.0.2.0> CREATE INDEX tns_inits_idx ON target_name_stage (tinits);
Index created.
-- suggested query:
SCOTT@orcl_12.1.0.2.0> column candidate_name format a25 word_wrapped
SCOTT@orcl_12.1.0.2.0> column target_name format a25 word_wrapped
SCOTT@orcl_12.1.0.2.0> select candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
2 from candidate_name_stage, target_name_stage
3 where cinits = tinits
4 and contains(target_name, 'ndata(name, ' || candidate_name || ')',1 ) > 70
5 ORDER BY candidate_name
6 /
CANDIDATE_NAME TARGET_NAME T FUZZY_MATCH_SCORE
------------------------- ------------------------- - -----------------
John Smith John Smythe P 88
1 row selected.
|
|
|
|
Re: Oracle Text Performance [message #670803 is a reply to message #670801] |
Thu, 26 July 2018 13:56 |
|
vka2b
Messages: 21 Registered: June 2018
|
Junior Member |
|
|
So I've gone back through your example in detail, and I have good news and bad news -- the good news is that I have performed the same test in my environment, and I get the same result as you -- so thanks very much for that. The bad news is that when I use my own test data (not the few dummy names I provided in my example test), it no longer works -- I get the following error that I'm not sure how to debug:
ORA-20000: Oracle Text error:
DRG-50901: text query parser syntax error on line 1, column 13
20000. 00000 - "%s"
*Cause: The stored procedure 'raise_application_error'
was called which causes this error to be generated.
*Action: Correct the problem as described in the error message or contact
the application administrator or DBA for more information.
So, I have a few thoughts:
1.) I'd like to give you my sample data if you don't mind to see if the same thing happens on your end -- it's just two simple text files with a few hundred names each -- I just don't know if it's possible to attach those files to a message in this forum?
2.) While unlikely, I thought perhaps I was reaching the limit you were referring to earlier with the query, so I decided to try my PL/SQL procedure instead. Interestingly, running that procedure in SQLDeveloper gives me this error: "
ORA-29909: label for the ancillary operator is not a literal number
ORA-06512: at line 5
29909. 00000 - "label for the ancillary operator is not a literal number"
*Cause: The label used for invoking the ancillary operator is not
a literal number.
*Action: Use a literal number as the label."
But, running the same exact block in SQLPlus works just fine. Do you know why this would be? The only issue with SQLPlus is that it's only returning one result, whereas I'm expecting a few thousand. I suspect it is due to this in my code:
exception
when NO_DATA_FOUND then
NULL;
when TOO_MANY_ROWS then
NULL;
when OTHERS then
NULL;
The reason I had to add this is that I kept getting errors when there were multiple matches. So, for example if my candidate name was "Tom Smith" and it matched two target names "Tim Smith" and "Tomm Smith", I would get an error. It would only work if one of those target names was around. Do you know why this would be? Also, I'd like to have the option in such cases of only returning the highest scored match (so if I get the above to work, it would return "Tom Smith" and "Tom Smith" matched at 100 rather than "Tom Smith" and "Tom Smith" matched at 100 AND "Tom Smith" and "Tim Smith" matched at 85. I thought this would be a simple modification to the query such as doing a MAX on the score and GROUP BY on the candidate name, but I couldn't get it to work -- I'm not sure if there's something I'm missing about doing such things with Oracle Text.
I feel like I'm so close here! If you are able to help me debug these few remaining issues, I think your suggestion should get me functionally and performance-wise exactly where I need to be -- thank you again for your help!
|
|
|
Re: Oracle Text Performance [message #670804 is a reply to message #670803] |
Thu, 26 July 2018 19:38 |
|
Barbara Boehmer
Messages: 9100 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
ORA-20000 is a user-defined error that Oracle Text uses in its applications and is called using raise_application_error to display the message "Oracle Text error", so that you know that the error originated from something to do with Oracle Text. The other error message in the stack DRG-50901 tells you that there was an error in text query parser syntax. This error can be due to a lot of things and is the error message that is raised when your tables are too big for the query. It is also raised when a value that you have used cannot be parsed. For example, Oracle Text does not know what to do with a single space as a value to search for and raises that error.
When you use
select ... score(1) where contains (..., ..., 1) > 0
the 1 is the label for the ancillary operator. They can be any whole numbers, but the number used in the score must match the number used in the contains clause. If you do not use one in the contains clause, then you cannot reference it in the score. You cannot use letters within quotes or some such thing. Some violation of this is why you got "label for the ancillary operator is not a literal number", so you need to fix your query.
When you use exception when others then null it obscures all errors. You need to either handle the errors or remove the exception block, so that you can see where the problems are.
You cannot select multiple values from multiple rows into one scalar variable, which is what your block tries to do, so that raises an error. You need to loop through the rows. Even when you only use the highest score, which can be done by selecting the rows having the score equal to the maximum value, then you need to loop through them.
In order to display errors caused by values in individual rows, then continue with the rest, you can enclose things within begin and end with an exception block.
Please see the following demonstration, in which I have used some slightly different data and a PL/SQL block to demonstrate a bunch of things. I did not repost the rest, since it was the same as in my last post. I included names of just a space and comma to deliberately raise an exception, so that you can see how that value is displayed along with the accompanying error message, then continues with the rest. I also included two matching highest values, so that you can see how duplicates are displayed.
I have used dbms_output for convenience and since that is what you were using. I don't know how you return things to your appllication, but there are other ways to do this. For example, you could return your data set using a pipelined table function and insert your errors into an error logging table.
You can upload files as attachments, however many people will not open them due to concern of viruses, so it is usually better to post inline. You need to make sure that you do not post any confidential information. You can also send files via PM and email through this website. However, I do not think this is necessary at this point. Using a pl/sql block like I posted, you should be able to identify the problems. It is usually better if you post a copy and paste of an actual run, with any confidential information removed. Most errors are then readily visible to most of us reading, rather than stating various possibilities based on the error messages.
-- test data:
SCOTT@orcl_12.1.0.2.0> select candidate_name from candidate_name_stage order by candidate_name
2 /
CANDIDATE_NAME
-------------------------
,
John Smith
Mary Robison
Tom Smith
4 rows selected.
SCOTT@orcl_12.1.0.2.0> select target_name from target_name_stage order by target_name
2 /
TARGET_NAME
-------------------------
,
John Smythe
Mari Robison
Mary Robinson
Tim Smith
Tom Smith
Tomm Smith
7 rows selected.
-- PL/SQL block:
SET SERVEROUTPUT ON FORMAT WRAPPED
SCOTT@orcl_12.1.0.2.0> begin
2 for c in
3 (select candidate_name, cinits from candidate_name_stage order by candidate_name)
4 loop
5 dbms_output.put_line ('----------------------------------------');
6 dbms_output.put_line ('CANDIDATE NAME: ' || c.candidate_name);
7 begin
8 for t in
9 (select target_name, type_cd, fuzzy_match_score
10 from (select target_name, type_cd, score(1) as fuzzy_match_score
11 from target_name_stage
12 where tinits = c.cinits
13 and contains(target_name, 'ndata(name, ' || c.candidate_name || ')',1 ) > 70
14 order by score(1) desc)
15 where fuzzy_match_score =
16 (select max (score(1))
17 from target_name_stage
18 where tinits = c.cinits
19 and contains(target_name, 'ndata(name, ' || c.candidate_name || ')',1 ) > 70))
20 loop
21 begin
22 dbms_output.put_line (' ----------------------------------------');
23 dbms_output.put_line (' TARGET NAME: ' || t.target_name);
24 dbms_output.put_line (' TYPE CD: ' || t.type_cd);
25 dbms_output.put_line (' SCORE: ' || t.fuzzy_match_score);
26 exception
27 when others then
28 dbms_output.put_line (' ' || sqlcode || ': ' || sqlerrm);
29 end;
30 end loop;
31 exception
32 when others then
33 dbms_output.put_line (sqlcode || ': ' || sqlerrm);
34 end;
35 end loop;
36 dbms_output.put_line ('----------------------------------------');
37 end;
38 /
----------------------------------------
CANDIDATE NAME: ,
-20000: ORA-20000: Oracle Text error:
DRG-50901: text query parser syntax error
on line 1, column 14
----------------------------------------
CANDIDATE NAME: John Smith
----------------------------------------
TARGET NAME: John Smythe
TYPE CD: P
SCORE: 88
----------------------------------------
CANDIDATE NAME: Mary Robison
----------------------------------------
TARGET NAME: Mari Robison
TYPE CD:
SCORE: 94
----------------------------------------
TARGET NAME: Mary Robinson
TYPE CD: P
SCORE: 94
----------------------------------------
CANDIDATE NAME: Tom Smith
----------------------------------------
TARGET NAME: Tom Smith
TYPE CD:
SCORE: 100
----------------------------------------
PL/SQL procedure successfully completed.
|
|
|
Re: Oracle Text Performance [message #670866 is a reply to message #670804] |
Mon, 30 July 2018 22:44 |
|
vka2b
Messages: 21 Registered: June 2018
|
Junior Member |
|
|
Hi, thank you for your latest reply. I have gotten a chance to give it a try, and I've made a few observations:
1.) Functionally, it works against my data, so that's good.
2.) Surprisingly, I never got any exception messages printed out when I ran against my data -- more on this later, but it made me conclude that the DRG-50901 error generated by my single query execution wasn't because Oracle Text couldn't parse something in my data but perhaps had to do with the amount of data, which turned out to be correct (with a caveat).
3.) Also surprisingly, performance-wise there wasn't really any difference. The PL/SQL block with the "cinit = tinit" filter ran in about ~40 minutes. The single query I was running before without the filter also ran in about ~40 minutes. I concluded this might be due to generating output for each match in the PL/SQL block, so I went back to trying my single query with the "cinit = tinit" that was throwing the DRG-50901 error and trying to determine what the issue was.
What I did was insert all ~900 rows of my test candidate data into candidate_name_stage, and just 1 row into target_name_stage, and ran the following:
select candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
from candidate_name_stage, target_name_stage
where cinits = tinits
and contains(target_name, 'ndata(name, ' || candidate_name || ')',1 ) > 70
ORDER BY candidate_name
This ran just fine. So, then I piece-meal added 100 rows at a time of the ~700 rows of my test target data and kept running the above query in between. It executed just fine until there were around ~400 rows in target_name_stage, and then the DRG-50901 started getting thrown. Initially I thought perhaps it was something in the last batch of rows inserted causing it, but that turned out not to be the case -- taking the first batch of 100 rows that works just fine but inserting it 4 times into the table causes the DRG-50901 error to get thrown. Clearing out the target table but inserting the same clean record 400 times and then running the above query also generates that error. This would lead one to believe that I am running into some size limitation, but interestingly, taking the cinits = tinits out causes everything to work fine as it did before (but that's what I need for the performance fix)! That is, this works:
select candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
from candidate_name_stage, target_name_stage
--where cinits = tinits
where contains(target_name, 'ndata(name, ' || candidate_name || ')',1 ) > 70
ORDER BY candidate_name
In fact, even if I put close to 1 million records in my target_name_stage table, the above works just fine. But as soon as I add that "cinits = tinits" back in, I get the DRG-50901 error again. As I mentioned, this isn't due to some sort of bad data in "cinits" or "tinits" -- even putting in "Mary" with tinit simply of "M" 400 times causes the query to bomb with that "cinits = tinits" filter (but not so if I take it out). Any idea why this might be?
I know you had mentioned not to send across any files due to virus concerns, which is fine. I don't want to put a few hundred records in-line, so what I'm going to do is just paste some of it -- it is all dummy data -- no confidential customer data -- and you can just load it into the example tables multiple times. Below is an example run of where I've done this -- once with just 1 name in the candidate table (no error), once with 900 of the same name in the candidate table (error), and once with 900 of the same name in the candidate table but with 'cinits = tinits' taken out (no error). I'm hoping you will be able to reproduce this on your end:
-- test data:
SQL> DROP TABLE candidate_name_stage;
Table dropped.
SQL> CREATE TABLE candidate_name_stage (candidate_name VARCHAR2(350));
Table created.
SQL> INSERT INTO candidate_name_stage values ('Mary Eckhart');
1 row created.
SQL> DROP TABLE target_name_stage;
Table dropped.
SQL> CREATE TABLE target_name_stage (target_name VARCHAR2(350), type_cd VARCHAR2(1));
Table created.
Please take the following names and load into target_name_stage 4 times (so that you get ~400 records):
ALI
IZZY N STITCH
FOURTH BANK OF LORTON VA
SECOND BANK OF VIRGINIA BR 3
KYLE MCCALLALAHAN
LIBBY K POSTHILL
LISA BACCUS
MARY ECKHART
BRCU-24
CREEK MARSHAL
DAVID HIGH ROSS
DEBBI K MORGAN
DLLUSPLAZA
ELLISE TRUMAN
FIRST BANK OF CLEARWATER VA BR 2
PAYNE MARSHAL
PETER S BAKER
VENKAT RAMARAO
WENDELL K OLDONE
YESBANK
KANGAL BANK
SUSAN SWEENY
AKMALEXTERNAL
ANDREW KERRY
AUSTIN GOLD
BARKER CHESTERFIELD
ABHIJEET CHOWDHURY
BENJAMIN
Mukunda
FRANK LESTER
FRANSICA N BABARINO
HARRIET M SWANN
JOHN MARSHAL
FIRST NATIONAL BANK OF COLUMBIA VA
JULLY SHALINI
KANTH
KELLY INTERNATIONAL HEATING
BRCU-1
BRCU-8
BUCKMINSTER TRUMAN
CHRISTOPHER N ADAMS
CODBANK
DARRIUS REOTKEY
MOTTLEY HARRY
OTTO PLASMUT
VERONICA - PASSPORT #
SALLY FIELD
SUSAN TRUNDLEBUND
ABN
SALMONSETH
HIGHWAY HOOKER
JACK GREEN
JOAN KUTER
NUMERO UNO BANK
AXIS BANK
EXHIDRLSHCUXX-2
KAREN FULLER
MARY HARSHBURGER
CALIBRCU-1
CARLOS JOSE ESTRELLA
DENNIS M JUMPER
EFRAM Q WEST
ERIN W BLACK
YOHANAEXTERNAL
CASPER BANK
SECOND BANK OF VIRGINIA
SIXTH NATIONAL BANK OF VA
ADAM HALLWAY
JANICE DAVIS
JILL GREEN
JIM BURTON
JUDAS EVERCLEAR
FIRST BANK OF MILLFORD NJ
FOURTH NATIONAL BANK OF DC
Laurentian Bank of Canada
MABK01MR
National Bank of Canada
LEONARDO DE CAPRIO
MARY STEVEN
MAXIMILLIAM GRADIUS
BRCU-9
BURTON MARGO
NANCY S LIST
NEO A MATRIX
RACHEL B MIAMI
SCOTT WIESMANN
THE HELMUT TRUST INC
ARPITA
BARRY STONE JOHNSON
ELISSA BRADLEY
VIVEKHAR
ORKUT VYUH G
FIRST NATIONAL BANK ONE
FRANCIS BACON
JENNIFER JOHNSTON
JOHANN FREDICKERSON
FREDERICKSBURG BANK OF VA
JMAMBA0221X
KAREN R WATERMAN
MARIAH PENNY CENTS
BILL CROOK
BRCU-15
BRCU-23
CALIBRCU-6
DANIEL SAUNDERS
NAIDU
RON I SELMOUR
SALLY BLACKWORTH
WILIM BAFFLE
-- Oracle Text datastore, section group, ndata section, and index:
SQL> exec ctx_ddl.drop_preference ('my_ds');
PL/SQL procedure successfully completed.
SQL> exec ctx_ddl.create_preference ('my_ds', 'MULTI_COLUMN_DATASTORE');
PL/SQL procedure successfully completed.
SQL> exec ctx_ddl.set_attribute ('my_ds', 'COLUMNS', 'target_name name');
PL/SQL procedure successfully completed.
SQL> exec ctx_ddl.drop_section_group ('my_secgrp');
PL/SQL procedure successfully completed.
SQL> exec ctx_ddl.create_section_group('my_secgrp', 'BASIC_SECTION_GROUP');
PL/SQL procedure successfully completed.
SQL> exec ctx_ddl.add_ndata_section ('my_secgrp', 'name', 'name');
PL/SQL procedure successfully completed.
SQL> CREATE INDEX target_name_idx ON target_name_stage(target_name) INDEXTYPE IS CTXSYS.CONTEXT
parameters ('datastore my_ds section group my_secgrp'); 2
Index created.
-- virtual columns and additional indexes:
SQL> ALTER TABLE candidate_name_stage ADD (cinits VARCHAR2(4000)
GENERATED ALWAYS AS (upper(regexp_replace(candidate_name,'(^| )([^ ])([^ ])*','\2'))) VIRTUAL);
2
Table altered.
SQL> CREATE INDEX cns_inits_idx ON candidate_name_stage (cinits);
Index created.
SQL> ALTER TABLE target_name_stage ADD (tinits VARCHAR2(4000)
GENERATED ALWAYS AS (upper(regexp_replace(target_name,'(^| )([^ ])([^ ])*','\2'))) VIRTUAL);
2
Table altered.
SQL> CREATE INDEX tns_inits_idx ON target_name_stage (tinits);
Index created.
-- suggested query:
SQL> column candidate_name format a25 word_wrapped;
SQL> column target_name format a25 word_wrapped;
SQL> select candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
from candidate_name_stage, target_name_stage
where cinits = tinits
and contains(target_name, 'ndata(name, ' || candidate_name || ')',1 ) > 70
ORDER BY candidate_name; 2 3 4 5
CANDIDATE_NAME TARGET_NAME T FUZZY_MATCH_SCORE
------------------------- ------------------------- - -----------------
Mary Eckhart MARY ECKHART 100
Mary Eckhart MARY ECKHART 100
Mary Eckhart MARY ECKHART 100
Mary Eckhart MARY ECKHART 100
Now, let's do the same thing, but add 'Mary Eckhart' to the candidate table 900 times instead of 1, and we'll see it throw the error with 'cinits=tinits', but not without it (note, the same thing happens with my actual candidate data, I am just using 'Mary Eckhart' duplicated 900 times to avoid having to post another long set of names here):
-- test data:
SQL> DROP TABLE candidate_name_stage;
Table dropped.
SQL> CREATE TABLE candidate_name_stage (candidate_name VARCHAR2(350));
Table created.
SQL> INSERT INTO candidate_name_stage (candidate_name) select 'Mary Eckhart' from dual connect by rownum <= 900;
900 rows created.
SQL> DROP TABLE target_name_stage;
Table dropped.
SQL> CREATE TABLE target_name_stage (target_name VARCHAR2(350), type_cd VARCHAR2(1));
Table created.
Please load target_name_stage in the same manner I described in the previous test.
-- Oracle Text datastore, section group, ndata section, and index:
SQL> exec ctx_ddl.drop_preference ('my_ds');
PL/SQL procedure successfully completed.
SQL> exec ctx_ddl.create_preference ('my_ds', 'MULTI_COLUMN_DATASTORE');
PL/SQL procedure successfully completed.
SQL> exec ctx_ddl.set_attribute ('my_ds', 'COLUMNS', 'target_name name');
PL/SQL procedure successfully completed.
SQL> exec ctx_ddl.drop_section_group ('my_secgrp');
PL/SQL procedure successfully completed.
SQL> exec ctx_ddl.create_section_group('my_secgrp', 'BASIC_SECTION_GROUP');
PL/SQL procedure successfully completed.
SQL> exec ctx_ddl.add_ndata_section ('my_secgrp', 'name', 'name');
PL/SQL procedure successfully completed.
SQL> CREATE INDEX target_name_idx ON target_name_stage(target_name) INDEXTYPE IS CTXSYS.CONTEXT
parameters ('datastore my_ds section group my_secgrp'); 2
Index created.
-- virtual columns and additional indexes:
SQL> ALTER TABLE candidate_name_stage ADD (cinits VARCHAR2(4000)
GENERATED ALWAYS AS (upper(regexp_replace(candidate_name,'(^| )([^ ])([^ ])*','\2'))) VIRTUAL);
2
Table altered.
SQL> CREATE INDEX cns_inits_idx ON candidate_name_stage (cinits);
Index created.
SQL> ALTER TABLE target_name_stage ADD (tinits VARCHAR2(4000)
GENERATED ALWAYS AS (upper(regexp_replace(target_name,'(^| )([^ ])([^ ])*','\2'))) VIRTUAL);
2
Table altered.
SQL> CREATE INDEX tns_inits_idx ON target_name_stage (tinits);
Index created.
-- suggested query:
SQL> column candidate_name format a25 word_wrapped;
SQL> column target_name format a25 word_wrapped;
SQL> select candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
from candidate_name_stage, target_name_stage
where cinits = tinits
and contains(target_name, 'ndata(name, ' || candidate_name || ')',1 ) > 70
ORDER BY candidate_name; 2 3 4 5
from candidate_name_stage, target_name_stage
*
ERROR at line 2:
ORA-20000: Oracle Text error:
DRG-50901: text query parser syntax error on line 1, column 13
SQL> select candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
from candidate_name_stage, target_name_stage
--where cinits = tinits
where contains(target_name, 'ndata(name, ' || candidate_name || ')',1 ) > 70
ORDER BY candidate_name;
CANDIDATE_NAME TARGET_NAME T FUZZY_MATCH_SCORE
------------------------- ------------------------- - -----------------
Mary Eckhart MARY ECKHART 100
Mary Eckhart MARY ECKHART 100
Mary Eckhart MARY ECKHART 100
...
I apologize for the length of this post, but I wanted to make sure I was more direct in order to avoid you having to speculate on the various possibilities as you had indicated in your last message. Thank you again for continuing to work me on this.
|
|
|
Re: Oracle Text Performance [message #670869 is a reply to message #670866] |
Tue, 31 July 2018 03:53 |
|
Barbara Boehmer
Messages: 9100 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
The following uses SDATA and NDATA to combine the searches for matching initials and names into one text index. SDATA cannot be created on virtual columns, so I have used regular columns. I have used FILTER BY to create the SDATA portion of the text index. I have added distinct to the query to eliminate listing the duplicate values. This may not be necessary with your actual data and adds time to the query. I have also demonstrated an additional query using dense_rank to select only the rows with the maximum score. The text index handles the SDATA and the NDATA well for a lot of rows in the target_name_stage table. The problem is with looping through all of the rows in the candidate_name_stage table.
-- tables and data:
SCOTT@orcl_12.1.0.2.0> DROP TABLE candidate_name_stage;
Table dropped.
SCOTT@orcl_12.1.0.2.0> CREATE TABLE candidate_name_stage (candidate_name VARCHAR2(350));
Table created.
SCOTT@orcl_12.1.0.2.0> INSERT INTO candidate_name_stage values ('John Smith');
1 row created.
SCOTT@orcl_12.1.0.2.0> INSERT INTO candidate_name_stage values ('Tom Smith');
1 row created.
SCOTT@orcl_12.1.0.2.0> INSERT INTO candidate_name_stage values ('Mary Robison');
1 row created.
SCOTT@orcl_12.1.0.2.0> INSERT INTO candidate_name_stage values ('MARY ECKHART');
1 row created.
SCOTT@orcl_12.1.0.2.0> DROP TABLE target_name_stage;
Table dropped.
SCOTT@orcl_12.1.0.2.0> CREATE TABLE target_name_stage (target_name VARCHAR2(350), type_cd VARCHAR2(1));
Table created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage values ('John Smythe', 'P');
1 row created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage values ('Mary Robinson', 'P');
1 row created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage (target_name) values ('Tim Smith');
1 row created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage (target_name) values ('Tomm Smith');
1 row created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage (target_name) values ('Tom Smith');
1 row created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage (target_name) values ('Mari Robison');
1 row created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage (target_name) values ('MARY ECKHART');
1 row created.
SCOTT@orcl_12.1.0.2.0> create or replace directory my_dir as 'c:\my_oracle_files'
2 /
Directory created.
SCOTT@orcl_12.1.0.2.0> drop table test_names
2 /
Table dropped.
SCOTT@orcl_12.1.0.2.0> create table test_names
2 (name varchar2(350))
3 ORGANIZATION external
4 (TYPE oracle_loader
5 DEFAULT DIRECTORY my_dir
6 ACCESS PARAMETERS
7 (RECORDS DELIMITED BY NEWLINE
8 LOGFILE 'test.log_xt'
9 FIELDS TERMINATED BY "," LDRTRIM
10 MISSING FIELD VALUES ARE NULL
11 REJECT ROWS WITH ALL NULL FIELDS
12 (name))
13 location ('names.dat'))
14 REJECT LIMIT UNLIMITED
15 /
Table created.
SCOTT@orcl_12.1.0.2.0> select count(*) from test_names
2 /
COUNT(*)
----------
109
1 row selected.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage (target_name) select * from test_names
2 /
109 rows created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage (target_name) select * from test_names
2 /
109 rows created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage (target_name) select * from test_names
2 /
109 rows created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage (target_name) select * from test_names
2 /
109 rows created.
SCOTT@orcl_12.1.0.2.0> begin
2 for i in 1 .. 900 loop
3 insert into candidate_name_stage (candidate_name) values ('MARY ECKHART');
4 end loop;
5 end;
6 /
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> select count(*) from candidate_name_stage
2 /
COUNT(*)
----------
904
1 row selected.
SCOTT@orcl_12.1.0.2.0> select count(*) from target_name_stage
2 /
COUNT(*)
----------
443
1 row selected.
-- additional columns and updates (new data could be automatically updated with a trigger or procedure):
SCOTT@orcl_12.1.0.2.0> ALTER TABLE candidate_name_stage ADD (cinits VARCHAR2(40))
2 /
Table altered.
SCOTT@orcl_12.1.0.2.0> UPDATE candidate_name_stage SET cinits = upper(regexp_replace(candidate_name,'(^| )([^ ])([^ ])*','\2'))
2 /
904 rows updated.
SCOTT@orcl_12.1.0.2.0> ALTER TABLE target_name_stage ADD (tinits VARCHAR2(40))
2 /
Table altered.
SCOTT@orcl_12.1.0.2.0> UPDATE target_name_stage SET tinits = upper(regexp_replace(target_name,'(^| )([^ ])([^ ])*','\2'))
2 /
443 rows updated.
-- indexes:
SCOTT@orcl_12.1.0.2.0> CREATE INDEX cns_inits_idx ON candidate_name_stage (cinits);
Index created.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.drop_preference ('my_ds')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.create_preference ('my_ds', 'MULTI_COLUMN_DATASTORE')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.set_attribute ('my_ds', 'COLUMNS', 'target_name name')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.drop_section_group ('my_secgrp')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.create_section_group('my_secgrp', 'BASIC_SECTION_GROUP')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.add_ndata_section ('my_secgrp', 'name', 'name')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> -- text index uses filter by tinits
SCOTT@orcl_12.1.0.2.0> CREATE INDEX target_name_idx ON target_name_stage(target_name) INDEXTYPE IS CTXSYS.CONTEXT
2 FILTER BY tinits
3 parameters ('datastore my_ds section group my_secgrp')
4 /
Index created.
-- gather statistics:
SCOTT@orcl_12.1.0.2.0> exec dbms_stats.gather_table_stats (USER, 'CANDIDATE_NAME_STAGE')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec dbms_stats.gather_table_stats (USER, 'TARGET_NAME_STAGE')
PL/SQL procedure successfully completed.
-- queries combining sdata and ndata:
SCOTT@orcl_12.1.0.2.0> column candidate_name format a25 word_wrapped
SCOTT@orcl_12.1.0.2.0> column target_name format a25 word_wrapped
SCOTT@orcl_12.1.0.2.0> select distinct candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
2 from candidate_name_stage, target_name_stage
3 where contains
4 (target_name,
5 'sdata (tinits = ''' || cinits || ''') and
6 ndata (name, ' || candidate_name || ')',
7 1) > 70
8 order by candidate_name, score(1) desc
9 /
CANDIDATE_NAME TARGET_NAME T FUZZY_MATCH_SCORE
------------------------- ------------------------- - -----------------
John Smith John Smythe P 88
MARY ECKHART MARY ECKHART 100
Mary Robison Mary Robinson P 94
Mary Robison Mari Robison 94
Tom Smith Tom Smith 100
Tom Smith Tim Smith 93
Tom Smith Tomm Smith 93
7 rows selected.
SCOTT@orcl_12.1.0.2.0> select candidate_name, target_name, type_cd, fuzzy_match_score
2 from (select distinct candidate_name, target_name, type_cd, score(1) as fuzzy_match_score,
3 dense_rank () over (partition by candidate_name order by score(1) desc) as dr
4 from candidate_name_stage, target_name_stage
5 where contains
6 (target_name,
7 'sdata (tinits = ''' || cinits || ''') and
8 ndata (name, ' || candidate_name || ')',
9 1) > 70)
10 where dr = 1
11 order by candidate_name
12 /
CANDIDATE_NAME TARGET_NAME T FUZZY_MATCH_SCORE
------------------------- ------------------------- - -----------------
John Smith John Smythe P 88
MARY ECKHART MARY ECKHART 100
Mary Robison Mary Robinson P 94
Mary Robison Mari Robison 94
Tom Smith Tom Smith 100
5 rows selected.
[Updated on: Tue, 31 July 2018 04:02] Report message to a moderator
|
|
|
Re: Oracle Text Performance [message #670881 is a reply to message #670869] |
Tue, 31 July 2018 14:08 |
|
vka2b
Messages: 21 Registered: June 2018
|
Junior Member |
|
|
Thank you so much for the response. This is similar to the PL/SQL block just using NDATA that you proposed previously -- I can get it to work functionally, but the performance somehow remains in that 40 minute range. I find this somewhat confusing because the 'cinit = tinit' filter should reduce the runtime, but it doesn't. To be clear, it returns within seconds with the few hundred records we have been doing in these test runs. But if you recall from my initial post, we are expecting to get upward of 1 million names in the candidate table, and 1 thousand names in the target table, and that's the test that takes 40 minutes to return. You had said:
Quote:The text index handles the SDATA and the NDATA well for a lot of rows in the target_name_stage table. The problem is with looping through all of the rows in the candidate_name_stage table.
I'm wondering if perhaps that's my issue. In my use case, the candidate list is the huge one, and the target list is relatively small. I'm wondering if "reversing" the way this is done would help, or if I need to just say that ~40 minutes would be our baseline benchmark for our worst-case scenario data volume.
|
|
|
Re: Oracle Text Performance [message #670882 is a reply to message #670881] |
Tue, 31 July 2018 18:42 |
|
Barbara Boehmer
Messages: 9100 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
vka2b wrote on Tue, 31 July 2018 12:08
... the candidate list is the huge one, and the target list is relatively small. I'm wondering if "reversing" the way this is done would help ...
Yes, absolutely! I did not realize which is the bigger table. Oracle Text is actually intended to compare one value to a table of values. So, whether you force it to loop through one table in PL/SQL, comparing one value at a time to the other table or use a SQL query, it does the same thing. Unfortunately, it appears that efforts to compare initials does not speed things up, so you might as well eliminate the extra columns and indexes and things in the query to do that.
In the following demonstration, I have created text indexes on target_name_stage(target_name) and candidate_name_stage(candidate_name) and included the timing and explained plans for queries using each index, that show that the bottleneck is the full table scan of the other table. So, the one that reverses things as you suggested is far better, even on a small scale.
There are a few other things that can be done, such as adding a first_rows hint to begin displaying the first results sooner, while still calculating the others. Here is a link to Oracle Text tuning in general for your version:
https://docs.oracle.com/database/121/CCAPP/GUID-96F4B2AC-59E5-4B02-A85B-A7B2C20C4BDF.htm#CCAPP0500
-- tables and data:
SCOTT@orcl_12.1.0.2.0> DROP TABLE candidate_name_stage;
Table dropped.
SCOTT@orcl_12.1.0.2.0> CREATE TABLE candidate_name_stage (candidate_name VARCHAR2(350));
Table created.
SCOTT@orcl_12.1.0.2.0> INSERT INTO candidate_name_stage values ('John Smith');
1 row created.
SCOTT@orcl_12.1.0.2.0> INSERT INTO candidate_name_stage values ('Tom Smith');
1 row created.
SCOTT@orcl_12.1.0.2.0> INSERT INTO candidate_name_stage values ('Mary Robison');
1 row created.
SCOTT@orcl_12.1.0.2.0> INSERT INTO candidate_name_stage values ('MARY ECKHART');
1 row created.
SCOTT@orcl_12.1.0.2.0> DROP TABLE target_name_stage;
Table dropped.
SCOTT@orcl_12.1.0.2.0> CREATE TABLE target_name_stage (target_name VARCHAR2(350), type_cd VARCHAR2(1));
Table created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage values ('John Smythe', 'P');
1 row created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage values ('Mary Robinson', 'P');
1 row created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage (target_name) values ('Tim Smith');
1 row created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage (target_name) values ('Tomm Smith');
1 row created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage (target_name) values ('Tom Smith');
1 row created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage (target_name) values ('Mari Robison');
1 row created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage (target_name) values ('MARY ECKHART');
1 row created.
SCOTT@orcl_12.1.0.2.0> create or replace directory my_dir as 'c:\my_oracle_files'
2 /
Directory created.
SCOTT@orcl_12.1.0.2.0> drop table test_names
2 /
Table dropped.
SCOTT@orcl_12.1.0.2.0> create table test_names
2 (name varchar2(350))
3 ORGANIZATION external
4 (TYPE oracle_loader
5 DEFAULT DIRECTORY my_dir
6 ACCESS PARAMETERS
7 (RECORDS DELIMITED BY NEWLINE
8 LOGFILE 'test.log_xt'
9 FIELDS TERMINATED BY "," LDRTRIM
10 MISSING FIELD VALUES ARE NULL
11 REJECT ROWS WITH ALL NULL FIELDS
12 (name))
13 location ('names.dat'))
14 REJECT LIMIT UNLIMITED
15 /
Table created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage (target_name) select * from test_names
2 /
109 rows created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage (target_name) select * from test_names
2 /
109 rows created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage (target_name) select * from test_names
2 /
109 rows created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage (target_name) select * from test_names
2 /
109 rows created.
SCOTT@orcl_12.1.0.2.0> begin
2 for i in 1 .. 900 loop
3 insert into candidate_name_stage (candidate_name) values ('MARY ECKHART');
4 end loop;
5 end;
6 /
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> select count(*) from candidate_name_stage
2 /
COUNT(*)
----------
904
1 row selected.
SCOTT@orcl_12.1.0.2.0> select count(*) from target_name_stage
2 /
COUNT(*)
----------
443
1 row selected.
-- text index on target_name_stage:
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.drop_preference ('t_ds')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.create_preference ('t_ds', 'MULTI_COLUMN_DATASTORE')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.set_attribute ('t_ds', 'COLUMNS', 'target_name name')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.drop_section_group ('t_secgrp')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.create_section_group('t_secgrp', 'BASIC_SECTION_GROUP')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.add_ndata_section ('t_secgrp', 'name', 'name')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> CREATE INDEX target_name_idx ON target_name_stage(target_name) INDEXTYPE IS CTXSYS.CONTEXT
2 parameters ('datastore t_ds section group t_secgrp')
3 /
Index created.
text index on candidate_name_stage:
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.drop_preference ('c_ds')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.create_preference ('c_ds', 'MULTI_COLUMN_DATASTORE')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.set_attribute ('c_ds', 'COLUMNS', 'candidate_name name')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.drop_section_group ('c_secgrp')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.create_section_group('c_secgrp', 'BASIC_SECTION_GROUP')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.add_ndata_section ('c_secgrp', 'name', 'name')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> CREATE INDEX candidate_name_idx ON candidate_name_stage(candidate_name) INDEXTYPE IS CTXSYS.CONTEXT
2 parameters ('datastore c_ds section group c_secgrp')
3 /
Index created.
-- queries and execution plans:
SCOTT@orcl_12.1.0.2.0> set linesize 130
SCOTT@orcl_12.1.0.2.0> set timing on
SCOTT@orcl_12.1.0.2.0> set autotrace on explain
SCOTT@orcl_12.1.0.2.0> column candidate_name format a25 word_wrapped
SCOTT@orcl_12.1.0.2.0> column target_name format a25 word_wrapped
SCOTT@orcl_12.1.0.2.0> select distinct candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
2 from candidate_name_stage, target_name_stage
3 where contains (target_name, 'ndata (name, ' || candidate_name || ')', 1) > 70
4 order by candidate_name, score(1) desc
5 /
CANDIDATE_NAME TARGET_NAME T FUZZY_MATCH_SCORE
------------------------- ------------------------- - -----------------
John Smith John Smythe P 88
MARY ECKHART MARY ECKHART 100
Mary Robison Mary Robinson P 94
Mary Robison Mari Robison 94
Tom Smith Tom Smith 100
Tom Smith Tim Smith 93
Tom Smith Tomm Smith 93
7 rows selected.
Elapsed: 00:00:02.39
Execution Plan
----------------------------------------------------------
Plan hash value: 586206428
------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 200 | 73600 | 3672 (1)| 00:00:01 |
| 1 | SORT UNIQUE | | 200 | 73600 | 3671 (1)| 00:00:01 |
| 2 | NESTED LOOPS | | 200 | 73600 | 3670 (0)| 00:00:01 |
| 3 | TABLE ACCESS FULL | CANDIDATE_NAME_STAGE | 904 | 156K| 3 (0)| 00:00:01 |
| 4 | TABLE ACCESS BY INDEX ROWID| TARGET_NAME_STAGE | 1 | 191 | 3670 (0)| 00:00:01 |
|* 5 | DOMAIN INDEX | TARGET_NAME_IDX | | | 4 (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
5 - access("CTXSYS"."CONTAINS"("TARGET_NAME",'ndata (name, '||"CANDIDATE_NAME"||')',1)>70)
Note
-----
- dynamic statistics used: dynamic sampling (level=2)
SCOTT@orcl_12.1.0.2.0> select distinct candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
2 from candidate_name_stage, target_name_stage
3 where contains (candidate_name, 'ndata (name, ' || target_name || ')', 1) > 70
4 order by candidate_name, score(1) desc
5 /
CANDIDATE_NAME TARGET_NAME T FUZZY_MATCH_SCORE
------------------------- ------------------------- - -----------------
John Smith John Smythe P 82
MARY ECKHART MARY ECKHART 100
Mary Robison Mary Robinson P 95
Mary Robison Mari Robison 94
Tom Smith Tom Smith 100
Tom Smith Tomm Smith 94
Tom Smith Tim Smith 93
7 rows selected.
Elapsed: 00:00:00.98
Execution Plan
----------------------------------------------------------
Plan hash value: 2210785759
------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 200 | 73600 | 1825 (1)| 00:00:01 |
| 1 | SORT UNIQUE | | 200 | 73600 | 1824 (1)| 00:00:01 |
| 2 | NESTED LOOPS | | 200 | 73600 | 1823 (0)| 00:00:01 |
| 3 | TABLE ACCESS FULL | TARGET_NAME_STAGE | 443 | 79297 | 3 (0)| 00:00:01 |
| 4 | TABLE ACCESS BY INDEX ROWID| CANDIDATE_NAME_STAGE | 1 | 189 | 1823 (0)| 00:00:01 |
|* 5 | DOMAIN INDEX | CANDIDATE_NAME_IDX | | | 4 (0)| 00:00:01 |
------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
5 - access("CTXSYS"."CONTAINS"("CANDIDATE_NAME",'ndata (name, '||"TARGET_NAME"||')',1)>70)
Note
-----
- dynamic statistics used: dynamic sampling (level=2)
SCOTT@orcl_12.1.0.2.0> set autotrace off
SCOTT@orcl_12.1.0.2.0> set timing off
SCOTT@orcl_12.1.0.2.0> set linesize 80
|
|
|
Re: Oracle Text Performance [message #670883 is a reply to message #670882] |
Tue, 31 July 2018 19:30 |
|
vka2b
Messages: 21 Registered: June 2018
|
Junior Member |
|
|
Thanks for the reply. I actually did try doing the reverse situation -- even before I came to this forum for help -- but I abandoned it because I just couldn't get past this error. Running your test verbatim in my environment produces the following:
select distinct candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
from candidate_name_stage, target_name_stage
where contains (candidate_name, 'ndata (name, ' || target_name || ')', 1) > 70
order by candidate_name, score(1) desc
ORA-29902: error in executing ODCIIndexStart() routine
ORA-20000: Oracle Text error:
DRG-50901: text query parser syntax error on line 1, column 14
29902. 00000 - "error in executing ODCIIndexStart() routine"
*Cause: The execution of ODCIIndexStart routine caused an error.
*Action: Examine the error messages produced by the indextype code and
take appropriate action.
It doesn't matter what is in the candidate table -- even a single row of 'Mary Eckhart' produces this error. Going back to the original query works fine:
select distinct candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
from candidate_name_stage, target_name_stage
where contains (target_name, 'ndata (name, ' || candidate_name || ')', 1) > 70
order by candidate_name, score(1) desc
I'm going to keep trying to troubleshoot on my end, but I wanted to throw this out there in case you already know what the issue is! Thank you again.
|
|
|
Re: Oracle Text Performance [message #670884 is a reply to message #670883] |
Tue, 31 July 2018 21:43 |
|
Barbara Boehmer
Messages: 9100 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
That is odd, especially since we are using the same version. The slightest mistyping of something can result in an error. Just to rule that out, I have provided a simple test script that you can copy and paste, followed by a run of the same script on my system. Please make sure you are running from SQL*Plus, not SQL*Developer or some other interface that adds a whole other layer of potential problems.
-- script:
-- tables and data:
DROP TABLE candidate_name_stage;
CREATE TABLE candidate_name_stage (candidate_name VARCHAR2(350));
INSERT INTO candidate_name_stage values ('John Smith');
INSERT INTO candidate_name_stage values ('Tom Smith');
INSERT INTO candidate_name_stage values ('Mary Robison');
INSERT INTO candidate_name_stage values ('MARY ECKHART');
DROP TABLE target_name_stage;
CREATE TABLE target_name_stage (target_name VARCHAR2(350), type_cd VARCHAR2(1));
insert into target_name_stage values ('John Smythe', 'P');
insert into target_name_stage values ('Mary Robinson', 'P');
insert into target_name_stage (target_name) values ('Tim Smith');
insert into target_name_stage (target_name) values ('Tomm Smith');
insert into target_name_stage (target_name) values ('Tom Smith');
insert into target_name_stage (target_name) values ('Mari Robison');
insert into target_name_stage (target_name) values ('MARY ECKHART');
-- index on candidate_name_stage:
exec ctx_ddl.drop_preference ('c_ds')
exec ctx_ddl.create_preference ('c_ds', 'MULTI_COLUMN_DATASTORE')
exec ctx_ddl.set_attribute ('c_ds', 'COLUMNS', 'candidate_name name')
exec ctx_ddl.drop_section_group ('c_secgrp')
exec ctx_ddl.create_section_group('c_secgrp', 'BASIC_SECTION_GROUP')
exec ctx_ddl.add_ndata_section ('c_secgrp', 'name', 'name')
CREATE INDEX candidate_name_idx ON candidate_name_stage(candidate_name) INDEXTYPE IS CTXSYS.CONTEXT
parameters ('datastore c_ds section group c_secgrp')
/
-- query:
column candidate_name format a25 word_wrapped
column target_name format a25 word_wrapped
select distinct candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
from candidate_name_stage, target_name_stage
where contains (candidate_name, 'ndata (name, ' || target_name || ')', 1) > 70
order by candidate_name, score(1) desc
/
-- run of above script on my system:
SCOTT@orcl_12.1.0.2.0> -- tables and data:
SCOTT@orcl_12.1.0.2.0> DROP TABLE candidate_name_stage;
Table dropped.
SCOTT@orcl_12.1.0.2.0> CREATE TABLE candidate_name_stage (candidate_name VARCHAR2(350));
Table created.
SCOTT@orcl_12.1.0.2.0> INSERT INTO candidate_name_stage values ('John Smith');
1 row created.
SCOTT@orcl_12.1.0.2.0> INSERT INTO candidate_name_stage values ('Tom Smith');
1 row created.
SCOTT@orcl_12.1.0.2.0> INSERT INTO candidate_name_stage values ('Mary Robison');
1 row created.
SCOTT@orcl_12.1.0.2.0> INSERT INTO candidate_name_stage values ('MARY ECKHART');
1 row created.
SCOTT@orcl_12.1.0.2.0> DROP TABLE target_name_stage;
Table dropped.
SCOTT@orcl_12.1.0.2.0> CREATE TABLE target_name_stage (target_name VARCHAR2(350), type_cd VARCHAR2(1));
Table created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage values ('John Smythe', 'P');
1 row created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage values ('Mary Robinson', 'P');
1 row created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage (target_name) values ('Tim Smith');
1 row created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage (target_name) values ('Tomm Smith');
1 row created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage (target_name) values ('Tom Smith');
1 row created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage (target_name) values ('Mari Robison');
1 row created.
SCOTT@orcl_12.1.0.2.0> insert into target_name_stage (target_name) values ('MARY ECKHART');
1 row created.
SCOTT@orcl_12.1.0.2.0> -- index on candidate_name_stage:
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.drop_preference ('c_ds')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.create_preference ('c_ds', 'MULTI_COLUMN_DATASTORE')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.set_attribute ('c_ds', 'COLUMNS', 'candidate_name name')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.drop_section_group ('c_secgrp')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.create_section_group('c_secgrp', 'BASIC_SECTION_GROUP')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> exec ctx_ddl.add_ndata_section ('c_secgrp', 'name', 'name')
PL/SQL procedure successfully completed.
SCOTT@orcl_12.1.0.2.0> CREATE INDEX candidate_name_idx ON candidate_name_stage(candidate_name) INDEXTYPE IS CTXSYS.CONTEXT
2 parameters ('datastore c_ds section group c_secgrp')
3 /
Index created.
SCOTT@orcl_12.1.0.2.0> -- query:
SCOTT@orcl_12.1.0.2.0> column candidate_name format a25 word_wrapped
SCOTT@orcl_12.1.0.2.0> column target_name format a25 word_wrapped
SCOTT@orcl_12.1.0.2.0> select distinct candidate_name, target_name, type_cd, score(1) as fuzzy_match_score
2 from candidate_name_stage, target_name_stage
3 where contains (candidate_name, 'ndata (name, ' || target_name || ')', 1) > 70
4 order by candidate_name, score(1) desc
5 /
CANDIDATE_NAME TARGET_NAME T FUZZY_MATCH_SCORE
------------------------- ------------------------- - -----------------
John Smith John Smythe P 82
MARY ECKHART MARY ECKHART 100
Mary Robison Mary Robinson P 95
Mary Robison Mari Robison 94
Tom Smith Tom Smith 100
Tom Smith Tomm Smith 94
Tom Smith Tim Smith 93
7 rows selected.
|
|
|
|
|
Re: Oracle Text Performance [message #671652 is a reply to message #671642] |
Tue, 11 September 2018 13:09 |
|
Barbara Boehmer
Messages: 9100 Registered: November 2002 Location: California, USA
|
Senior Member |
|
|
vka2b wrote on Tue, 11 September 2018 07:29Hi...I need to resurrect this thread to ask a hopefully quick functional question on the example we were using for this discussion. If I modify the sample data such that I have "Tom" as a candidate name and "Tom Smith" as a target name, I get a score of 60 back, which seems reasonable. But if I do the reverse (i.e. "Tom Smith" as the candidate name and "Tom" as the target name), I get a score of 100, which does not make sense to me. I would have expected 60 still. Do you know why this is and what I could do about it? Thank you!
When you search for target name "Tom" in candidate name "Tom Smith", you are asking Oracle text if the target name "Tom" is contained within the candidate name "Tom Smith", and it is, "Tom" matches "Tom" exactly, so it scores 100%. The "contains" queries are not similarity queries. If you were to search for "Tom S" you would get a lower score, because it would search for both "Tom" and "S" and "S" does not match "Smith" exactly. This is where your comparison of initials is useful.
[Updated on: Tue, 11 September 2018 13:11] Report message to a moderator
|
|
|
|
Goto Forum:
Current Time: Tue Nov 12 19:13:46 CST 2024
|