We're using a Oracle Text CTXSYS.CONTEXT index to index about half a million rows containing metainformation. The information is spread over two tables that are combined by a procedure that the indexer calls at runtime (functional index).
When I run the CREATE INDEX on my local machine (simple dualcore notebook) the index is built in about 3 minutes. On our DB server which runs on Solaris with 8 cores and 16G of RAM it takes abozt 24 hours to create an index for the same (exactly the same) data.
Sample code: This is our index feeder for two tables and 3 columns:
create or replace procedure docmeta_revisions_text_feeder
( p_rowid in rowid , p_clob in out nocopy clob) as v_clob CLOB begin
FOR c1 IN (select DM.DID, DM.XDESCRIB || ' ' || DM.XAUTHOR AS data
from DOCMETA DM
WHERE ROWID = p_rowid)
LOOP
v_clob := v_clob || c1.data;
FOR c2 IN (
SELECT ' ' || RV.DDOCTITLE AS data
FROM REVISIONS RV
WHERE RV.DID = c1.DID)
LOOP
v_clob := v_clob || c2.data;
END LOOP;
END LOOP;
p_clob := v_clob;
end docmeta_revisions_text_feeder
These are the preferences
BEGIN
CTX_DDL.CREATE_PREFERENCE ('concat_DM_RV_DS', 'USER_DATASTORE');
CTX_DDL.SET_ATTRIBUTE ('concat_DM_RV_DS', 'PROCEDURE',
'docmeta_revisions_text_feeder');
END;
Now we create the index
CREATE INDEX concat_DM_RV_idx ON DOCMETA (FULLTEXTIDX_DUMMY)
INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS ('datastore concat_DM_RV_DS
section group CTXSYS.AUTO_SECTION_GROUP
') PARALLEL 4;
The data mostly consists of a simple title or author name + a short description with < 1k text.
I tried to play a little bit with the involved memory settings and the PARALLEL parameter but haven't any success. So here come my questions:
We finally figured out how to do a splitted sync of the index. Here are some basic steps that show what we did:
CREATE INDEX concat_DM_RV_idx ON DOCMETA (FULLTEXTIDX_DUMMY)
INDEXTYPE IS CTXSYS.CONTEXT
PARAMETERS ('datastore concat_DM_RV_DS section group CTXSYS.AUTO_SECTION_GROUP
NOPOPULATE
');
see the NOPOPULATE parameter? that tells the indexer that it shouldn't start the populating / indexing process. If you're on 11g you now have a very nice CTX_DDL feature at hand that populates the index at will, namely the procedure "POPULATE_PENDING". Calling it on your index name will populate the CTXSYS table that holds rows that have been modified and therefore are out of sync. Note that after calling this method the indexer still hasn't started anything. Since 10g (?) the according CTX_DDL.SYNC_INDEX procedure has several additional parameters, e.g. the "maxtime" parameter. Provide it with, say, 4H and your indexer will start to sync pending rows for about 4 hours. You repeat that procedure by schedule and are done.
That doesn't work in 9i unfortunately. So we tried successfully to "simulate" the Oracle POPULATE_PENDING process. The only restriction on this method is: you need some kind of unique row identifier to be able to query chunks of the same content from your table. Here's what we did:
1.) Create the index with NOPOPULATE (see above) 2.) Become SYS / DBA / CTXSYS (yes, you might call your admin for that). Find out the ID that your freshly created index has by querying the index meta table:
SELECT IDX_ID FROM CTXSYS.CTX_INDEXES WHERE IDX_NAME ='concat_DM_RV_idx';
3.) note the index ID this is yielding on a yellow snippet of paper and execute this insertion statement as CTXSYS role and replace the <> with your index id and the <> with the name of the table that the index is built on. The unique row identifer can be some kind of document ID or any kind of countable statement that creates a unique chunk of data of your table :
INSERT INTO CTXSYS.DR$PENDING (PND_CID,PND_PID,PND_ROWID,PND_TIMESTAMP)
SELECT <<your index id>>, 0, <<basetable name>>.ROWID, CURRENT_DATE
FROM gsms.DOCMETA
WHERE <<basetable unique row identifier>> < 50000;
COMMIT; -- Dont forget the COMMIT! DONT FORGET IT!!! WE MEAN IT!
The "50.000" marks the number of rows depending on the scarceness of your basetabel that'll be inserted in the pending rows table as payload for the indexer. Adjust it for your own needs.
4.) Now we are setup to let the indexer loose.
CALL CTX_DDL.SYNC_INDEX(
'CONCAT_DM_RV_IDX', -- your index name here
'100M', -- memory count
NULL, -- param for partitioned idxes
2 -- parallel count
);
will start the indexing process on whatever count of rows you have inserted in step 3.) To run the next chunk repeat step 3.) with the next 50.000 or so rows ("where id between 50.000 and 100.000")
If you accidentally run the indexer on the same set of rows the index will strongly fragment. The only way to clean it up is to optimize the index with a REBUILD parameter. On our local machine that was extremely fast since the indexer doesn’t have to run but only rearranges the index tables' contents:
CALL CTX_DDL.OPTIMIZE_INDEX('CONCAT_DM_RV_IDX', 'REBUILD');
If you need some meta information about the indexing status and size you can ask the CTX_REPORT package:
SELECT CTX_REPORT.INDEX_SIZE('CONCAT_DM_RV_IDX') FROM DUAL;
And if you forgot which parameters you chose on indexing time:
SELECT * FROM CTXSYS.CTX_PARAMETERS;
Happy indexing!