What is a good model, using Open IceCat to import data and populate products in a a catalog's database

rhaag71 picture rhaag71 · Nov 28, 2011 · Viewed 7.8k times · Source

I am working on building a catalog of products based on OpenIcecat's (Icecat Open Catalog) and I am looking for advice from someone who may have experience with this, or possibly experience with another similar service (like C-Net maybe).

My question is, what is a good model for populating a product catalog's database?

Here is what I have so far...

  1. I GET the XML feed of the entire catalog
  2. I extract the data about the products that I need based on the Category ID
  3. At this point, I inserted all the data into a table, so now I have a table for like 'Printer cats', this contains the URL to the images and the id for the XML for each product in the category...Easy enough

Here is where I run into question/concern... I find it easy to adhoc a script to use a GET request for each XML file and image...then could dump them into directories, but Icecat does NOT want you to rip very large amounts. My categories contain thousands (over 40k for instance) of products.

What I feel I need to do is GET the XML for a product and grab the image and store them. I feel this way because it's an obvious solution and thats what the client keeps asking for...doesn't mean that it's correct though. So, then I could parse the individual XML to extract the description, SKU, etc. to a table so I can build the catalog, like for use with Magento, later adding/changing etc. as needed (prices, related product, etc.) Seems easy enough, but after around 3-4k GET requests or so I get booted because I'm ripping to much data, once I have the entire catalog (my catalog of wanted categories) then it will be easy enough to grab the update files (XML..and small in comparison) and make changes accordingly...this would be a great point to be at, but first need to get all the data and build the product table(s) first.

So here is what I kick around...

One idea is to get the data in real time as needed, but this is not desired by the client or myself. The client wants the catalog, understandable...and I notice that real time adds a performance hit and does not plug in to (easily) many solutions. But, expanding on the 'real-time' idea...use real time GET of XML data, and then store the data as it comes in with some logic like 'if it's not present locally...go get it and then store it; if it is present locally then check if it is up-to-date info...if not update it'...of course if I'm gonna check if it's up-to-date then there really is no point in storing the data because I'm doing a request every time no matter what...may as well just fetch it and throw it away, which seems inefficient.

-or-

Maybe everything is in real time: The products are fetched and displayed in real time, when the admin views products for manipulation it is presented in real-time, etc. Always grabbing whats needed in real time based on the meta-data that is in the database that was (is) already populated from the 'main' catalog file...that describes the entire catalog available from Icecat, but this don't plug into many solutions and will take a performance hit, plus some hosts wont let us GET anyhow...so many limitations here, but sounds like an awsome solution to be sure you always have super current info (which is not needed here though)

Here is where I am sort of headed already...

I have the meta-data based on the main catalog (over 500K items). I have already populated tables based on desired categories...now I am kind of headed towards this: Building an app (tool) that will better refine, such as a single category, what I am working with. Then produce a job 'use category ID and get all XML feeds'...then 'use cat.ID (probably same again) and then fetch images'...then, take same Cat. ID and build products by grabbing SKU, Desc., Image filename, etc. and build a catalog. At this point in the workflow I have all info and can use SKU (or whats needed) to grab price, etc. from other feed, manipulate descriptions, rename images if need (SEO) or whatever.

Then I will need to build a model for updating prices and shipping weights from a different feed...Synnex in this case, but seems much easier because shipping and price should be real-time...so different story and much less data at once, only whats in the cart I'm thinking.

Still not sure how to go about doing this..supposedly others have built a catalog like this for the same client by ripping the Icecat repository, but never make/provide tools for future manipulation etc...which is where I am headed. Plus the old catalog is very old/stale and I have never seen 'proof' that they actually did rip the data and build a catalog, not the full set anyhow.

OK, to help with the confusion...

The source I am using has a repository of over 600,000 products in many categories. I only need about 45,000 products (over several categories). As it is, it takes hours to download the xml file for each, like around 1000 per hour (we know this from past experience).

Part of the problem is that not every XML file is exactly the same, and we need different information from different categories. These requirements are most likely going to change (probably more at first). So I cannot just have a single schema to store all of them. Once the 45,000 (or so) files are downloaded, we only have to get the changes/updates in the future. So what I am really trying to do is build a local repository of only the categories that we need, so we can work with them more efficiently. We don't plan to use the related categories right away either, so I want the files locally for when we go back to do that too.

Answer

deefactorial picture deefactorial · Nov 13, 2012

did you ever get this resolved ? have you tried to install http://sourceforge.net/projects/icecatpim/ it's very difficult to get going but it allows you to download the product database into a local mysql database, once you have the local db, you could write a cron script to periodically update the db as needed. Then access/manipulate the data using your own program.

there is also this project which could help http://code.google.com/p/icecat-import/ it has code to write data to a local db, it was originally designed to download the full db but there is a lot of data missing.

I've posted a patch for a innodb version of the database. http://sourceforge.net/tracker/?func=detail&aid=3578405&group_id=312187&atid=1314070 I hope that helps. there are a lot of bad references in the database that you have to watch out for,

here is a cheat sheet for checking and setting db referential integrity.

//# is the db holding refrential integrity.
SELECT * FROM Aclass a LEFT JOIN Bclass b ON a.column=b.column WHERE b.column IS NULL;

//# Set NULL on update
UPDATE Aclass a LEFT JOIN Bclass b ON a.column=b.column SET a.column=NULL WHERE b.column   IS NULL; 

//# Cascade on delete 
DELETE FROM A WHERE id NOT IN (SELECT DISTINCT B.id FROM B);