How does "DHT search engine" work?

user2025043 picture user2025043 · Jan 30, 2013 · Viewed 11.1k times · Source

I'm interested in the Btdigg.org which is called a "DHT search engine". According to this article, it doesn't store any content and even has no database. Then how does it work? Doesn't it need to gather meta infos and store them in database like other normal search engines? After a user submit a query, it scans the DHT network and return the results in "real time"? Is this possible?

Answer

Arvid picture Arvid · Mar 17, 2013

I don't have specific insight into BTDigg, but I believe the claim that there is not database (or something that acts like a database) is a false statement. The author of that article might have been referring to something more specific that you might encounter in a traditional torrent site, where actual .torrent files are stored for instance.

This is how a BTDigg-like site works:

  1. You run a bunch of DHT nodes, specifically with the purpose of "eaves dropping" on DHT traffic, to be introduced to info-hashes that people talk about.
  2. join those swarms and download the metadata (.torrent file) by using the ut_metadata extension
  3. index the information you find in there, map it to the info-hash
  4. Provide a front-end for that index

If you want to luxury it up a bit you can also periodically scrape the info-hashes you know about to gather stats over time and maybe also figure out when swarms die out and should be removed from the index.

So, the claim that you don't store .torrent files nor any content is true.

It is not realistic to search the DHT in real-time, because the DHT is not organized around keyword searches, you need to build and maintain the index continuously, "in the background".

EDIT:

Since this answer, an optimization (BEP 51) has been implemented in some DHT clients that lets you query which info-hashes they are hosting, significantly reducing the cost of indexing.