The tremendous growth of the web has led to an information proliferation leading to a situation where it has become very difficult to find relevant information in the web. Search engines are used for this purpose and the Tamil Search engine described here searches Tamil web pages upon an user query.
Architecture of a Search Engine
A "search engine" is software that searches for documents in the Internet dealing with a specific topic. The basic architecture of a Search engine is shown in Fig.1.
A search engine consists of two parts, viz. a back end database (server side) and a GUI (client side), to facilitate the user to type the search term. On the server side, the process involves creation of a database and its periodic updating done by a software called Spider. The spider "crawls" the URL pages periodically and indexes the crawled pages in the database. The hyperlinked nature of the Internet makes it possible for the spider to traverse the web. The interface between the client and server side consists of matching the posted query with the entries in the database and retrieving the matched URLs to the user's machine.
As stated earlier, the spider crawls the web pages through the hyperlinks. In this process it extracts the 'title', 'keywords', and any other r elated information needed for the database from the HTML document. Sometimes, the entire content of the HTML document, (but for the stop words - very common words such as for, is etc.), is extracted and indexed in the database. This is based on the idea that a page dealing with a particular issue will have relevant words throughout its page. Thus indexing all the words in a document increases the probability of getting the relevant URLs to a query. One point is worth noting here: before the query words are processed they are removed of the morphological inflections before they are searched for in the database. The spider is also referred to by names: "Robot", "Crawler", "Indexer" etc.
Fig. 1 Search Engine -Architecture
The database consists of a number of tables arranged to aid in quick retrieval of the data. With the number of sites increasing it is common for search engines to maintain more than one database server. For the case of a Tamil search engine a single database server is sufficient, as the number of sites are comparatively few in number.
When the user queries for term(s), these particular term(s) is(are) searched in the database. The sites in which these term(s) are present are identified. Then these sites are ranked on the basis of the relevancy of they have with the user query. The ranked sites are then displayed, with links to these sites and a small description taken from the site itself so as to give an idea to the user about the site.
You can see the a sample query term entered and the search results in the GUI here .
Supported Encoding Schemes
A major stumbling block for Tamil to grow in IT is fonts, keyboards and other standardization. Though TamilNadu government has standardized TAMxxx and TABxxx many others fonts are being used widely. Other popular font is TSCII font, which has wide usage. The Search engine that is being developed, searches only the pages created with TAB, TAM and TSCII fonts.
This can be extended in future by adding a English to Tamil machine translation system. This enables the user to search for data in English sites by giving a Tamil query. The search results could then be translated to Tamil before being displayed. However this needs a limited amount of Tamil to English translation (as the user query has to be translated). This is the most useful extension that can be made, as the user will be given search results from the English sites also. Such translation services are already available in some search engines such as altavista, google etc. for European and some East Asian languages.