Ad Placement

  • Web Search Engines from a Quality' Point of View, Intellectual Search Engine
  • Author:www.adsplan.net   Source:Free Articles  


Internet achievements and new problems

The world is on the eve of a new information explosion. During the 80-ties and 90-ties of the last century, computers and information networks salvaged the world from a paper flood. At the beginning, when the Internet appeared, it was a field of activity for a narrow circle of people. Now the Internet has rejuvenated many sectors such as shopping, service, manufacturing, health-care, education and a much more. In this advanced society, almost every family has a computer; and about 70% have access to the Web via different kinds of networks including telephone lines, fiber networks, cellular networks, radio, TV, and satellite networks. Currently old and young, from 4 to 100, housekeepers and sportspersons, teachers, actors, and so on are involved in this process. New technologies, special devices and tools now expand the abilities of every person so that everybody can be an artist, writer or musician. With the computers help it becomes very easy to draw a picture (on a tablet PC, or on a regular PC with special tools), write an essay, an article or a book (with the help of a word processor), to create animation, cartoons or movies, to compose or to combine music; and to publish all of this on the Internet via a blog, clip, article, or a website. There is a similar situation where anybody can sell anything not having a store, or buy anything on line. New technologies on the Internet prepared a soil for new types of social communication and social networking. If in old times people used to get together to discuss something, or to write petitions, etc, now they can have virtual meetings in chat rooms, forums, conferences or inside social networks. This is the reason why everything mentioned above prepared a base for the new information explosion.

Internet search technology

It is obvious, that a basic part of the Internet is a Search system. This part can be a bottleneck for the new generation of Internet technology. Web search engines (SE) occupy the role of gatekeepers in the information sea on the web. Internet search technology is built on the frame of an advanced industry. There are thousands of search engines and catalogues; thousands of computers and servers supporting this system; thousands, maybe millions of databases and knowledge bases; and all this for millions of users per hour. But we need to understand, that search technology is still in its infancy. This is a reason why Internet users encountered many problems until now. Let's describe these problems:
1. Volume growth for different kinds of materials (text, and non text materials) placed in search systems has led to the situation where practically any inquiry gives plenty of suitable documents (sometimes thousands and even millions). Usually it exceeds the maximum that one person is capable of processing. The maximums for a professional analyst are within the limits of several hundred documents, for the nonprofessional, naturally, the limits are essentially below this.
2. In connection with this problem, Search Engine Result Page ( SERP) shows a medium number of document titles, a user can see only a part of these documents. For example, you are looking for "Computer Buyers Guide " with the help of Google. There are 12500 titles. You can see only 3000, i.e about one fourth of the titles. This approach is sometimes right. But in some cases the user needs to see all of the information.
3. The main search engines, such as Google, Yahoo, or MSN are highways for delivering and receiving information. Have you seen dirty highways, or highways with garbage? The above mentioned search engines give vast amounts of garbage in their SERPs. You can find a lot of obsolete and meaningless information. Sometimes if you click on the chosen item, you may see 404 error-Not Found or you can find many documents with the same content.
4. Today the web position (number of titles in the list of web pages in the SERP) is defined by an algorithm of the SE. According to this algorithm (different for different SE), a web position indirectly depends on the content of the referring material, first of all the number and location of keywords in the web page. If content allocates on several pages of the website, this website, and its web pages can have a worse position, than the one from the first page with concentrated keywords.
5. Web position is variable at any time, and may be changed without visible reasons. Very often this is a Sisyphean task (labor) to keep the web page in the same position even for a short time
6. Because user usually doesn't know or uses keywords needed to acquire material, he/she needs to use synonyms, keyword phrases and so on. This is the reason why an average user cannot find the needed information in one attempt. Usually he/she needs to make many searches with different keywords, keyword phrases, and sometimes pictures, using several search engines.
7. Metatag was invented for helping search engines classify a web page. This attribute has now lost this role.
8. Practically any web site has many web positions depending on the keywords used.
9. Many search engines are using a popularity link as an indirect method to evaluate the quality of websites. Some search engines require at least one or more links coming to a web site, otherwise they will drop it from their index. Search engines such as Google use a special link analysis system to rank web pages. Ideas to evaluate a links popularity are that the more important site the more links it has. Content-poor sites have a small number of links. Criterion of links popularity assumes that not all incoming links are equal, the quality of incoming links counts more than the numbers of them. This method has many disadvantages:
· According to common sense this method has an indirect correlation with the real quality of a website, because many companies exist for the collection of such links, i.e. the idea of direct correlation in reality is not proven. Companies spend big funds on marketing or use spamdexing to have more links. Let's pretend, that one Mrs. X created a website with very good and useful content for all moms around the word. It is obvious, that most moms don't have websites, therefore a website with close to ideal content will not have a sufficient number of the back links, and as result - a number 1 position. The best they can do, are bookmarks. For a better understanding, using link popularity can be the same as evaluating the quality of man or women by the number of their love affairs.
· Link popularity, is the root for artificial pollution of the Internet, because these are the majority of sites, blogs and advertisements are born with goals to enlarge the amounts of these criteria , and in fact they are junk advertisement pages designed for SEO or PPC purposes.
· The collection of the back links demands time, sometime several years. It means a good quality site can be highly ranked only after several years, many companies die before this time
· Dividing links into the good and bad is very subjective
· In many search engines the quality of the website is evaluated with the help of special programs, called soft robots, crawlers or bots (Googlebot, Yahoo Slurp, MSN bot, and so on). No one intellectual program has the intellect equal to a human’s intellect. It means, that any evaluation with help of such a program makes mistakes. This is one of the reasons why we have so much garbage on the web.
10. As is known, Czech dramatist Karel Capek invented the term "robot” in 1917 to describe the mechanical people in his science fiction drama R.U.R. (Rossum's Universal Robots). His intelligent machines were intended as servants for their human creators, and they end up taking over the world and destroying humanity. Now the soft robots judge the humans, and most people don't like this. Maybe this is the beginning of a local war with the robots?
11. The modern search engines don't have the means to discover a real polysemy (multi meaning) of the keywords, and can't make a search in any depth. The main reason for their disadvantages lays in the obsolete algorithms used in a search to find the needed information in a short time. Many algorithms of search systems are connected with commercial interests to create maximum profits for the search system.

Quality of Internet service

According to the Theory of Quality, the quality of a product or service means a degree of satisfaction for which the product or service meets the customer's expectations. Quality has no specific meaning unless related to a specific function and/or object. If you have criteria for evaluation, you can calculate the amounts of quality for these criteria, and therefore compare the quality of any product or service. If you find something with the help a SE that you wanted for short period time, it is good. The less time the higher the quality of a SE. If you use a minimum of actions to find something you wanted, it is also good. The fewer actions (for example, number of clicks, number of used keywords, and so on), the better and the higher the quality of a SE. If you didn’t find anything after many efforts, the quality of a SE from your point of view is zero. So we can measure the quality of a SE on a scale, for example from zero to ten, or from zero to 100%.

Quality of search

What is the quality of a search? How to measure it? Now it is a common opinion that search engines satisfy needs of most of the searchers a little more than 50 percent. But it is the wrong approach for evaluation. The quality of a search with more accuracy can be measured with the help of the next criteria:
· quality of information
· relevance
· average time of search for one attempt (click)
· number of attempts to find a needed result
· robustness of the search to keywords and its synonyms
· some stability of the results for awhile
· novelty of information
· depth of search
· and so on.
Only according to a special agreement between owners of main SE’s, the quality can be evaluated for the SE as a special function of the above mentioned criteria. Comparing the amount of this function can help to find the best SE.

Quality of retrieved information

To separate genuine information from information of the second grade or even such as garbage, this information must be ranked by quality. Today it is very important, because now on the Internet a serious book and one phrase in a forum several years ago can have the same quality. What can be indication of quality information? First of all, quantity of information, measured in bits or bates. Obviously any book must have higher rank than article, or Ad; any video or movie - higher rank than text. Secondly, number of categories (not keywords, but categories from, for example, classificatory of DMOZ), included in text and in headers. At third, fullness of information. If you are looking for a computer for yourself, you need to know its features (maybe main features). Therefore items in search database, which has more features must have higher web positions. At last, it must be novelty and reliability of information.

Indicator of the quality of retrieved information

Today the best indicator of the quality of information is numbers of web positions. Unfortunately there is harsh reality - commercial search. The search systems must have incomes. For these goals they include in SERP sponsored ads and other types of ads. Therefore number of web position is not accurate indicator. Qualitative search and existence of commercial search systems - things badly compatible.

Relevance

The result of search must be extremely relevant to goal of information retrieval. It impossible to measure relevancy by keywords with 100% accuracy, because it is subjective category. What is relevant for one request (person or soft robot), might not be relevant for another. How really to measure grade of relevance? In many search algorithms relevance is defined with help of keywords and their location. Search engines mostly evaluate relevance in one-dimension space, either by keywords, or by keywords phrases. In practice for integral evaluation of relevance use two parameters: precision (the percentage of documents returned that are relevant) and recall (the fraction of all relevant material that is returned by the search, including not founded). Such search system as eBay widely use for relevance evaluation of feedbacks or poll of users.

Speed of search

Now we have very high speed of search, because when we perform a search we are not actually searching the web page, but are searching a kind of copy of the web page (index), created and stored by the crawler at some point of time in the past. But for some kind of information user can wait. For him/her more important the quality of received information than speed of search.

Reliability of information

Information reliability depends on reliability of information source. Today we have some tools for evaluation of trustable sources on the Internet, such as Whois, Alexa, Reliability seals of BBB, and so on. Unfortunately this information doesn't concern to page rank, and even don't restrict cases of circulation on Internet absolutely obsolete and false/spam information. I was personally familiar with company called Digitalsquare, Inc. It went down 5 years ago, but until now anyone can find its site on the Internet. and more than 300 links in the Yahoo, and other search engines.

Search engine optimization

Search engine optimization (SEO) was once invented almost exclusively to those who wanted their sites to be found and listed by the search engines and directories. Now, search is one of the hottest topics in online marketing. Search engine marketing is critical to sales, visibility, advertising -- even branding. Now general goal of SEO to be within top 10 positions. It is possible to divide all methods of promotion of sites in search systems on two groups: the methods using lacks of search algorithms (group A); and the methods of the real optimization (group B). The methods of group A were invented to deceive search system. For example the first way can be realized via cloaking. Technically cloaking is realized simply enough. When to a web-server come an inquiry, instead of public version, can start prepared a cloaking script. The script compares the IP-address of a requester to an available database. This base contains IP-addresses of known spiders. If it will be found out, that IP belongs to a spider, the corresponding optimized version will be given out. If the address in base is not present, the public version shows. Methods of group B are described in many books and articles. To in full size realize these methods, everybody needs to know algorithm of positioning. Because this is impossible for most of the search engines, it looks as unfair game without rules. The methods of real optimization can play their role in progress of the search systems.

Criteria of quality for Web positioning

Traditionally, the scoring and ranking of documents is based on the content in each document matching the query. Obviously different quality demands must be different for different types of information: one for patents, ones for children concerning, ones for porno or gamblers, and so on. There are list of the quality components, which may have an effect:
· weight of keywords metrics, including relevance
· trusty grade of website, blog and so on,
· weight of age of material,
· weight of actuality,
· and so on
On base of these components can be composed a goal function for math methods of optimization

Stability of web position

Nothing is constant in our world, especially web position. It means one can spend thousand dollars to receive web position for website in top 10. Just in a month this buyer wouldn’t find that website in the top 10, sometimes top 100 any longer. Reason of this lays in absence of evaluation by quality products. Now we know some ever-young books and articles of classic writers or scientist of literature, cybernetic or math. The web position of really valuable materials must be the same for long time (of course, not forever).

Technology of sequential multi-dimensional search

Existed technologies of search are built on the basic principle of looking a needle in haystack. This is main reason why we found millions of titles for word "computer", "hi-fi" and so on. These technologies using keywords, can not find usually what user looking by many reason described above, and also sometimes because he or she can not explicitly formulate what they want, or how do name what they want (for example monitor or display, laptop or notebook, "speed of computer" or "frequency of computer', and so on). The more items in databases of search engines, the less effective this approach. To narrow the search, user can write keyword phrase, or use such symbols as OR, AND, + and so on, but this usually doesn't work for long keyword phrases. I am sure that for different kind of retrieval information must be used different algorithms of search. Below I would like attract attention to algorithms of the sequential multi-dimensional orthogonal search for given arrow of keywords. For this type of search algorithm, the relevance can be calculated by many keywords, therefore can be greater. How this model can work? You can compare it with existed models. Pretend, that you want to find the best new computer for you. For reference; on the USA market more than 5 millions types of computers exist. If you write a phrase "best computer" without quotation marks, you will receive about 800,000,000 results in the Google, and about 1,000,000 with quotation marks. Let we will continue our search only using quotation marks. After that, for example, after thinking, you understood, that you want more a best laptop computer. For this phrase you will receive in the Google about 15,000 variants. You can narrow your search more, if you write "best new laptop computer". You will receive only one result: Dell computer. (January 2007) You have needed only 4 searches to receive what you wanted. But our result was not absolutely true, because existed systems of search doesn't allow really narrowing a space of search, and counted on 2-3 words searching phrase. In traditional search systems when user searches by some keywords, search engine try to find answer in one-dimension spaces for each keyword. If database has 1000 rows of main table, and user used X keywords, SE needs to check all 1000 strings by X times. Because user doesn't know amount of X, he/she can make Y searches, and Y can be much more than X. Idea of sequential multi-dimensional search allows dividing the space of search on the subspaces. It means that to find a needle in the haystack here is best ways to divide the haystack on parts with evaluation for example higher probability of a needle location (sometimes without it). For sequential search SE tries to find answer in multi-dimension spaces for each keyword. Difference is in that: if for first right keyword SE found, for example, 100 strings in the database, next search by other keyword will be within these 100 strings; if SE found 50 answers, next search by the new keyword will be within theses 50 strings; and so on. It is obvious, that time of search must be less. What are advantages of the sequential search from points of view of searchers, webmasters, and advertisers?
From point of view of searchers there are:
· at average it must be less attempts to find a needed item,
· number of items in RAM of search engines for one session of search must became much less,
· user will receive more relevant items in results
· if user cannot define what he wants at the beginning of the search, he on the second or third step, can easier find next keyword.
From point of view of webmasters and advertisers there are:
· webmasters and advertisers will concentrate their attention on the real profits of their products to receive higher web position
· they will stop their Sisyphean labor for SEO with searching the best keywords for top 10 ranking and unneeded back links. By the way, nobody calculated how many time and money was spent on this labor for example per year in the USA. I think, here are millions of dollars, and maybe it much more than cost of exploitation of search engines.
· webmasters and advertisers won’t create junk advertisement pages designed for SEO or PPC purposes
· we won't have such discrimination as supplemental index of Google.

Intellectual search engine.

This type of search engine can with the help of an intellectual dialog system solves 2 main problems of traditional search technology: fullness and lexical polysemy by using combination of different mechanisms in the process of a formulation of inquiries: lexicographical; by frequency analysis; by catalogue. The lexicographical mechanism allows discover all multiple meanings, and allows to build the another inquiry list with use of synonyms of keywords, originally entered by the user, and additional words for explanation of other meaning above mentioned keywords. Algorithms based on the frequency mechanism use the offered list frequency clusters, including the keywords originally entered by the users. Frequency cluster is the list of the words representing most often encountered combination for texts from given keywords. The catalogue mechanism allows on the base of lexical values for the entered keywords to switch over to list from catalogue of existed categories. Now we have all the preconditions for creating and developing such systems. Dr. Yuri Iserlis is president of Clever Ace (Former-president of Intellectual systems, Inc. in Russia). He can be reached at yiserlis@cleverace.com



Your Ad Here

Close↓ Print