The Problem Of Demoting Spam On The Internet: Yahoo!â€™s Trustrank Approach

TrustRank is an attempt to counter the web spamming activities that threatens to deceive search enginesâ€™ ranking algorithms. It propagates trust among web pages in the same manner that PageRank propagates authority. However, tests would show that the combination of trust and distrust values have greater ability to demote spam sites than with the use of trust values alone.

The Assumption

A link between two pages holds an implied conveyance of trust emanating from the source page to the target page. Pointing to a link is a vote of confidence from the source that the target is able to provide content that will be of value to the user. It basically revolves around the ideal set-up that good sites only point to similarly good sites and will not knowingly refer people to spam sites. These good sites hold the trust of people which is then used in propagating trust through the link structure of the web.

TrustRank hopes to use a set of highly trusted seed sites to help in demoting web spam. The approach assigns a non-zero initial trust score to these seed sites while assigning initial values of zero to all other sites. A biased PageRank algorithm is used to propagate these initial trust scores to the outgoing sites where good sites are expected to get a decent trust score while spam sites are likely to get lower trust scores after convergence.

The possibility of a page pointing to a spam page increases as the number of links increases. It has been proposed that the trust score of a parent page be equally split among the children pages. There is the question as to the logic of having different trust scores for children pages in cases of multiple parent pages. TrustRank provides a solution by simple summation which has been not quite effective in curtailing the spam siteâ€™s efforts to raise their ranking.

The conveyance of distrust emerged as a natural extension of the conveyance of trust between links. Distrust may be an indication of lack of confidence to a source page due to its linkage to an untrustworthy page. Thus, when a link with a known spam page is established, the trust judgment of the source page cannot be considered valid.

TrustRank as it was originally conceived, proposed that trust should be reduced as we move further away from the seed set of trusted pages. However, the limited number of seed pages makes it impossible for the whole web to be touched by propagation. A well performing algorithm is needed to produce trust judgments at least for a larger fraction of web pages.

The seed sets used may not be able to sufficiently represent the different topics of the web. TrustRank tend to show a bias towards larger communities which can be remedied by the use of topical information to divide the seed set and calculate trust scores separately for each topic. The use of the pages listed in well-maintained topic directories can help in resolving the coverage issue. Seed filtering may be done to remove low quality pages or even spam pages that may inadvertently been included in the pool of seed pages.

Much work is being done to come up with methods that donâ€™t rely heavily on human judgment for identification of spam free pages. As it is, searchers are highly challenged to locate pages that would serve their needs and not those that are intended for high ranking in search engines. Sites that do not provide any value to users are just too many to be ignored.

Semantic Cloaking on the Web

Semantics is the study or science of meaning in language that takes words and compares them with other words or symbols and determines the relevancy and relationship between them. Semantic cloaking is the practice of supplying different versions of a web page to search engines and to browsers. The purpose of the content provider is to hide the real content of the page from the view of search engines. The difference in meaning between the pages is supposed to deceive search enginesâ€™ ranking algorithms. Cloaking is one type of search engine spamming technique that makes it possible for non-relevant pages to occupy top ranking in searches.

Search engines are used by people when they need to find the most relevant responses to their search. It is typical for users to view just one page of results thus sites are hard put to compete for the top rankings particularly for popular queries. Increased traffic to a commercial website is equivalent to more profit.

Reputable content providers work hard to come up with high quality web pages to get their desired high ranking. Unfortunately, not all content providers hold the same view. These are the people that would try to reach high ranking through manipulation of web page features used by search engines as basis for their ranking algorithms.

Ranking algorithms assumes that page content is real. This means that the content seen by search engines is identical to that seen by actual users with browsers. With the use of the web spamming technique of cloaking, different versions are successfully supplied causing a big amount of confusion and disappointment for users.

Cloaking falls under the page-hiding spam category in search engine spamming techniques. Some cloaking behavior is considered acceptable. Cloaking is of two types â€“ syntactic and semantic. Syntactic cloaking includes all situations in which different content is sent to a crawler and real user. Semantic cloaking is an offshoot of syntactic cloaking which employs differences in meaning between pages to deceive the ranking algorithms of search engines.

Syntactic cloaking may be acceptable in cases such as web servers using session identifiers within URLs for copies sent to browser and no such identifiers for copies sent to crawlers. This is in effect being used by web servers to differentiate their users. Search engines may interpret these identifiers as a change in the page. The cloaking behavior that needs to be penalized is the semantic cloaking.

There are various proposals on ways to counter the problem. One proposal suggests the comparison of copies from both the browserâ€™s perspective and the crawlerâ€™s perspective. It may be necessary to get two or more copies from each side to be able to detect cloaking. Another suggests a two-step process that would require fewer resources. The first step implements a filter by use of heuristics to eliminate web pages that cannot demonstrate cloaking. All the pages that have not been eliminated will go through the second step for inspection. Features are extracted from about four copies and a classifier is used to determine whether semantic cloaking is being done or not. However, the reality remains that no ideal solution has been arrived at to effectively curb semantic cloaking. This is a technique that should not be practiced by anyone who wants to maintain good business ethics. The practice continues to undermine the search engineâ€™s attempts to provide users with the actual information they need.

About The Author:
http://www.theinternetone.net

Get Latest Tech Updates For Free!