Username: Save?
Password:
Home Forum Links Search Login Register*
    News: Welcome to the TechnoWorldInc! Community!
Recent Updates
[June 03, 2024, 11:05:55 AM]

[June 03, 2024, 11:05:55 AM]

[June 03, 2024, 11:05:55 AM]

[June 03, 2024, 11:05:55 AM]

[May 17, 2024, 05:02:16 PM]

[May 17, 2024, 05:02:16 PM]

[May 17, 2024, 05:02:16 PM]

[May 17, 2024, 05:02:16 PM]

[April 24, 2024, 11:48:22 AM]

[April 24, 2024, 11:48:22 AM]

[April 24, 2024, 11:48:22 AM]

[April 24, 2024, 11:48:22 AM]

[April 03, 2024, 06:11:00 PM]
Subscriptions
Get Latest Tech Updates For Free!
Resources
   Travelikers
   Funistan
   PrettyGalz
   Techlap
   FreeThemes
   Videsta
   Glamistan
   BachatMela
   GlamGalz
   Techzug
   Vidsage
   Funzug
   WorldHostInc
   Funfani
   FilmyMama
   Uploaded.Tech
   MegaPixelShop
   Netens
   Funotic
   FreeJobsInc
   FilesPark
Participate in the fastest growing Technical Encyclopedia! This website is 100% Free. Please register or login using the login box above if you have already registered. You will need to be logged in to reply, make new topics and to access all the areas. Registration is free! Click Here To Register.
+ Techno World Inc - The Best Technical Encyclopedia Online! » Forum » THE TECHNO CLUB [ TECHNOWORLDINC.COM ] » Techno Articles » Website Promotion » Search Engine
 The Problem Of Demoting Spam On The Internet: Yahoo!’s Trustrank Approach
Pages: [1]   Go Down
  Print  
Author Topic: The Problem Of Demoting Spam On The Internet: Yahoo!’s Trustrank Approach  (Read 994 times)
Stephen Taylor
TWI Hero
**********



Karma: 3
Offline Offline

Posts: 15522

unrealworld007
View Profile


TrustRank is an attempt to counter the web spamming activities that threatens to deceive search engines’ ranking algorithms. It propagates trust among web pages in the same manner that PageRank propagates authority. However, tests would show that the combination of trust and distrust values have greater ability to demote spam sites than with the use of trust values alone.



The Assumption

A link between two pages holds an implied conveyance of trust emanating from the source page to the target page. Pointing to a link is a vote of confidence from the source that the target is able to provide content that will be of value to the user. It basically revolves around the ideal set-up that good sites only point to similarly good sites and will not knowingly refer people to spam sites. These good sites hold the trust of people which is then used in propagating trust through the link structure of the web.

TrustRank hopes to use a set of highly trusted seed sites to help in demoting web spam. The approach assigns a non-zero initial trust score to these seed sites while assigning initial values of zero to all other sites. A biased PageRank algorithm is used to propagate these initial trust scores to the outgoing sites where good sites are expected to get a decent trust score while spam sites are likely to get lower trust scores after convergence.

The possibility of a page pointing to a spam page increases as the number of links increases. It has been proposed that the trust score of a parent page be equally split among the children pages. There is the question as to the logic of having different trust scores for children pages in cases of multiple parent pages. TrustRank provides a solution by simple summation which has been not quite effective in curtailing the spam site’s efforts to raise their ranking.

The conveyance of distrust emerged as a natural extension of the conveyance of trust between links. Distrust may be an indication of lack of confidence to a source page due to its linkage to an untrustworthy page. Thus, when a link with a known spam page is established, the trust judgment of the source page cannot be considered valid.

TrustRank as it was originally conceived, proposed that trust should be reduced as we move further away from the seed set of trusted pages. However, the limited number of seed pages makes it impossible for the whole web to be touched by propagation. A well performing algorithm is needed to produce trust judgments at least for a larger fraction of web pages.

The seed sets used may not be able to sufficiently represent the different topics of the web. TrustRank tend to show a bias towards larger communities which can be remedied by the use of topical information to divide the seed set and calculate trust scores separately for each topic. The use of the pages listed in well-maintained topic directories can help in resolving the coverage issue. Seed filtering may be done to remove low quality pages or even spam pages that may inadvertently been included in the pool of seed pages.

Much work is being done to come up with methods that don’t rely heavily on human judgment for identification of spam free pages. As it is, searchers are highly challenged to locate pages that would serve their needs and not those that are intended for high ranking in search engines. Sites that do not provide any value to users are just too many to be ignored.

Semantic Cloaking on the Web

Semantics is the study or science of meaning in language that takes words and compares them with other words or symbols and determines the relevancy and relationship between them. Semantic cloaking is the practice of supplying different versions of a web page to search engines and to browsers. The purpose of the content provider is to hide the real content of the page from the view of search engines. The difference in meaning between the pages is supposed to deceive search engines’ ranking algorithms. Cloaking is one type of search engine spamming technique that makes it possible for non-relevant pages to occupy top ranking in searches.

Search engines are used by people when they need to find the most relevant responses to their search. It is typical for users to view just one page of results thus sites are hard put to compete for the top rankings particularly for popular queries. Increased traffic to a commercial website is equivalent to more profit.

Reputable content providers work hard to come up with high quality web pages to get their desired high ranking. Unfortunately, not all content providers hold the same view. These are the people that would try to reach high ranking through manipulation of web page features used by search engines as basis for their ranking algorithms.

Ranking algorithms assumes that page content is real. This means that the content seen by search engines is identical to that seen by actual users with browsers. With the use of the web spamming technique of cloaking, different versions are successfully supplied causing a big amount of confusion and disappointment for users.

Cloaking falls under the page-hiding spam category in search engine spamming techniques. Some cloaking behavior is considered acceptable. Cloaking is of two types – syntactic and semantic. Syntactic cloaking includes all situations in which different content is sent to a crawler and real user. Semantic cloaking is an offshoot of syntactic cloaking which employs differences in meaning between pages to deceive the ranking algorithms of search engines.

Syntactic cloaking may be acceptable in cases such as web servers using session identifiers within URLs for copies sent to browser and no such identifiers for copies sent to crawlers. This is in effect being used by web servers to differentiate their users. Search engines may interpret these identifiers as a change in the page. The cloaking behavior that needs to be penalized is the semantic cloaking.

There are various proposals on ways to counter the problem. One proposal suggests the comparison of copies from both the browser’s perspective and the crawler’s perspective. It may be necessary to get two or more copies from each side to be able to detect cloaking. Another suggests a two-step process that would require fewer resources. The first step implements a filter by use of heuristics to eliminate web pages that cannot demonstrate cloaking. All the pages that have not been eliminated will go through the second step for inspection. Features are extracted from about four copies and a classifier is used to determine whether semantic cloaking is being done or not. However, the reality remains that no ideal solution has been arrived at to effectively curb semantic cloaking. This is a technique that should not be practiced by anyone who wants to maintain good business ethics. The practice continues to undermine the search engine’s attempts to provide users with the actual information they need.

About The Author:
http://www.theinternetone.net

Logged

Pages: [1]   Go Up
  Print  
 
Jump to:  

Copyright © 2006-2023 TechnoWorldInc.com. All Rights Reserved. Privacy Policy | Disclaimer
Page created in 0.111 seconds with 25 queries.