Tuesday, April 26, 2011

+ Anchor Text Identification from web pages or (from our web crawler data)



Anchor Text Identification
Anchor Text
The anchor text is the visible, clickable text in a hyperlink. The words contained in the anchor text can determine the ranking that the page will receive by search engines.  Some web browsers have added the ability to show a tooltip for a hyperlink before it is selected. Not all links have anchor texts because it may be obvious where the link will lead due to the context in which it is used. Anchor texts normally remain below 60 characters. Different browsers will display anchor texts differently.
Anchor text usually gives the user relevant descriptive or contextual information about the content of the link's destination. The anchor text may or may not be related to the actual text of the URL of the link. For example, a hyperlink with anchor text might take this form:
<a href="http://www.adventureescapades.co.za/activity-hanggliding"> Hang Gliding</a>

The anchor text in this example is “Hang Gliding"; the unwieldy URL http://www.adventureescapades.co.za/activity-hanggliding displays on the web page as Hang Gliding, contributing to clean, easy-to-read text

Why We Need Anchor Text?
For Single profile parser in Business Search, The Company name is the anchor text. So we need anchor text for Company name. Anchor Text is same as the company name for single profile as shown in figure 1.

Figure 1: Sample Web Page
When we click any url in the above figure, it goes a single 
profile. Figure 2, shows the profile of first url and it’s Anchor
 Text formation is: <ahref="compdet_company.asp?screen=Show&amp;iC_ID=27337" 
target="_parent">Altech Netstar Fleet Management Services (Pty) Ltd</a>

Figure 2: Sample Profile page
Architecture
System architecture of Anchor Text identifier is shown in Figure 3.


Figure 3: System Architecture of Anchor Text

Approach
Input: URLs
Output: Anchor Text of corresponding URL
Anchor Text exists in previous depth of the page, containing profile.
If the URL contains n-depth then
            Anchor Text will be found in the (n-1) depth’s pages of current HOST.


Directory Sorted by PHONE
For DataFoundURL info we need DirectorySortedByPhone. Sample information of Directory Sorted By Phone is shown in figure 4.

Figure 4: Directory Sorted By PHONE





Information of DataFoundURL is shown in figure 5.


Figure 5: DataFoundURL

In figure 5, marked areas are how many phone found and corresponding data location of that URL. If the Phone number is 1, we can say that, it is a single profile.







Depth Information
Depth information is found in Crawled Data. That is stored in HostData\WebSite\LinkDB. It’s format is shown in figure 6.

Figure 6: Download Link Path

F:\HostData/Website/content/www.100hills.kzn.org.za means location
2.html means page id
CRAWLER_DEPTH=1 means current depth
1 means index number of content
2261859 start byte of url
2261904 end byte of url
2261905 start byte of content
2347596 end byte of content
1 means igonore or not





Code for Relative URL Maker
For generating absolute url from relative url and host, we need some rules. The code part for generate absolute url is shown in figure 7 and 8.


Figure 7: Making Relative URL (Cont…)



Figure 8: Making Relative URL








Result of Anchor Text
Screen shot of sample output of Anchor Text is shown in Figure 9. Here selected part of Figure 9 is showing necessary information of one anchor text. <OUT_DEGREE> means the hyperlink found in a web page, <URL> means the address of current page, <ANCHOR_TEXT> is the anchor Text of <OUT_DEGREE>.


Figure 9: Result of Anchor Text


Index of Anchor Container
Figure 10 shows index of Anchor Container.

Figure 10a: Index of Anchor Container

Figure 10b: index of Anchor Container
Comparison
Matching found-URL in anchor-container with DataFoundURL.xml, and the data found url comes from directory detector. 
Challenges
Identifying Anchor test is a challenging task. Here Anchor tag start with <a and Hyperlink attributes start with href.  For example href=”http://www.adventureescapades.co.za/activity-airevents”. But there are lot of variations in right hand side of href, likes:
href=http://www.adventureescapades.co.za/activity-airevents
href= http://www.adventureescapades.co.za/activity-airevent’s

Different pages have different style of syntax for href. Some of the time one’s definition conflict with others, like:
here url start and end with ’. But there are some definition of href:
 href= http://www.adventureescapades.co.za/activity-airevent’s

where ‘ is not end marker.

There are also scripts for generating url in web pages and it is produced run-time. For example:
BannerTag.href  = Btargets[curBanner]
That is solved by omitting scripts and stylesheet.

In many cases, the hyperlink does not contain full url, it may contain relative url. For example:
href=”/ search/Bearings”

Finding absolute url from relative url is one of the most challenging task for anchor text identification.

There are also lots of variations in relative urls. But we have a general rules for finding absolute url as like Web Crawler. Sample Code is shown in figure 7 and 8.




Done So Far:
There are lots of unknown rules for hyperlink. We are following a “trail and error” basis to making absolute url and identifying anchor text.

No comments:

Post a Comment