Monday, April 25, 2011

+ Business Profile extraction from webpages or business directory report second (Intelligent Profile Parser (Multiple-Profiles-Page)(Front End))

Intelligent Profile Parser (Multiple-Profiles-Page)(Front End)
There are two parts in the design of the Intelligent Multiple-Profile Parser:
1)     Back End:
Input:
HTML files with the possibility of containing semi-structured multiple business profiles.
Output:
1)     An Analysis Statistics: The Back End analyzes the input html and identifies the html-tag-chains and their frequencies. (html-tag-chain is the tag-path from the root (<html> tag) to the container (<td>, <div>  etc) of the text)
2)     A more structured intermediate file: This file contains all the, possible business data as separate entries. They are grouped with their corresponding html-tag-chain.

 The tasks of the back end can be summarized as follows:
-          Get rid of unwanted text.
-          Impose a structure upon the semi-structured or unstructured text.
-          Detect relation among the texts.
-          Identify the importance of the texts.
-          Parse the html and classify the texts in groups according to different html-tag-chain.  
-          Generate the intermediate file that will be parsed by the Front End.


2)     Front End:
Input:
1)     The analysis statistics.
2)     The intermediate file.
Output:
Structured xml file with formatted business profiles.
            The steps of the Front End can be summarized as follows:
a)      From the analysis report, process each chain one by one and for each chain:
1)     From the intermediate file, parse the texts bounded by that particular chain.
2)     Detect the entity types (Company, Phone, Cell, email, host, Unknown_data etc.) for each entry.
3)     Run a linguistic analysis process on the unknown_data entries to detect the address entry, if the address was not detected in step 2.
4)     Run a linguistic analysis process on the unknown_data entries to detect the description entry, if it was not detected in step 2.
5)     Output each profile after converting to xml.
b)     Detect the correct chain that contains the business profiles.
c)      Validate the profiles.
d)     Convert the profiles in xml format.



Figure 1:  A sample webpage segment, containing multiple business profiles.





Figure 2: This is the html code of the sample html page from figure 1. Each single <tr> tag in the highlighted (in yellow) region contains a single business profile.


Figure 3: This is the html expansion of a single <tr> tag (mentioned in Figure 2).



Figure 4: part of the analysis statistics after analyzing the sample html from Figure 1. This is generated by the back end and used by the front end. The highlight (yellow) string is a html-tag-chain that has the highest frequency of 28 (highlighted in red).






Figure 5: This is a part of the intermediate file, generated by the back end. The highlighted (in red rectangle) strings are the instance of a single html-tag-chain that defines the profile segments. The highlighted (in yellow) texts are our target business data. One profile is assumed to be contained between two instances of an html-tag-chain. Detecting the correct html-tag-chain is one of the challenges we have.


Figure 6: This is an example of the output, produced by the back end after processing the intermediate file (Figure 5). The profiles are highlighted in yellow rectangles. These profiles are parsed with the highlighted chain (in red rectangles). Detecting the company name, address and description are some of the challenging parts in our R&D. We are using some linguistic analysis to detect them, which need to be improved significantly.




Figure 7: This is a more complex example of a business directory webpage. The profiles are highlighted in red rectangle.




Figure 8: This is the html of the page from Figure 7. Each <table> (highlighted in red) contains a business profile. The yellow region is the expansion of the first <table>.
Figure 9: This is the analysis statistics produced by the back end after analyzing the webpage from Figure 7. Here, all the red tags produce the same profiles. Sometimes, the same profile is parsed in two different forms for two different chains. So, it becomes challenging to identify the correct profiles. The yellow tag is the tag with the highest frequency but it doesn’t contain any profiles.



Figure 10: This is the intermediate file generated by the back end, after analyzing the html of Figure 7.
Figure 11: Business Profile output for the html of figure 7.



Figure 12: This is an example of an webpage that contains human-profiles. It is challenging to distinguish them from business-profiles.



Figure 10: This is the profile output for the page from Figure 9. Note that, the name of the human are treated as company-name. Hopefully, we have considered this problem in our Research and made some important progress which would be implemented in future.



Figure 11: This is an example of a page with a structure that challenges our Intelligent Profile Parser. Note that, on each row, the first entry is an address and the second entry is a company name. Sometimes in these cases, it becomes hard to distinguish between the address and the company name. In our parser, we are assuming (so far) that, there would be no repeated company name in a single page. This assumption leads us to ignore the unknown repeated strings. In that case, some addresses are ignored and some addresses gets considered as company name. This problem can be solved by some complex analysis of the linguistic features and html structure.


Figure 12: This is the html of the page from figure 11. Each of the yellow <tr> tag contains a single profile.



Figure 13: This is the intermediate file for the page of figure 11. The yellow texts are all addresses and the texts in red rectangles are company name.



Figure 14: These are the business profiles extracted from the page of Figure 11. The company names in yellow rectangles are correctly identified because we ignored the string “Alberton” before them because of its multiple occurrences. But, the company names in red rectangles are actually addresses. They were not repeated. This problem will be solved in future.



Figure 15: This page is a sample multiple-profile-page. The texts highlighted in red are the company name entries. The texts highlighted in yellow are description entries.



Figure 16: These are the output profiles parsed by the parser from the page in Figure 15. Here all the entries are correctly identified. The description-detection part is still in the research phase, so they are skipped. These profiles are parsed with the tag-chain (enclosed in red rectangle).

 

Figure 17: These are some incorrectly parsed profiles from the page in Figure 15. Here, the description entries are detected as company name (highlighted in blue rectangles). Detecting and discarding these profiles is a major challenge in our R&D. This problem would be solved by the Style-Sheet-Analysis that we shall start in future.


Figure 18: In this profile, the company name contains both the name of the company (highlighted in yellow) and the description/category of the company (highlighted in green). We shall try to separate them in two separate entries in our future research.

No comments:

Post a Comment