Don't think you can't

Saturday, June 29, 2013

mysql-python install problem using virtualenv, windows, pip

Today I was working with Django and MySQL using "ENVIRONMENTVARIABLE". But when I was trying to sync my MySQL db. It was generating a error. something like this.

(test_south) C:\Users\a\Envs\test_south\test_project>python manage.py syncdb
Traceback (most recent call last):
File "manage.py", line 10, in <module>
    execute_from_command_line(sys.argv)
File "C:\Users\a\Envs\test_south\lib\site-packages\django\core\management\__init__.py", line 453, in execute_from_command_line
    utility.execute()
File "C:\Users\a\Envs\test_south\lib\site-packages\django\core\management\__init__.py", line 392, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
File "C:\Users\a\Envs\test_south\lib\site-packages\django\core\management\__init__.py", line 272, in fetch_command
    klass = load_command_class(app_name, subcommand)
File "C:\Users\a\Envs\test_south\lib\site-packages\django\core\management\__init__.py", line 77, in load_command_class
    module = import_module('%s.management.commands.%s' % (app_name, name))
File "C:\Users\a\Envs\test_south\lib\site-packages\django\utils\importlib.py", line 35, in import_module
    __import__(name)
File "C:\Users\a\Envs\test_south\lib\site-packages\south\management\commands\__init__.py", line 12, in <module>
    from south.hacks import hacks
File "C:\Users\a\Envs\test_south\lib\site-packages\south\hacks\__init__.py", line 8, in <module>
    from south.hacks.django_1_0 import Hacks
File "C:\Users\a\Envs\test_south\lib\site-packages\south\hacks\django_1_0.py", line 6, in <module>
    from django.db.backends.creation import BaseDatabaseCreation
File "C:\Users\a\Envs\test_south\lib\site-packages\django\db\__init__.py", line 40, in <module>
    backend = load_backend(connection.settings_dict['ENGINE'])
File "C:\Users\a\Envs\test_south\lib\site-packages\django\db\__init__.py", line 34, in __getattr__
    return getattr(connections[DEFAULT_DB_ALIAS], item)
File "C:\Users\a\Envs\test_south\lib\site-packages\django\db\utils.py", line 93, in __getitem__
    backend = load_backend(db['ENGINE'])
File "C:\Users\a\Envs\test_south\lib\site-packages\django\db\utils.py", line 27, in load_backend
    return import_module('.base', backend_name)
File "C:\Users\a\Envs\test_south\lib\site-packages\django\utils\importlib.py", line 35, in import_module
    __import__(name)
File "C:\Users\a\Envs\test_south\lib\site-packages\django\db\backends\mysql\base.py", line 17, in <module>
    raise ImproperlyConfigured("Error loading MySQLdb module: %s" % e)
django.core.exceptions.ImproperlyConfigured: Error loading MySQLdb module: No module named _mysql

Then I was trying to install MySQLdb using pip command. My pip command was: pip install MySQL-python. But at time I faced a another problem. Like

building '_mysql' extension
C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\BIN\amd64\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -Dversion_info=(1,2,4,'final',1) -D__versi
on__=1.2.4 "-IC:\Program Files\MySQL\Connector C++ 1.1.3\include" -IC:\Python27\include -IC:\Users\a\Envs\test_south\PC /Tc_mysql.c /Fobuild\temp.win-
amd64-2.7\Release\_mysql.obj /Zl
_mysql.c
_mysql.c(42) : fatal error C1083: Cannot open include file: 'config-win.h': No such file or directory
error: command '"C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\BIN\amd64\cl.exe"' failed with exit status 2

How could I solved this problem?

1] First I went to the http://www.codegood.com/download/11/ this location to download the binary of mysql-python64bit. Which is MySQL-python-1.2.3.win-amd64-py2.7.exe
2] Then I installed the binary in my C:\Python27 folder location
3] Then I went to the Lib\site-package folder and copy all the files and folder which are related to the mysql

4] Then I just paste the files in my environmentvariable location. which is

then pressed the syncdb command. And it was working :-)

Tuesday, April 26, 2011

+ Anchor Text Identification from web pages or (from our web crawler data)

Anchor Text Identification

Anchor Text

The anchor text is the visible, clickable text in a hyperlink. The words contained in the anchor text can determine the ranking that the page will receive by search engines. Some web browsers have added the ability to show a tooltip for a hyperlink before it is selected. Not all links have anchor texts because it may be obvious where the link will lead due to the context in which it is used. Anchor texts normally remain below 60 characters. Different browsers will display anchor texts differently.

Anchor text usually gives the user relevant descriptive or contextual information about the content of the link's destination. The anchor text may or may not be related to the actual text of the URL of the link. For example, a hyperlink with anchor text might take this form:

<a href="http://www.adventureescapades.co.za/activity-hanggliding"> Hang Gliding</a>

The anchor text in this example is “Hang Gliding"; the unwieldy URL http://www.adventureescapades.co.za/activity-hanggliding displays on the web page as Hang Gliding, contributing to clean, easy-to-read text

Why We Need Anchor Text?

For Single profile parser in Business Search, The Company name is the anchor text. So we need anchor text for Company name. Anchor Text is same as the company name for single profile as shown in figure 1.

Figure 1: Sample Web Page

When we click any url in the above figure, it goes a single

profile. Figure 2, shows the profile of first url and it’s Anchor

 Text formation is: <ahref="compdet_company.asp?screen=Show&amp;iC_ID=27337"

target="_parent">Altech Netstar Fleet Management Services (Pty) Ltd</a>

Figure 2: Sample Profile page

Architecture

System architecture of Anchor Text identifier is shown in Figure 3.

Figure 3: System Architecture of Anchor Text

Approach

Input: URLs

Output: Anchor Text of corresponding URL

Anchor Text exists in previous depth of the page, containing profile.

If the URL contains n-depth then

Anchor Text will be found in the (n-1) depth’s pages of current HOST.

Directory Sorted by PHONE

For DataFoundURL info we need DirectorySortedByPhone. Sample information of Directory Sorted By Phone is shown in figure 4.

Figure 4: Directory Sorted By PHONE

Information of DataFoundURL is shown in figure 5.

Figure 5: DataFoundURL

In figure 5, marked areas are how many phone found and corresponding data location of that URL. If the Phone number is 1, we can say that, it is a single profile.

Depth Information

Depth information is found in Crawled Data. That is stored in HostData\WebSite\LinkDB. It’s format is shown in figure 6.

Figure 6: Download Link Path

F:\HostData/Website/content/www.100hills.kzn.org.za means location

2.html means page id

CRAWLER_DEPTH=1 means current depth

1 means index number of content

2261859 start byte of url

2261904 end byte of url

2261905 start byte of content

2347596 end byte of content

1 means igonore or not

Code for Relative URL Maker

For generating absolute url from relative url and host, we need some rules. The code part for generate absolute url is shown in figure 7 and 8.

Figure 7: Making Relative URL (Cont…)

Figure 8: Making Relative URL

Result of Anchor Text

Screen shot of sample output of Anchor Text is shown in Figure 9. Here selected part of Figure 9 is showing necessary information of one anchor text. <OUT_DEGREE> means the hyperlink found in a web page, <URL> means the address of current page, <ANCHOR_TEXT> is the anchor Text of <OUT_DEGREE>.

Figure 9: Result of Anchor Text

Index of Anchor Container

Figure 10 shows index of Anchor Container.

Figure 10a: Index of Anchor Container

Figure 10b: index of Anchor Container

Comparison

Matching found-URL in anchor-container with DataFoundURL.xml, and the data found url comes from directory detector.

Challenges

Identifying Anchor test is a challenging task. Here Anchor tag start with <a and Hyperlink attributes start with href. For example href=”http://www.adventureescapades.co.za/activity-airevents”. But there are lot of variations in right hand side of href, likes:

href=”http://www.adventureescapades.co.za/activity-airevents”

href=’http://www.adventureescapades.co.za/activity-airevents’

href=http://www.adventureescapades.co.za/activity-airevents

href= http://www.adventureescapades.co.za/activity-airevent’s

Different pages have different style of syntax for href. Some of the time one’s definition conflict with others, like:

href=’http://www.adventureescapades.co.za/activity-airevents’

here url start and end with ’. But there are some definition of href:

href= http://www.adventureescapades.co.za/activity-airevent’s

where ‘ is not end marker.

There are also scripts for generating url in web pages and it is produced run-time. For example:

BannerTag.href = Btargets[curBanner]

That is solved by omitting scripts and stylesheet.

In many cases, the hyperlink does not contain full url, it may contain relative url. For example:

href=”/ search/Bearings”

Finding absolute url from relative url is one of the most challenging task for anchor text identification.

There are also lots of variations in relative urls. But we have a general rules for finding absolute url as like Web Crawler. Sample Code is shown in figure 7 and 8.

Done So Far:

There are lots of unknown rules for hyperlink. We are following a “trail and error” basis to making absolute url and identifying anchor text.

Monday, April 25, 2011

+ Business Profile extraction from webpages or business directory report second (Intelligent Profile Parser (Multiple-Profiles-Page)(Front End))

Intelligent Profile Parser (Multiple-Profiles-Page)(Front End)

There are two parts in the design of the Intelligent Multiple-Profile Parser:

1) Back End:

Input:

HTML files with the possibility of containing semi-structured multiple business profiles.

Output:

1) An Analysis Statistics: The Back End analyzes the input html and identifies the html-tag-chains and their frequencies. (html-tag-chain is the tag-path from the root (<html> tag) to the container (<td>, <div> etc) of the text)

2) A more structured intermediate file: This file contains all the, possible business data as separate entries. They are grouped with their corresponding html-tag-chain.

The tasks of the back end can be summarized as follows:

- Get rid of unwanted text.

- Impose a structure upon the semi-structured or unstructured text.

- Detect relation among the texts.

- Identify the importance of the texts.

- Parse the html and classify the texts in groups according to different html-tag-chain.

- Generate the intermediate file that will be parsed by the Front End.

2) Front End:

Input:

1) The analysis statistics.

2) The intermediate file.

Output:

Structured xml file with formatted business profiles.

The steps of the Front End can be summarized as follows:

a) From the analysis report, process each chain one by one and for each chain:

1) From the intermediate file, parse the texts bounded by that particular chain.

2) Detect the entity types (Company, Phone, Cell, email, host, Unknown_data etc.) for each entry.

3) Run a linguistic analysis process on the unknown_data entries to detect the address entry, if the address was not detected in step 2.

4) Run a linguistic analysis process on the unknown_data entries to detect the description entry, if it was not detected in step 2.

5) Output each profile after converting to xml.

b) Detect the correct chain that contains the business profiles.

c) Validate the profiles.

d) Convert the profiles in xml format.

Figure 1: A sample webpage segment, containing multiple business profiles.

Figure 2: This is the html code of the sample html page from figure 1. Each single <tr> tag in the highlighted (in yellow) region contains a single business profile.

Figure 3: This is the html expansion of a single <tr> tag (mentioned in Figure 2).

Figure 4: part of the analysis statistics after analyzing the sample html from Figure 1. This is generated by the back end and used by the front end. The highlight (yellow) string is a html-tag-chain that has the highest frequency of 28 (highlighted in red).

Figure 5: This is a part of the intermediate file, generated by the back end. The highlighted (in red rectangle) strings are the instance of a single html-tag-chain that defines the profile segments. The highlighted (in yellow) texts are our target business data. One profile is assumed to be contained between two instances of an html-tag-chain. Detecting the correct html-tag-chain is one of the challenges we have.

Figure 6: This is an example of the output, produced by the back end after processing the intermediate file (Figure 5). The profiles are highlighted in yellow rectangles. These profiles are parsed with the highlighted chain (in red rectangles). Detecting the company name, address and description are some of the challenging parts in our R&D. We are using some linguistic analysis to detect them, which need to be improved significantly.

Figure 7: This is a more complex example of a business directory webpage. The profiles are highlighted in red rectangle.

Figure 8: This is the html of the page from Figure 7. Each <table> (highlighted in red) contains a business profile. The yellow region is the expansion of the first <table>.

Figure 9: This is the analysis statistics produced by the back end after analyzing the webpage from Figure 7. Here, all the red tags produce the same profiles. Sometimes, the same profile is parsed in two different forms for two different chains. So, it becomes challenging to identify the correct profiles. The yellow tag is the tag with the highest frequency but it doesn’t contain any profiles.

Figure 10: This is the intermediate file generated by the back end, after analyzing the html of Figure 7.

Figure 11: Business Profile output for the html of figure 7.

Figure 12: This is an example of an webpage that contains human-profiles. It is challenging to distinguish them from business-profiles.

Figure 10: This is the profile output for the page from Figure 9. Note that, the name of the human are treated as company-name. Hopefully, we have considered this problem in our Research and made some important progress which would be implemented in future.

Figure 11: This is an example of a page with a structure that challenges our Intelligent Profile Parser. Note that, on each row, the first entry is an address and the second entry is a company name. Sometimes in these cases, it becomes hard to distinguish between the address and the company name. In our parser, we are assuming (so far) that, there would be no repeated company name in a single page. This assumption leads us to ignore the unknown repeated strings. In that case, some addresses are ignored and some addresses gets considered as company name. This problem can be solved by some complex analysis of the linguistic features and html structure.

Figure 12: This is the html of the page from figure 11. Each of the yellow <tr> tag contains a single profile.

Figure 13: This is the intermediate file for the page of figure 11. The yellow texts are all addresses and the texts in red rectangles are company name.

Figure 14: These are the business profiles extracted from the page of Figure 11. The company names in yellow rectangles are correctly identified because we ignored the string “Alberton” before them because of its multiple occurrences. But, the company names in red rectangles are actually addresses. They were not repeated. This problem will be solved in future.

Figure 15: This page is a sample multiple-profile-page. The texts highlighted in red are the company name entries. The texts highlighted in yellow are description entries.

Figure 16: These are the output profiles parsed by the parser from the page in Figure 15. Here all the entries are correctly identified. The description-detection part is still in the research phase, so they are skipped. These profiles are parsed with the tag-chain (enclosed in red rectangle).

Figure 17: These are some incorrectly parsed profiles from the page in Figure 15. Here, the description entries are detected as company name (highlighted in blue rectangles). Detecting and discarding these profiles is a major challenge in our R&D. This problem would be solved by the Style-Sheet-Analysis that we shall start in future.

Figure 18: In this profile, the company name contains both the name of the company (highlighted in yellow) and the description/category of the company (highlighted in green). We shall try to separate them in two separate entries in our future research.

Sunday, February 27, 2011

+ C++ Red Black Tree implementation

RedBlackTree.h
------------------------------------------------------------------------------------------------
#include <stdio.h>
#include <string.h>
#include <vector>

using namespace std;

/*
    Each child of the Red Black Tree
*/
class Node
{
public:
    char *data; /*data*/
    char color;    /*color*/
    Node *left, *right, *parent; /*lelf, right, parent*/

    /*copy constructor*/
    Node(char *data)
    {
        this->data = new char[strlen(data) + 1];
        strcpy(this->data, data);
        color = NULL;
    }

    /*constructor*/
    Node()
    {
        this->data = new char[1];
        color = NULL;
    }

    /*destructor*/
    ~Node()
    {
        delete (this->data);
        color = NULL;
    }
};

/*
    This is a class of Red Black Tre
*/
class RedBlackTree
{
private:
    Node    *Root;    /*root*/
    Node    *Nil;    /*leaf*/

public:
    /*constructor*/
    RedBlackTree            ();

    /*destructor*/
    ~RedBlackTree            ();

    Node*    tree_successor    ( Node* x );

            /*left rotate*/
    int        leftRotate        ( Node *x );

            /*right rotate*/
    int        rightRotate        ( Node *x );

            /*insert key*/
    int        insert_key        ( char *data );

            /*fix up*/
    int        fixup            ( Node *z );

            /*delete key*/
    int        delete_key        ( Node *z );

            /*delete fixup*/
    int        delete_fixup    ( Node *x );

            /*in order traverse*/
    void    inOrderTraverse    ( Node *n );

            /*post order delete*/
    void    postOrderDelete    ( Node *n );
};
RedBlackTree.cpp
------------------------------------------------------------------------------------------
#include "RedBlackTree.h"

/**
*    tree successor
*    @param: x is Node type data
*    @return: return y is a another Node type pointer
*/
Node* RedBlackTree::tree_successor( Node* x )
{
    if ( x->right != Nil )
    {
        x = x->right;

        while ( x->left != Nil )
        {
            x = x->left;
        }

        return x;
    }

    Node* y = x->parent;

    while ( y != Nil && x == y->right )
    {
        x = y;
        y = y->parent;
    }

    return y;
}

/**
*    left rotate
*    @param: x is a node type data
*    @return: 0 by default
*/
int RedBlackTree::leftRotate( Node *x )
{
    Node *y;
    y = x->right;
    x->right = y->left;
    y->left->parent = x;
    y->parent = x->parent;

    if ( x->parent == Nil )
        Root = y;
    else if ( x == x->parent->left )
        x->parent->left = y;
    else
        x->parent->right = y;
    y->left = x;
    x->parent = y;
    return 0;
}

/**
*    right rotate
*    @param: x is a Node type pointer
*    @return: 0 by default
*/
int RedBlackTree::rightRotate( Node *x )
{
    Node *y;

    y = x->left;
    x->left = y->right;
    y->right->parent = x;
    y->parent = x->parent;

    if ( x->parent == Nil )
        Root = y;
    else if ( x == x->parent->right )
        x->parent->right = y;
    else
        x->parent->left = y;
    y->right = x;
    x->parent = y;
    return 0;
}

/**
*    insert key
*    @param: data is a char type pointer
*    @return: 0 by default
*    @see: fixup()
*/
int RedBlackTree::insert_key( char *data )
{
    Node *x, *y, *z;

    y = Nil;
    x = Root;

    while ( x != Nil )
    {
        y = x;
        if ( strcmp( data, x->data ) < 0 )
            x = x->left;
        else if ( strcmp( data, x->data ) > 0 )
            x = x->right;
        else
            return -1;
    }

    z = new Node(data);
    z->parent = y;
    z->color = 'R';
    z->left = z->right = Nil;


    if ( y == Nil )
        Root = z;
    else
    {
        if ( strcmp( data, y->data ) <= 0 )
            y->left = z;
        else
            y->right = z;
    }

    fixup( z );

    x = y = NULL;
    return 0;
}

/**
*    fixup the node
*    @param:        z is a Node type pointer
*    @return:    0 by default
*    @see:        leftRotate(Node*)
*    @see:        rightRotate(Node*)
*/
int RedBlackTree::fixup( Node *z )
{
    Node *y;

    while ( z->parent->color == 'R' )
    {
        if ( z->parent == z->parent->parent->left )
        {
            y = z->parent->parent->right;

            if ( y->color == 'R' )
            {
                z->parent->color = 'B';
                y->color = 'B';
                z->parent->parent->color = 'R';
                z = z->parent->parent;
            }
            else
            {
                if ( z == z->parent->right )
                {
                    z = z->parent;
                    leftRotate( z );
                }

                z->parent->color = 'B';
                z->parent->parent->color = 'R';
                rightRotate( z->parent->parent );
            }
        }
        else
        {
            y = z->parent->parent->left;

            if ( y->color == 'R' )
            {
                z->parent->color = 'B';
                y->color = 'B';
                z->parent->parent->color = 'R';
                z = z->parent->parent;
            }
            else
            {
                if ( z == z->parent->left )
                {
                    z = z->parent;
                    rightRotate( z );
                }
                z->parent->color = 'B';
                z->parent->parent->color = 'R';
                leftRotate( z->parent->parent );
            }
        }
    }

    Root->color = 'B';

    y = NULL;
    return 0;
}

/**
*    delete key
*    @param:    z is a Node type param
*    @see:    tree_successor(Node*)
*    @see:    delete_fixup(Node*)
*    @return:0 by default
*/
int RedBlackTree::delete_key( Node *z )
{
    Node *x, *y;

    if ( z->left == Nil || z->right == Nil )
    {
        y = z;
    }
    else
    {
        y = this->tree_successor( z );
    }

    if ( y->left != Nil )
    {
        x = y->left;
    }
    else
    {
        x = y->right;
    }

    x->parent = y->parent;

    if ( y->parent == Nil )
    {
        Root = x;
    }
    else
    {
        if ( y == y->parent->left )
        {
            y->parent->left = x;
        }
        else
        {
            y->parent->right = x;
        }
    }

    if ( y != z )
    {
        strcpy( y->data, z->data );
        //z->data = y->data;
    }

    if ( y->color == 'B' )
    {
        delete_fixup( x );
    }

    return 0;
}

/**
*    delete fixup
*    @param: x is a Node type param
*    @return: 0 by default
*    @see:    leftRotate(Node*)
*    @see:    rightRotate(Node*)
*/
int RedBlackTree::delete_fixup( Node *x )
{
    Node *w;

    while ( x != Root && x->color == 'B' )
    {
        if ( x == x->parent->left )
        {
            w = x->parent->right;

            if ( w->color == 'R' )
            {
                w->color = 'B';
                x->parent->color = 'R';
                leftRotate( x->parent );
                w = x->parent->right;
            }

            if ( w->left->color == 'B' && w->right->color == 'B' )
            {
                w->color = 'R';
                x = x->parent;
            }
            else if ( w->right->color == 'B' )
            {
                w->left->color = 'B';
                w->color = 'R';
                rightRotate( w );
                w = x->parent->right;
            }

            w->color = x->parent->color;
            x->parent->color = 'B';
            w->right->color = 'B';
            leftRotate( x->parent );
            x = Root;
        }
        else
        {
            w = x->parent->left;

            if ( w->color == 'R' )
            {
                w->color = 'B';
                x->parent->color = 'R';
                rightRotate( x->parent );
                w = x->parent->left;
            }

            if ( w->right->color == 'B' && w->left->color == 'B' )
            {
                w->color = 'R';
                x = x->parent;
            }
            else if ( w->left->color == 'B' )
            {
                w->right->color = 'B';
                w->color = 'R';
                leftRotate( w );
                w = x->parent->left;
            }

            w->color = x->parent->color;
            x->parent->color = 'B';
            w->left->color = 'B';
            rightRotate( x->parent );
            x = Root;
        }
    }

    x->color = 'B';

    return 0;
}

/**
*    in order traverse
*    @param:    n is a Node type param
*    @desc:    recursive calling
*/
void RedBlackTree::inOrderTraverse( Node *n )
{
    if ( n != Nil )
    {
        inOrderTraverse( n->left );
        puts( n->data );
        inOrderTraverse( n->right );
    }
}

/**
*    post order delete
*    @param: n is a Node type
*    @desc:    recursive calling
*/
void RedBlackTree::postOrderDelete( Node *n )
{
    if ( n != Nil )
    {
        postOrderDelete( n->left );
        postOrderDelete( n->right );
        delete( n );
    }
}

/**
*    Constructor
*    Nil equal Node
*    Root equal Nil
*/
RedBlackTree::RedBlackTree()
{
    this->Nil = new Node();
    this->Root = Nil;
}

/**
* destructor
* @see: postOrderDelete(Node*)
*/
RedBlackTree::~RedBlackTree()
{
    postOrderDelete( this->Root );
    delete (this->Nil);
}

Wednesday, February 23, 2011

+ Project work flow and description (Business Profile extraction) at the very first time.

BusinessSearch Project work flow and description

Objective of BusinessSearch Project:

Main goal of this project is to collect the business profile in a automated way and try to develop a system that can update the profile in quarter or half year basis.

Flow Description:

The target of our project is to collect maximum business profile. But the business profile is fully depends on host name. If we have a single host name then we are able to create a profile by using it’s home page, contact us, about us pages. But we do not have enough host names to extract Business profile for our project . That is why our main target is to collect maximum host name and convert it to a perfect business profile.

Current / available attributes for Business profile are:

· Fax

· Email

· Website

· Zip code

· Address

· Phone no

· GPS location

· Contract person

· Contact person’s designation

Upcoming/ future attributes are:

· Business type

· Business Description

· Owner/ Proprietor/ Director

· Establishment date

· Registration Date

· Logo

Challenges: WebCrawler

· Robustness

· Mirror

· Hashing

· Unexpected bug

Work done so far in Business Search:

· Fax : 80- 90% accuracy

· Phone: 80-90%

· Zip Code: 70-80%

· Website: 85-90%

· Email: 85-90%

· Address: 45-50%

· Company Name: 45-50%

· Contact Person: 50-60%

· Contact Designation: 60-70%

· GPS Cordinate: 60-70% but (we will consider it later with the support of iSearch/Lucene)

· Branch: 70-80%

Yet to Develop:

· Business type – (Machine Learning)

· Business Description – (Snippet)

· Owner/ Proprietor/ Director (Natural Language Processing)

· Establishment date - Parsing

· Registration Date – Parsing

· Logo – Paring + Tricks + Crawling

+ Business Profile extraction Process flow diagram

Here I'm giving you a Business Profile extraction process flow diagram. But this is not a flow diagram. Actually I'm giving you a basic idea of a modules name in our project.

+ Business Profile extraction from webpages or business directory report

Few days ago, we have tried to develop a business profile extraction parser. But if you want to write down a business profile parser then you have to do a proper R&D over that.

Right now I'm giving you a my first R&D report of a business profile parser.

Current R&D Target:

Developing an efficient intelligent profile parsing system that is able to parse profiles from at least 80% - 90% Business Directory WebPages.

Progress:

Our target Business-Directory-WebPages are classified into two groups:

1. Multiple-Profile Pages

2. Single-Profile Pages

Multiple Profile Pages:

- These WebPages contains Multiple Business Profiles on a single page and there are common patterns in the html structure of those profiles.

- Our intelligent parser would identify this repeated pattern and parse the profiles accordingly.

- There is a big possibility that, some invalid data would be parse as business profiles, but our profile validator will discard them.

Single Profile Pages:

- This case is a little harder than the previous case.

- But, the approach is almost similar to the previous one.

- In maximum business directories, different Single-Profile-Pages maintain the same pattern.

Currently we are focusing on the first case (Multiple-Profile Page). We are using selected pages from our crawled data.

Challenges:

The main challenge is to define and identify the pattern. Most of the html pages are semi-structured. So, there is no universal pattern that works for each one.

Here are some sample patterns:

Figure 1:

This is a part of the html, from a multiple-profile page. The highlighted (yellow) region contains muiltiple business profiles.

Figure 2:

In this html, each profile is contained in the <div class=”listing”> tag.

Figure 3:

This is a more complex pattern. Here, each of the 3 consecutive (highlighted) <tr> tags contains a single profile.

Figure 4:

The previous examples imply that, identifying repeated patterns allows us to parse multiple profiles. But, this figure shows a complex counter-example. Here, the pattern in the yellow region is repeated, but they don’t contain any business profiles. All the business profiles are contained in the <tr> tag (marked with a red rectangle). The next image is the expansion of this <tr> tag.

Figure 5:

This is the expansion of the <tr> tag (marked with red rectangle) from the previous example (Figure 4). Each of the table in the yellow region contains a business profile.

Figure 6:

This image shows the expansion of the <table> tags from the previous example (Figure 5).

Figure 7: This is the expansion of the 3 consecutive <tr> tags from the figure 3.

Approach:

After analyzing many business directory websites, we have found some possible techniques to identify the profile patterns.

Each business profile contains multiple entries (Company Name, Email, Address etc.). Each of these entries has a unique html tag-chain. Depending on these tag-chains we can identify the possible data segments in the following way:

- At first, we have to identify the chain that signifies the start of a profile segment. We call it, the main-chain.

- Initially, we considered the most frequent chain as the main-chain. But, this heuristic fails for some cases.

- So, we consider all the chains in the non-increasing order of frequency and check for the validity of the result. But, the validity checking has many hard challenges that are part of Natural Language Processing.

Example:

Here is an example of a main-chain for a webpage:

Most Frequent Chain: <tr>||<td width="64" valign="top">||</td>||<td width="102" valign="top">||</td>||<td width="368" valign="top">||</td>||</tr>||

Frequency: 167

Using this chain we could parse profiles successfully. But, this is not always possible.

Counter Example:

Here are 2 of the tag-chains from the Webpage of figure 4:

Chain 1: <tr>||<td width="100" align="right" class="SideMenuLink1">||</td>||<td class="SideMenuLink1">||</td>||</tr>||

Frequency: 80

Chain 2: <tr>||<td colspan="2" class="SideMenuTitle">||</td>||</tr>||

Frequency: 31

Here, chain 2 is the main-chain, which is not the most-frequent chain.

Figure 8:

This is an example of a simple html-parse-tree. Every vertex represents a single html-tag. The red-vertices contain the profile-data. The path from the root to a leaf is the tag-chain that we previously described. But, most of the html-parse-trees are more complex.

But we are not going to describe those complex cases which has been skiped by us.