How to do Web Crawling and Text Mining in Statistica?

How to do Web Crawling and Text Mining in Statistica?

book

Article ID: KB0078133

calendar_today

Updated On:

Products Versions
Spotfire Statistica 13.3.0 and later

Description

This article shows general instructions to perform Web Crawling and Text Mining analysis of crawled htmls in Statistica. 

Issue/Introduction

How to do Web Crawling and Text Mining in Statistica?

Environment

Windows

Resolution

To do Web Crawling:
1. go to Statistica menu "Data mining | Web Crawling" dialog, enter the web URLs you want to craw in the "Target" text box
2. Choose the "File Filter".
3. Click "Add to Crawl". The URL will be listed on the "Crawling tree".
4. Click "Start" to begin the crawling process. Once done, the sub URLs will be listed under "Crawling tree".
User-added image
5. Select the sub URLs under "Crawling tree" and add to the "Document list"
6. Select the Documents of interest under "Document list" on the right panel
7. Click "Create a Spreadsheet from the document list" to get a list of the URL crawled
8. Browse to a "Content folder" where you want to save the crawled html files
9. Click "Load web contents from the list to local folder".
User-added image
The crawled html files will be saved to the folder you have specified. At this stage, web crawling is done. 

To do Text Mining of the crawled html files, 
1. go to Statistica menu "Data Mining | Text Mining"
2. Under Quick tab of Text mining dialog, select "Files | Browse documents | Add Files", browse to the folder where you have saved your crawled html files in Web Crawling process. 
3. Select the html files and add to the "Select documents" list 
4. Once done, click "Index" to do the text mining analysis
User-added image

Important Notice:
Text mining requires user to first install a 3rd party tool (lynx) to Statistica in order to mine web URLs. This tool is now no longer built-in with Statistica due to group license policy. 
Please install and configure lynx tool before you do the text mining analysis by following this article: 
Text Miner/Web Crawling error: Failed to index document while pointing it to URLS