How to do Web Crawling and Text Mining in Statistica?
book
Article ID: KB0078133
calendar_today
Updated On:
Products
Versions
Spotfire Statistica
13.3.0 and later
Description
This article shows general instructions to perform Web Crawling and Text Mining analysis of crawled htmls in Statistica.
Issue/Introduction
How to do Web Crawling and Text Mining in Statistica?
Environment
Windows
Resolution
To do Web Crawling: 1. go to Statistica menu "Data mining | Web Crawling" dialog, enter the web URLs you want to craw in the "Target" text box 2. Choose the "File Filter". 3. Click "Add to Crawl". The URL will be listed on the "Crawling tree". 4. Click "Start" to begin the crawling process. Once done, the sub URLs will be listed under "Crawling tree".
5. Select the sub URLs under "Crawling tree" and add to the "Document list" 6. Select the Documents of interest under "Document list" on the right panel 7. Click "Create a Spreadsheet from the document list" to get a list of the URL crawled 8. Browse to a "Content folder" where you want to save the crawled html files 9. Click "Load web contents from the list to local folder".
The crawled html files will be saved to the folder you have specified. At this stage, web crawling is done.
To do Text Mining of the crawled html files, 1. go to Statistica menu "Data Mining | Text Mining" 2. Under Quick tab of Text mining dialog, select "Files | Browse documents | Add Files", browse to the folder where you have saved your crawled html files in Web Crawling process. 3. Select the html files and add to the "Select documents" list 4. Once done, click "Index" to do the text mining analysis
Important Notice: Text mining requires user to first install a 3rd party tool (lynx) to Statistica in order to mine web URLs. This tool is now no longer built-in with Statistica due to group license policy. Please install and configure lynx tool before you do the text mining analysis by following this article: Text Miner/Web Crawling error: Failed to index document while pointing it to URLS