Web Issue Analysis – A practical guide to the HTML text analysis component

The text analysis component of web issue analysis identifies words that are unusually common in web pages matching a search query. For example, to gain insights into Integrated Water Resource Management it would be possible to identify words that occurred unusually often in pages returned by Google for a search “Integrated Water Resources Management”. Instructions for carrying out this kind of analysis with LexiURL Searcher and SocSciBot are given below.

The first stage is to construct a query matching the topic analysed. The query should be designed to capture as many issue-relevant pages as possible but to include a low proportion of irrelevant pages. This can be achieved by trial and error: testing queries in a search engine to see how many results are returned and to check that virtually all matching pages are relevant by scanning the search results list for irrelevant matches.

Once the query is ready, it can be used to generate a list of up to 1,000 results via LexiURL Searcher (lexiurl.wlv.ac.uk). After downloading LexiURL Searcher, create a text file containing the query using Windows/Notepad or similar. The query should be on its own on a line at the top of the file. Next start LexiURL Searcher and choose the classic interface option. From the classic interface screen, select “Run all searches in file” from the search menu and select the text file just created when asked. After about a minute the searches will be finished and the long results file created in the same folder as the original text file contains a list of up to 1,000 matching URLs from Live Search.

The up to 1,000 matching URLs can be downloaded by SocSciBot for a text analysis. Follow the instructions for downloading and installing SocSciBot 4 at socscibot.wlv.ac.uk. Start SocSciBot 4 and set up a new project by entering a new project name in the wizard step 1. In wizard screen 2, instead of entering a single URL to crawl, check the Download Multiple Sites/URLs option and click the Crawl Site with SocSciBot button. At the main SocSciBot screen that appears, select the I have a list of up to 10000 URLs to crawl option and load the long results by clicking on the Load List of URLs to Crawl button. Start the downloading by clicking the Crawl above list of Sites/URLs button.

Once the downloading is complete, close SocSciBot 4 and start it again, selecting the new project. Now select the Analyse TEXT in project with Cyclist option. This runs algorithms to extract the text from all the downloaded web pages and may take several minutes. Once this is complete, get a list of the most common words by selecting Save Page Word Frequency file from the Info menu. This lists all words in all pages, together with their frequency and the number of different pages in which they occurred. This list can be compared with lists of word frequencies in a range of texts to find unusually common words, i.e. those with a significantly higher rank in the first list than the second.