Have you ever wondered what separates the UK Regional Development Agencies? Are all the Regional Economic Strategies (RES) the same? Who would disagree with the promotion of higher incomes, investment in innovative technology and a cohesive communities? How do the unique assets and histories of each UK region affect its RES? How can every region pursue higher than average output?!
This analysis answers some of these questions by delving into semantic technologies. I present a visualisation of keywords (a tag cloud) and a statistical analysis of keywords (ngram ranking).
Tag Clouds
The information contained within is drawn from Regional Economic Strategies and the Regional Development Agencies' own websites. This data has been recorded on this site in a web-book on Regional Economic Strategies. Tags have been applied to excerpts of the text and then aggregated.
The RES book is not comprehensive but the analysis is illustrative even though it only includes a few of the strategies.
This tag cloud illustrates the frequency of economic development topics across the regions:
Advanced Technology Assets Attractive built environment Business City-Regions cohesive communities community communties Competition Competitive Advantage Conservation Culture Deprived Areas Dynamism Economy Education Efficiency Employment energy enterprise environment equality Global Competitiveness Growth GVA health inclusion Income Infrastructure Innovation International Economy Investment Inward Investment Jobs Knowledge land Learning Liveability People Physical Private Investment Productivity Quality of Life Regional Economic Strategy renewal Rural Safeguard Skills Social Justice Society Sustainable Development Sustainable Prosperity Towns Training Transport Unemployment Urban Work World Class
N-Gram Ranking
A n-gram is a collection of co-located words. "n" defines the number of words. In this analysis we compare pairs of words or "bigrams".
I've used perl modules from the N-Gram Statistics Package to parse the text. The script looks for pairs of words found within a four-word window. Some common conjunctions (and/ or etc) have been ignored. The frequency of bi-grams (memes in the RES) are counted in histograms. The histograms are then ranked by log-likelihood ratios.
The top ten bi-grams for each RES are as follows:
| AWM | EEDA | EMDA | LDA | NWDA | ONE | SEEDA | SWRDA | YF |
| West Midlan | East England | East Midland | London s | England aver | North East | South East | South West | Yorkshire Hu |
| region s | per cent | sub area | GLA Group | Themed Cha | One NorthEa | Regional Eco | region s | Northern Way |
| low carbon | Regional Str | region s | ECONOMIC D | Lake District | set out | Economic Str | Regional Str | such as |
| climate chan | Economic Str | : PSA | ECONOMIC S | Economic Str | region s | Regional Str | Regional Eco | long term |
| Regional Str | Regional Eco | Evidence Bas | DEVELOPMEN | Regional Str | ECONOMIC S | region s | Economic Str | e g |
| R D | region s | such as | EVIDENCE BA | Economic 20 | REGIONAL ST | Smart Growth | southwestrda | region s |
| working age | such as | Local authori | Sustaining S | Strategy 200 | REGIONAL EC | quality life | 2006 2015 | REGIONAL EC |
| such as | R D | long term | such as | page Theme | Tees Valley | sustainable | Economic 20 | YORKSHIRE |
| per head | climate chan | Regional Str | Sustaining D | page Chapte | such as | New Action | Strategy 201 | 2006 2015 |
The list throws-up some interesting results
- "Climate Change" is only a top 10 issue for AWM and EEDA (NB: just because these RDA mention this term frequently doesn't mean that other RDAs aren't as, if not more, concerned about the environment).
- "The Northern Way", an agreement between the three northern RDA's, is only in YF's top 10.
- "Long Term" thinking is only apparent in EMDA and YF's top 10.
The lists demonstrate that my method needs some work. Ideally we would like to remove "s" and stylistic elements like "such as" but I'm satisfied with the results for the time being.
The rankings of different RES documents are compared with a Spearman's Rank Coefficient. The correlation matrix of RES memes is as follows.
| awm | eeda | emda | lda | nwda | one | seeda | swrda | yf | |
| 56.12% | 56.19% | 54.79% | 54.78% | 57.08% | 55.22% | 58.18% | 57.66% | awm | |
| 57.79% | 57.41% | 57.88% | 60.08% | 61.09% | 57.60% | 59.15% | eeda | ||
| 55.47% | 54.46% | 59.50% | 57.90% | 55.08% | 60.00% | emda | |||
| 51.24% | 58.64% | 57.58% | 54.19% | 58.28% | lda | ||||
| 55.76% | 57.63% | 52.97% | 56.33% | nwda | |||||
| 59.18% | 56.99% | 62.51% | one | ||||||
| 58.90% | 58.96% | seeda | |||||||
| 54.89% | swrda | ||||||||
| yf |
Since the correlation coefficients cluster around the "half-similar"-50% mark, I've prepared another matrix that subtracts .5 from each. The below matrix clearly shows that the most similar RES's are neighbours Yorkshire Forward & One North East and the South East & East of England Development Agencies. The least similar RES's are both distant cousins of the Northwest Development Agency, in the London and the South West.
| awm | eeda | emda | lda | nwda | one | seeda | swrda | yf | |
| 6% | 6% | 5% | 5% | 7% | 5% | 8% | 8% | awm | |
| 8% | 7% | 8% | 10% | 11% | 8% | 9% | eeda | ||
| 5% | 4% | 10% | 8% | 5% | 10% | emda | |||
| 1% | 9% | 8% | 4% | 8% | lda | ||||
| 6% | 8% | 3% | 6% | nwda | |||||
| 9% | 7% | 13% | one | ||||||
| 9% | 9% | seeda | |||||||
| 5% | swrda | ||||||||
| yf |
Again this method isn't perfect. First, it would benefit from improved input data as described above. Second, we could extend the analysis to compare the correlation coefficients with a statistical significance test just to confirm whether the regions are similar/ different.
Moreover the N-Gram method itself isn't an means of comparison. Much of RES similarity is inevitable - all the documents are written to a similar brief. The analysis also prone to bias - similar writing styles do not always mean similar content. Indeed the N-Gram Statistics Package that I've used was not designed to compare different documents - the rank module is supposed to be used to compare the statistical ranking assigned by different measures - e.g. log-likelihood vs. pointwise mutual information.
Please let me know if you'd like me to continue with the analysis. Perhaps with the improvements I've hinted at, or by extending the research to cover the Welsh Development Agency or Scottish Enterprise, you may have your own recommendations. If you've got a more serious requirement, I'm open to commissions!
Post new comment