Sustainable Tourism Index
In this notebook I create a composite index to measure sustainability in the tourism industry. The Sustainable Tourism Index was composed using Factor Analysis, where various observable variables were combined to form an unobservable factor, "Sustainable Tourism". I explain the importance and relevance of such an index, before discussing the method used to create it. I find that factor analysis is extremely suitable for creating such a composite index, since differential weighting is representative and realistic in attributing weight to the factor. Alternatively, instead of using several individual indices, it may be more valuable to calculate one index for all years, such that the scores can be compared over multiple years more reliably.
I attempt to create such an index with data that is freely available to the public
In the last decades tourism has become increasingly accessible to people around the world; with travel costs decreasing, and disposable income steadily rising, it has become more affordable for the population to travel and discover new places. In fact, in the 1950s 25 million international arrivals were recorded, whereas by 2015 this amount had risen to a staggering 1.2 billion arrivals (Sharpley and Telfer, 2014 and WTTC, 2016). With the tourism industry now contributing almost 10% to global GDP annually, and employing over 300 million people worldwide, it is not surprising that it continues to have a tremendous impact on economic development. Besides GDP and employment, the industry also directly affects investments, export, consumption, and even wealth distribution. Distinct from many other sectors, the tourism industry provides an immensely diverse range of activities and products, thus generating an extensive value chain which impacts multiple other sectors.
Unfortunately the development of tourism is paired with various production externalities which can damage long-term growth. Over-exploitation of and competition for natural resources directly affect the livelihood of local communities and stress regional biodiversity. Creaco and Querini (2003) note that the negative and irreversible effect of unplanned and uncontrolled growth actually destroys the unique natural and social resource foundation of tourism. Consequently, it is important to balance the use of social, environmental, and economic factors of tourism. With this in mind, the tourism industry is increasingly becoming more sustainable, and global institutions are focussing on stimulating this sustainable development through specific programmes such as the Sustainable Tourism – Eliminating Poverty (ST-EP) Initiative and the 10-Year Framework of Programmes on Sustainable Consumption and Production Patterns Sustainable Tourism Programme (UNWTOb, 2016 and UNEP, 2016).
from IPython.display import YouTubeVideo
YouTubeVideo("JFbbKbdqoJg")
With the UN introducing 2017 as the International Year of Sustainable Tourism (Video Link for IY 2017), focus is increasingly on the measurement of sustainability in the industry, and on tracking its development. Since sustainability is a complex and multivariate concept, different methods have been proposed to quantify a composite index for the level of sustainable tourism. Such a composite would properly encase the multitude of elements inherent to sustainability and has been attempted in the following cases; Tourism Penetration Index, the Barometer of Tourism Sustainability, the Sustainable Tourism Index, the Sustainable Tourism Benchmarking Tool, and the Vectorial Dynamic Composite Indicator (McElroy & De Albuquerque, 1998; Ko, 2005; Pulido Fernández & Sánchez Rivero, 2009; Cernat & Gourdon, 2012; & Blancas et al., 2016). Unfortunately however, many of the variables needed to create these composites are not (well) measured by data gathering institutions. With this in mind, I attempt to create a new index which is based on variables that are currently measured and available.
After reviewing various data sources I decided that data from the World Economic Forum (WEF) and Yale were most relevant for this analysis. These organisations have collected data on the sustainable development of the tourism industry and on environmental performance, respectively, and are reliable and valid sources of data. The WEF publishes a Travel and Tourism Competitiveness Report every few years, which includes specific data on the tourism industry and sustainability. Yale developed the Environmental Performance Index (EPI), which is an index composed of environmental indicators, and publishes new scores each year on a country level. Data from 2007, 2008, 2009, 2011, 2013, and 2015 will be used from 144 countries.
Factor analysis is a method often used to create composite indices, in particular for concepts of a complex social nature. It has the capacity to analyze correlations between observable variables and an unobservable variable, or factor. Factor analysis commences with observed variables, which are assumed to have some linear relation to the common factor, and then derives likely component variables (Mulaik, 2010). Instead of just using equal weighting for all variables, factor analysis filters out variables which are not relevant and attributes more realistic and accurate weightings to each of them. See these webpages for more information (Short Introduction to Factor Analysis, A Beginner's Guide to Factor Analysis, Confirmatory Factor Analysis, Factor Analysis in Python)
This study will use a confirmatory approach, combining observable variables which are expected to represent the common factor ‘Sustainable Tourism’. This factor will be composed using differentially weighted variables, similar to the ST index and Sustainable Tourism Benchmarking Tool created by Pulido-Fernandez & Sanchez-Rivero (2009) and Cernat & Gourdon (2012), respectively. This index will be based on different observable variables which are currently available for research. The number of selected variables is large during this stage, but will be reduced during the factor analysis, when their individual correlation is determined.
Sustainability incorporates three common dimensions, the economic, environmental, and social dimensions. This index will include the environmental and social aspects only, to avoid endogeneity when relating the index to economic development (this is a recommendation for the future which will be discussed later). Additionally, other relevant WEF report variables are included which could contribute to the factor, but which are not necessarily attributable to a specific dimension. The following equation will represent the final factor of Sustainable Tourism:
$$ Sustainable \ Tourism \ Index_{ij} = \beta_1 \ S_{1ij} + \beta_2 \ S_{2ij} + ... + \beta_n \ S_{nij}$$
where i and j respectively denote country and year, and where $S_n$ represents some of the independent variables mentioned above. In this equation, $\beta_n$ represents the so-called loading factors, which are better known as the correlations between the observable and unobservable variables.
The following factors are included in the dataset I composed:
I find that Factor Analysis works well as a method to create a composite index with differential weighting. It is able to process large datasets, with many observations per variable and country. The resulting ST Index is realistic and representative of many underlying variables. Using the same loading factors over the same variables each year, would provide an ST index which is comparable over many years.
Below I first import necessary packages and import the data. I also convert data to numeric, which is necessary for further analysis. The table you see below is of 2015, as an example of the individual datasets. Columns represent the variables mentioned earlier, and the rows represent the countries. Row 0 is Albania, row 143 Zimbabwe.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.decomposition import PCA, FactorAnalysis
xls_file2007 = pd.ExcelFile("C:/Users/asus/Documents/Master/Thesis/Data/FactorAnalysis per Year/PythonSustainable2007.xlsx")
pd.to_numeric(xls_file2007, errors='coerce')
xls_file2007
xls_file2007.sheet_names
Sustain2007 = xls_file2007.parse('Blad1')
xls_file2008 = pd.ExcelFile("C:/Users/asus/Documents/Master/Thesis/Data/FactorAnalysis per Year/PythonSustainable2008.xlsx")
pd.to_numeric(xls_file2008, errors='coerce')
xls_file2008
xls_file2008.sheet_names
Sustain2008 = xls_file2008.parse('Blad1')
xls_file2009 = pd.ExcelFile("C:/Users/asus/Documents/Master/Thesis/Data/FactorAnalysis per Year/PythonSustainable2009.xlsx")
pd.to_numeric(xls_file2009, errors='coerce')
xls_file2009
xls_file2009.sheet_names
Sustain2009 = xls_file2009.parse('Blad1')
xls_file2011 = pd.ExcelFile("C:/Users/asus/Documents/Master/Thesis/Data/FactorAnalysis per Year/PythonSustainable2011.xlsx")
pd.to_numeric(xls_file2011, errors='coerce')
xls_file2011
xls_file2011.sheet_names
Sustain2011 = xls_file2011.parse('Blad1')
xls_file2013 = pd.ExcelFile("C:/Users/asus/Documents/Master/Thesis/Data/FactorAnalysis per Year/PythonSustainable2013.xlsx")
pd.to_numeric(xls_file2013, errors='coerce')
xls_file2013
xls_file2013.sheet_names
Sustain2013 = xls_file2013.parse('Blad1')
xls_file2015 = pd.ExcelFile("C:/Users/asus/Documents/Master/Thesis/Data/FactorAnalysis per Year/PythonSustainable2015.xlsx")
pd.to_numeric(xls_file2015, errors='coerce')
xls_file2015
xls_file2015.sheet_names
Sustain2015 = xls_file2015.parse('Blad1')
Sustain2015
Sustain2007
Sustain2007.describe()
Sustain2008
Sustain2008.describe()
Sustain2009
Sustain2009.describe()
Sustain2011
Sustain2011.describe()
Sustain2013
Sustain2013.describe()
Sustain2015
Sustain2015.describe()
From the descriptives above we can see that most datapoints lie between 0 and 7, because they originate from survey questions which are measured on a likert scale. This is particularly useful because this means the ST index will also result in a balanced score. However, there are also extreme outliers which will affect the ST index later on. The variables ProtectArea, EPI, and GenderIneq all show much higher values, and will significantly alter the ST index scores when they are included in the calculation. This is not necessarily a problem, but must be taken into account when interpreting the ST Index scores, and comparing them over multiple years.
As I mentioned earlier, to create a composite index I must first perform the factor analyis. This analysis will result in a list of factor loadings, which represent the different weights I can attribute to each variable. Again, I only show the data for 2015, since the method is the same in each period.
from sklearn import decomposition, preprocessing
import numpy as np
import pandas as pd
data2007 = Sustain2007
data2007 = data2007[~np.isnan(data2007).any(axis=1)] # take out values of NaN to ensure only numeric data remains
data_normal2007 = preprocessing.scale(data2007) #normalisation
fa2007 = decomposition.FactorAnalysis(n_components = 1) # decomposition
fa2007.fit(data_normal2007) # Factor analysis
data2008 = Sustain2008
data2008 = data2008[~np.isnan(data2008).any(axis=1)]
data_normal2008 = preprocessing.scale(data2008)
fa2008 = decomposition.FactorAnalysis(n_components = 1)
fa2008.fit(data_normal2008)
data2009 = Sustain2009
data2009 = data2009[~np.isnan(data2009).any(axis=1)]
data_normal2009 = preprocessing.scale(data2009)
fa2009 = decomposition.FactorAnalysis(n_components = 1)
fa2009.fit(data_normal2009)
data2011 = Sustain2011
data2011 = data2011[~np.isnan(data2011).any(axis=1)]
data_normal2011 = preprocessing.scale(data2011)
fa2011 = decomposition.FactorAnalysis(n_components = 1)
fa2011.fit(data_normal2011)
data2013 = Sustain2013
data2013 = data2013[~np.isnan(data2013).any(axis=1)]
data_normal2013 = preprocessing.scale(data2013)
fa2013 = decomposition.FactorAnalysis(n_components = 1)
fa2013.fit(data_normal2013)
print Sustain2015
data2015 = Sustain2015
data2015 = data2015[~np.isnan(data2015).any(axis=1)]
data_normal2015 = preprocessing.scale(data2015)
fa2015 = decomposition.FactorAnalysis(n_components = 1)
fa2015.fit(data_normal2015)
for score in fa2015.score_samples(data_normal2015):
print -score # Factor analysis scores
From the Factor Analysis scores above, I can compute the Factor Analysis Loadings. These loadings represent the differential weighting to be used in the final computation of the ST Index.
Loading2007 = pd.DataFrame(fa2007.components_, columns=('ProtectArea', 'EPI', 'Satisfaction', 'Heritage', 'HealthCap', 'GenderIneq', 'GovPrior', 'StringEnv'))
Loading2007
print pd.DataFrame(-fa2007.components_, columns=('ProtectArea', 'EPI', 'Satisfaction', 'Heritage', 'HealthCap', 'GenderIneq', 'GovPrior', 'StringEnv'))
# gives the factor loadings
Loading2008 = pd.DataFrame(fa2008.components_, columns=('ProtectArea', 'EPI', 'Satisfaction', 'Heritage', 'HealthCap', 'GenderIneq', 'GovPrior', 'StringEnv', 'SustInd'))
Loading2008
print pd.DataFrame(-fa2008.components_, columns=('ProtectArea', 'EPI', 'Satisfaction', 'Heritage', 'HealthCap', 'GenderIneq', 'GovPrior', 'StringEnv', 'SustInd'))
Loading2009 = pd.DataFrame(fa2009.components_, columns=('ProtectArea', 'EPI', 'Satisfaction', 'Heritage', 'HealthCap', 'GenderIneq', 'GovPrior', 'StringEnv', 'SustInd'))
Loading2009
print pd.DataFrame(-fa2009.components_, columns=('ProtectArea', 'EPI', 'Satisfaction', 'Heritage', 'HealthCap', 'GenderIneq', 'GovPrior', 'StringEnv', 'SustInd'))
Loading2011 = pd.DataFrame(fa2011.components_, columns=('ProtectArea', 'EPI', 'Satisfaction', 'Heritage', 'HealthCap', 'GenderIneq', 'GovPrior', 'StringEnv', 'SustInd'))
Loading2011
print pd.DataFrame(-fa2011.components_, columns=('ProtectArea', 'EPI', 'Satisfaction', 'Heritage', 'HealthCap', 'GenderIneq', 'GovPrior', 'StringEnv', 'SustInd'))
Loading2013 = pd.DataFrame(fa2013.components_, columns=('EPI', 'Satisfaction', 'Heritage', 'HealthCap', 'GenderIneq', 'GovPrior', 'StringEnv', 'SustInd'))
Loading2013
print pd.DataFrame(-fa2013.components_, columns=('EPI', 'Satisfaction', 'Heritage', 'HealthCap', 'GenderIneq', 'GovPrior', 'StringEnv', 'SustInd'))
Loading2015 = pd.DataFrame(fa2015.components_, columns=('ProtectArea', 'EPI', 'Heritage', 'HealthCap', 'GenderIneq', 'GovPrior', 'StringEnv', 'SustInd'))
Loading2015
print pd.DataFrame(-fa2015.components_, columns=('ProtectArea', 'EPI', 'Heritage', 'HealthCap', 'GenderIneq', 'GovPrior', 'StringEnv', 'SustInd'))
import plotly.plotly as py
from plotly.tools import FigureFactory as FF
import plotly.tools as tls
tls.set_credentials_file(username='JoelleDuff', api_key='NkSk1R88B7vYzL33bkuk')
import pandas as pd
data_matrix = [['Year', 'ProtectArea', 'EPI', 'Satisfaction', 'Heritage', 'HealthCap', 'GenderIneq', 'GovPrior', 'StringEnv', 'SustInd'],
['2007', 0.286224, 0.986312, 0.018201, 0.331698, 0.709648, 0.018977, 0.01186, 0.777657, 'NaN'],
['2008', -0.08715, 0.243535, 0.68323, 0.180948, -0.064265, 0.04019, 0.878366, 0.552252, 0.992939],
['2009', -0.153331, 0.162599, 0.746392, 0.164256, -0.042939, 0.028881, 0.941884, 0.444508, 0.973355],
['2011', -0.155709, 0.349313, 0.667062, 0.211852, 0.069243, 0.050142, 0.828213, 0.604534, 0.970997],
['2013', 'NaN', 0.381605, 0.571026, 0.237333, 0.166761, 0.167314, 0.777796, 0.685753, 0.970045],
['2015', 0.339649, 0.979355, 'NaN', 0.231035, 0.812514, -0.121435, 0.230918, 0.678779, 0.345624]]
table = FF.create_table(data_matrix, index=True, index_title='Year')
py.iplot(table, filename='Loading Factors')
The table above shows the loading factors clearly, for each year. To make it more clear which variables are most promising to include in the index, the following scatterplot is created:
import plotly.plotly as py
import plotly.graph_objs as go
import plotly.plotly as py
import plotly.graph_objs as go
trace2007 = go.Scatter(
x=['ProtectArea', 'EPI', 'Satisfaction', 'Heritage', 'HealthCap', 'GenderIneq', 'GovPrior', 'StringEnv', 'SustInd'],
y=[0.286224, 0.986312, 0.018201, 0.331698, 0.709648, 0.018977, 0.01186, 0.777657, 'NaN'],
text=['ProtectArea', 'EPI', 'Satisfaction', 'Heritage', 'HealthCap', 'GenderIneq', 'GovPrior', 'StringEnv', 'SustInd'],
mode='markers',
name=2007,
marker=dict(
size=[100,100,100,100,100,100,100,100,100],
sizemode='area',
)
)
trace2008 = go.Scatter(
x=['ProtectArea', 'EPI', 'Satisfaction', 'Heritage', 'HealthCap', 'GenderIneq', 'GovPrior', 'StringEnv', 'SustInd'],
y=[-0.08715, 0.243535, 0.68323, 0.180948, -0.064265, 0.04019, 0.878366, 0.552252, 0.992939],
text=['ProtectArea', 'EPI', 'Satisfaction', 'Heritage', 'HealthCap', 'GenderIneq', 'GovPrior', 'StringEnv', 'SustInd'],
mode='markers',
name=2008,
marker=dict(
size=[100,100,100,100,100,100,100,100,100],
sizemode='area',
)
)
trace2009 = go.Scatter(
x=['ProtectArea', 'EPI', 'Satisfaction', 'Heritage', 'HealthCap', 'GenderIneq', 'GovPrior', 'StringEnv', 'SustInd'],
y=[-0.153331, 0.162599, 0.746392, 0.164256, -0.042939, 0.028881, 0.941884, 0.444508, 0.444508],
text=['ProtectArea', 'EPI', 'Satisfaction', 'Heritage', 'HealthCap', 'GenderIneq', 'GovPrior', 'StringEnv', 'SustInd'],
mode='markers',
name=2009,
marker=dict(
size=[100,100,100,100,100,100,100,100,100],
sizemode='area',
)
)
trace2011= go.Scatter(
x=['ProtectArea', 'EPI', 'Satisfaction', 'Heritage', 'HealthCap', 'GenderIneq', 'GovPrior', 'StringEnv', 'SustInd'],
y=[-0.155709, 0.349313, 0.667062, 0.211852, 0.069243, 0.050142, 0.828213, 0.604534, 0.970997],
text=['ProtectArea', 'EPI', 'Satisfaction', 'Heritage', 'HealthCap', 'GenderIneq', 'GovPrior', 'StringEnv', 'SustInd'],
mode='markers',
name=2011,
marker=dict(
size=[100,100,100,100,100,100,100,100,100],
sizemode='area',
)
)
trace2013= go.Scatter(
x=['ProtectArea', 'EPI', 'Satisfaction', 'Heritage', 'HealthCap', 'GenderIneq', 'GovPrior', 'StringEnv', 'SustInd'],
y=['NaN', 0.381605, 0.571026, 0.237333, 0.166761, 0.167314, 0.777796, 0.685753, 0.970045],
text=['ProtectArea', 'EPI', 'Satisfaction', 'Heritage', 'HealthCap', 'GenderIneq', 'GovPrior', 'StringEnv', 'SustInd'],
mode='markers',
name=2013,
marker=dict(
size=[100,100,100,100,100,100,100,100,100],
sizemode='area',
)
)
trace2015= go.Scatter(
x=['ProtectArea', 'EPI', 'Satisfaction', 'Heritage', 'HealthCap', 'GenderIneq', 'GovPrior', 'StringEnv', 'SustInd'],
y=[0.339649, 0.979355, 'NaN', 0.231035, 0.812514, -0.121435, 0.230918, 0.678779, 0.345624],
text=['ProtectArea', 'EPI', 'Satisfaction', 'Heritage', 'HealthCap', 'GenderIneq', 'GovPrior', 'StringEnv', 'SustInd'],
mode='markers',
name=2015,
marker=dict(
size=[100,100,100,100,100,100,100,100,100],
sizemode='area',
)
)
data = [trace2007, trace2008, trace2009, trace2011, trace2013, trace2015]
py.iplot(data, filename='bubblechart-size-ref')
This scatterplot shows the distribution of the loadings for each variable and for each year. A method often used to choose which variables will be included, is to chose those that are greater than $0.3$. Furthermore, from this scatterplot we see that several variables are particularly important within the factor, as they return each year. Satisfaction, GovPrior, StringEnv, and SustInd are often above $0.5$.
Now we will construct the Sustainability Index scores, based on the loading factors given earlier. The method is repeated each year, using the corresponding loading factors in the table below.
data_loadings = [['Indicator', '2007', 'Indicator', '2008', 'Indicator', '2009', 'Indicator', '2011', 'Indicator', '2013', 'Indicator', '2015'],
['EPI', 0.986, 'SustInd', 0.993, 'SustInd', 0.973, 'SustInd', 0.971, 'SustInd', 0.970, 'EPI', 0.979],
['StringEnv', 0.778, 'GovPrior', 0.878, 'GovPrior', 0.942, 'GovPrior', 0.828, 'GovPrior', 0.778, 'HealthCap', 0.812],
['HealthCap', 0.710, 'Satisfaction', 0.683, 'Satisfaction', 0.746, 'Satisfaction', 0.667, 'StringEnv', 0.686, 'StringEnv', 0.679],
['Heritage', 0.332, 'StringEnv', 0.552, 'StringEnv', 0.445, 'StringEnv', 0.605, 'Satisfaction', 0.571, 'SustInd', 0.346],
['-', '-', '-', '-', '-', '-', 'EPI', 0.349, 'EPI', 0.382, 'ProtectArea', 0.340]]
table2 = FF.create_table(data_loadings)
py.iplot(table2, filename='LoadingFactorsReduced')
These are the corresponding function for each Sustainable Tourism Index:
$\ $
$$Sustainable \ Tourism \ Index \ (2007) = 0,986 \ EPI + 0,778 \ StringEnv + 0,71 \ HealthCap + 0,332 \ Heritage$$Below I have created matrices for the loading factors and datasets, such that $STVar2007$ represents the combination of $MatrixST2007$ (loading factors) and $MatrixVar2007$ (variables of dataset which are included). Then, the seperate columns of $STVar2007$ are summed together to create the ST index, $ST2007$. This method is used for each period.
MatrixST2007=[0.986, 0.778, 0.71, 0.332] # (1,4) matrix of loading factors
MatrixVar2007=Sustain2007[['EPI', 'StringEnv', 'HealthCap', 'Heritage']].values # (144,4) matrix of variables per country
STVar2007=MatrixST2007*MatrixVar2007 # combination of previous matrices
ST2007=np.sum([STVar2007], axis=2)
ST2007=zip([ST2007])
ST2007=np.reshape(ST2007, (144,1))# sum seperate colums of new (144,4) matrix to create (114,1) matrix, where the new column shows the scores.
MatrixST2008=[0.993, 0.878, 0.683, 0.552]
MatrixVar2008=Sustain2008[['SustInd', 'GovPrior', 'Satisfaction', 'StringEnv']].values
STVar2008=MatrixST2008*MatrixVar2008
ST2008=np.sum([STVar2008], axis=2)
ST2008=zip([ST2008])
ST2008=np.reshape(ST2008, (144,1))
MatrixST2009=[0.973, 0.942, 0.746, 0.445]
MatrixVar2009=Sustain2009[['SustInd', 'GovPrior', 'Satisfaction', 'StringEnv']].values
STVar2009=MatrixST2009*MatrixVar2009
ST2009=np.sum([STVar2009], axis=2)
ST2009=zip([ST2009])
ST2009=np.reshape(ST2009, (144,1))
MatrixST2011=[0.971, 0.828, 0.667, 0.605, 0.349]
MatrixVar2011=Sustain2011[['SustInd', 'GovPrior', 'Satisfaction', 'StringEnv', 'EPI']].values
STVar2011=MatrixST2011*MatrixVar2011
ST2011=np.sum([STVar2011], axis=2)
ST2011=zip([ST2011])
ST2011=np.reshape(ST2011, (144,1))
MatrixST2013=[0.97, 0.778, 0.686, 0.571, 0.382]
MatrixVar2013=Sustain2013[['SustInd', 'GovPrior', 'StringEnv', 'Satisfaction', 'EPI']].values
STVar2013=MatrixST2013*MatrixVar2013
ST2013=np.sum([STVar2013], axis=2)
ST2013=zip([ST2013])
ST2013=np.reshape(ST2013, (144,1))
MatrixST2015=[0.979, 0.812, 0.679, 0.346, 0.34]
MatrixVar2015=Sustain2015[['EPI', 'HealthCap', 'StringEnv', 'SustInd', 'ProtectArea']].values
STVar2015=MatrixST2015*MatrixVar2015
ST2015=np.sum([STVar2015], axis=2)
ST2015=zip([ST2015])
ST2015=np.reshape(ST2015, (144,1))
ST=np.concatenate([ST2007,ST2008], axis=1) # Combine all ST index scores into one matrix containing all years
ST=np.concatenate([ST,ST2009], axis=1)
ST=np.concatenate([ST,ST2011], axis=1)
ST=np.concatenate([ST,ST2013], axis=1)
ST=np.concatenate([ST,ST2015], axis=1)
print ST
The table printed below, shows the final ST Index scores per year, per country
STFull = pd.ExcelFile("C:/Users/asus/Documents/Master/AEA I/ST Score Python.xlsx")
pd.to_numeric(STFull, errors='coerce')
STFull
STFull.sheet_names
STFull = STFull.parse('Blad1')
STFull
print STFull
To show the results more clearly, I create a line graph below, which shows the trend in ST scores for several countries. Due to limitations in space, only the first few and last few countries in the dataset are shown below (Albania, Barbados, Burundi, Singapore, Sweden, and United Kingdom).
import plotly.plotly as py
import plotly.graph_objs as go
Albania = go.Scatter(
x = ['2007', '2008', '2009', '2011', '2013', '2015'],
y = [55.70450, 13.22739, 14.63330, 32.94164, 40.22614, 78.964873],
text=['2007', '2008', '2009', '2011', '2013', '2015'],
name='Albania',
line=dict(
shape='spline')
)
Barbados = go.Scatter(
x = ['2007', '2008', '2009', '2011', '2013', '2015'],
y = ['Nan', 18.31559, 19.27034, 34.19201, 38.69009, 61.994398],
text=['2007', '2008', '2009', '2011', '2013', '2015'],
name='Barbados',
line=dict(
shape='spline')
)
Burundi = go.Scatter(
x = ['2007', '2008', '2009', '2011', '2013', '2015'],
y = ['Nan', 11.91487, 12.78301, 20.38951, 27.02672, 46.622036],
text=['2007', '2008', '2009', '2011', '2013', '2015'],
name='Burundi',
line=dict(
shape='spline')
)
Singapore = go.Scatter(
x = ['2007', '2008', '2009', '2011', '2013', '2015'],
y = ['Nan', 19.60847, 20.05713, 47.77074, 52.65162, 94.220296],
text=['2007', '2008', '2009', '2011', '2013', '2015'],
name='Singapore',
line=dict(
shape='spline')
)
Sweden = go.Scatter(
x = ['2007', '2008', '2009', '2011', '2013', '2015'],
y = [88.43342, 17.49914, 18.43893, 45.34532, 50.95157, 100.942513],
text=['2007', '2008', '2009', '2011', '2013', '2015'],
name='Sweden',
line=dict(
shape='spline')
)
UnitedKingdom = go.Scatter(
x = ['2007', '2008', '2009', '2011', '2013', '2015'],
y = [90.59648, 16.94227, 17.04695, 43.27964, 50.48223, 103.090195],
text=['2007', '2008', '2009', '2011', '2013', '2015'],
name='United Kingdom',
line=dict(
shape='spline')
)
data = [Albania, Barbados, Burundi, Singapore, Sweden, UnitedKingdom]
layout = dict(title = 'Sustainable Tourism Index Trend',
xaxis = dict(title = 'Year'),
yaxis = dict(title = 'ST Index (Score)'),
)
fig = dict(data=data, layout=layout)
py.iplot(fig, filename='STindex scores')
To check if we get realistic and accurate results, I will compare these results with those I found using the statistical programme R. I performed the same anlysis in this programme earlier, and find that the loading factors resemble those calculated by R. Although the loadingfactors are not exactly the same, they are often very similar and in the same direction. This comparison helps support the results found in this python programming.
Rloadings = pd.ExcelFile("C:/Users/asus/Documents/Master/AEA I/Rdata.xlsx")
pd.to_numeric(Rloadings, errors='coerce')
Rloadings
Rloadings.sheet_names
Rloadings = Rloadings.parse('Blad2')
print Rloadings
data_matrix = [['Year', 'ProtectArea', 'EPI', 'Satisfaction', 'Heritage', 'HealthCap', 'GenderIneq', 'GovPrior', 'StringEnv', 'SustInd'],
['2007', 0.286224, 0.986312, 0.018201, 0.331698, 0.709648, 0.018977, 0.01186, 0.777657, 'NaN'],
['2008', -0.08715, 0.243535, 0.68323, 0.180948, -0.064265, 0.04019, 0.878366, 0.552252, 0.992939],
['2009', -0.153331, 0.162599, 0.746392, 0.164256, -0.042939, 0.028881, 0.941884, 0.444508, 0.973355],
['2011', -0.155709, 0.349313, 0.667062, 0.211852, 0.069243, 0.050142, 0.828213, 0.604534, 0.970997],
['2013', 'NaN', 0.381605, 0.571026, 0.237333, 0.166761, 0.167314, 0.777796, 0.685753, 0.970045],
['2015', 0.339649, 0.979355, 'NaN', 0.231035, 0.812514, -0.121435, 0.230918, 0.678779, 0.345624]]
table = FF.create_table(data_matrix, index=True, index_title='Year')
py.iplot(table, filename='Loading Factors')
The results I got from performing this analysis in python give me similar results as in R. From this, I can conclude that Factor Analysis is a good method for creating a composite index with many underlying variables. The ST index seems to work very well for each year, and is well equipped to give quick insight into the sustainability of a county's tourism industry. So, to answer the research question.. Yes!
There are some caveats to my approach though, which would need to be considered for future research. Some datasets missed certain variables, which meant that the factor loadings differed each year. In particular, for 2007, 2013, and 2015 the datasets differed quite a lot, as did the ST index scores. Fortunately, during 2008, 2009, and 2011 the same variables were included in the ST index, making the indices of those years more comparable with eachtother. Unfortunately however, the scores of 2007, 2013, and 2015 included variables such as ProtectedArea, EPI, and GenderIneq, which, as mentioned earlier, had much higher values than the other indicators. For this reason, the ST index scores of 2007, 2013, and 2015 are much higher than those of the other years.
In the future it would be good to use only those variables for the index which are included in each dataset, and perhaps exclude certain years if they lack important variables. This way the indices over multiple years would be comparable, and important insights could be made. A second option would be to make a single ST index, which combines the individual ones by averaging the loading factors. This would make the scores even more comparable, since they would all be calculated in the same way.