Meetup Data Analysis for Countries trending in Innovation
Meetup API key
External Libraries Required: pycountry
Internet Connection
Wikipedia API has been used to find out the countries list trending on Innovation.
Meetup API is used using an API key to analyze and compare statistics of these countries. The API key is listed below.
Analyzing Meetups Data Trends for countries interested in innovation in the last one year.
api_key = "29191b3c116b165929763e304f2897e"
These API's data keeps on changing and for analysis purposes, data for below period is used:
Wikipedia Data:
Revision Date: 06-01-2018
Page Title: Global_Innovation_Index
Meetup Data:
Recent Data
import pandas as pd
from pandas.io.json import json_normalize
import urllib.request
import json
import time
import os
# this is a general class which can be used by all the other classes for fetching data
class api_request:
def request_data(self,url):
'''
Parameters: url
return: data fetched from api in json format
----------
To request data from a URL though HTTP request and parse it into json format
'''
try:
response = urllib.request.urlopen(url) #Hit URL: url
raw_json = response.read().decode() #Decode response
data = json.loads(raw_json)
return data #return data
except Exception as h:
print("Cannot find data at given URL\nError:"+str(h)) # Handle exception and return noting if
return "" # data is not found at given URL
class Wiki(api_request):
def __init__(self):
self.__apiurl = "https://en.wikipedia.org/w/api.php" # API Url
self.__datatype = "json" # default format to Fetch data: json
def get_wiki_page(self,page_title="Global_Innovation_Index"):
'''
Parameters: page_title (page title of wikipedia article)
return: json response
----------
To get the data from WIKI API using the parameters: pagetitle and format
'''
url_params = "?action=parse&prop=text&page="+page_title+"&format="+self.__datatype
url= self.__apiurl + url_params
return self.request_data(url)
# meetup-api is wrapper on top of the meetup Http api and it uses an api key to establish connection and get response
class Meetup(api_request):
def __init__(self):
self.__apiurl = "https://api.meetup.com"
self.__apikey = "29191b3c116b165929763e304f2897e" #API key for meetup used to get the data
self.__client = meetup.api.Client(self.__apikey,overlimit_wait=True) #get client from API key to execute API
def refresh_connection(self):
'''refresh client if expired, this was required to make sure
data is fetched consistently and the code does not stop in case of any exception from api
'''
self.__client = meetup.api.Client(self.__apikey,overlimit_wait=True)
def get_categories(self):
'''
return: dataframe
--------
Method to get all the meetup categories
'''
a = self.__client.GetCategories()
categories_df = pd.read_json(json.dumps(a.results))
return categories_df
def get_cities(self, code):
'''
Parameters: country code
return: dataframe
---------
Method to get cities for the country using parameter: country alpha_2 code
'''
cities = self.__client.GetCities(country=code)
cities_df = pd.read_json(json.dumps(cities.results))
return cities_df
def get_groups(self, cat_id, lat, long, offset_val=0, retrycount=0):
'''
Parameters: category id, latitude, longitude, offset value and retry count(max allowed 2)
return: normalized json
--------
Method to get all groups of the city
'''
try:
if retrycount==2:
print("Max retries reached") # Check if maximum retries reached, return nothing in that case
return ""
retrycount = retrycount + 1
group = self.__client.GetGroups(category_id = cat_id, lat=lat, lon= long, offset = offset_val)
return json_normalize(group.results)
except:
print("Trying to re-establish connection")
self.refresh_connection()
get_groups(self, cat_id, lat, long, offset_val, retrycount)
from bs4 import BeautifulSoup
import pycountry
class ParseData:
def __init__(self):
'''Initialize {country: country code} disctionary using pycountry module
'''
self.__countries_dict={country.name.lower():country.alpha_2.lower() for country in pycountry.countries }
def wiki_content(self, html):
'''
Parameters: html content
returns: dataframe
---------
Method to parse wikipedia html content to get top countries which are innovative
'''
parsed_html = BeautifulSoup(html,"lxml")
table = parsed_html.body.find_all('table', attrs={'class':'wikitable'})[0]
list=[]
list.extend(table.select('td a'))
countries_wiki=[]
for l in list:
countries_wiki.append(l.text.lower())
df = pd.DataFrame([countries_wiki],index=['geoName']).T
df =df.set_index('geoName')
# After getting the list of countries,
# creating a dataframe by joining the country list and their aplha 2 code dictionary
country_df = df.join(pd.DataFrame(self.__countries_dict,index=['code']).rename_axis('geoName').T , how='inner')
return country_df
def get_countries_dict(self):
'''
returns {country code: country} disctionary using pycountry module
'''
return {country.alpha_2.upper():country.name for country in pycountry.countries }
class DataStore:
def __init__(self):
self.__datapath = os.getcwd() + "\data"
self.__pathObj = {
'wiki' : self.__datapath + "\wiki",
'meetup': self.__datapath + "\meetup"
}
#Create required directories for meetup data and wikipedia data
for k,v in self.__pathObj.items():
os.makedirs(v, exist_ok=True)
def df_to_csv(self, df, data_dir, filename, index=True):
'''
Parameters: data frame, directory name, file name and if index column is required or not
-----------
Method to write dataframe to csv,
exception handled if file locked by some other operation or could not write to file
'''
try:
print("Write to file:\n"+self.__pathObj[data_dir]+'\\'+filename)
if(index):
df.to_csv(self.__pathObj[data_dir]+'/'+filename, sep=',', encoding='utf-8')
else:
df.to_csv(self.__pathObj[data_dir]+'/'+filename, sep=',', encoding='utf-8',index=False)
except PermissionError:
print("\nCannot write to file\n\nPlease close the file, if file is open")
except Exception as h:
print("Cannot write to file \nError:"+str(h))
def create_directory(self,data_dir, dirname):
'''
Parameters: base directory(data_dir), directory name
-----------
Method to create folder
'''
os.makedirs(self.__pathObj[data_dir]+'/'+dirname, exist_ok=True)
def read_from_csv(self,data_dir, filename):
'''
Parameters: base directory(data_dir), file name
return: dataframe
-----------
Method to read csv into dataframe
'''
df = pd.read_csv(self.__pathObj[data_dir]+'/'+filename)
return df
from IPython.display import display #For displaying beautiful dataframes, supported by IPython
parse = ParseData()
sd = DataStore()
#Fetch Wikipedia content
wiki = Wiki();
data = wiki.get_wiki_page()
# the HTML content retreived from wikipedia as string and write them to a file countries.csv in wiki folder
if(data!=""):
html = data['parse']['text']['*']
# Parse and Store countries list in file
country_df = parse.wiki_content(html)
sd.df_to_csv(country_df, 'wiki', 'countries.csv')
print("\n\nSample Innovative Countries Dataframe:")
selected_countries = country_df[0:2] #Only top 2 countries have been chosen
display(selected_countries)
else:
selected_countries = pd.DataFrame()
#Sample output of the dataframe has been shown below:
import meetup.api
meetupObj = Meetup();
The countries fetched from wikipedia have multiple countries where Meetup groups exist. To get further data about groups and categories cities are required. Latitude and longiude of cities is used later to get groups data.
#Get country codes for the selected countries, if no country is found, use ireland by default
if not selected_countries.empty:
codes = selected_countries['code'].tolist()
else:
codes = ['ie']
codes
delay_time = 1.0
#to iteratively get cities for all the selected countries
for code in codes:
print("Fetching cities for "+code+". . . Please wait . . .")
cities_df = meetupObj.get_cities(code)
time.sleep(delay_time)
sd.create_directory('meetup','cities')
sd.df_to_csv(cities_df, 'meetup', 'cities/'+code+'.csv', index=False)
display(cities_df[:3])
print("Total Records Fetched for "+ code +": "+str(len(cities_df.index)) )
To further request groups data, categories id are also required.
#Fetch Meetup Categories
df = meetupObj.get_categories().set_index('id')
sd.df_to_csv(df, 'meetup', 'categories.csv')
print("\n\nSample Categories:")
display(df[0:5])
cat_id = df.index.tolist()
cat_id[0:5]
Step 1: Fetch Once we have city data (lat and long), we can now fetch groups data to get more granular level of information. It includes the fields that are required for comparitive analysis of top two innovative countries.
The groups information will be stored in local folers in the form of Multiple CSVs. Considering the large size of data, the files are written by category id which will be unique for one country.
# Getting group data for each city based on lat, long and category
for code in codes:
print(code)
df = sd.read_from_csv('meetup','/cities/'+code+'.csv')
df1 = df[['lat','lon']]
for i in cat_id:
print(i)
groups = pd.DataFrame([])
#Iterate over cities lat lon
for city_itr in range(len(df1)):
lat = df1.loc[city_itr].lat
lon = df1.loc[city_itr].lon
#run again untill we have results
for j in range(10000):
result = meetupObj.get_groups(i,lat,lon,j)
if len(result) != 0:
groups = groups.append(result)
else:
break
# Get data in dataframe from multiple CSVs
sd.create_directory('meetup','groups')
sd.create_directory('meetup','groups/'+code)
sd.df_to_csv(groups, 'meetup', 'groups/'+code+'/'+str(i) +'.csv', index=False)
Sample files are present in 'data' folder and the files inside the './data/group' folder does not contain all the groups since it was a huge data but all the files shared same structure.
Please contact me for more data
Step 2: Select, combine and create a master dataframe from group data extracted for different countries. The API data must be stored locally to enable future processing. The created dataframe is still unclean and we will select only a limited part of the data.
print("executing")
cwd = os.getcwd()
group_df_arr = pd.DataFrame([])
frames = []
for code in codes:
data_list = []
for i in cat_id:
try:
df = pd.read_csv(cwd+'/data/meetup/groups/'+code+'/'+str(i)+'.csv', index_col = False, header=0, dtype='unicode')
df.drop_duplicates(subset=None, inplace=True)
data = df
except Exception as h:
print("Exception: "+str(h))
print(cwd+'/data/meetup/groups/'+code+'/'+str(i)+'.csv');
pass
data_list.append(data)
group_df_arr = pd.concat(data_list)
print(len(group_df_arr))
frames.append(group_df_arr)
result = pd.concat(frames)
display(result[0:2])
Raw data has been collected in the 'result' dataframe
def parse_group_data(result):
'''
Parameters: dataframe with raw data (result)
--------
result: cleaned dataframe which will be used for analysis
'''
grouped_Df = result.reset_index(drop=True)
# Since in the default border radius of one city's latitude and longitude, it has data from different countries also.
# The data which is relevant will only be choosen. So a filter is applied on fetched country codes
# and then again more filters are applied to get only required columns
grouped_Df = grouped_Df[grouped_Df['country'].str.lower().isin([x.lower() for x in codes])][['id','city','country','created','category.id','category.name','category.shortname','rating','members']]
#change the date format from timestamp to date string
grouped_Df = grouped_Df.assign(
date_created=pd.Series(
grouped_Df['created']
.apply(lambda x: datetime.datetime.fromtimestamp(int(x) / 1e3))
.dt.strftime('%Y-%m-%d'))
.values)
countries_dict = parse.get_countries_dict()
grouped_Df = grouped_Df.assign(country_name=pd.Series(grouped_Df['country'].apply(lambda x: countries_dict[x])).values)
grouped_Df.drop(['created'], axis=1, inplace=True)
return grouped_Df
import datetime
grouped_Df = parse_group_data(result)
grouped_Df[0:5]
#Sample output for parsed data
Finally as a part of Pre-Processing, check if further pre-processing is needed for missing data
#look for missing data
grouped_Df.isnull().sum() # no missing values in the reduced dataset
grouped_Df.dtypes.value_counts()
grouped_Df.isnull().values.any()
There are no Null's in the data collected from raw dataframe and no values exist such as "NaN" or blank values in any column. This indicates the data is clean now and all values are pro-processed now.
#grouped_Df[0:5]
print("Descriptive Stats:\n")
display(grouped_Df.describe())
import matplotlib
import matplotlib.pyplot as plt
import squarify # Squarify is used to get the rectangle coordinates for tree map
import seaborn as sns
%matplotlib inline
temp = grouped_Df # A temporary copy is created to perform further actions using the copy of dataframe
temp = temp.apply(pd.to_numeric, errors='ignore')
Aim: The aim here is to compare, how different meetup groups have grown in the countries over the years.
Plots Used: Line Chart and Area Chart
country_time_series = []
countries_dict = parse.get_countries_dict()
for c in range(len(codes)):
country_time_series.append(grouped_Df[grouped_Df['country'] == codes[c].upper()][['date_created']])
country_time_series[c]['date_created'] =pd.to_datetime(country_time_series[c].date_created).dt.year
country_time_series[c].reset_index().drop(['index'],axis=1)
country_time_series[c] = country_time_series[c].groupby(['date_created'])[['date_created']].agg('count')
country_time_series[c] = country_time_series[c].rename(index=str, columns={"date_created": countries_dict[codes[c].upper()]})
country_time_series_df = pd.concat(country_time_series, axis=1).fillna(value=0)
display(country_time_series_df[0:5])
country_time_series_df.plot()
plt.title("Growth in number of groups on meetup in the two countries",fontsize=20)
plt.ylabel("No. of Groups")
plt.xlabel("Time")
plt.savefig('graphs/groups_creation_country_wise-line.png',dpi=100, bbox_inches='tight')
This graph is showing the rise in the number of groups on yearly basis using line chart plot in the two countries. X-axis represents years and Y axis represents the number of groups formed.
country_time_series_df.plot.area(stacked=False)
plt.suptitle("Growth in number of groups on meetup in the two countries",fontsize=20)
plt.ylabel("No. of Groups")
plt.xlabel("Time")
plt.savefig('graphs/groups_creation_country_wise-area.png',dpi=100, bbox_inches='tight')
This graph is showing the rise in the number of groups on yearly basis using stacked area chart for the two countries. X-axis represents years and Y axis represents the number of groups formed.
Aim: The aim here is to compare, which month sees highest growth in meetup groups.
Plots Used: Bar Charts
group_month_data = pd.DataFrame([])
group_month_data = grouped_Df[['date_created']]
group_month_data = grouped_Df.assign(Month_created=pd.to_datetime(group_month_data.date_created).dt.strftime('%b'))
group_month_data.reset_index().drop(['index'],axis=1)
group_month_data = group_month_data.groupby(['Month_created'])[['Month_created']].agg('count')
group_month_data = group_month_data.rename(index=str, columns={"Month_created": "Count"}).reset_index()
# Using temp month number to sort rows
months = {datetime.datetime(2000,i,1).strftime("%b"): i for i in range(1, 13)}
group_month_data["month_number"] = group_month_data["Month_created"].map(months)
group_month_data = group_month_data.fillna(value=0).sort_values(by=['month_number']).set_index('Month_created')
group_month_data = group_month_data.drop(['month_number'], axis=1)
display(group_month_data[0:5])
p = group_month_data.plot.bar(figsize=(6,5),fontsize=14)
p.set_xlabel("Month",fontsize=14)
p.set_ylabel("Group Count",fontsize=14)
plt.suptitle("Growth in number of groups - monthly data",fontsize=20)
plt.savefig('graphs/groups_growth_monthly.png',dpi=100, bbox_inches='tight')
This graph is showing the rise in the number of groups on monthly basis using bar chart plot. X-axis represents months from january to december and Y axis represents the count of number of groups formed.
Aim: To find out how the member accross different categories have grown since 2003
Plots Used: Small Multiple Bar Charts
temp = grouped_Df
temp = temp.apply(pd.to_numeric, errors='ignore')
#display(temp [0:5])
members_df = temp.groupby(['category.name'])[['members']].agg('sum').sort_values(by=['members'], ascending=False).reset_index()
display(members_df[0:5])
#getting top 9 categories based on popularity
popular_categories = members_df[0:9]['category.name'].tolist()
cat_time_series = []
#creating array of dataframes for different categories
for p in range(len(popular_categories)):
cat_time_series.append(grouped_Df[grouped_Df['category.name'].str.lower() == popular_categories[p]][['date_created']])
cat_time_series[p]['date_created'] =pd.to_datetime(cat_time_series[p].date_created).dt.year
cat_time_series[p].reset_index().drop(['index'],axis=1)
cat_time_series[p] = cat_time_series[p].groupby(['date_created'])[['date_created']].agg('count')
cat_time_series[p] = cat_time_series[p].rename(index=str, columns={"date_created": popular_categories[p]})
# Once categories are ready join the on the basis of index which is category name
# The groups which were existent in one country but not present in the other have been assigned a default count of 0
cat_time_series_df = pd.concat(cat_time_series, axis=1).fillna(value=0)
display(cat_time_series_df[0:5])
ax1 = cat_time_series_df.plot(kind='bar', subplots=True, layout=(3,3), figsize=(12,10), sharey=True)
plt.suptitle("Growth in number of groups in meetup for Top Category",fontsize=20)
plt.savefig('graphs/groups_creation_category_wise.png',dpi=100, bbox_inches='tight')
These are small multiples bar chart to represent the number of groups formed category wise. For representational purposes only top 9 countries have been selected and more can be selected to visualize data. X-axis represents years and Y -axis represents count of groups formed.
Aim: The aim is to find out the most popular groups. Popularity is based on the count of members that have joined a group
Plots Used: Treemap Charts
# Custom Plot function to draw tree map
def get_treemap(sizes, norm_x=100, norm_y=100, label=None, value=None, ax=None, **kwargs):
if ax is None:
ax = plt.gca()
# create a color palette, mapped to these values
cmap = matplotlib.cm.Wistia
mini=min(sizes)
maxi=max(sizes)
norm = matplotlib.colors.Normalize(vmin=mini, vmax=maxi)
color = [cmap(norm(value)) for value in sizes]
# Gives Normalized values over area
normed = squarify.normalize_sizes(sizes, norm_x, norm_y)
# Gives rectange coordinates from Normalized values over area
rects = squarify.squarify(normed, 0, 0, norm_x, norm_y)
x = [rect['x'] for rect in rects]
y = [rect['y'] for rect in rects]
dx = [rect['dx'] for rect in rects]
dy = [rect['dy'] for rect in rects]
ax.bar(x, dy, width=dx, bottom=y, color=color, edgecolor='white', linewidth=1.0,
label=label, align='edge', **kwargs)
#Show Values (Member Count) in the center
if not value is None:
va = 'center' if label is None else 'top'
#Iterate over values to add labels to axes
for v, r in zip(value, rects):
fz=r['dy']/2.5
x, y, dx, dy = r['x'], r['y'], r['dx'], r['dy']
ax.text(x + dx / 2, y + dy / 2, v, va=va, ha='center', fontsize=fz)
#Show label (category name) in the center
if not label is None:
va = 'center' if value is None else 'bottom'
#Iterate over label names to add labels to axes
for l, r in zip(label, rects):
fz=r['dy']/2.5
x, y, dx, dy = r['x'], r['y'], r['dx'], r['dy']
ax.text(x + dx / 2, y + dy / 2, l, va=va, ha='center',fontsize=fz)
ax.set_xlim(0, norm_x)
ax.set_ylim(0, norm_y)
return ax
members_list=members_df['members'].tolist()
label_list = members_df['category.name'].tolist()
fig1 = plt.figure(figsize=(12,10))
ax1 = fig1.add_subplot(1, 1, 1)
title = 'Tree Map for Popularity (Most joined category by members) in ' + ",".join(grouped_Df['country_name'].unique().tolist())
fig1.suptitle(title, fontsize=20)
get_treemap(sizes=members_list, label=label_list, value=members_list, alpha=0.9, ax=ax1)
#ax1.set_axis_off()
fig1.savefig('graphs/treeMap.png', dpi=100, bbox_inches='tight')
plt.show(fig1)
This tree map shows the popularity of the categories based on the number of members who joined it. The data has been normalized on the scale of (100,100) to get the uniform boxes.
temp = grouped_Df
temp = temp.apply(pd.to_numeric, errors='ignore')
#display(temp [0:5])
members_country = temp.groupby(['category.name','country_name'])[['members']].agg('sum').sort_values(by=['members'], ascending=False).reset_index()
display(members_country[0:5])
countries = members_country['country_name'].unique().tolist()
fig2, ax = plt.subplots(2)
title = 'Tree Map for Popularity Comparison (Most joined category by members)'
fig2.suptitle(title, fontsize=20)
member = pd.DataFrame([])
member = []
members_list = []
label_list=[]
for i in range(len(countries)):
member = members_country[members_country['country_name'] == countries[i]]
members_list=member['members'].tolist()
label_list = member['category.name'].tolist()
get_treemap(sizes=members_list, label=label_list, value=members_list , alpha=0.9, ax=ax[i])
ax[i].set_title(countries[i], fontsize=20)
#ax[i].set_axis_off()
ax[i].figure.set_size_inches(19, 10)
fig2.savefig('graphs/treeMap-comapre.png', dpi=100, bbox_inches='tight')
plt.show(fig2)
This tree map shows the popularity of the categories based on the number of members who joined it but using subplots to comapre data for two countries. The data has been normalized on the scale of (100,100) to get the uniform boxes.
Aim: The idea is to compare what are the trends for ratings in the differnt countries based on categories
Plots Used: Bar Plot and Stacked Bar Plot
#display(temp [0:5])
category_rating = temp.groupby(['category.name','country_name'])[['rating']].agg('mean').sort_values(by=['rating'], ascending=False).reset_index()
display(category_rating[0:5])
# Min-Max Normalization to create a new column normalized_rating
# By creating new column, one can easily compare values before and after normalization
category_rating['normalized_rating']=(category_rating['rating']-category_rating['rating'].min())/(category_rating['rating'].max()-category_rating['rating'].min())
category_rating[0:5]
countries = category_rating['country_name'].unique().tolist()
# Modify the structure of data and make different columns for the different countries
df_arr = []
for i in range(len(countries)):
df_arr.append(category_rating[category_rating['country_name']==countries[i]])
df_arr[i] = df_arr[i].drop(['rating','country_name'], axis=1).set_index('category.name').rename(index=str, columns={"normalized_rating": countries[i]})
cat_rating_df = pd.concat(df_arr, axis=1, join="inner")
display(cat_rating_df[0:5])
axes = cat_rating_df.plot.bar(figsize=(15,5))
axes.set_xlabel("Categories",fontsize=14)
axes.set_ylabel("Rating",fontsize=14)
axes.set_title("Ratings vs Categories Comaparison for two countries",fontsize=20)
plt.savefig('graphs/category_ratings.png',dpi=100, bbox_inches='tight')
This graph represnts the rating vs categories bar plot for comparing two countries. The rartings have been normalized using min-max normalization. Once this has been done, the ratings are now plotted against the categories to find out patterns. X-axis represents categories and Y -axis represents normalized ratings.
axes = cat_rating_df.plot.bar(stacked=True,figsize=(15,5))
axes.set_xlabel("Categories",fontsize=14)
axes.set_ylabel("Rating",fontsize=14)
axes.set_title("Ratings vs Categories Comaparison for two countries",fontsize=20)
plt.savefig('graphs/category_ratings_stacked.png',dpi=100, bbox_inches='tight')
This graph represnts the rating vs categories stacked bar plot for comparing two countries. The rartings have been normalized using min-max normalization. Once this has been done, the ratings are now plotted against the categories to find out patterns. X-axis represents categories and Y -axis represents normalized ratings.
Aim: The aim is to find out the relation between ratings, members in group and groups count
Plots Used: Scatter Plot, Scatter Matrix(Small Multiples), Line Chart(Small Multiples), Dual Axis Chart
category_rating_dual = temp.groupby(['category.name'])[['rating']].agg('mean').reset_index().set_index('category.name').rename(index=str, columns={"rating": "mean_rating"})
groups_df = temp.groupby(['category.name'])[['id']].agg('count').reset_index().set_index('category.name').rename(index=str, columns={"id": "group_count"})
members_df_copy = members_df.set_index('category.name')
cat_data_unnormalized = pd.concat([members_df_copy, category_rating_dual,groups_df], axis=1)
display(cat_data_unnormalized[0:5])
# Normalization of data using min-max normalization
cat_data = pd.DataFrame(cat_data.index.tolist()).set_index(0).rename_axis('Categories')
cat_data['mean_rating']=(cat_data_unnormalized['mean_rating']-cat_data_unnormalized['mean_rating'].min())/(cat_data_unnormalized['mean_rating'].max()-cat_data_unnormalized['mean_rating'].min())
cat_data['members']=(cat_data_unnormalized['members']-cat_data_unnormalized['members'].min())/(cat_data_unnormalized['members'].max()-cat_data_unnormalized['members'].min())
cat_data['group_count']=(cat_data_unnormalized['group_count']-cat_data_unnormalized['group_count'].min())/(cat_data_unnormalized['group_count'].max()-cat_data_unnormalized['group_count'].min())
display(cat_data[0:5])
ax = cat_data.plot.scatter(x="members", y="group_count", c='mean_rating', cmap = matplotlib.cm.jet, alpha=0.4 , s=250)
ax.grid(True,linestyle='-',color='0.75')
ax.text(0.75, -0.1, 'Number of Members',
verticalalignment='bottom', horizontalalignment='right',
transform=ax.transAxes)
plt.suptitle("Categories Data Scatter Plot",fontsize=14)
plt.savefig('graphs/scatter_plot_categories.png',dpi=100, bbox_inches='tight')
This graph represnts the rating vs categories vs members using scatter plot. The ratings have been normalized to a color scale based on mean ratings for the data. Once this has been done, the ratings are now plotted against the categories to find out patterns. X-axis represents number of members and Y-axis represents normalized ratings.
sns.pairplot(cat_data)
ax = plt.gca()
plt.suptitle("Small Multiples for category dimensions via Scatter Plot",fontsize=14, y=1.08)
plt.savefig('graphs/scatter_categories_small_multiple.png',dpi=100, bbox_inches='tight')
This graph represents the scatter matrix plot for mean rating, number of members and group count. The data has been normalized already using min-max normalization. The other half of the graph is identical.
axes = cat_data.plot(rot=90,subplots=True, figsize=(6, 6),x_compat=True);
plt.xticks(range(len(cat_data.index.tolist())), cat_data.index.tolist(), size='small')
plt.suptitle("Small Multiples for category dimensions via Line Chart",fontsize=14)
plt.savefig('graphs/line_categories_small_multiple.png',dpi=100, bbox_inches='tight')
This graph represents the small multiples for mean rating, number of members and group count using line chart. The data has been normalized already using min-max normalization. The other half of the graph is identical, so we are interested X-axis represents categories and Y-axis represents value.
ax = cat_data.plot(rot=90,secondary_y=['group_count'])
ax.set_ylabel("Members Count")
ax.right_ax.set_ylabel("Group Count")
plt.xticks(range(len(cat_data.index.tolist())), cat_data.index.tolist(), size='small')
plt.title("Dual Axis Line Chart for different categories\n")
plt.xlabel("Time")
plt.savefig('graphs/line_categories_dual_axes.png',dpi=100, bbox_inches='tight')
Aim: The aim is to find out which city has more number of groups
Plots Used: Bar Chart - frequeny distribution
from collections import Counter
city_data = temp[['city','country_name']]
countries = city_data['country_name'].unique().tolist()
for i in range(len(countries)):
a = city_data.loc[city_data['country_name'] == countries[i]]['city'].tolist()
letter_counts = Counter(a)
df = pd.DataFrame.from_dict(letter_counts, orient='index')
df.columns = ['count']
df['count']=(df['count']-df['count'].min())/(df['count'].max()-df['count'].min())
threshold = 0.005
# Remove rows less than the threshold
df = df[df['count'] > threshold].sort_values(by=['count'])
df.plot(kind='bar')
plt.suptitle("Frequency Distribution of groups among top cities of "+countries[i],fontsize=20)
plt.ylabel("Normalized Frequency")
plt.xlabel("Cities")
plt.savefig("graphs/groups_distribution_city_wise"+countries[i]+".png",dpi=100, bbox_inches='tight')
These graphs represents the most active countries with maximum number of group counts. A threshold of 0.05 was used over normalized values to remove the cities which as less active or has lesser number of groups. x-axis represents the cities and y axis represents the frequency of groups in these cities.
There are many conclusions that can be drawn from the given graphs before:
Plot 1: Comparison of growth of groups for two countries
It can be clearly seen from the graphs that both the innovative countries are showing growth in the group count over the years. But the point specifically to be noted is Switzerland is ranked first in Global innovation index and Sweden as second. The groups count is more in Switzerland than in Sweden and the gap in increasing. Considering, year 2018 is not complete yet, trend says, in 2018 the number groups will increase
Plot 2: Growth in number of groups - monthly data
There is no clear trend for the increase in the groups growth accross the different months. However, the starting months January, February and March saw the major growth.
Plot 3: Growth in number of groups category-wise
It can be clearly figured out that the innovative countries are focussing more on Technology field and have been joining it in more and more numbers. Also, people in these countries are more focussed on career / business. Though this field is behind technology but the gap in number of members in career/ business and other categories is significant. Apart from them, health/ well being has seen a good growth in year 2017.
Plot 4: Categories Popularity in top Global Innovative Countries
Like the previous graphs, we are now comparing categories based on member count:
- The top joined categories are: tech, socializing, career/business and outdoor adventures
- Also, if we compare the previous graph with this one, we will find out that even though socializing category does not have too many groups but still it has a lot of members
Comparison between Switzerland and Sweden:
- Sweden is showing a bias towards tech. Tech is also the most joined category in Switzerland but it's people are still joining other groups as well and the gap is smaller.
This means country focussing on all the fields is more innovative than others which is biased towards one field.
Plot 5: Compare ratings of different categories for the selected countries
This data shows that there is no clear trend in the ratings provided by the users. Less number of members could be the reason for irregular trends.
However, one point to be noted is that the categories which have the highest number of members in the two countries share the similar or slightly different rating for those categories like tech, socializing, career/business and outdoor adventures
Plot 6: Members vs Rating vs No. of Groups to Rating for Different Categories
As in the previous graphs, these graphs also depict no clear picture for rating, but the trends is strong, whcih suggests as the number groups increases, member counts also increases. As seen in the dual axes chart for members and group count or in Small Multiple of line chart.
Plot 7: Frequency Distribution for groups in different cities of one country
The plot represents the frequency of groups in different cities of the country. A threshold of 0.05 was used to select the cities. It depicts which cities are most active on meetup. Zurich and Stockholm are the most active cities in the two countries. Stockholm being the capital of Sweden might be the reason for it popularity but Zurich is not the capital of Switzerland still has a lot of popularity.
Any country which wants to increase it's rank in the global innovation index can take insights from this experiement and encourage more groups which leads to innovation.
Higher the number of groups, more are the members.
Higher the number of groups, more is the global innovation index.
More statisfical modelling can be applied to this data to find out that there is strong relation between number of groups and members and the countries with high global innovation index are involved more in meetups. Currently, only two countries were selected.
But Limited data was collected since the resources and time was limited. More data can be picked from around the world for multiple countries and staistical modelling can be perfromed on the various countries data to compare data.
The members details and the events around the world is a high volumetric data and could not be picked since it require high memory constraints. Also, further since it give coordinates(latitude, longitude), success of a group can be discovered using more detailed data which can be used by companies to share their new innovations. Companies can also benefit from the data, because before organizing any event they might analyze success of the event based on numbers.
Meetup provides more set of open APIs.
For example:
Member details and how many groups they have joined. This can be used to track the activity and trends for the members and give then tailored recommendations. Also, countries can promote categories among members if the country is lagging behind in a category which is highly joined by innovative countries.
It also gives venue data. So, the numbers members attending an event may depend on venues and further data can be analyzed to select top venues having a high turnout ratio.