Meetup Data Analysis¶

Problem Statement¶

Meetup Data Analysis for Countries trending in Innovation

Pre-requisites¶

Meetup API key
External Libraries Required: pycountry
Internet Connection

1. Task - Choosing one or more suitable web API¶

1.1. API Used:¶

Wikipedia API has been used to find out the countries list trending on Innovation.
Meetup API is used using an API key to analyze and compare statistics of these countries. The API key is listed below.
Analyzing Meetups Data Trends for countries interested in innovation in the last one year.

api_key = "29191b3c116b165929763e304f2897e"

1.2. Constraints:¶

The use of Wikipedia API is limited and used for finding out the countries listed as most innovative in 2017.
The data from Wikipedia is subject to change and also the format may change with a change in revision.
This project involves Well-Supported API for wikipedia, see details below
These API's data keeps on changing and for analysis purposes, data for below period is used:

Wikipedia Data:
```
  Revision Date: 06-01-2018
  Page Title: Global_Innovation_Index
```
Meetup Data:
```
  Recent Data   
```

1.3. Pre Installations:¶

pip install pycountry
pip install meetup-api
pip install squarify

import pandas as pd
from pandas.io.json import json_normalize
import urllib.request
import json
import time
import os

2. Task - Collect data from One or more choosen Web API(s)¶

Import Librabies for fetching data from Rest API¶

# this is a general class which can be used by all the other classes for fetching data

class api_request:
    def request_data(self,url):
        '''
        Parameters: url
        return: data fetched from api in json format
        ----------
        To request data from a URL though HTTP request and parse it into json format
        '''
        try:
            response = urllib.request.urlopen(url)       #Hit URL: url
            raw_json = response.read().decode()          #Decode response
            data = json.loads(raw_json)                  
            return data                                  #return data
        
        except Exception as h:
            print("Cannot find data at given URL\nError:"+str(h))      # Handle exception and return noting if 
            return ""                                                  # data is not found at given URL

2.1. Data Fetch from Wikipedia Page - Global Innovation Index ¶

(Custom Class Used Later)¶

class Wiki(api_request):
    def __init__(self):
        self.__apiurl = "https://en.wikipedia.org/w/api.php"     # API Url 
        self.__datatype = "json"                                 # default format to Fetch data: json
        
    
    def get_wiki_page(self,page_title="Global_Innovation_Index"):
        '''
        Parameters: page_title (page title of wikipedia article)
        return: json response
        ----------
        To get the data from WIKI API using the parameters: pagetitle and format
        
        '''
        url_params = "?action=parse&prop=text&page="+page_title+"&format="+self.__datatype 
        url= self.__apiurl +  url_params
        return self.request_data(url)

2.2. Data Fetch from Meet up - Various Content¶

(Custom Class Used Later)¶

# meetup-api is wrapper on top of the meetup Http api and it uses an api key to establish connection and get response

class Meetup(api_request):
    def __init__(self):       
        self.__apiurl = "https://api.meetup.com"
        self.__apikey = "29191b3c116b165929763e304f2897e"                      #API key for meetup used to get the data
        self.__client = meetup.api.Client(self.__apikey,overlimit_wait=True)   #get client from API key to execute API
        
    def refresh_connection(self):
        '''refresh client if expired, this was required to make sure 
           data is fetched consistently and the code does not stop in case of any exception from api
        '''
        self.__client = meetup.api.Client(self.__apikey,overlimit_wait=True)
    
    def get_categories(self):
        ''' 
        return: dataframe
        --------
        Method to get all the meetup categories
        '''
        a = self.__client.GetCategories()                             
        categories_df = pd.read_json(json.dumps(a.results))
        return categories_df                                          
    
    def get_cities(self, code):
        '''
        Parameters: country code
        return: dataframe
        ---------
        Method to get cities for the country using parameter: country alpha_2 code
        '''
        cities = self.__client.GetCities(country=code)             
        cities_df = pd.read_json(json.dumps(cities.results))
        return cities_df
    
    def get_groups(self, cat_id, lat, long, offset_val=0, retrycount=0):
        ''' 
        Parameters: category id, latitude, longitude, offset value and retry count(max allowed 2)
        return: normalized json
        --------
        Method to get all groups of the city
        '''
        try:
            if retrycount==2:
                print("Max retries reached")            # Check if maximum retries reached, return nothing in that case                
                return ""
            retrycount = retrycount + 1            
            group = self.__client.GetGroups(category_id = cat_id, lat=lat, lon= long, offset = offset_val)
            return json_normalize(group.results)
        except:
            print("Trying to re-establish connection")
            self.refresh_connection()
            get_groups(self, cat_id, lat, long, offset_val, retrycount)

3. Task - Parse data, and store it in an appropriate file format¶

3.1. Parse collected data ¶

(Custom Class Used Later)¶

from bs4 import BeautifulSoup
import pycountry

class ParseData:
    
    def __init__(self):
        '''Initialize {country: country code} disctionary using pycountry module
        '''
        self.__countries_dict={country.name.lower():country.alpha_2.lower() for country in pycountry.countries }
    
    def wiki_content(self, html):    
        ''' 
        Parameters: html content
        returns: dataframe
        ---------
        Method to parse wikipedia  html content to get top countries which are innovative 
        '''
        parsed_html = BeautifulSoup(html,"lxml")
        table = parsed_html.body.find_all('table', attrs={'class':'wikitable'})[0]

        list=[]
        list.extend(table.select('td a'))
        countries_wiki=[]
        
        for l in list:
            countries_wiki.append(l.text.lower())

        df = pd.DataFrame([countries_wiki],index=['geoName']).T
        df =df.set_index('geoName')    
        
        # After getting the list of countries, 
        # creating a dataframe by joining the country list and their aplha 2 code dictionary
        
        country_df = df.join(pd.DataFrame(self.__countries_dict,index=['code']).rename_axis('geoName').T , how='inner')
        return country_df

    def get_countries_dict(self):
        '''
        returns {country code: country} disctionary using pycountry module
        '''
        return {country.alpha_2.upper():country.name for country in pycountry.countries }

3.2. Data Store¶

(Custom Class Used Later)¶

class DataStore:
    
    def __init__(self):
        self.__datapath = os.getcwd() + "\data"
        self.__pathObj = {
            'wiki' : self.__datapath + "\wiki",
            'meetup': self.__datapath + "\meetup"
        }
        #Create required directories for meetup data and wikipedia data 
        for k,v in self.__pathObj.items():
            os.makedirs(v, exist_ok=True)
    
    def df_to_csv(self, df, data_dir, filename, index=True):
        ''' 
        Parameters: data frame, directory name, file name and if index column is required or not
        -----------
        Method to write dataframe to csv,
        exception handled if file locked by some other operation or could not write to file
        '''
        try:
            print("Write to file:\n"+self.__pathObj[data_dir]+'\\'+filename)
            if(index):
                df.to_csv(self.__pathObj[data_dir]+'/'+filename, sep=',', encoding='utf-8')
            else:
                df.to_csv(self.__pathObj[data_dir]+'/'+filename, sep=',', encoding='utf-8',index=False)
        except PermissionError:
            print("\nCannot write to file\n\nPlease close the file, if file is open")
        except Exception as h:
            print("Cannot write to file \nError:"+str(h))
            
    def create_directory(self,data_dir, dirname):
        ''' 
        Parameters: base directory(data_dir), directory name
        -----------
        Method to create folder
        '''
        os.makedirs(self.__pathObj[data_dir]+'/'+dirname, exist_ok=True)

    def read_from_csv(self,data_dir, filename):
        ''' 
        Parameters: base directory(data_dir), file name
        return: dataframe
        -----------
        Method to read csv into dataframe
        '''
        df = pd.read_csv(self.__pathObj[data_dir]+'/'+filename)
        return df

3.3. Using Objects to fetch, parse and store data¶

3.3.1. Initializations and Object Creation¶

from IPython.display import display #For displaying beautiful dataframes, supported by IPython
parse = ParseData()    
sd = DataStore()

3.3.2. Fetch, Parse and Store first five countries with highest global innovation index¶

Using Custom Class "Wiki" object "wiki"¶

#Fetch Wikipedia content
wiki = Wiki();
data = wiki.get_wiki_page()

Using object to parse and store wikipedia data¶

Important: The two countries selected from fetched wikipedia data are the top ranked in global innovation will be used further¶

# the HTML content retreived from wikipedia as string and write them to a file countries.csv in wiki folder
if(data!=""):
    html = data['parse']['text']['*']

    # Parse and Store countries list in file
    country_df = parse.wiki_content(html)
    sd.df_to_csv(country_df, 'wiki', 'countries.csv')
    
    print("\n\nSample Innovative Countries Dataframe:")
    selected_countries = country_df[0:2]                    #Only top 2 countries have been chosen
    display(selected_countries)
else:
    selected_countries = pd.DataFrame()
    
#Sample output of the dataframe has been shown below:

Write to file:
G:\Mega\UCD\Modules\Sem2\6-Data-Science-in-Python\Assignment-API\data\wiki\countries.csv


Sample Innovative Countries Dataframe:

3.3.3. Fetch, Parse and Store Meetup Data¶

import meetup.api
meetupObj = Meetup();

Cities of fetched countries

The countries fetched from wikipedia have multiple countries where Meetup groups exist. To get further data about groups and categories cities are required. Latitude and longiude of cities is used later to get groups data.

#Get country codes for the selected countries, if no country is found, use ireland by default
if not selected_countries.empty:
    codes = selected_countries['code'].tolist()
else:
    codes = ['ie']

codes

['ch', 'se']

delay_time = 1.0

#to iteratively get cities for all the selected countries

for code in codes:
    print("Fetching cities for "+code+". . . Please wait . . .")
    cities_df = meetupObj.get_cities(code)
    
    time.sleep(delay_time)
    sd.create_directory('meetup','cities')
    sd.df_to_csv(cities_df, 'meetup', 'cities/'+code+'.csv', index=False)
    display(cities_df[:3])
    print("Total Records Fetched for "+ code +": "+str(len(cities_df.index)) )

Fetching cities for ch. . . Please wait . . .
28/30 (2 seconds remaining)
Write to file:
G:\Mega\UCD\Modules\Sem2\6-Data-Science-in-Python\Assignment-API\data\meetup\cities/ch.csv

Total Records Fetched for ch: 200
Fetching cities for se. . . Please wait . . .
27/30 (0 seconds remaining)
Write to file:
G:\Mega\UCD\Modules\Sem2\6-Data-Science-in-Python\Assignment-API\data\meetup\cities/se.csv

Total Records Fetched for se: 107

Meetup Categories

To further request groups data, categories id are also required.

#Fetch Meetup Categories

df = meetupObj.get_categories().set_index('id')

sd.df_to_csv(df, 'meetup', 'categories.csv')
print("\n\nSample Categories:")
display(df[0:5])

29/30 (10 seconds remaining)
Write to file:
G:\Mega\UCD\Modules\Sem2\6-Data-Science-in-Python\Assignment-API\data\meetup\categories.csv


Sample Categories:

cat_id = df.index.tolist()
cat_id[0:5]

[1, 18, 2, 3, 4]

Meetup Groups based on Category ids

Step 1: Fetch Once we have city data (lat and long), we can now fetch groups data to get more granular level of information. It includes the fields that are required for comparitive analysis of top two innovative countries.

The groups information will be stored in local folers in the form of Multiple CSVs. Considering the large size of data, the files are written by category id which will be unique for one country.

# Getting group data for each city based on lat, long and category 

for code in codes:
    print(code)
    df = sd.read_from_csv('meetup','/cities/'+code+'.csv')
    df1 = df[['lat','lon']]
    
    for i in cat_id:
        print(i)          
        groups = pd.DataFrame([])
        
        #Iterate over cities lat lon
        for city_itr in range(len(df1)):
            lat = df1.loc[city_itr].lat
            lon = df1.loc[city_itr].lon
            #run again untill we have results
            for j in range(10000):    
                result = meetupObj.get_groups(i,lat,lon,j)
                if len(result) != 0:
                    groups = groups.append(result)
                else:
                    break
        # Get data in dataframe from multiple CSVs             
        sd.create_directory('meetup','groups')
        sd.create_directory('meetup','groups/'+code)
        sd.df_to_csv(groups, 'meetup', 'groups/'+code+'/'+str(i) +'.csv', index=False)

Kindly Note: The data from this API returns redundant data even when offset was used and it takes a lot of time since it is an iterative process and need to fetch records for every possible category in every possible city.¶

Also, only sample has been shown below and a sample file is attached with the project.¶

Sample files are present in 'data' folder and the files inside the './data/group' folder does not contain all the groups since it was a huge data but all the files shared same structure.

Please contact me for more data

Step 2: Select, combine and create a master dataframe from group data extracted for different countries. The API data must be stored locally to enable future processing. The created dataframe is still unclean and we will select only a limited part of the data.

print("executing")
cwd = os.getcwd()

group_df_arr = pd.DataFrame([])
frames = []
for code in codes:
    data_list = []
    for i in cat_id:
        try:
            df = pd.read_csv(cwd+'/data/meetup/groups/'+code+'/'+str(i)+'.csv', index_col = False, header=0, dtype='unicode')
            df.drop_duplicates(subset=None, inplace=True)
            data = df
        except Exception as h:
            print("Exception: "+str(h))
            print(cwd+'/data/meetup/groups/'+code+'/'+str(i)+'.csv');
            pass
        data_list.append(data)
    group_df_arr = pd.concat(data_list)
    print(len(group_df_arr))
    frames.append(group_df_arr)
    
result = pd.concat(frames)

display(result[0:2])

executing
Exception: No columns to parse from file
G:\Mega\UCD\Modules\Sem2\6-Data-Science-in-Python\Assignment-API/data/meetup/groups/ch/24.csv
3413
2070

Raw data has been collected in the 'result' dataframe

4. Task - Load and represent your data as a Pandas DataFrame. Apply any pre-processing and quality checking steps that may be required to clean and filter the data before analysis¶

4.1. Load data into dataframe, filter data and select only required columns¶

def parse_group_data(result):
    '''
    Parameters: dataframe with raw data (result)
    --------
    result: cleaned dataframe which will be used for analysis
    '''
     
    grouped_Df = result.reset_index(drop=True)   
    
    # Since in the default border radius of one city's latitude and longitude, it has data from different countries also.
    # The data which is relevant will only be choosen. So a filter is applied on fetched country codes 
    # and then again more filters are applied to get only required columns
    grouped_Df = grouped_Df[grouped_Df['country'].str.lower().isin([x.lower() for x in codes])][['id','city','country','created','category.id','category.name','category.shortname','rating','members']]
    
    #change the date format from timestamp to date string
    grouped_Df = grouped_Df.assign(
                        date_created=pd.Series(
                            grouped_Df['created']
                            .apply(lambda x: datetime.datetime.fromtimestamp(int(x) / 1e3))
                            .dt.strftime('%Y-%m-%d'))
                        .values)
    countries_dict = parse.get_countries_dict()
    grouped_Df = grouped_Df.assign(country_name=pd.Series(grouped_Df['country'].apply(lambda x: countries_dict[x])).values)
    grouped_Df.drop(['created'], axis=1, inplace=True)
    return grouped_Df

import datetime
grouped_Df = parse_group_data(result)
grouped_Df[0:5]

#Sample output for parsed data

4.2. Missing Data¶

Finally as a part of Pre-Processing, check if further pre-processing is needed for missing data

#look for missing data
grouped_Df.isnull().sum() # no missing values in the reduced dataset

id                    0
city                  0
country               0
category.id           0
category.name         0
category.shortname    0
rating                0
members               0
date_created          0
country_name          0
dtype: int64

grouped_Df.dtypes.value_counts()

object    10
dtype: int64

grouped_Df.isnull().values.any()

False

There are no Null's in the data collected from raw dataframe and no values exist such as "NaN" or blank values in any column. This indicates the data is clean now and all values are pro-processed now.

5. Task - Analyse and summarise the cleaned dataset, using tables and plots where appropriate. Based on your results, what interpretations or insights can be made about the dataset? What further analysis might be done on the data? Detail this using Markdown cells in your notebook.¶

5.1 Analyse and Summarise the cleanest dataset using tables and plots¶

Descriptive Statistics¶

#grouped_Df[0:5]

print("Descriptive Stats:\n")
display(grouped_Df.describe())

Descriptive Stats:

Plots using Matplotlib and Pandas Graph¶

Initializations¶

import matplotlib
import matplotlib.pyplot as plt
import squarify                 # Squarify is used to get the rectangle coordinates for tree map
import seaborn as sns
%matplotlib inline

temp = grouped_Df               # A temporary copy is created to perform further actions using the copy of dataframe    
temp = temp.apply(pd.to_numeric, errors='ignore')

Plot 1: Comparison of growth of groups for two countries¶

Aim: The aim here is to compare, how different meetup groups have grown in the countries over the years.

Plots Used: Line Chart and Area Chart

country_time_series = []
countries_dict = parse.get_countries_dict()

for c in range(len(codes)):
    country_time_series.append(grouped_Df[grouped_Df['country'] == codes[c].upper()][['date_created']])
    country_time_series[c]['date_created'] =pd.to_datetime(country_time_series[c].date_created).dt.year
    country_time_series[c].reset_index().drop(['index'],axis=1)
    country_time_series[c] = country_time_series[c].groupby(['date_created'])[['date_created']].agg('count')
    country_time_series[c] = country_time_series[c].rename(index=str, columns={"date_created": countries_dict[codes[c].upper()]})

country_time_series_df = pd.concat(country_time_series, axis=1).fillna(value=0)
display(country_time_series_df[0:5])

Plot 1(a): Growth of groups comparison country-wise¶

Line Chart¶

country_time_series_df.plot()
plt.title("Growth in number of groups on meetup in the two countries",fontsize=20)
plt.ylabel("No. of Groups")
plt.xlabel("Time")
plt.savefig('graphs/groups_creation_country_wise-line.png',dpi=100, bbox_inches='tight')

This graph is showing the rise in the number of groups on yearly basis using line chart plot in the two countries. X-axis represents years and Y axis represents the number of groups formed.

Plot 1(b): Growth of groups comparison country-wise¶

Area Chart¶

country_time_series_df.plot.area(stacked=False)
plt.suptitle("Growth in number of groups on meetup in the two countries",fontsize=20)
plt.ylabel("No. of Groups")
plt.xlabel("Time")
plt.savefig('graphs/groups_creation_country_wise-area.png',dpi=100, bbox_inches='tight')

This graph is showing the rise in the number of groups on yearly basis using stacked area chart for the two countries. X-axis represents years and Y axis represents the number of groups formed.

Plot 2: Growth in number of groups - monthly data¶

Aim: The aim here is to compare, which month sees highest growth in meetup groups.

Plots Used: Bar Charts

group_month_data = pd.DataFrame([])
group_month_data = grouped_Df[['date_created']]
group_month_data = grouped_Df.assign(Month_created=pd.to_datetime(group_month_data.date_created).dt.strftime('%b'))

group_month_data.reset_index().drop(['index'],axis=1)
group_month_data = group_month_data.groupby(['Month_created'])[['Month_created']].agg('count')
group_month_data = group_month_data.rename(index=str, columns={"Month_created": "Count"}).reset_index()


# Using temp month number to sort rows
months = {datetime.datetime(2000,i,1).strftime("%b"): i for i in range(1, 13)}
group_month_data["month_number"] = group_month_data["Month_created"].map(months)
group_month_data = group_month_data.fillna(value=0).sort_values(by=['month_number']).set_index('Month_created')

group_month_data = group_month_data.drop(['month_number'], axis=1)

display(group_month_data[0:5])

p = group_month_data.plot.bar(figsize=(6,5),fontsize=14)
p.set_xlabel("Month",fontsize=14)
p.set_ylabel("Group Count",fontsize=14)

plt.suptitle("Growth in number of groups - monthly data",fontsize=20)
plt.savefig('graphs/groups_growth_monthly.png',dpi=100, bbox_inches='tight')

This graph is showing the rise in the number of groups on monthly basis using bar chart plot. X-axis represents months from january to december and Y axis represents the count of number of groups formed.

Plot 3: Growth in number of groups category-wise¶

Aim: To find out how the member accross different categories have grown since 2003

Plots Used: Small Multiple Bar Charts

temp = grouped_Df
temp = temp.apply(pd.to_numeric, errors='ignore')

#display(temp [0:5])
members_df = temp.groupby(['category.name'])[['members']].agg('sum').sort_values(by=['members'], ascending=False).reset_index()

display(members_df[0:5])

#getting top 9 categories based on popularity

popular_categories = members_df[0:9]['category.name'].tolist()

cat_time_series = []

#creating array of dataframes for different categories
for p in range(len(popular_categories)):
    cat_time_series.append(grouped_Df[grouped_Df['category.name'].str.lower() == popular_categories[p]][['date_created']])
    cat_time_series[p]['date_created'] =pd.to_datetime(cat_time_series[p].date_created).dt.year
    cat_time_series[p].reset_index().drop(['index'],axis=1)
    cat_time_series[p] = cat_time_series[p].groupby(['date_created'])[['date_created']].agg('count')
    cat_time_series[p] = cat_time_series[p].rename(index=str, columns={"date_created": popular_categories[p]})

    
# Once categories are ready join the on the basis of index which is category name
# The groups which were existent in one country but not present in the other have been assigned a default count of 0 
cat_time_series_df = pd.concat(cat_time_series, axis=1).fillna(value=0)
display(cat_time_series_df[0:5])

ax1 = cat_time_series_df.plot(kind='bar', subplots=True, layout=(3,3), figsize=(12,10), sharey=True)
plt.suptitle("Growth in number of groups in meetup for Top Category",fontsize=20)
plt.savefig('graphs/groups_creation_category_wise.png',dpi=100, bbox_inches='tight')

These are small multiples bar chart to represent the number of groups formed category wise. For representational purposes only top 9 countries have been selected and more can be selected to visualize data. X-axis represents years and Y -axis represents count of groups formed.

Plot 4: Categories Popularity in top Global Innovative Countries¶

Aim: The aim is to find out the most popular groups. Popularity is based on the count of members that have joined a group

Plots Used: Treemap Charts

# Custom Plot function to draw tree map
def get_treemap(sizes, norm_x=100, norm_y=100, label=None, value=None, ax=None,  **kwargs):
    if ax is None:
        ax = plt.gca()

    # create a color palette, mapped to these values
    cmap = matplotlib.cm.Wistia
    mini=min(sizes)
    maxi=max(sizes)
    norm = matplotlib.colors.Normalize(vmin=mini, vmax=maxi)
    color = [cmap(norm(value)) for value in sizes]

    
    # Gives Normalized values over area 
    normed = squarify.normalize_sizes(sizes, norm_x, norm_y)
    
    # Gives rectange coordinates from Normalized values over area 
    rects = squarify.squarify(normed, 0, 0, norm_x, norm_y)
    
    x = [rect['x'] for rect in rects]
    y = [rect['y'] for rect in rects]
    dx = [rect['dx'] for rect in rects]
    dy = [rect['dy'] for rect in rects]

    ax.bar(x, dy, width=dx, bottom=y, color=color, edgecolor='white', linewidth=1.0,
       label=label, align='edge', **kwargs)

    #Show Values (Member Count) in the center
    if not value is None:
        va = 'center' if label is None else 'top'
        
        #Iterate over values to add labels to axes
        for v, r in zip(value, rects):
            fz=r['dy']/2.5
            x, y, dx, dy = r['x'], r['y'], r['dx'], r['dy']
            ax.text(x + dx / 2, y + dy / 2, v, va=va, ha='center', fontsize=fz)
            
    #Show label (category name) in the center        
    if not label is None:
        va = 'center' if value is None else 'bottom'
        
        #Iterate over label names to add labels to axes
        for l, r in zip(label, rects):
            fz=r['dy']/2.5
            x, y, dx, dy = r['x'], r['y'], r['dx'], r['dy']
            ax.text(x + dx / 2, y + dy / 2, l, va=va, ha='center',fontsize=fz)
    
    ax.set_xlim(0, norm_x)
    ax.set_ylim(0, norm_y)
    return ax

Plot 4(a): Comparison of Categories Popularity cummulatively for Top 2 Global Innovative Countries¶

members_list=members_df['members'].tolist()
label_list = members_df['category.name'].tolist()

fig1 = plt.figure(figsize=(12,10))
ax1 = fig1.add_subplot(1, 1, 1)
title = 'Tree Map for Popularity (Most joined category by members) in ' + ",".join(grouped_Df['country_name'].unique().tolist())
fig1.suptitle(title, fontsize=20)
get_treemap(sizes=members_list, label=label_list, value=members_list, alpha=0.9, ax=ax1)
#ax1.set_axis_off()
fig1.savefig('graphs/treeMap.png', dpi=100, bbox_inches='tight')
plt.show(fig1)

This tree map shows the popularity of the categories based on the number of members who joined it. The data has been normalized on the scale of (100,100) to get the uniform boxes.

Plot 4(b): Comparison of Top 2 Global Innovative Countries Categories Popularity¶

temp = grouped_Df

temp = temp.apply(pd.to_numeric, errors='ignore')

#display(temp [0:5])
members_country = temp.groupby(['category.name','country_name'])[['members']].agg('sum').sort_values(by=['members'], ascending=False).reset_index()

display(members_country[0:5])

countries = members_country['country_name'].unique().tolist()

fig2, ax = plt.subplots(2)

title = 'Tree Map for Popularity Comparison (Most joined category by members)'

fig2.suptitle(title, fontsize=20)
member = pd.DataFrame([])    
member = []
members_list = []
label_list=[]

for i in range(len(countries)):    
    member = members_country[members_country['country_name'] == countries[i]]
    members_list=member['members'].tolist()
    label_list = member['category.name'].tolist()
    
    get_treemap(sizes=members_list, label=label_list, value=members_list , alpha=0.9, ax=ax[i])
    ax[i].set_title(countries[i], fontsize=20)
    #ax[i].set_axis_off()
    ax[i].figure.set_size_inches(19, 10)

fig2.savefig('graphs/treeMap-comapre.png', dpi=100, bbox_inches='tight')
plt.show(fig2)

This tree map shows the popularity of the categories based on the number of members who joined it but using subplots to comapre data for two countries. The data has been normalized on the scale of (100,100) to get the uniform boxes.

Plot 5: Compare ratings of different categories for the selected countries¶

Aim: The idea is to compare what are the trends for ratings in the differnt countries based on categories

Plots Used: Bar Plot and Stacked Bar Plot

#display(temp [0:5])
category_rating = temp.groupby(['category.name','country_name'])[['rating']].agg('mean').sort_values(by=['rating'], ascending=False).reset_index()

display(category_rating[0:5])

# Min-Max Normalization to create a new column normalized_rating
# By creating new column, one can easily compare values before and after normalization
category_rating['normalized_rating']=(category_rating['rating']-category_rating['rating'].min())/(category_rating['rating'].max()-category_rating['rating'].min())
category_rating[0:5]

countries = category_rating['country_name'].unique().tolist()

# Modify the structure of data and make different columns for the different countries
df_arr = []
for i in range(len(countries)):
    df_arr.append(category_rating[category_rating['country_name']==countries[i]])
    df_arr[i] = df_arr[i].drop(['rating','country_name'], axis=1).set_index('category.name').rename(index=str, columns={"normalized_rating": countries[i]})
    
cat_rating_df = pd.concat(df_arr, axis=1, join="inner")
display(cat_rating_df[0:5])

Plot 5(a): Bar Plot¶

axes = cat_rating_df.plot.bar(figsize=(15,5))
axes.set_xlabel("Categories",fontsize=14)
axes.set_ylabel("Rating",fontsize=14)
axes.set_title("Ratings vs Categories Comaparison for two countries",fontsize=20)
plt.savefig('graphs/category_ratings.png',dpi=100, bbox_inches='tight')

This graph represnts the rating vs categories bar plot for comparing two countries. The rartings have been normalized using min-max normalization. Once this has been done, the ratings are now plotted against the categories to find out patterns. X-axis represents categories and Y -axis represents normalized ratings.

Plot 5(b): Stacked Bar Plot¶

axes = cat_rating_df.plot.bar(stacked=True,figsize=(15,5))
axes.set_xlabel("Categories",fontsize=14)
axes.set_ylabel("Rating",fontsize=14)
axes.set_title("Ratings vs Categories Comaparison for two countries",fontsize=20)
plt.savefig('graphs/category_ratings_stacked.png',dpi=100, bbox_inches='tight')

This graph represnts the rating vs categories stacked bar plot for comparing two countries. The rartings have been normalized using min-max normalization. Once this has been done, the ratings are now plotted against the categories to find out patterns. X-axis represents categories and Y -axis represents normalized ratings.

Plot 6: Members vs Rating vs No. of Groups to Rating for Different Categories¶

Aim: The aim is to find out the relation between ratings, members in group and groups count

Plots Used: Scatter Plot, Scatter Matrix(Small Multiples), Line Chart(Small Multiples), Dual Axis Chart

category_rating_dual = temp.groupby(['category.name'])[['rating']].agg('mean').reset_index().set_index('category.name').rename(index=str, columns={"rating": "mean_rating"})
groups_df = temp.groupby(['category.name'])[['id']].agg('count').reset_index().set_index('category.name').rename(index=str, columns={"id": "group_count"})
members_df_copy = members_df.set_index('category.name')

cat_data_unnormalized = pd.concat([members_df_copy, category_rating_dual,groups_df], axis=1)

display(cat_data_unnormalized[0:5])

# Normalization of data using min-max normalization

cat_data = pd.DataFrame(cat_data.index.tolist()).set_index(0).rename_axis('Categories')


cat_data['mean_rating']=(cat_data_unnormalized['mean_rating']-cat_data_unnormalized['mean_rating'].min())/(cat_data_unnormalized['mean_rating'].max()-cat_data_unnormalized['mean_rating'].min())
cat_data['members']=(cat_data_unnormalized['members']-cat_data_unnormalized['members'].min())/(cat_data_unnormalized['members'].max()-cat_data_unnormalized['members'].min())
cat_data['group_count']=(cat_data_unnormalized['group_count']-cat_data_unnormalized['group_count'].min())/(cat_data_unnormalized['group_count'].max()-cat_data_unnormalized['group_count'].min())

display(cat_data[0:5])

Plot 6(a): Scatter Plot¶

ax = cat_data.plot.scatter(x="members", y="group_count", c='mean_rating', cmap = matplotlib.cm.jet, alpha=0.4 , s=250)
ax.grid(True,linestyle='-',color='0.75')

ax.text(0.75, -0.1, 'Number of Members',
        verticalalignment='bottom', horizontalalignment='right',
        transform=ax.transAxes)

plt.suptitle("Categories Data Scatter Plot",fontsize=14)
plt.savefig('graphs/scatter_plot_categories.png',dpi=100, bbox_inches='tight')

This graph represnts the rating vs categories vs members using scatter plot. The ratings have been normalized to a color scale based on mean ratings for the data. Once this has been done, the ratings are now plotted against the categories to find out patterns. X-axis represents number of members and Y-axis represents normalized ratings.

Plot 6(b): Scatter Matrix¶

sns.pairplot(cat_data)
ax = plt.gca()
plt.suptitle("Small Multiples for category dimensions via Scatter Plot",fontsize=14, y=1.08)
plt.savefig('graphs/scatter_categories_small_multiple.png',dpi=100, bbox_inches='tight')

This graph represents the scatter matrix plot for mean rating, number of members and group count. The data has been normalized already using min-max normalization. The other half of the graph is identical.

Plot 6(c): Small Multiples Line Chart¶

axes = cat_data.plot(rot=90,subplots=True, figsize=(6, 6),x_compat=True);
plt.xticks(range(len(cat_data.index.tolist())), cat_data.index.tolist(), size='small')

plt.suptitle("Small Multiples for category dimensions via Line Chart",fontsize=14)
plt.savefig('graphs/line_categories_small_multiple.png',dpi=100, bbox_inches='tight')

This graph represents the small multiples for mean rating, number of members and group count using line chart. The data has been normalized already using min-max normalization. The other half of the graph is identical, so we are interested X-axis represents categories and Y-axis represents value.

Plot 6(d): Dual Axes Line Chart¶

ax = cat_data.plot(rot=90,secondary_y=['group_count'])
ax.set_ylabel("Members Count")
ax.right_ax.set_ylabel("Group Count")
plt.xticks(range(len(cat_data.index.tolist())), cat_data.index.tolist(), size='small')
plt.title("Dual Axis Line Chart for different categories\n")
plt.xlabel("Time")

plt.savefig('graphs/line_categories_dual_axes.png',dpi=100, bbox_inches='tight')

Plot 7: Frequency Distribution for groups in different cities of one country¶

Aim: The aim is to find out which city has more number of groups

Plots Used: Bar Chart - frequeny distribution

from collections import Counter

city_data = temp[['city','country_name']]
countries = city_data['country_name'].unique().tolist()

for i in range(len(countries)):
    a = city_data.loc[city_data['country_name'] == countries[i]]['city'].tolist()
    letter_counts = Counter(a)
    df = pd.DataFrame.from_dict(letter_counts, orient='index')
    df.columns = ['count']
    df['count']=(df['count']-df['count'].min())/(df['count'].max()-df['count'].min())

    threshold = 0.005
    # Remove rows less than the threshold
    df = df[df['count'] > threshold].sort_values(by=['count'])
    df.plot(kind='bar')
    plt.suptitle("Frequency Distribution of groups among top cities of "+countries[i],fontsize=20)
    plt.ylabel("Normalized Frequency")
    plt.xlabel("Cities")
    plt.savefig("graphs/groups_distribution_city_wise"+countries[i]+".png",dpi=100, bbox_inches='tight')

These graphs represents the most active countries with maximum number of group counts. A threshold of 0.05 was used over normalized values to remove the cities which as less active or has lesser number of groups. x-axis represents the cities and y axis represents the frequency of groups in these cities.

5.2. Interpretations and Data Insights¶

There are many conclusions that can be drawn from the given graphs before:

Plot 1: Comparison of growth of groups for two countries

It can be clearly seen from the graphs that both the innovative countries are showing growth in the group count over the years. But the point specifically to be noted is Switzerland is ranked first in Global innovation index and Sweden as second. The groups count is more in Switzerland than in Sweden and the gap in increasing. Considering, year 2018 is not complete yet, trend says, in 2018 the number groups will increase
Plot 2: Growth in number of groups - monthly data

There is no clear trend for the increase in the groups growth accross the different months. However, the starting months January, February and March saw the major growth.
Plot 3: Growth in number of groups category-wise

It can be clearly figured out that the innovative countries are focussing more on Technology field and have been joining it in more and more numbers. Also, people in these countries are more focussed on career / business. Though this field is behind technology but the gap in number of members in career/ business and other categories is significant. Apart from them, health/ well being has seen a good growth in year 2017.

Plot 4: Categories Popularity in top Global Innovative Countries

 Like the previous graphs, we are now comparing categories based on member count:

 - The top joined categories are: tech, socializing, career/business and outdoor adventures
 - Also, if we compare the previous graph with this one, we will find out that even though socializing category does not have too many groups but still it has a lot of members

 Comparison between Switzerland and Sweden:

 - Sweden is showing a bias towards tech. Tech is also the most joined category in Switzerland but it's people are still joining other groups as well and the gap is smaller.

This means country focussing on all the fields is more innovative than others which is biased towards one field.

Plot 5: Compare ratings of different categories for the selected countries

This data shows that there is no clear trend in the ratings provided by the users. Less number of members could be the reason for irregular trends. However, one point to be noted is that the categories which have the highest number of members in the two countries share the similar or slightly different rating for those categories like tech, socializing, career/business and outdoor adventures
Plot 6: Members vs Rating vs No. of Groups to Rating for Different Categories

As in the previous graphs, these graphs also depict no clear picture for rating, but the trends is strong, whcih suggests as the number groups increases, member counts also increases. As seen in the dual axes chart for members and group count or in Small Multiple of line chart.
Plot 7: Frequency Distribution for groups in different cities of one country

The plot represents the frequency of groups in different cities of the country. A threshold of 0.05 was used to select the cities. It depicts which cities are most active on meetup. Zurich and Stockholm are the most active cities in the two countries. Stockholm being the capital of Sweden might be the reason for it popularity but Zurich is not the capital of Switzerland still has a lot of popularity.

Tentative Conclusion:¶

Any country which wants to increase it's rank in the global innovation index can take insights from this experiement and encourage more groups which leads to innovation.

Higher the number of groups, more are the members.
Higher the number of groups, more is the global innovation index.

5.3. Future Scope¶

More statisfical modelling can be applied to this data to find out that there is strong relation between number of groups and members and the countries with high global innovation index are involved more in meetups. Currently, only two countries were selected.

But Limited data was collected since the resources and time was limited. More data can be picked from around the world for multiple countries and staistical modelling can be perfromed on the various countries data to compare data.

The members details and the events around the world is a high volumetric data and could not be picked since it require high memory constraints. Also, further since it give coordinates(latitude, longitude), success of a group can be discovered using more detailed data which can be used by companies to share their new innovations. Companies can also benefit from the data, because before organizing any event they might analyze success of the event based on numbers.

Meetup provides more set of open APIs.

For example:

Member details and how many groups they have joined. This can be used to track the activity and trends for the members and give then tailored recommendations. Also, countries can promote categories among members if the country is lagging behind in a category which is highly joined by innovative countries.
It also gives venue data. So, the numbers members attending an event may depend on venues and further data can be analyzed to select top venues having a high turnout ratio.

References:¶

Slides by Mr. Derek Greene for Module Data Science in Python
http://pandas.pydata.org/
https://github.com/laserson/squarify/
https://matplotlib.org/
https://seaborn.pydata.org/examples/scatterplot_matrix.html

	city	country	distance	id	lat	localized_country_name	lon	member_count	ranking	zip
0	Zürich	ch	805.048730	1005076	47.380001	Switzerland	8.54	4272	0	meetup1
1	Genève	ch	770.239392	1005077	46.209999	Switzerland	6.14	1225	1	meetup2
2	Lausanne	ch	772.070427	1005080	46.520000	Switzerland	6.62	635	2	meetup5

	city	country	distance	id	lat	localized_country_name	lon	member_count	ranking	zip
0	Stockholm	se	952.186356	1037701	59.330002	Sweden	18.07	3302	0	meetup1
1	Göteborg	se	738.934201	1037702	57.720001	Sweden	12.01	603	1	meetup2
2	Malmö	se	768.409393	1037703	55.610001	Sweden	13.02	347	2	meetup3

	name	shortname	sort_name
id
1	Arts & Culture	Arts	Arts & Culture
18	Book Clubs	Book Clubs	Book Clubs
2	Career & Business	Business	Career & Business
3	Cars & Motorcycles	Auto	Cars & Motorcycles
4	Community & Environment	Community	Community & Environment

	category.id	category.name	category.shortname	city	country	created	description	group_photo.base_url	group_photo.highres_link	group_photo.photo_id	...	organizer.photo.thumb_link	organizer.photo.type	rating	state	timezone	topics	urlname	utc_offset	visibility	who
0	1	fine arts/culture	arts-culture	Zürich	CH	1360710014000	<p>We are creative people and artists getting ...	https://secure.meetupstatic.com	https://secure.meetupstatic.com/photos/event/c...	466610507.0	...	https://secure.meetupstatic.com/photos/member/...	member	4.89	NaN	Europe/Zurich	[{'urlkey': 'figuredrawing', 'name': 'Figure D...	Paper_pencil_caffeine	3600000	public	Members
1	1	fine arts/culture	arts-culture	Zürich	CH	1416469038000	<p>Hello music makers (beginners included!) an...	NaN	NaN	NaN	...	https://secure.meetupstatic.com/photos/member/...	member	4.43	NaN	Europe/Zurich	[{'urlkey': 'musicians', 'name': 'Musicians', ...	Open-Mic-every-Sunday-at-Kennedys-Irish-Pub-Zu...	3600000	public	Musicmakers/ Music Appreciators

	id	city	country	category.id	category.name	category.shortname	rating	members	date_created	country_name
0	7157472	Zürich	CH	1	fine arts/culture	arts-culture	4.89	985	2013-02-12	Switzerland
1	18202384	Zürich	CH	1	fine arts/culture	arts-culture	4.43	773	2014-11-20	Switzerland
2	18502560	Baden	CH	1	fine arts/culture	arts-culture	4.85	432	2015-03-15	Switzerland
3	18815796	Zürich	CH	1	fine arts/culture	arts-culture	4.98	464	2015-08-09	Switzerland
4	18995561	Zürich	CH	1	fine arts/culture	arts-culture	4.81	1214	2015-10-04	Switzerland

	id	city	country	category.id	category.name	category.shortname	rating	members	date_created	country_name
count	4568	4568	4568	4568	4568	4568	4568	4568	4568	4568
unique	4451	174	2	33	33	33	115	961	1630	2
top	26127445	Zürich	CH	34	tech	tech	0.0	50	2018-01-22	Switzerland
freq	2	1345	3090	1343	1343	1343	2067	126	15	3090

	category.name	members
0	tech	408900
1	career/business	183689
2	socializing	125094
3	outdoors/adventure	122833
4	language/ethnic identity	105040

	tech	career/business	socializing	outdoors/adventure	language/ethnic identity	health/wellbeing	food/drink	fitness
2003	0.0	1.0	0.0	0.0	1.0	0.0	0.0	0.0
2004	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2006	1.0	1.0	1.0	0.0	0.0	0.0	1.0	0.0
2007	0.0	4.0	5.0	2.0	0.0	0.0	1.0	1.0
2008	1.0	1.0	2.0	0.0	2.0	1.0	0.0	0.0

	category.name	country_name	members
0	tech	Sweden	240087
1	tech	Switzerland	168813
2	outdoors/adventure	Switzerland	109286
3	career/business	Switzerland	99586
4	career/business	Sweden	84103

	category.name	country_name	rating
0	sci-fi/fantasy	Switzerland	4.890000
1	book clubs	Switzerland	4.254444
2	dancing	Sweden	4.154444
3	writing	Sweden	3.828000
4	photography	Sweden	3.770556

	Switzerland	Sweden
category.name
sci-fi/fantasy	1.000000	0.659168
book clubs	0.870030	0.546779
singles	0.682734	0.442740
tech	0.622019	0.627446
movies/film	0.601113	0.397702

	members	mean_rating	group_count
LGBT	4282	1.205313	32
alternative lifestyle	449	1.357143	14
book clubs	4802	3.242800	25
career/business	183689	2.423424	663
cars/motorcycles	628	1.690909	11

	mean_rating	members	group_count
Categories
LGBT	0.297119	0.010358	0.022371
alternative lifestyle	0.334546	0.000983	0.008949
book clubs	0.799376	0.011630	0.017151
career/business	0.597393	0.449164	0.492916
cars/motorcycles	0.416822	0.001421	0.006711

	code
switzerland	ch
sweden	se

	Count
Month_created
Jan	490
Feb	470
Mar	505
Apr	333
May	375