Meetup Data Analysis

Problem Statement

Meetup Data Analysis for Countries trending in Innovation

Pre-requisites

  • Meetup API key

  • External Libraries Required: pycountry

  • Internet Connection

1. Task - Choosing one or more suitable web API

1.1. API Used:

  • Wikipedia API has been used to find out the countries list trending on Innovation.

  • Meetup API is used using an API key to analyze and compare statistics of these countries. The API key is listed below.

  • Analyzing Meetups Data Trends for countries interested in innovation in the last one year.

In [276]:
api_key = "29191b3c116b165929763e304f2897e"

1.2. Constraints:

  • The use of Wikipedia API is limited and used for finding out the countries listed as most innovative in 2017.
  • The data from Wikipedia is subject to change and also the format may change with a change in revision.
  • This project involves Well-Supported API for wikipedia, see details below
  • These API's data keeps on changing and for analysis purposes, data for below period is used:

    Wikipedia Data:

      Revision Date: 06-01-2018
      Page Title: Global_Innovation_Index
    
    

    Meetup Data:

      Recent Data   

1.3. Pre Installations:

  • pip install pycountry
  • pip install meetup-api
  • pip install squarify
In [1196]:
import pandas as pd
from pandas.io.json import json_normalize
import urllib.request
import json
import time
import os

2. Task - Collect data from One or more choosen Web API(s)

Import Librabies for fetching data from Rest API

In [854]:
# this is a general class which can be used by all the other classes for fetching data

class api_request:
    def request_data(self,url):
        '''
        Parameters: url
        return: data fetched from api in json format
        ----------
        To request data from a URL though HTTP request and parse it into json format
        '''
        try:
            response = urllib.request.urlopen(url)       #Hit URL: url
            raw_json = response.read().decode()          #Decode response
            data = json.loads(raw_json)                  
            return data                                  #return data
        
        except Exception as h:
            print("Cannot find data at given URL\nError:"+str(h))      # Handle exception and return noting if 
            return ""                                                  # data is not found at given URL 

2.1. Data Fetch from Wikipedia Page - Global Innovation Index

(Custom Class Used Later)
In [1198]:
class Wiki(api_request):
    def __init__(self):
        self.__apiurl = "https://en.wikipedia.org/w/api.php"     # API Url 
        self.__datatype = "json"                                 # default format to Fetch data: json
        
    
    def get_wiki_page(self,page_title="Global_Innovation_Index"):
        '''
        Parameters: page_title (page title of wikipedia article)
        return: json response
        ----------
        To get the data from WIKI API using the parameters: pagetitle and format
        
        '''
        url_params = "?action=parse&prop=text&page="+page_title+"&format="+self.__datatype 
        url= self.__apiurl +  url_params
        return self.request_data(url)
    

2.2. Data Fetch from Meet up - Various Content

(Custom Class Used Later)
In [1200]:
# meetup-api is wrapper on top of the meetup Http api and it uses an api key to establish connection and get response

class Meetup(api_request):
    def __init__(self):       
        self.__apiurl = "https://api.meetup.com"
        self.__apikey = "29191b3c116b165929763e304f2897e"                      #API key for meetup used to get the data
        self.__client = meetup.api.Client(self.__apikey,overlimit_wait=True)   #get client from API key to execute API
        
    def refresh_connection(self):
        '''refresh client if expired, this was required to make sure 
           data is fetched consistently and the code does not stop in case of any exception from api
        '''
        self.__client = meetup.api.Client(self.__apikey,overlimit_wait=True)
    
    def get_categories(self):
        ''' 
        return: dataframe
        --------
        Method to get all the meetup categories
        '''
        a = self.__client.GetCategories()                             
        categories_df = pd.read_json(json.dumps(a.results))
        return categories_df                                          
    
    def get_cities(self, code):
        '''
        Parameters: country code
        return: dataframe
        ---------
        Method to get cities for the country using parameter: country alpha_2 code
        '''
        cities = self.__client.GetCities(country=code)             
        cities_df = pd.read_json(json.dumps(cities.results))
        return cities_df
    
    def get_groups(self, cat_id, lat, long, offset_val=0, retrycount=0):
        ''' 
        Parameters: category id, latitude, longitude, offset value and retry count(max allowed 2)
        return: normalized json
        --------
        Method to get all groups of the city
        '''
        try:
            if retrycount==2:
                print("Max retries reached")            # Check if maximum retries reached, return nothing in that case                
                return ""
            retrycount = retrycount + 1            
            group = self.__client.GetGroups(category_id = cat_id, lat=lat, lon= long, offset = offset_val)
            return json_normalize(group.results)
        except:
            print("Trying to re-establish connection")
            self.refresh_connection()
            get_groups(self, cat_id, lat, long, offset_val, retrycount)

3. Task - Parse data, and store it in an appropriate file format

3.1. Parse collected data

(Custom Class Used Later)
In [857]:
from bs4 import BeautifulSoup
import pycountry

class ParseData:
    
    def __init__(self):
        '''Initialize {country: country code} disctionary using pycountry module
        '''
        self.__countries_dict={country.name.lower():country.alpha_2.lower() for country in pycountry.countries }
    
    def wiki_content(self, html):    
        ''' 
        Parameters: html content
        returns: dataframe
        ---------
        Method to parse wikipedia  html content to get top countries which are innovative 
        '''
        parsed_html = BeautifulSoup(html,"lxml")
        table = parsed_html.body.find_all('table', attrs={'class':'wikitable'})[0]

        list=[]
        list.extend(table.select('td a'))
        countries_wiki=[]
        
        for l in list:
            countries_wiki.append(l.text.lower())

        df = pd.DataFrame([countries_wiki],index=['geoName']).T
        df =df.set_index('geoName')    
        
        # After getting the list of countries, 
        # creating a dataframe by joining the country list and their aplha 2 code dictionary
        
        country_df = df.join(pd.DataFrame(self.__countries_dict,index=['code']).rename_axis('geoName').T , how='inner')
        return country_df

    def get_countries_dict(self):
        '''
        returns {country code: country} disctionary using pycountry module
        '''
        return {country.alpha_2.upper():country.name for country in pycountry.countries }
    

3.2. Data Store

(Custom Class Used Later)
In [1201]:
class DataStore:
    
    def __init__(self):
        self.__datapath = os.getcwd() + "\data"
        self.__pathObj = {
            'wiki' : self.__datapath + "\wiki",
            'meetup': self.__datapath + "\meetup"
        }
        #Create required directories for meetup data and wikipedia data 
        for k,v in self.__pathObj.items():
            os.makedirs(v, exist_ok=True)
    
    def df_to_csv(self, df, data_dir, filename, index=True):
        ''' 
        Parameters: data frame, directory name, file name and if index column is required or not
        -----------
        Method to write dataframe to csv,
        exception handled if file locked by some other operation or could not write to file
        '''
        try:
            print("Write to file:\n"+self.__pathObj[data_dir]+'\\'+filename)
            if(index):
                df.to_csv(self.__pathObj[data_dir]+'/'+filename, sep=',', encoding='utf-8')
            else:
                df.to_csv(self.__pathObj[data_dir]+'/'+filename, sep=',', encoding='utf-8',index=False)
        except PermissionError:
            print("\nCannot write to file\n\nPlease close the file, if file is open")
        except Exception as h:
            print("Cannot write to file \nError:"+str(h))
            
    def create_directory(self,data_dir, dirname):
        ''' 
        Parameters: base directory(data_dir), directory name
        -----------
        Method to create folder
        '''
        os.makedirs(self.__pathObj[data_dir]+'/'+dirname, exist_ok=True)

    def read_from_csv(self,data_dir, filename):
        ''' 
        Parameters: base directory(data_dir), file name
        return: dataframe
        -----------
        Method to read csv into dataframe
        '''
        df = pd.read_csv(self.__pathObj[data_dir]+'/'+filename)
        return df

3.3. Using Objects to fetch, parse and store data

3.3.1. Initializations and Object Creation

In [1202]:
from IPython.display import display #For displaying beautiful dataframes, supported by IPython
parse = ParseData()    
sd = DataStore()

3.3.2. Fetch, Parse and Store first five countries with highest global innovation index

Using Custom Class "Wiki" object "wiki"
In [1204]:
#Fetch Wikipedia content
wiki = Wiki();
data = wiki.get_wiki_page()
Using object to parse and store wikipedia data
Important: The two countries selected from fetched wikipedia data are the top ranked in global innovation will be used further
In [1205]:
# the HTML content retreived from wikipedia as string and write them to a file countries.csv in wiki folder
if(data!=""):
    html = data['parse']['text']['*']

    # Parse and Store countries list in file
    country_df = parse.wiki_content(html)
    sd.df_to_csv(country_df, 'wiki', 'countries.csv')
    
    print("\n\nSample Innovative Countries Dataframe:")
    selected_countries = country_df[0:2]                    #Only top 2 countries have been chosen
    display(selected_countries)
else:
    selected_countries = pd.DataFrame()
    
#Sample output of the dataframe has been shown below:    
Write to file:
G:\Mega\UCD\Modules\Sem2\6-Data-Science-in-Python\Assignment-API\data\wiki\countries.csv


Sample Innovative Countries Dataframe:
code
switzerland ch
sweden se

3.3.3. Fetch, Parse and Store Meetup Data

In [862]:
import meetup.api
meetupObj = Meetup();


  • Cities of fetched countries

The countries fetched from wikipedia have multiple countries where Meetup groups exist. To get further data about groups and categories cities are required. Latitude and longiude of cities is used later to get groups data.

In [863]:
#Get country codes for the selected countries, if no country is found, use ireland by default
if not selected_countries.empty:
    codes = selected_countries['code'].tolist()
else:
    codes = ['ie']

codes 
Out[863]:
['ch', 'se']
In [288]:
delay_time = 1.0

#to iteratively get cities for all the selected countries

for code in codes:
    print("Fetching cities for "+code+". . . Please wait . . .")
    cities_df = meetupObj.get_cities(code)
    
    time.sleep(delay_time)
    sd.create_directory('meetup','cities')
    sd.df_to_csv(cities_df, 'meetup', 'cities/'+code+'.csv', index=False)
    display(cities_df[:3])
    print("Total Records Fetched for "+ code +": "+str(len(cities_df.index)) )
Fetching cities for ch. . . Please wait . . .
28/30 (2 seconds remaining)
Write to file:
G:\Mega\UCD\Modules\Sem2\6-Data-Science-in-Python\Assignment-API\data\meetup\cities/ch.csv
city country distance id lat localized_country_name lon member_count ranking zip
0 Zürich ch 805.048730 1005076 47.380001 Switzerland 8.54 4272 0 meetup1
1 Genève ch 770.239392 1005077 46.209999 Switzerland 6.14 1225 1 meetup2
2 Lausanne ch 772.070427 1005080 46.520000 Switzerland 6.62 635 2 meetup5
Total Records Fetched for ch: 200
Fetching cities for se. . . Please wait . . .
27/30 (0 seconds remaining)
Write to file:
G:\Mega\UCD\Modules\Sem2\6-Data-Science-in-Python\Assignment-API\data\meetup\cities/se.csv
city country distance id lat localized_country_name lon member_count ranking zip
0 Stockholm se 952.186356 1037701 59.330002 Sweden 18.07 3302 0 meetup1
1 Göteborg se 738.934201 1037702 57.720001 Sweden 12.01 603 1 meetup2
2 Malmö se 768.409393 1037703 55.610001 Sweden 13.02 347 2 meetup3
Total Records Fetched for se: 107


  • Meetup Categories

To further request groups data, categories id are also required.

In [866]:
#Fetch Meetup Categories

df = meetupObj.get_categories().set_index('id')

sd.df_to_csv(df, 'meetup', 'categories.csv')
print("\n\nSample Categories:")
display(df[0:5])
29/30 (10 seconds remaining)
Write to file:
G:\Mega\UCD\Modules\Sem2\6-Data-Science-in-Python\Assignment-API\data\meetup\categories.csv


Sample Categories:
name shortname sort_name
id
1 Arts & Culture Arts Arts & Culture
18 Book Clubs Book Clubs Book Clubs
2 Career & Business Business Career & Business
3 Cars & Motorcycles Auto Cars & Motorcycles
4 Community & Environment Community Community & Environment
In [867]:
cat_id = df.index.tolist()
cat_id[0:5]
Out[867]:
[1, 18, 2, 3, 4]
  • Meetup Groups based on Category ids

Step 1: Fetch Once we have city data (lat and long), we can now fetch groups data to get more granular level of information. It includes the fields that are required for comparitive analysis of top two innovative countries.

The groups information will be stored in local folers in the form of Multiple CSVs. Considering the large size of data, the files are written by category id which will be unique for one country.

In [ ]:
# Getting group data for each city based on lat, long and category 

for code in codes:
    print(code)
    df = sd.read_from_csv('meetup','/cities/'+code+'.csv')
    df1 = df[['lat','lon']]
    
    for i in cat_id:
        print(i)          
        groups = pd.DataFrame([])
        
        #Iterate over cities lat lon
        for city_itr in range(len(df1)):
            lat = df1.loc[city_itr].lat
            lon = df1.loc[city_itr].lon
            #run again untill we have results
            for j in range(10000):    
                result = meetupObj.get_groups(i,lat,lon,j)
                if len(result) != 0:
                    groups = groups.append(result)
                else:
                    break
        # Get data in dataframe from multiple CSVs             
        sd.create_directory('meetup','groups')
        sd.create_directory('meetup','groups/'+code)
        sd.df_to_csv(groups, 'meetup', 'groups/'+code+'/'+str(i) +'.csv', index=False)

Kindly Note: The data from this API returns redundant data even when offset was used and it takes a lot of time since it is an iterative process and need to fetch records for every possible category in every possible city.

Also, only sample has been shown below and a sample file is attached with the project.

Sample files are present in 'data' folder and the files inside the './data/group' folder does not contain all the groups since it was a huge data but all the files shared same structure.

Please contact me for more data

Step 2: Select, combine and create a master dataframe from group data extracted for different countries. The API data must be stored locally to enable future processing. The created dataframe is still unclean and we will select only a limited part of the data.

In [930]:
print("executing")
cwd = os.getcwd()

group_df_arr = pd.DataFrame([])
frames = []
for code in codes:
    data_list = []
    for i in cat_id:
        try:
            df = pd.read_csv(cwd+'/data/meetup/groups/'+code+'/'+str(i)+'.csv', index_col = False, header=0, dtype='unicode')
            df.drop_duplicates(subset=None, inplace=True)
            data = df
        except Exception as h:
            print("Exception: "+str(h))
            print(cwd+'/data/meetup/groups/'+code+'/'+str(i)+'.csv');
            pass
        data_list.append(data)
    group_df_arr = pd.concat(data_list)
    print(len(group_df_arr))
    frames.append(group_df_arr)
    
result = pd.concat(frames)

display(result[0:2])
executing
Exception: No columns to parse from file
G:\Mega\UCD\Modules\Sem2\6-Data-Science-in-Python\Assignment-API/data/meetup/groups/ch/24.csv
3413
2070
category.id category.name category.shortname city country created description group_photo.base_url group_photo.highres_link group_photo.photo_id ... organizer.photo.thumb_link organizer.photo.type rating state timezone topics urlname utc_offset visibility who
0 1 fine arts/culture arts-culture Zürich CH 1360710014000 <p>We are creative people and artists getting ... https://secure.meetupstatic.com https://secure.meetupstatic.com/photos/event/c... 466610507.0 ... https://secure.meetupstatic.com/photos/member/... member 4.89 NaN Europe/Zurich [{'urlkey': 'figuredrawing', 'name': 'Figure D... Paper_pencil_caffeine 3600000 public Members
1 1 fine arts/culture arts-culture Zürich CH 1416469038000 <p>Hello music makers (beginners included!) an... NaN NaN NaN ... https://secure.meetupstatic.com/photos/member/... member 4.43 NaN Europe/Zurich [{'urlkey': 'musicians', 'name': 'Musicians', ... Open-Mic-every-Sunday-at-Kennedys-Irish-Pub-Zu... 3600000 public Musicmakers/ Music Appreciators

2 rows × 36 columns

Raw data has been collected in the 'result' dataframe

4. Task - Load and represent your data as a Pandas DataFrame. Apply any pre-processing and quality checking steps that may be required to clean and filter the data before analysis

4.1. Load data into dataframe, filter data and select only required columns

In [1218]:
def parse_group_data(result):
    '''
    Parameters: dataframe with raw data (result)
    --------
    result: cleaned dataframe which will be used for analysis
    '''
     
    grouped_Df = result.reset_index(drop=True)   
    
    # Since in the default border radius of one city's latitude and longitude, it has data from different countries also.
    # The data which is relevant will only be choosen. So a filter is applied on fetched country codes 
    # and then again more filters are applied to get only required columns
    grouped_Df = grouped_Df[grouped_Df['country'].str.lower().isin([x.lower() for x in codes])][['id','city','country','created','category.id','category.name','category.shortname','rating','members']]
    
    #change the date format from timestamp to date string
    grouped_Df = grouped_Df.assign(
                        date_created=pd.Series(
                            grouped_Df['created']
                            .apply(lambda x: datetime.datetime.fromtimestamp(int(x) / 1e3))
                            .dt.strftime('%Y-%m-%d'))
                        .values)
    countries_dict = parse.get_countries_dict()
    grouped_Df = grouped_Df.assign(country_name=pd.Series(grouped_Df['country'].apply(lambda x: countries_dict[x])).values)
    grouped_Df.drop(['created'], axis=1, inplace=True)
    return grouped_Df
In [1219]:
import datetime
grouped_Df = parse_group_data(result)
grouped_Df[0:5]

#Sample output for parsed data
Out[1219]:
id city country category.id category.name category.shortname rating members date_created country_name
0 7157472 Zürich CH 1 fine arts/culture arts-culture 4.89 985 2013-02-12 Switzerland
1 18202384 Zürich CH 1 fine arts/culture arts-culture 4.43 773 2014-11-20 Switzerland
2 18502560 Baden CH 1 fine arts/culture arts-culture 4.85 432 2015-03-15 Switzerland
3 18815796 Zürich CH 1 fine arts/culture arts-culture 4.98 464 2015-08-09 Switzerland
4 18995561 Zürich CH 1 fine arts/culture arts-culture 4.81 1214 2015-10-04 Switzerland

4.2. Missing Data

Finally as a part of Pre-Processing, check if further pre-processing is needed for missing data

In [1220]:
#look for missing data
grouped_Df.isnull().sum() # no missing values in the reduced dataset 
Out[1220]:
id                    0
city                  0
country               0
category.id           0
category.name         0
category.shortname    0
rating                0
members               0
date_created          0
country_name          0
dtype: int64
In [1221]:
grouped_Df.dtypes.value_counts() 
Out[1221]:
object    10
dtype: int64
In [1222]:
grouped_Df.isnull().values.any()
Out[1222]:
False

There are no Null's in the data collected from raw dataframe and no values exist such as "NaN" or blank values in any column. This indicates the data is clean now and all values are pro-processed now.

5. Task - Analyse and summarise the cleaned dataset, using tables and plots where appropriate. Based on your results, what interpretations or insights can be made about the dataset? What further analysis might be done on the data? Detail this using Markdown cells in your notebook.

5.1 Analyse and Summarise the cleanest dataset using tables and plots

Descriptive Statistics

In [1225]:
#grouped_Df[0:5]

print("Descriptive Stats:\n")
display(grouped_Df.describe())
Descriptive Stats:

id city country category.id category.name category.shortname rating members date_created country_name
count 4568 4568 4568 4568 4568 4568 4568 4568 4568 4568
unique 4451 174 2 33 33 33 115 961 1630 2
top 26127445 Zürich CH 34 tech tech 0.0 50 2018-01-22 Switzerland
freq 2 1345 3090 1343 1343 1343 2067 126 15 3090

Plots using Matplotlib and Pandas Graph

Initializations
In [1434]:
import matplotlib
import matplotlib.pyplot as plt
import squarify                 # Squarify is used to get the rectangle coordinates for tree map
import seaborn as sns
%matplotlib inline

temp = grouped_Df               # A temporary copy is created to perform further actions using the copy of dataframe    
temp = temp.apply(pd.to_numeric, errors='ignore')

Plot 1: Comparison of growth of groups for two countries

Aim: The aim here is to compare, how different meetup groups have grown in the countries over the years.

Plots Used: Line Chart and Area Chart

In [1061]:
country_time_series = []
countries_dict = parse.get_countries_dict()

for c in range(len(codes)):
    country_time_series.append(grouped_Df[grouped_Df['country'] == codes[c].upper()][['date_created']])
    country_time_series[c]['date_created'] =pd.to_datetime(country_time_series[c].date_created).dt.year
    country_time_series[c].reset_index().drop(['index'],axis=1)
    country_time_series[c] = country_time_series[c].groupby(['date_created'])[['date_created']].agg('count')
    country_time_series[c] = country_time_series[c].rename(index=str, columns={"date_created": countries_dict[codes[c].upper()]})

country_time_series_df = pd.concat(country_time_series, axis=1).fillna(value=0)
display(country_time_series_df[0:5])
Switzerland Sweden
2003 2 0.0
2004 1 2.0
2006 4 1.0
2007 14 1.0
2008 6 3.0

Plot 1(a): Growth of groups comparison country-wise

Line Chart
In [1423]:
country_time_series_df.plot()
plt.title("Growth in number of groups on meetup in the two countries",fontsize=20)
plt.ylabel("No. of Groups")
plt.xlabel("Time")
plt.savefig('graphs/groups_creation_country_wise-line.png',dpi=100, bbox_inches='tight')

This graph is showing the rise in the number of groups on yearly basis using line chart plot in the two countries. X-axis represents years and Y axis represents the number of groups formed.

Plot 1(b): Growth of groups comparison country-wise

Area Chart
In [1424]:
country_time_series_df.plot.area(stacked=False)
plt.suptitle("Growth in number of groups on meetup in the two countries",fontsize=20)
plt.ylabel("No. of Groups")
plt.xlabel("Time")
plt.savefig('graphs/groups_creation_country_wise-area.png',dpi=100, bbox_inches='tight')

This graph is showing the rise in the number of groups on yearly basis using stacked area chart for the two countries. X-axis represents years and Y axis represents the number of groups formed.

Plot 2: Growth in number of groups - monthly data

Aim: The aim here is to compare, which month sees highest growth in meetup groups.

Plots Used: Bar Charts

In [1291]:
group_month_data = pd.DataFrame([])
group_month_data = grouped_Df[['date_created']]
group_month_data = grouped_Df.assign(Month_created=pd.to_datetime(group_month_data.date_created).dt.strftime('%b'))

group_month_data.reset_index().drop(['index'],axis=1)
group_month_data = group_month_data.groupby(['Month_created'])[['Month_created']].agg('count')
group_month_data = group_month_data.rename(index=str, columns={"Month_created": "Count"}).reset_index()


# Using temp month number to sort rows
months = {datetime.datetime(2000,i,1).strftime("%b"): i for i in range(1, 13)}
group_month_data["month_number"] = group_month_data["Month_created"].map(months)
group_month_data = group_month_data.fillna(value=0).sort_values(by=['month_number']).set_index('Month_created')

group_month_data = group_month_data.drop(['month_number'], axis=1)

display(group_month_data[0:5])
Count
Month_created
Jan 490
Feb 470
Mar 505
Apr 333
May 375
In [1425]:
p = group_month_data.plot.bar(figsize=(6,5),fontsize=14)
p.set_xlabel("Month",fontsize=14)
p.set_ylabel("Group Count",fontsize=14)

plt.suptitle("Growth in number of groups - monthly data",fontsize=20)
plt.savefig('graphs/groups_growth_monthly.png',dpi=100, bbox_inches='tight')

This graph is showing the rise in the number of groups on monthly basis using bar chart plot. X-axis represents months from january to december and Y axis represents the count of number of groups formed.

Plot 3: Growth in number of groups category-wise

Aim: To find out how the member accross different categories have grown since 2003

Plots Used: Small Multiple Bar Charts

In [1082]:
temp = grouped_Df
temp = temp.apply(pd.to_numeric, errors='ignore')

#display(temp [0:5])
members_df = temp.groupby(['category.name'])[['members']].agg('sum').sort_values(by=['members'], ascending=False).reset_index()

display(members_df[0:5])
category.name members
0 tech 408900
1 career/business 183689
2 socializing 125094
3 outdoors/adventure 122833
4 language/ethnic identity 105040
In [1228]:
#getting top 9 categories based on popularity

popular_categories = members_df[0:9]['category.name'].tolist()

cat_time_series = []

#creating array of dataframes for different categories
for p in range(len(popular_categories)):
    cat_time_series.append(grouped_Df[grouped_Df['category.name'].str.lower() == popular_categories[p]][['date_created']])
    cat_time_series[p]['date_created'] =pd.to_datetime(cat_time_series[p].date_created).dt.year
    cat_time_series[p].reset_index().drop(['index'],axis=1)
    cat_time_series[p] = cat_time_series[p].groupby(['date_created'])[['date_created']].agg('count')
    cat_time_series[p] = cat_time_series[p].rename(index=str, columns={"date_created": popular_categories[p]})

    
# Once categories are ready join the on the basis of index which is category name
# The groups which were existent in one country but not present in the other have been assigned a default count of 0 
cat_time_series_df = pd.concat(cat_time_series, axis=1).fillna(value=0)
display(cat_time_series_df[0:5])
tech career/business socializing outdoors/adventure language/ethnic identity health/wellbeing food/drink fitness fine arts/culture
2003 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
2004 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2006 1.0 1.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0
2007 0.0 4.0 5.0 2.0 0.0 0.0 1.0 1.0 0.0
2008 1.0 1.0 2.0 0.0 2.0 1.0 0.0 0.0 0.0
In [1426]:
ax1 = cat_time_series_df.plot(kind='bar', subplots=True, layout=(3,3), figsize=(12,10), sharey=True)
plt.suptitle("Growth in number of groups in meetup for Top Category",fontsize=20)
plt.savefig('graphs/groups_creation_category_wise.png',dpi=100, bbox_inches='tight')

These are small multiples bar chart to represent the number of groups formed category wise. For representational purposes only top 9 countries have been selected and more can be selected to visualize data. X-axis represents years and Y -axis represents count of groups formed.

Plot 4: Categories Popularity in top Global Innovative Countries

Aim: The aim is to find out the most popular groups. Popularity is based on the count of members that have joined a group

Plots Used: Treemap Charts

In [1254]:
# Custom Plot function to draw tree map
def get_treemap(sizes, norm_x=100, norm_y=100, label=None, value=None, ax=None,  **kwargs):
    if ax is None:
        ax = plt.gca()

    # create a color palette, mapped to these values
    cmap = matplotlib.cm.Wistia
    mini=min(sizes)
    maxi=max(sizes)
    norm = matplotlib.colors.Normalize(vmin=mini, vmax=maxi)
    color = [cmap(norm(value)) for value in sizes]

    
    # Gives Normalized values over area 
    normed = squarify.normalize_sizes(sizes, norm_x, norm_y)
    
    # Gives rectange coordinates from Normalized values over area 
    rects = squarify.squarify(normed, 0, 0, norm_x, norm_y)
    
    x = [rect['x'] for rect in rects]
    y = [rect['y'] for rect in rects]
    dx = [rect['dx'] for rect in rects]
    dy = [rect['dy'] for rect in rects]

    ax.bar(x, dy, width=dx, bottom=y, color=color, edgecolor='white', linewidth=1.0,
       label=label, align='edge', **kwargs)

    #Show Values (Member Count) in the center
    if not value is None:
        va = 'center' if label is None else 'top'
        
        #Iterate over values to add labels to axes
        for v, r in zip(value, rects):
            fz=r['dy']/2.5
            x, y, dx, dy = r['x'], r['y'], r['dx'], r['dy']
            ax.text(x + dx / 2, y + dy / 2, v, va=va, ha='center', fontsize=fz)
            
    #Show label (category name) in the center        
    if not label is None:
        va = 'center' if value is None else 'bottom'
        
        #Iterate over label names to add labels to axes
        for l, r in zip(label, rects):
            fz=r['dy']/2.5
            x, y, dx, dy = r['x'], r['y'], r['dx'], r['dy']
            ax.text(x + dx / 2, y + dy / 2, l, va=va, ha='center',fontsize=fz)
    
    ax.set_xlim(0, norm_x)
    ax.set_ylim(0, norm_y)
    return ax

Plot 4(a): Comparison of Categories Popularity cummulatively for Top 2 Global Innovative Countries

In [1427]:
members_list=members_df['members'].tolist()
label_list = members_df['category.name'].tolist()

fig1 = plt.figure(figsize=(12,10))
ax1 = fig1.add_subplot(1, 1, 1)
title = 'Tree Map for Popularity (Most joined category by members) in ' + ",".join(grouped_Df['country_name'].unique().tolist())
fig1.suptitle(title, fontsize=20)
get_treemap(sizes=members_list, label=label_list, value=members_list, alpha=0.9, ax=ax1)
#ax1.set_axis_off()
fig1.savefig('graphs/treeMap.png', dpi=100, bbox_inches='tight')
plt.show(fig1)

This tree map shows the popularity of the categories based on the number of members who joined it. The data has been normalized on the scale of (100,100) to get the uniform boxes.

Plot 4(b): Comparison of Top 2 Global Innovative Countries Categories Popularity

In [913]:
temp = grouped_Df

temp = temp.apply(pd.to_numeric, errors='ignore')

#display(temp [0:5])
members_country = temp.groupby(['category.name','country_name'])[['members']].agg('sum').sort_values(by=['members'], ascending=False).reset_index()

display(members_country[0:5])
category.name country_name members
0 tech Sweden 240087
1 tech Switzerland 168813
2 outdoors/adventure Switzerland 109286
3 career/business Switzerland 99586
4 career/business Sweden 84103
In [1428]:
countries = members_country['country_name'].unique().tolist()

fig2, ax = plt.subplots(2)

title = 'Tree Map for Popularity Comparison (Most joined category by members)'

fig2.suptitle(title, fontsize=20)
member = pd.DataFrame([])    
member = []
members_list = []
label_list=[]

for i in range(len(countries)):    
    member = members_country[members_country['country_name'] == countries[i]]
    members_list=member['members'].tolist()
    label_list = member['category.name'].tolist()
    
    get_treemap(sizes=members_list, label=label_list, value=members_list , alpha=0.9, ax=ax[i])
    ax[i].set_title(countries[i], fontsize=20)
    #ax[i].set_axis_off()
    ax[i].figure.set_size_inches(19, 10)

fig2.savefig('graphs/treeMap-comapre.png', dpi=100, bbox_inches='tight')
plt.show(fig2)

This tree map shows the popularity of the categories based on the number of members who joined it but using subplots to comapre data for two countries. The data has been normalized on the scale of (100,100) to get the uniform boxes.

Plot 5: Compare ratings of different categories for the selected countries

Aim: The idea is to compare what are the trends for ratings in the differnt countries based on categories

Plots Used: Bar Plot and Stacked Bar Plot

In [1066]:
#display(temp [0:5])
category_rating = temp.groupby(['category.name','country_name'])[['rating']].agg('mean').sort_values(by=['rating'], ascending=False).reset_index()

display(category_rating[0:5])
category.name country_name rating
0 sci-fi/fantasy Switzerland 4.890000
1 book clubs Switzerland 4.254444
2 dancing Sweden 4.154444
3 writing Sweden 3.828000
4 photography Sweden 3.770556
In [1255]:
# Min-Max Normalization to create a new column normalized_rating
# By creating new column, one can easily compare values before and after normalization
category_rating['normalized_rating']=(category_rating['rating']-category_rating['rating'].min())/(category_rating['rating'].max()-category_rating['rating'].min())
category_rating[0:5]
Out[1255]:
category.name country_name rating normalized_rating
0 sci-fi/fantasy Switzerland 4.890000 1.000000
1 book clubs Switzerland 4.254444 0.870030
2 dancing Sweden 4.154444 0.849580
3 writing Sweden 3.828000 0.782822
4 photography Sweden 3.770556 0.771075
In [1256]:
countries = category_rating['country_name'].unique().tolist()

# Modify the structure of data and make different columns for the different countries
df_arr = []
for i in range(len(countries)):
    df_arr.append(category_rating[category_rating['country_name']==countries[i]])
    df_arr[i] = df_arr[i].drop(['rating','country_name'], axis=1).set_index('category.name').rename(index=str, columns={"normalized_rating": countries[i]})
    
cat_rating_df = pd.concat(df_arr, axis=1, join="inner")
display(cat_rating_df[0:5])
Switzerland Sweden
category.name
sci-fi/fantasy 1.000000 0.659168
book clubs 0.870030 0.546779
singles 0.682734 0.442740
tech 0.622019 0.627446
movies/film 0.601113 0.397702
Plot 5(a): Bar Plot
In [1429]:
axes = cat_rating_df.plot.bar(figsize=(15,5))
axes.set_xlabel("Categories",fontsize=14)
axes.set_ylabel("Rating",fontsize=14)
axes.set_title("Ratings vs Categories Comaparison for two countries",fontsize=20)
plt.savefig('graphs/category_ratings.png',dpi=100, bbox_inches='tight')

This graph represnts the rating vs categories bar plot for comparing two countries. The rartings have been normalized using min-max normalization. Once this has been done, the ratings are now plotted against the categories to find out patterns. X-axis represents categories and Y -axis represents normalized ratings.

Plot 5(b): Stacked Bar Plot
In [1430]:
axes = cat_rating_df.plot.bar(stacked=True,figsize=(15,5))
axes.set_xlabel("Categories",fontsize=14)
axes.set_ylabel("Rating",fontsize=14)
axes.set_title("Ratings vs Categories Comaparison for two countries",fontsize=20)
plt.savefig('graphs/category_ratings_stacked.png',dpi=100, bbox_inches='tight')

This graph represnts the rating vs categories stacked bar plot for comparing two countries. The rartings have been normalized using min-max normalization. Once this has been done, the ratings are now plotted against the categories to find out patterns. X-axis represents categories and Y -axis represents normalized ratings.

Plot 6: Members vs Rating vs No. of Groups to Rating for Different Categories

Aim: The aim is to find out the relation between ratings, members in group and groups count

Plots Used: Scatter Plot, Scatter Matrix(Small Multiples), Line Chart(Small Multiples), Dual Axis Chart

In [1431]:
category_rating_dual = temp.groupby(['category.name'])[['rating']].agg('mean').reset_index().set_index('category.name').rename(index=str, columns={"rating": "mean_rating"})
groups_df = temp.groupby(['category.name'])[['id']].agg('count').reset_index().set_index('category.name').rename(index=str, columns={"id": "group_count"})
members_df_copy = members_df.set_index('category.name')

cat_data_unnormalized = pd.concat([members_df_copy, category_rating_dual,groups_df], axis=1)

display(cat_data_unnormalized[0:5])
members mean_rating group_count
LGBT 4282 1.205313 32
alternative lifestyle 449 1.357143 14
book clubs 4802 3.242800 25
career/business 183689 2.423424 663
cars/motorcycles 628 1.690909 11
In [1332]:
# Normalization of data using min-max normalization

cat_data = pd.DataFrame(cat_data.index.tolist()).set_index(0).rename_axis('Categories')


cat_data['mean_rating']=(cat_data_unnormalized['mean_rating']-cat_data_unnormalized['mean_rating'].min())/(cat_data_unnormalized['mean_rating'].max()-cat_data_unnormalized['mean_rating'].min())
cat_data['members']=(cat_data_unnormalized['members']-cat_data_unnormalized['members'].min())/(cat_data_unnormalized['members'].max()-cat_data_unnormalized['members'].min())
cat_data['group_count']=(cat_data_unnormalized['group_count']-cat_data_unnormalized['group_count'].min())/(cat_data_unnormalized['group_count'].max()-cat_data_unnormalized['group_count'].min())

display(cat_data[0:5])
mean_rating members group_count
Categories
LGBT 0.297119 0.010358 0.022371
alternative lifestyle 0.334546 0.000983 0.008949
book clubs 0.799376 0.011630 0.017151
career/business 0.597393 0.449164 0.492916
cars/motorcycles 0.416822 0.001421 0.006711

Plot 6(a): Scatter Plot

In [1432]:
ax = cat_data.plot.scatter(x="members", y="group_count", c='mean_rating', cmap = matplotlib.cm.jet, alpha=0.4 , s=250)
ax.grid(True,linestyle='-',color='0.75')

ax.text(0.75, -0.1, 'Number of Members',
        verticalalignment='bottom', horizontalalignment='right',
        transform=ax.transAxes)

plt.suptitle("Categories Data Scatter Plot",fontsize=14)
plt.savefig('graphs/scatter_plot_categories.png',dpi=100, bbox_inches='tight')

This graph represnts the rating vs categories vs members using scatter plot. The ratings have been normalized to a color scale based on mean ratings for the data. Once this has been done, the ratings are now plotted against the categories to find out patterns. X-axis represents number of members and Y-axis represents normalized ratings.

Plot 6(b): Scatter Matrix

In [1455]:
sns.pairplot(cat_data)
ax = plt.gca()
plt.suptitle("Small Multiples for category dimensions via Scatter Plot",fontsize=14, y=1.08)
plt.savefig('graphs/scatter_categories_small_multiple.png',dpi=100, bbox_inches='tight')

This graph represents the scatter matrix plot for mean rating, number of members and group count. The data has been normalized already using min-max normalization. The other half of the graph is identical.

Plot 6(c): Small Multiples Line Chart

In [1442]:
axes = cat_data.plot(rot=90,subplots=True, figsize=(6, 6),x_compat=True);
plt.xticks(range(len(cat_data.index.tolist())), cat_data.index.tolist(), size='small')

plt.suptitle("Small Multiples for category dimensions via Line Chart",fontsize=14)
plt.savefig('graphs/line_categories_small_multiple.png',dpi=100, bbox_inches='tight')

This graph represents the small multiples for mean rating, number of members and group count using line chart. The data has been normalized already using min-max normalization. The other half of the graph is identical, so we are interested X-axis represents categories and Y-axis represents value.

Plot 6(d): Dual Axes Line Chart

In [1441]:
ax = cat_data.plot(rot=90,secondary_y=['group_count'])
ax.set_ylabel("Members Count")
ax.right_ax.set_ylabel("Group Count")
plt.xticks(range(len(cat_data.index.tolist())), cat_data.index.tolist(), size='small')
plt.title("Dual Axis Line Chart for different categories\n")
plt.xlabel("Time")

plt.savefig('graphs/line_categories_dual_axes.png',dpi=100, bbox_inches='tight')

Plot 7: Frequency Distribution for groups in different cities of one country

Aim: The aim is to find out which city has more number of groups

Plots Used: Bar Chart - frequeny distribution

In [1440]:
from collections import Counter

city_data = temp[['city','country_name']]
countries = city_data['country_name'].unique().tolist()

for i in range(len(countries)):
    a = city_data.loc[city_data['country_name'] == countries[i]]['city'].tolist()
    letter_counts = Counter(a)
    df = pd.DataFrame.from_dict(letter_counts, orient='index')
    df.columns = ['count']
    df['count']=(df['count']-df['count'].min())/(df['count'].max()-df['count'].min())

    threshold = 0.005
    # Remove rows less than the threshold
    df = df[df['count'] > threshold].sort_values(by=['count'])
    df.plot(kind='bar')
    plt.suptitle("Frequency Distribution of groups among top cities of "+countries[i],fontsize=20)
    plt.ylabel("Normalized Frequency")
    plt.xlabel("Cities")
    plt.savefig("graphs/groups_distribution_city_wise"+countries[i]+".png",dpi=100, bbox_inches='tight')

These graphs represents the most active countries with maximum number of group counts. A threshold of 0.05 was used over normalized values to remove the cities which as less active or has lesser number of groups. x-axis represents the cities and y axis represents the frequency of groups in these cities.

5.2. Interpretations and Data Insights

There are many conclusions that can be drawn from the given graphs before:

  1. Plot 1: Comparison of growth of groups for two countries

    It can be clearly seen from the graphs that both the innovative countries are showing growth in the group count over the years. But the point specifically to be noted is Switzerland is ranked first in Global innovation index and Sweden as second. The groups count is more in Switzerland than in Sweden and the gap in increasing. Considering, year 2018 is not complete yet, trend says, in 2018 the number groups will increase

  2. Plot 2: Growth in number of groups - monthly data

    There is no clear trend for the increase in the groups growth accross the different months. However, the starting months January, February and March saw the major growth.

  3. Plot 3: Growth in number of groups category-wise

    It can be clearly figured out that the innovative countries are focussing more on Technology field and have been joining it in more and more numbers. Also, people in these countries are more focussed on career / business. Though this field is behind technology but the gap in number of members in career/ business and other categories is significant. Apart from them, health/ well being has seen a good growth in year 2017.

  4. Plot 4: Categories Popularity in top Global Innovative Countries

     Like the previous graphs, we are now comparing categories based on member count:
    
     - The top joined categories are: tech, socializing, career/business and outdoor adventures
     - Also, if we compare the previous graph with this one, we will find out that even though socializing category does not have too many groups but still it has a lot of members
    
     Comparison between Switzerland and Sweden:
    
     - Sweden is showing a bias towards tech. Tech is also the most joined category in Switzerland but it's people are still joining other groups as well and the gap is smaller.
    
    


    This means country focussing on all the fields is more innovative than others which is biased towards one field.

  5. Plot 5: Compare ratings of different categories for the selected countries

    This data shows that there is no clear trend in the ratings provided by the users. Less number of members could be the reason for irregular trends. However, one point to be noted is that the categories which have the highest number of members in the two countries share the similar or slightly different rating for those categories like tech, socializing, career/business and outdoor adventures

  6. Plot 6: Members vs Rating vs No. of Groups to Rating for Different Categories

    As in the previous graphs, these graphs also depict no clear picture for rating, but the trends is strong, whcih suggests as the number groups increases, member counts also increases. As seen in the dual axes chart for members and group count or in Small Multiple of line chart.

  7. Plot 7: Frequency Distribution for groups in different cities of one country

    The plot represents the frequency of groups in different cities of the country. A threshold of 0.05 was used to select the cities. It depicts which cities are most active on meetup. Zurich and Stockholm are the most active cities in the two countries. Stockholm being the capital of Sweden might be the reason for it popularity but Zurich is not the capital of Switzerland still has a lot of popularity.

Tentative Conclusion:

Any country which wants to increase it's rank in the global innovation index can take insights from this experiement and encourage more groups which leads to innovation.

Higher the number of groups, more are the members.
Higher the number of groups, more is the global innovation index.

5.3. Future Scope

More statisfical modelling can be applied to this data to find out that there is strong relation between number of groups and members and the countries with high global innovation index are involved more in meetups. Currently, only two countries were selected.

But Limited data was collected since the resources and time was limited. More data can be picked from around the world for multiple countries and staistical modelling can be perfromed on the various countries data to compare data.

The members details and the events around the world is a high volumetric data and could not be picked since it require high memory constraints. Also, further since it give coordinates(latitude, longitude), success of a group can be discovered using more detailed data which can be used by companies to share their new innovations. Companies can also benefit from the data, because before organizing any event they might analyze success of the event based on numbers.

Meetup provides more set of open APIs.

For example:

  1. Member details and how many groups they have joined. This can be used to track the activity and trends for the members and give then tailored recommendations. Also, countries can promote categories among members if the country is lagging behind in a category which is highly joined by innovative countries.

  2. It also gives venue data. So, the numbers members attending an event may depend on venues and further data can be analyzed to select top venues having a high turnout ratio.

In [ ]: