How to Utilize Asyncio to Conduct Company Research

Overview

One day Mei asked me to recommend some tech companies for her next job. While I am more familiar with data science/ML jobs in tech companies, she is looking for software engineer opportunities with strong software/engineering focus. I tried my best brush my memories on startups and big companies in the Bay Areas, however, I found out what I only got no more than 10 companies either FANNG or ML related companies like H2O.ai in an hour. Apparently, I am not familiar with this field. The first drafted list is neither comprehensive nor relevant.

Once I found this curated list of engineer blogs, I was saying to myself this is a way to go. Companies, either big or small, will be likely operating with a good engineering practice if they maintain good engineering blogs. This is actually a better starting point than random career pages and LinkedIn Ads since the eng blogs tell a lot more about what is happening under the hood in the engineering department. Fortunately, I still remember how to build crawlers and learned the Asyncio from my coworker Vincent. Why not use the tools and the Internet enhance the list and build a good engineering company list?

Introduction of Asyncio

Though it might not be necessary to use asyncio with Python 3.7 for this trivial case, it would be a good exercise and expandable solution to go with. Before diving into details of the use case, let’s have a look of Asynchronous I/O and its implementation in Python 3.7.

On high level, asyncio is a library to write concurrent code using the async/await syntax in Python 3.

High Level APIs

Coroutines

Coroutines is a high-level API/abstract that declared with async/await syntax is the preferred way of writing asyncio applications (Python 3.7+).

Awaitables

We say that an object is an awaitable object if it can be used in an await expression. Many asyncio APIs are designed to accept awaitables.

There are three main types of awaitable objects:

coroutines
Tasks: used to schedule coroutines concurrently.
Futures: a low-level awaitable object that represents an eventual result of an asynchronous operation

Some examples are listed below.

>>> async def main():
...     print('hello')
...     await asyncio.sleep(1)
...     print('world')
>>> asyncio.run(main())

Learn more about Company from Google and Wikipedia

You can easily extract a list of company names and eng blog urls from the the github repos( repo1, repo2) after some basic regex parsing. Check out https://regexone.com/ or https://regex101.com/ if you need some extra help.

Get Locations from Google Place API

The new Google Place API requires signing up and filling in credit card information. But it’s free for minimal usages in the first year. Check out https://developers.google.com/places/web-service/intro more details and be cautious not to expose your tokens and control your cost.

be cautious not to expose your tokens and control your cost.
For our use case, it’s important to find out the location, latitude, and longitude of the tech company’s’ offices. search end point is the one to go with. It is a bit tricky that if you search the company name by itself in google place API, you will get no result. In order to increase the match rate, more location information like USA or Bay Area or Seattle are needed to gather with the company name. Also, the inclusion of keywords like office or headquarters will help.

import aiohttp
import asyncio
import json
GOOGLE_PLACE_API = 'https://maps.googleapis.com/maps/api/place/findplacefromtext/json'
async def fetch(session, url, params):
    async with session.get(url, params=params) as response:
        return await response.text()
async def fetch_company(session, url=GOOGLE_PLACE_API, company=None):
    params = {'input':'{}%20office%20Bay%20Area%20usa'.format(company['name']),
              'inputtype': 'textquery',
              'fields': 'formatted_address,name,geometry',
              'key': 'YOUR_API_KEY'
          }
    resp = await fetch(session, url, params)
    company.update({'place_response': json.loads(resp)})
    return company
    
async def main(companies):
    tasks = []
    async with aiohttp.ClientSession() as session:
        for comp in companies:
            tasks.append(fetch_company(session, GOOGLE_PLACE_API, comp))
        res = await asyncio.gather(*tasks)
        # write output to json lines once everything is done
        with open('company_google_place_search.jsonl', 'w') as f:    
            for r in res:
                f.write(json.dumps(r) + '\n')
if __name__ == '__main__':
    company_sample = [{'name': 'facebook', 'url': 'https://code.fb.com/'}, {'name':'databricks', 'url': 'https://databricks.com/blog/category/engineering'}]
    asyncio.run(main(company_sample))

After running the script above, we can get the detailed place information about the company we are searching for. Here is an expanded json line output.

{
  "name": "databricks",
  "url": "https://databricks.com/blog/category/engineering",
  "place_response": {
    "candidates": [
      {
        "formatted_address": "160 Spear St 13th floor, San Francisco, CA 94105, USA",
        "geometry": {
          "location": {
            "lat": 37.791365,
            "lng": -122.393741
          },
          "viewport": {
            "northeast": {
              "lat": 37.79293997989272,
              "lng": -122.3921091701073
            },
            "southwest": {
              "lat": 37.79024032010727,
              "lng": -122.3948088298927
            }
          }
        },
        "name": "Databricks Inc."
      }
    ],
    "status": "OK"
  }
}

Get Filmographic data from Wikipedia

After getting the location, we need more detailed information about the companies. Wikipedia is a good source of information for some well-known public/private companies with useful headcount, revenue, industry and founder information.

For comparison purpose, two methods will be presented below, where one is synchronous() and the other is asynchronous()
.

Synchronous Method & Asynchronous Method

import aiohttp
import asyncio
import json
import requests
from scrapy.http import HtmlResponse
from time import perf_counter as pc
WIKIPEDIA_API = 'https://en.wikipedia.org/w/api.php'
WIKIPEDIA_PAGE_FORMAT = 'https://en.wikipedia.org/api/rest_v1/page/html/{}'
def get_title(name):
    params = {'action':'query',
              'origin':'*',
              'format':'json',
              'generator':'search',
              'gsrnamespace':0,
              'gsrlimit':1,
              'gsrsearch': name
          }    
    response = requests.get(WIKIPEDIA_API, params=params)
    res = response.json()
    if res.get('query', {}).get('pages', {}):
        return list(res['query']['pages'].values())[0]['title']
def get_infobox(response):
    for node in response.xpath('//table[contains(@class, "infobox")]'):
        raw_data = node.xpath('@data-mw').extract()
        if not raw_data:
            continue
        for item in raw_data:
            item = json.loads(item)
            try:
                return item['parts'][0]['template']['params']
            except Exception as e:
                print(e)
def get_wiki_info_box(name):
    title = get_title(name)
    print(title, end=', ')
    if title:
        url = WIKIPEDIA_PAGE_FORMAT.format(title)
        body = requests.get(url).content
        response = HtmlResponse(url=url, body=body)
        return get_infobox(response)
def synchronous(names):
    res = [get_wiki_info_box(name) for name in names]
    return res
async def fetch(session, url, params):
    async with session.get(url, params=params) as response:
        return await response.text()
def get_infobox(response):
    for node in response.xpath('//table[contains(@class, "infobox")]'):
        raw_data = node.xpath('@data-mw').extract()
        if not raw_data:
            continue
        for item in raw_data:
            item = json.loads(item)
            try:
                return item['parts'][0]['template']['params']
            except Exception as e:
                print(e)
async def fetch_wiki_title_res(session, url=WIKIPEDIA_API, name=None):
    params = {'action':'query',
              'origin':'*',
              'format':'json',
              'generator':'search',
              'gsrnamespace':0,
              'gsrlimit':1,
              'gsrsearch': name
          }    
    res = json.loads(await fetch(session, url, params))
    if res.get('query', {}).get('pages', {}):
        return list(res['query']['pages'].values())[0]['title']   
async def fetch_wiki_info_box(session, format_url=WIKIPEDIA_PAGE_FORMAT, name=None):
    title = await(fetch_wiki_title_res(session, url=WIKIPEDIA_API, name=name))
    print(title, end=', ')
    if title:
        url = format_url.format(title)
        resp =  await fetch(session, url=url, params=None)
        response = HtmlResponse(url=url, body=resp, encoding='utf-8')
        return get_infobox(response)
async def asynchronous(names):
    tasks = []
    async with aiohttp.ClientSession() as session:
        for name in names:
            tasks.append(fetch_wiki_info_box(session, WIKIPEDIA_PAGE_FORMAT, name))
        res = await asyncio.gather(*tasks)
        return res

Comparison

In order to evaluate sync vs. async methods, 100 company names are used for a benchmarking. More Specifically, the same data are passed into synchronous() and asynchronous() methods defined above in the same environment (Mac OS, CPU: 2.2 GHz Intel Core i7, Memory: 16 GB 1600 MHz DDR3). The asynchronous() method is 7.7X faster than synchronous() method in a small sample where a total of 200 HTTP requests are involved in each case. Though there are some non-trivial synchronous operations in both method such as HTML parsing, it is clear that asyncio + aiohttp non-blocking workflow make the whole wikipedia extraction process much faster.

if __name__ == '__main__':
    # 100 sample names
    sample_names = ['Netflix', 'Google', 'Yahoo', 'Uber', 'Adobe', 'Lyft', 'Facebook', 'DiDi', 'Stripe', 'Salesforce'] * 10
    total = len(sample_names)
    print('Synchronous:')
    print('Wiki titles of the first ten results:')
    t0 = pc()
    synchronous(sample_names)
    print('finished extracting {} wiki records in {:.2f} seconds'.format(total, pc()-t0))
    print('Asynchronous:')
    print('Wiki titles of the first ten results:')
    print()
    t1 = pc()
    asyncio.run(asynchronous(sample_names))
    print()
    print('finished extracting {} wiki records in {:.2f} seconds'.format(total, pc()-t1))

Here are the results for the references. Another interesting part is that it is clear that the sequence of Asynchronous results is not guaranteed whereas the synchronous method returns the results in the exact order as input.

Synchronous:
- finished extracting 100 wiki records in 54.54 seconds
- Wiki titles of the results:
Netflix, Google, Yahoo!, Uber, Adobe, Lyft, Facebook, DiDi, Stripe, Salesforce.com, Netflix, Google, Yahoo!, Uber, Adobe, Lyft, Facebook, DiDi, Stripe, Salesforce.com, Netflix, Google, Yahoo!, Uber, Adobe, Lyft, Facebook, DiDi, Stripe, Salesforce.com, Netflix, Google, Yahoo!, Uber, Adobe, Lyft, Facebook, DiDi, Stripe, Salesforce.com, Netflix, Google, Yahoo!, Uber, Adobe, Lyft, Facebook, DiDi, Stripe, Salesforce.com, Netflix, Google, Yahoo!, Uber, Adobe, Lyft, Facebook, DiDi, Stripe, Salesforce.com, Netflix, Google, Yahoo!, Uber, Adobe, Lyft, Facebook, DiDi, Stripe, Salesforce.com, Netflix, Google, Yahoo!, Uber, Adobe, Lyft, Facebook, DiDi, Stripe, Salesforce.com, Netflix, Google, Yahoo!, Uber, Adobe, Lyft, Facebook, DiDi, Stripe, Salesforce.com, Netflix, Google, Yahoo!, Uber, Adobe, Lyft, Facebook, DiDi, Stripe, Salesforce.com,
===
Asynchronous:
- finished extracting 100 wiki records in 7.06 seconds
- Wiki titles of the results:
Lyft, Uber, Yahoo!, Salesforce.com, Lyft, Lyft, Lyft, Lyft, Lyft, Salesforce.com, Yahoo!, Stripe, Lyft, Uber, Salesforce.com, Adobe, Uber, Salesforce.com, Stripe, Netflix, Salesforce.com, Netflix, Salesforce.com, Uber, Netflix, Adobe, Google, Salesforce.com, Salesforce.com, Netflix, Lyft, Uber, Netflix, Adobe, Adobe, Uber, Stripe, Stripe, Salesforce.com, Netflix, Uber, Netflix, Stripe, Stripe, Stripe, Stripe, Adobe, DiDi, Stripe, Adobe, Google, Netflix, DiDi, Adobe, Google, Netflix, Google, Uber, Google, Yahoo!, Uber, Uber, Netflix, DiDi, Adobe, Yahoo!, Yahoo!, Facebook, Salesforce.com, DiDi, Adobe, DiDi, Facebook, Google, Stripe, Google, Google, Facebook, Yahoo!, DiDi, Yahoo!, Yahoo!, Facebook, Yahoo!, Facebook, Facebook, DiDi, DiDi, Google, DiDi, Facebook, Google, Lyft, Lyft, Adobe, Facebook, Facebook, Facebook, DiDi, Yahoo!,

Next Steps

How far is it from my location?

There are lots of things to play around with once you have the data from Google place and Wikipedia. One thing you could try is to calculate the distance between the target company and your current location via geopy library. This information is super useful to filter out the offices that are too far from your location or rank companies based on adjacency.

1 2	import geopy.distnace company_distance = geopy.distance.vincenty((lat1, lng1), (lat2, lng2)).miles

What are those companies talking about in the engineering groups?

In this trivial case, crawlers are not necessary. However, if we want to extend the work to cover all the pages of eng blog from each company, some form of crawler system is a must. A simple/Python way to go with this is to use Scrapy to crawl the related eng blog pages, and use text analytics to generate work clouds or grouping companies by technology together. That would be a fun exercise and interesting facts to discover.

More information please!

There are always more data sources out there that provide detailed information about public/private companies. Here are some websites and services to start with:

Final Words

Here is a part of the curated list I got from this exercise. Hope you enjoy the post.

Final List