Overview
One day Mei asked me to recommend some tech companies for her next job. While I am more familiar with data science/ML jobs in tech companies, she is looking for software engineer opportunities with strong software/engineering focus. I tried my best brush my memories on startups and big companies in the Bay Areas, however, I found out what I only got no more than 10 companies either FANNG or ML related companies like H2O.ai in an hour. Apparently, I am not familiar with this field. The first drafted list is neither comprehensive nor relevant.
Once I found this curated list of engineer blogs, I was saying to myself this is a way to go. Companies, either big or small, will be likely operating with a good engineering practice if they maintain good engineering blogs. This is actually a better starting point than random career pages and LinkedIn Ads since the eng blogs tell a lot more about what is happening under the hood in the engineering department. Fortunately, I still remember how to build crawlers and learned the Asyncio from my coworker Vincent. Why not use the tools and the Internet enhance the list and build a good engineering company list?
Introduction of Asyncio
Though it might not be necessary to use asyncio with Python 3.7 for this trivial case, it would be a good exercise and expandable solution to go with. Before diving into details of the use case, let’s have a look of Asynchronous I/O and its implementation in Python 3.7.
On high level, asyncio is a library to write concurrent code using the async
/await
syntax in Python 3.
High Level APIs
Coroutines
Coroutines is a high-level API/abstract that declared with async/await syntax is the preferred way of writing asyncio applications (Python 3.7+).
Awaitables
We say that an object is an awaitable object if it can be used in an await expression. Many asyncio APIs are designed to accept awaitables.
There are three main types of awaitable objects:
- coroutines
- Tasks: used to schedule coroutines concurrently.
- Futures: a low-level awaitable object that represents an eventual result of an asynchronous operation
Some examples are listed below.
Learn more about Company from Google and Wikipedia
You can easily extract a list of company names and eng blog urls from the the github repos( repo1, repo2) after some basic regex parsing. Check out https://regexone.com/ or https://regex101.com/ if you need some extra help.
Get Locations from Google Place API
The new Google Place API requires signing up and filling in credit card information. But it’s free for minimal usages in the first year. Check out https://developers.google.com/places/web-service/intro more details and be cautious not to expose your tokens and control your cost.
be cautious not to expose your tokens and control your cost.
For our use case, it’s important to find out the location, latitude, and longitude of the tech company’s’ offices. search end point is the one to go with. It is a bit tricky that if you search the company name by itself in google place API, you will get no result. In order to increase the match rate, more location information like USA
or Bay Area
or Seattle
are needed to gather with the company name. Also, the inclusion of keywords like office
or headquarters
will help.
|
|
After running the script above, we can get the detailed place information about the company we are searching for. Here is an expanded json line output.
|
|
Get Filmographic data from Wikipedia
After getting the location, we need more detailed information about the companies. Wikipedia is a good source of information for some well-known public/private companies with useful headcount, revenue, industry and founder information.
For comparison purpose, two methods will be presented below, where one is synchronous()
and the other is asynchronous()
.
Synchronous Method & Asynchronous Method
|
|
Comparison
In order to evaluate sync vs. async methods, 100 company names are used for a benchmarking. More Specifically, the same data are passed into synchronous()
and asynchronous()
methods defined above in the same environment (Mac OS, CPU: 2.2 GHz Intel Core i7, Memory: 16 GB 1600 MHz DDR3). The asynchronous()
method is 7.7X faster than synchronous()
method in a small sample where a total of 200 HTTP requests are involved in each case. Though there are some non-trivial synchronous operations in both method such as HTML parsing, it is clear that asyncio
+ aiohttp
non-blocking workflow make the whole wikipedia extraction process much faster.
|
|
Here are the results for the references. Another interesting part is that it is clear that the sequence of Asynchronous results is not guaranteed whereas the synchronous method returns the results in the exact order as input.
Next Steps
How far is it from my location?
There are lots of things to play around with once you have the data from Google place and Wikipedia. One thing you could try is to calculate the distance between the target company and your current location via geopy
library. This information is super useful to filter out the offices that are too far from your location or rank companies based on adjacency.
What are those companies talking about in the engineering groups?
In this trivial case, crawlers are not necessary. However, if we want to extend the work to cover all the pages of eng blog from each company, some form of crawler system is a must. A simple/Python way to go with this is to use Scrapy to crawl the related eng blog pages, and use text analytics to generate work clouds or grouping companies by technology together. That would be a fun exercise and interesting facts to discover.
More information please!
There are always more data sources out there that provide detailed information about public/private companies. Here are some websites and services to start with:
Final Words
Here is a part of the curated list I got from this exercise. Hope you enjoy the post.
References
- https://docs.python.org/3/library/asyncio.html
- https://github.com/kilimchoi/engineering-blogs
- https://github.com/sumodirjo/engineering-blogs
- https://magic.io/blog/uvloop-blazing-fast-python-networking/
- https://scraperwiki.com/2011/12/how-to-scrape-and-parse-wikipedia/
- https://codesnippet.io/wikipedia-api-tutorial/
- https://developers.google.com/places/web-service/search