My Background
Firstly, a little bit about my background before the internship. I had just finished my 2 years of National Service and was delighted to be able to find an internship role in Ohmyhome. During my time in National Service, I often dabbled with programming, learning some programming languages and building some passion projects. However, joining a proper tech company felt like such a huge step and I didn’t feel like I was ready.
Everything I had programmed before then was mostly for fun. It was a hobby! I often started projects I didn’t finish, and my code was extremely messy and unrefined. Most of my programming knowledge was attained through online courses, such as Web Development and Game Development courses. Lucky for me, I had a huge interest in programming, and having the opportunity to intern for a tech company sounded like a great chance to test my abilities.
One of the game development courses I took on Udemy.
I was also informed that most of my tasks required me to be coding in Python. This led me to try improving Python as much as I could before I began the internship. For anyone trying to get a good grasp of the Python language, I strongly recommend following W3Schools Python tutorial. It covers very basic yet essential Python topics, from Variables to Classes. I will also be releasing a series of articles on Python basics soon, which will cover everything you need to know to be a competent Python engineer, so do stay tuned!
Anyways, I was also informed that I was going to build some Web Scraping projects, something I had touched on in the past but had mostly forgotten at that time. This also led me to feel a little anxious before my first day of work.
Ohmyhome
Now, what type of company is Ohmyhome? Ohmyhome is Singapore’s first proptech company and helps to streamline the property transaction system with the use of technology. In this article, I won’t go too in-depth into the operations of Ohmyhome, but I will mostly talk about the working environment. As an intern, I was grateful to be under great mentors that were super welcoming and friendly. Being a part of the data team, everyone was extremely well-knit and it was easy to ask for assistance. Anytime I was struggling with my tasks, my peers would often lend me a helping hand.
Furthermore, I felt that our team had a very clear direction, allowing us to focus on our tasks while also being able to give suggestions to improve anything that we felt was inefficient. Even as an intern, I felt that my voice was always heard and respected, which gave me the confidence to push for more responsibilities and harder tasks.
A clear time direction helps everyone focus!
It was also my first time programming as a team. Although most of my tasks were only appointed to me, it was a very different experience programming while being able to ask for help. I felt the urge to always keep my code clean and concise so that my team members would not have a hard time debugging my code!
Another thing I was introduced to was Version Control. Previously, I had never felt that Version Control was necessary. I always coded for myself and never needed to let anyone else view it. My projects were all also relatively small, which meant I never felt the need to do go back to a previous version of my code. To anyone that is looking to intern in the tech industry, I strongly encourage you to pick up Git. Udacity has a great Git course that is free!
One example of VCS is Git!
It was also a great feeling when I realised people in other departments were using my data. It made me feel like my work was actually useful to the company and that I wasn't wasting my time writing useless code. Data was something that I never appreciated before my internship, but seeing how the company used the data for a multitude of reasons made me realise the value of having data!
Data Collection
My main role in ohmyhome was Data Collection - specifically Web Scraping. Over the 3 months I spent there, I built many Web Scrapers to help collect data for the company, which allowed me to get familiar with some of Python’s most popular modules, namely Selenium and BeautifulSoup (which I both wrote articles on). One thing I learnt from my Web Scraping experience is that with the power of these two modules (and Python’s request module), you can pretty much scrape anything off the internet!
I’m not kidding! After experimenting with lots of websites online, Selenium and BeautifulSoup provided most of what I needed to scrape data. BeautifulSoup helped to scrape simpler, static websites while Selenium helped to cover websites that loaded Dynamic content or websites that require a bit of user intervention (e.g. logging in). Just by mastering these 2 modules, it’s pretty much the same code for any website you want to scrape!
In fact, the difficult part of Web scraping isn’t the actual scraping, it’s trying not to get blocked! I learnt a great deal about IP addresses and even dabbled with using Docker containers to hide my script’s automation, which was super interesting to me.
Web scraping can sometimes make you feel like a hacker!
Learning about how some websites try to protect their information and how I can “evade” their defenses was something I never thought I would learn! Playing this “cat-and-mouse” game of scraping and getting blocked was really exciting, and when I finally was able to make a script robust enough to prevent getting blocked, it was extremely satisfying!
Of course, you should only scrape websites that allow for Web scraping :)
Another thing I learnt is about Threading! When using Python, it’s important to learn about the Global Interpreter Lock (GIL). Python’s GIL makes it so that at any one point of time, only one thread can be executed (in other words, only one thing can happen at one time!). This may be a small concern for most Python programmers, but when dealing with Data Collection, speed is often a big issue.
Python Threading allows us to “evade” this GIL, allowing your threads to run in parallel. This is extremely useful because it cuts down the time required for our script to finish by a ton! Especially for tasks that require little CPU processing and involve a lot of waiting (Web Scraping), Threading is extremely effective!
Threading in Python helps speed our processes up!
For example, imagine we are trying to scrape a 100-page website that takes 1 second per page. A normal Python script that has not implemented Threading will require 100 seconds! However, if we were able to run 10 Python threads that each scraped 10 pages (e.g. Thread 1 scrapes pages 1 - 10, Thread 2 scrapes pages 11 - 20 …), we would be done in just 10 seconds! That’s the power of Threading!
I will be writing another article that will cover Python Threading more comprehensively, while also going through how to actually implement it so please stay tuned!
Additionally, I also learnt about Python Scheduling. Basically, Schedulers in Python allows us to set a specific time/interval at which our functions will trigger! It’s fairly simple and very useful too! For my Web Scraping scripts, I had to keep my data constantly updated, which also meant that I had to continually be scraping data off of the websites as long as there are changes. The Python Scheduler module really helps with this!
Python Scheduler helps us to time when our functions run.
I can simply key in what time (or after every set amount of time) I want my function to run, and the Scheduler will trigger my function! For example, if I realise that my data is updated every Monday at 8am in the morning, I can simply set my scraper to scrape every Monday at 8:05am every Monday! This will ensure my Database will always have the most updated information!
However, one thing about the Python Scheduler module is that it requires your script to be constantly running. This may not be the smartest way of triggering your Python scripts and many cloud computing services (such as Google Cloud Platform) offers much more efficient alternatives!
Cloud Computing
During the internship, I was also fortunate enough to use Google Cloud Platform (GCP) and the suite of services they provided. Before I go into the details, what exactly is GCP? This was a question I was struggling with when I was first introduced to the platform. Verbatim from the GCP website, Google Cloud consists of a set of physical assets, such as computers and hard disk drives, and virtual resources, such as virtual machines (VMs), that are contained in Google's data centers around the globe. Simply put, GCP (and any cloud computing platform) is a service that offers a large range of tools for companies to use!
For example, if you needed a place to store all your data, GCP offers Cloud Storage! If you need a Virtual Machine to run your processes, GCP offers Compute Engine! It’s just a place where you can use all the tools that the platform offers, but of course at a price.
Google Cloud Platform
The services that I dealt with during my internship were Compute Engine and Apache Airflow! I’ll just briefly go through them all:
Compute Engine is simply a place for you to spin up Virtual Machines. Virtual Machine is basically like a computer without any hardware (imagine a computer but *VIRTUAL*)! Virtual Machines and Virtualisation has a ton of benefits, including running multiple operating systems or handling potential malware software. Personally, I used VMs because I wanted to keep my local IP address hidden, ensuring that I can smoothly Web Scrape without a hitch!
Virtualisation is an amazing advancement in the programming world!
Apache Airflow is a platform to build workflows for any of your operations. Simply put, it’s a place where you can lay out tasks that can be carried out one after another. Tasks can be Python functions, SQL Actions etc. Imagine your Airflow like a factory pipeline churning out Toys, with each task being one stop in your pipeline. You may start of with a task to build the toy, then another task to colour the toy, then a task to package the toy and lastly a task to send it off to shipping!
Building your Airflow is like building a factory Pipeline!
One of the main reasons why people use Airflow is because of the Operators and Hooks that Apache Airflow provides. These Operators and Hooks are almost like your modules, such that they are pre-written code that you can use to more efficiently finish tasks. For example, one very useful Operator is the Postgres Operator, which allows you to execute Postgres Queries within one of your tasks!
Another reason why people use Airflow is because it provides a very convenient visual representation of your processes. After defining your Airflow, you can view all your tasks in a Tree view, which gives you lots of information like whether the task has failed or whether the task is running! This is extremely useful because without this Tree view, we have to constantly be logging information about the status of our tasks which can be hard to read and extremely difficult to manage.
Conclusion
Overall, I thoroughly enjoyed my time in Ohmyhome. Thinking back, I’m amazed by how much I was able to learn from my internship! I was introduced to concepts I didn’t even know existed and built projects that the old me wouldn’t have been able to comprehend! The mentorship and guidance I had in Ohmyhome was also something I greatly appreciated. Ohmyhome truly provided a safe space for me to hone my software development skills.
If you are someone thinking of starting a software internship in a tech company, please do! However, I do feel it is important that you do your own research into the company you are applying for, because not all companies have the same working environment. Remember that as an intern, your job is to soak up as much knowledge and skills as you can! If you need any tips for your tech internship or are feeling a little anxious, feel free to contact me! Stay cool, cucumbers!