Summary: the basic local environment to learn Data Engineering and Data Science without overspending: Laptop (Thinkpad X220 or similar class), 16GB RAM and 120GB SSD, Windows 10, OS language set as English (United States) language, WSL (Windows Subsystem for Linux) enabled, Python 3, venv, Notepad++, vi. Total budget: hardware $250, software free.
In my Big Data in 30 hours training class, we cover a number of Data Engineering environments and Data Science platforms. From barebones Linux text processing, to relational databases (SQLite, Oracle) data warehouses and BI, Hadoop, Spark, Amazon AWS, streaming (Kafka), to end up with data science Python environments: jupyter, numpy, pandas, scikit-learn, TensorFlow, Keras and more. Meanwhile, we touch on software engineering practices involving version control (git, svn), or containers (docker, kuberentes).
I have designed the class to minimize financial entry barriers for students. We use laptops. Below is my recommendation for the basic hardware and software environment setup. While I am specifically recommending this to my Big Data class students, I believe this list can be helpful to other people considering making first steps in Data Engineering / Data Science without overspending. Below is all you need.
Yes you will need one:) The preferred operating system is Windows 10. We will be installing A LOT of stuff. When you install stuff, you sometimes break other stuff. You need to decide whether you want to do this on your primary laptop, where you hold your private data, family pictures and bank passwords. One option is to be very careful, use isolation and containers whenever possible, and use frequent backups. One other option (recommended) is to use a completely separate machine for experimenting. It is actually quite cheap to do.
Noting this is just a sandbox machine, with no critical data or systems, you do not need a new laptop. Assuming your budget is as tight as $250, here is a solid second-hand setup I have tested and recommend.
- An old Lenovo Thinkpad X220. Second-hand price is $150. Before buying make sure it comes with Windows 7, 8 or 8.1 installed. You will not need this but you will use the installation key to upgrade to Windows 10.
- One necessary upgrade is 16 GB RAM memory. New 2x 8GB DDR3 RAM set costs $50, depending on manufacturer (for Goodram or Kingston you may pay a bit more). Read carefully the online content, there is discussion wheter X220 can handle 16GB of RAM. I can confirm it can.
- Your second best investment is an SSD drive. New 2,5’’ mSATA (120 GB) costs $30
- Two extra goodies that help your comfort but aren’t really necessary: Get yourself a 600mAh Li-Ion extra-capacity battery, allowing for ~6 hours work without charging. That’s $25. DisplayPort-to-HDMI adapter is $8, and that’s only if you want to connect to an external screen (ThinkPads don’t have HDMI).
- You will spend a day disassembling your laptop and manually installing memory and/or SSD and upgrading Windows. If you can afford time, it is a good investment of your time and it is fun to disassemble your laptop’s keyboard and look under the hood. Don’t go to the service. Do it yourself. Yes, there are chances that you break it, try not to. Here’s some good instructions. You can also follow more detailed youtube videos.
Here you go, all set with an aspiring Data Engineer hardware setup for about $250. Of course, you may use any other laptop model, just make sure it has enough memory and optimally an SSD. And yes, your environment isn’t perfect. You cannot perform meaningful Spark or Hadoop operations on one machine. Your tensor operations will be slow without an NVIDIA GeForce GTX 1080 GPU. Do not worry. You are fine. If your budget is $250, this is the way.
2. Operating system… language
Install Windows 10 with system language set to English (United States).
Why not your mother tongue? We will be dealing with a lot of software configuration. You will often case get stuck with stuff. When stuck, the online professional community is where you are most likely to get help: Stackoverflow, Kaggle, Howtogeek, Quora, asktom, and many more niche interest groups at linkedin, facebook, or slack.
Having access to the error communicates in English will make your interactions with the community effective. Instead, if you interface the OS in your local language, whatever that is, your online search for guidance in moments when you stumble will be cumbersome and frustrating. Polish readers will smile remembering that 7zip Extract command translates in Polish to an absurd term Wyodrębnij, rather than commonly used Rozpakuj. Apparently some folks working in translation and i8n (internationalization) business had some obscure reasons to make life difficult for us. When hitting a technical error, I don’t want to waste time guessing what the translator meant and why. My recommendation: improve your mental hygiene and set your OS to English (United States).
You will soon need to install a lot of software, including various, often conflicting versions of same libraries. Docker container model helps you isolate those environments and saves you from much of sysadmin headache. Install Docker, and then install any additional software inside a Docker container.
You need access to a Linux shell environment (Linux terminal), and this should be the bash shell. Assuming you own a laptop with Windows on it, you can achieve this in a variety of ways.
- One way is install barebones Linux as the main operating system on your laptop. Popular distros (distributions) are Ubuntu, Red Hat, SuSE, Mint, Fedora, CentOS, Debian and many more. If you are not sure which one, go for Ubuntu. You can still keep Windows on the same machine with dual-boot configuration. However, I do not recommend this way of installing Linux unless you know what you are doing.
- You may install Linux inside a virtual machine such as Microsoft Hyper-V, Oracle Virtualbox, or VMWare. That’s one reasonable option.
- Ubuntu may also be run inside a Docker container. It is a reasonable option. Access to the host filesystem might require some additional config, but not much
- The option I recommend is to use Windows 10 Subsystem for Linux (WSL). WSL has the native support for Linux, which isn’t perfect from sysadmin point of view but for our purposes it has all we need. The good part is that from your WSL Linux terminal you can access your regular Windows filesystem and all the files. The trick is that WSL support is disabled by default, so you need to enable it before installing. This Windows Central article describes how to perform the installation.
That’s it for Windows. For Mac users, there is no need to install anything, as it has a unix console built in. Mac OS X includes a Terminal application, which provides a text window in which you can run Linux/Unix commands. This terminal is also often referred to as command line or shell or shell window.
While there is a passionate discussion going online regarding which Python to use, 2.7 or 3.5, I will cut it short. Use the latest stable version of Python 3. Download and install it. Before installing any further libraries (numpy, jupyter…), wait. You should immediatly think about isolating the environment, as you are likely to get stuck in conflicting dependencies that break your install. First of all get hold of a package manager such as venv, virtualenv or conda (either of them works). Then and only then proceed to any additional installs. Summary of best practices:
if you are doing this in Linux (WSL), never sudo pip
also, do not use pip as shell command. Use python3 -m pip
install venv (or another package manager) first, before installing anything else
on WSL, do not install venv environments on Windows mounted drives. Won’t work. Venv environments work best in the virtual Linux filesystem that WSL provides, for instance in your home directory ~
other packages including jupyter needs to be installed through venv, never before installing venv.
in addition, many authors recommend: double isolation: virtual env inside Docker
Programmers typically don’t use notepad. Better options include: Atom, Sublinme, Brackets, Notepad++, vi, vim, Emacs, and many more. You can use anything. My recommendation 1: install Notepad++ and learn to use it. It is reasonably lightweight and have many excellent features such as line numbering, syntax highlight, tabs, macros and plugins. Recommendation 2: also learn the basic operations of vi, which will be useful when logging in to a remote server in a text mode.
Note we do not require a full-fledged IDE, but you might select to choose one, such as PyCharm.
You are set.
That’s it for the basic setup! We will also need to be installing tons of additional software, including databases, middleware, python and data analytics packages, however this will be covered in the lectures and articles to follow.
Bonus read: Anki
One more recommendation, but this time it’s nothing to do with the environment. We will cover a lot (A LOT) of material. Here is my opinion which you may find controversial. I believe a lot of this material should be learned by memory. This does not mean that you should not understand stuff! You should do both: understand how things work, and memorize it. The problem you will be facing as data specialist is that there are too many technologies out there. Even as full-stack developer, with time you may become expert in two, three or five platforms at most, but not fifty. Face it: you will never have time for more. Hence, one option to remain literate in the technical areas for which you don’t have time is to memorize the key terms and regularly follow the professional press to track the new concepts that appear.
For memorizing stuff, consider using Anki. It is a flashcard-type application, open source and free. Anki implements spaced repetition, and hence helps you memorize stuff with less effort and time. Anki synchronizes your material between your laptop and android, so you can keep learning on either. Commercial alternatives are Memrise, Quizlet, Supermemo, Studyblue, Revunote, Reflect, Brainscape. I did quite a thorough research on all of them one day, so I’ll save you the work and will summarize it: use Anki. It works.