Information about whitepapers is available here.
Specifics about final assignment available here, too.
Here's a useful tutorial on how to set up "shared folders" within Virtual Box.
And! Bonus! A file that will come in handy during the tutorial. Go ahead and grab a copy of universities.csv
Not so hard, eh? We're almost done. Let's install git and curlicue.
You can find notes from the last two weeks of class and assistance with the assignment on this [updated! again!] page of notes about working with files in Linux. It contains assignment information for 29 March class meeting.
If you have time prior to class on the 22nd March, (and you've already gotten Virtual Box and Vagrant up and running [see below if you're having trouble there]), then please take a look at section 3.c. in this text, Data Science at the Command Line. NB: Please do not distribute this text beyond our course.Data Science PDF
Last updated: 6:50PM, 21 March. If you don't see something here that you need, please let me know. Otherwise, check back when you're able.
The instructions for doing this are available in the book Data Science on the Command Line. Which explains why I'm explaining the same process here: The book does a poor job of explaining this process. I hope my explanation is clearer.
Our first step is to download the VirtualBox software. VirtualBox is made by Oracle, the same company that owns the computer language Java. VirtualBox is an app like Photoshop or Word, but where Photoshop can open an image and hold it in memory while you work on it, VirtualBox can open an operating system and hold it in memory while you work in it. Recall that one of the most important aspects of computation -- at least, as we understand it now -- is that one computer should be able to emulate another to an exacting degree. This is why you can vist Minecraft and see players building working models of 8 bit computers from the 1980's: They are very slow, and roughly the size of a football field, but they work.
Here is a simplified map of how that works: Each element "contains" all of the other elements.
Your hardware > Your OS > Java > Minecraft > 8-bit PC
And while it might seem weird to be inside one operating system that is inside another, from the computer's perspective, it really doesn't add anything new: The only changes worth noting are (1) the computer itself is a lot busier, and (2) it needs a lot more memory.
To get a copy of the VirtualBox, visit:
And download the VirtualBox that is particular to your system.
DISCLAIMER: VirtualBox is not consumer-grade software. It is software for hackers, programmers, coders, and specialists. People like you. VirtualBox does not hold your hand. It is not "user-friendly." In fact, it typically ignores you, even if there is a problem. That's just the way it is: For the most part, the only software that is worth using in this world is not "user-friendly". "User-friendly" software is like art produced by a committee. It might be easy to understand, but who would want to?
Put another way: Microsoft Word is like a boyfriend who will probably wear only khakis and short-sleeve dress shirts for the rest of his life. Don't marry him! Keep your options open!
Did you get an error when you tried to install? Thanks, ghost of Steve Jobs. You'll need to go into your System Preferences and choose the Security icon. From there, you'll get a set of tabbed panels: The first panel will probably make reference to the fact you just tried to open VirtualBox, and it will warn you against doing so. But tell it you want to live dangerously, and that Word is just not enough for you: Choose "Allow apps downloaded from..." (or maybe "Install anyway" or "Ignore warning" or whatever it says) -- and your OS should take you back to the installer and let you finish the process. If it doesn't, try starting the installer again.
Now we need Vagrant. Vagrant is what is called "middleware" -- a very sexy term! Think of Vagrant as a shipping container that is stuffed with at least one operating system. Or think of Vagrant as a vector, a means of delivery. Vagrant is going to "deliver" a Linux operating system to VirtualBox, and VirtualBox will open it up and run it inside the operating system we're already running. Imagine, for example, Dr. Who's TARDIS on a hilltop. While we're on the hilltop, we're in our regular operating system. But open the door to the TARDIS and look inside: It is very different. When we "SSH" our way into this new operating system, everything will look different. It will seem like a world unto itself... just like it seems when we're inside the TARDIS. But we're also still in the old world: We're in the TARDIS -- that is, Linux, but we're still on the hilltop -- that is, Windows or MacOS.
So, to install Vagrant, visit:
Download the version that best suits your OS.
(Hint: If you're using a recent version of Windows, it is almost certainly 64-bit. If it is older, or if you're using an older laptop, 32-bit may be the better choice.)
Install that software. If it asks you for installation information, it is, in practice, ALWAYS better to go with the defaults it offers you -- it makes it much easier to fix or remove later the software later.
Remember that macOS typically installs in the Applications folder at the director root:
But you can often more easily install things in your own Applications folder (easier to get to, usually). In my case, it is:
The tilde (~) is a shortcut that means (in my case):
Typically your applications are in one of two directories:
C:/Program Files (x86)/
The latter is where 32-bit applications are stored (e.g., aging, out-of-date programs). Since you're probably installing the 64-bit version of Vagrant, you'll likely want to choose the first of those two.
So now that we've got Vagrant on our system, let's use it to install the Data Science Toolbox that our author prepared for us.
Open your terminal by searching the system for CMD.exe; if you have PowerShell installed you can use that, too. Run CMD.exe. You should see an empty text window open on the screen, and there will be a cursor near the top, blinking.
Launch terminal from this location:
Now you're in your terminal. When you're in terminal, you're always standing inside some folder or another. Whether you're in Linux, or Windows, or MacOS, terminal typically dumps you into your USER DIRECTORY. On my Mac, it is:
Or, if I write it out:
This is where gigabytes of data are stored that are specific to me: My documents, my music, my videos; it is where my browser history is kept, where my calendar records information, etc. It is specially permissioned so that other users can neither READ nor WRITE to this giant folder.
We're going to put our stuff here in the root of our user folder -- side-by-side with our documents folders, music folders, library folders, etc. (Note that it is usually better not to fill this directory with files: Wherever possible, stick with folders only.)
NB: Vagrant expects this data to be letter-perfect, including spaces, capitals, etc. So type it exactly as indicated below.
Both MacOS and Windows use similar commands for this section!
At the $ prompt (the $ is another way of saying "I'm ready for your input"), type:
Hit enter. If there is an error, it will tell you. If it doesn't say anything in response, assume everything worked well.
Now change into that directory. Type:
Voila. Now you're in that directory. Take a look by typing:
On Windows, type:
On your Mac, type:
IMPORTANT: Before you initialize vagrant, make sure you are INSIDE the directory called "MyDataScienceToolbox". If you are not there, you will have problems later. Do not run vagrant in the step below unless you are in the MyDataScienceToolbox directory.
Whether you are on a Mac or a PC, you should see an empty directory. Let's fill it with stuff. At the $ prompt, type:
vagrant init data-science-toolbox/data-science-at-the-command-line
1. When you work from the command line, the first word always refers to a program or application. In this case, it is effectively the same as double clicking on the vagrant icon. When you type "cd" to change directory, you're actually launching a very small program called "cd", for example.
2. The next part, "init data-science-toolbox" is what we call an "argument." It is data that gets fed into vagrant when it begins to run (which is why we use the command line: It is inconvenient to try to supply an icon with arguments when you double-click it). In this case, init is probably a few lines of code inside vagrant that knows how to set up our working environment; data-science-toolbox is the name of a shipping container that is stored somewhere on vagrant's servers. It will go look for it and then bring it back to your computer.
That may take a few minutes. When that is finished, you can look around. Type:
You should see a blank icon indicating a file called Vagrantfile -- it is blank because it doesn't have a filetype suffix (.exe, .jpg, .html, etc.). That's ok, those are a lot less important than you'd think. Anyway, that is our configuration file -- its where we can set variables that are important on launching. You'll use those a lot in Linux. Ignore it for now.
Instead, let's get things up and running. From inside that directory, type:
If that worked, it means we've successfully started the engines on our virtual environment, which in turn successfully launched a version of Linux that was stored inside the data-science-toolbox package. So, in addition to loading and running Linux inside your OS, it will also make lots of programs immediately available to us for data grabbing and manipulation. (If we just installed a generic Linux OS through vagrant, instead of the data-science-toolbox package our author built for us, then we'd have to install dozens of data-oriented programs by hand -- which takes forever. This is why they invented Vagrant in the first place!)
You'll probably see lots of warnings and security caveats. Now we need to get inside the operating system.
Finally! Your VirtualBox is now running in the background. At this point, it isn't really doing much, though, and it certainly isn't very entertaining from the outside. So we have to get in. Windows users and macOS users will get to the entrance in two very different vehicles, but we're both going in through the same door.
MacOS users have an advantage: They have an SSH client built into their operating system. SSH stands for Secure SHell: It replaces older protocols like FTP (File Transfer Protocol) because they are not sufficiently secure. So we need to establish a Secure SHell connection to our little virtual operating system. In a sense, when you use SSH, you're tricking your computer into believing that you're linking to another computer, but you don't need an internet connection for this at all.
What you do need is an SSH client. Consider the venerable PUTTY -- as no-nonsense a piece of software as you'll ever meet. Visit:
If PUTTY isn't to your liking, there are dozens of others: Some are free, some are not. But an SSH client is something everyone should own. Look around online for alternatives.
Once you've installed PUTTY, you need to connect to your machine. Launch PUTTY and enter this address information:
HOST (IP ADDRESS): 127.0.0.1
Connection Type: SSH
Click on the button labeled "OPEN". The new operating system will demand your name and your password. Your username is:
Similarly, your password is:
Your process of logging into the VirtualBox is much easier than that of your Windows-using friends. In your terminal window, type:
If it asks for a username and password, go ahead and supply the word
vagrant for each.
This will open up an SSH link to the virtual OS. You'll see a greeting with some copyright data and a reference to the Ubuntu OS from which our Data Science Toolbox is derived. That's it! You're in!
To shut down your virtual OS, make sure you're in the main Data Science Toolbox directory (the one that is immediately inside your home directory; in my case, that is:
When you are there, you can shut down the Toolbox by typing:
You can delete everything and start fresh, too. Just type:
That deletes everything that you'd installed and all the work you've done so far, so be careful!