Some Basic Tips When Using a MacBook for
Scientific Computing / "Data Science" / Machine Learning


  • General General Stuff

    • USE A PACKAGE MANAGER, most people use Homebrew. this lets you install lots of (usually command-line/terminal/libraries/etc) software in a sane way instead of relying on whatever mac gave you. just "brew install" your own and it will handle everything.
  • General Desktop Stuff

    • Use the command-space spotlight search thing a lot. like to change an option, you better be typing "command-space" then "s" and then it will probably autocomplete-suggest "system preferences" which is what you want, so you can just press enter and get there in 4 keystrokes.

    • SizeUp lets you easily move windows to occupy the left or right half of the screen, quartants of the screen, maximize, send to other desktops, monitors, etc etc, using hotkeys. It's great and makes me way more productive, can't recommend enough. It costs like 13 bucks, just buy it, or use the trial version, but seriously, its 13 bucks. It doesn't seem that useful but being able to consistently operate in splitscreen mode and stuff is great when reading documentation/papers/math and coding at the same time, etc.

    • Karabiner Elements lets you remap keys. I did this because as a PC programmer I was so used to the ctrl key (which acts more like the command key on windows) being under my left pinky, where Mac laptops put the "fn" key, and the hand contortions for "command-x/c/v" seem so much more unnatural, I wanted to remap my keys. I mapped my fn key to left_command so it was under my left pinky like on a PC, left_command to left_option, left_option to fn. Karabiner-Elements is free (I think).

    • Sublime Text 3 is the correct text editor to use. Don't use it to write code for languages that have a good IDE (usually that means an IDE made by JetBrains, e.g. IntelliJ and all its plugins like for Scala, etc. PyCharm is just basically the IntelliJ Python plugin bundled as a different appplication.) Install Package Control on it to easily install new packages and use it to edit things like config files, LaTeX (if you're not collaborating on Overleaf), this document, etc. It costs money and will bug you occasionally. Depends on how much money you have.

    • Skim is a great PDF reader with support for annotating documents and an API for syncing up with other applications, such as tex editing plugins for Sublime.

    • I use the app QuickRes to let me hack around with crazy window resolutions on my retina display. I don't know how much this is necessary on the modern OSX in in 2018 vs in 2012 when you needed it, but you can technically run at like native retina resolution if you were insane. Generally highest resolution you can stand is the best, because you have more screen real estate to work with, you can split screens with SizeUp, and be more productive.

    • Go to System Preferences > Trackpad and enable:

      • Under the Point & Click tab:

        • Tap to click. If you're physically depressing your trackpad to click stuff I hate you.

        • Secondary click (right click) should be "click or tap with 2 fingers".

      • Under Scroll & Zoom tab:

        • Enable Scroll direction: Natural. It's sooooo much better. It's like you're pushing and pulling the page: "Get away from me, internet! Get over here, internet!".

        • Enable pinch to zoom.

        • I don't care about the others.

      • Under More Gestures tab:

        • I enable all of them but to be honest, the ones I really use are:

          • Swipe up with 3 fingers for Mission Control (see all windows).

          • Swipe down with 3 fingers for Exposé (see all windows of current application).

    • In most text situations, using the arrow keys plus holding option will let you skip over entire words at a time. I CAN'T EMPHASIZE ENOUGH HOW MUCH MORE PRODUCTIVE THIS MAKES YOU. Especially if you hold down shift, you can use it to quickly highlight specific words or lines to copy and paste and cut.

  • General Terminal ("Shell") Stuff:

    • Using the shell is important and very powerful. To demonstrate this power, the following one-liner downloads the lyrics of the song Juicy by Biggie Smalls and reads it over your speakers (warning, lots of explicit lyrics):

      curl -Ls http://genius.com/The-notorious-big-juicy-lyrics | xmllint --html --xpath "//div[@class='lyrics']"  - 2> /dev/null | tr $'\n' ' ' | sed 's/<[^>]*>//g' | sed 's/\[[^]]*\]//g' | say
    • Get a good .bash_profile file that colors things differently, makes your prompt nice, etc. Set your terminal to white on black. Add the alias ll for "ls -lha" because you'll use it a lot.

    • I've also heard good things about zsh so maybe do that (with someone's monster config script)

    • Navigation:

      • ls shows the contents of the current folder.

      • ls -lha is like ls but shows sizes, hidden files, more information, etc. it's good and should be aliased to ll.

      • ~ is an alias for your home directory, . is an alias for the current directory, .. is an alias for your parent directory.

      • cd DIR or pushd DIR changes directories. pushd is nice because it lets you return to where you came from later with popd, and nests.

      • pwd prints the full path of the current directory.

      • Often when typing a path you can double-tap tab for a list of options (files/subdirectories of the current part of the path your typing or with a prefix youve typed). if there's only one option, like when going through long nested code directories, or only one option given the prefix you've typed, pressing tab once will auto-fill it.
    • Basic commands to interact with files and chain commands:

      • man COMMAND prints the help page for a given COMMAND

      • cat FILE prints a file to the screen (standard out or stdout)

      • The | separator (pipe) between two commands pipes the stdout of one to the stdin of another. Many utilities can accept input from stdin (standard in) so this lets you chain cool commands.

      • The > separator, used as COMMAND > filename writes the stdout of COMMAND to that file. very useful.

      • echo FOO prints FOO to the screen. echo $FOO prints the environment variable FOO to the screen. This is a good way to check your path, the ordered list of folders that your shell looks in when you type a command.

      • touch FILENAME creates a new empty file called FILENAME. Nice to add a blank README.md to a new git repository or whatever.

      • mv SOURCE DEST moves SOURCE to DEST.

      • cp SOURCE DEST copies SOURCE to DEST. If source is a directory, you need cp -r for recursive.

      • ln -s DEST SOURCE makes a shortcut ("symlink" or "soft link") without actually moving or copying. Very useful, and dangerous, because notice that the order of arguments is reversed. You can mess a ton of stuff up.

      • rm FILE removes a file. if its a directory, you need "rm -r DIR".

      • wc -l FILENAME counts the number of lines in a file. you can also count words, etc. even if you couldn't, you could use tr to change " " to "\n" and pipe that to wc -l (see "slightly more advanced" section).

      • du -h -s gives the disk space usage of the current directory. du -h -s DIR gives it for DIR.

      • head -n NUM FILE gives the first NUM lines of FILE. There's a similar command, tail. head also accepts negative lengths, use the man page. if you don't give it a file, it will use stdin, this is common of most commands so you can chain them together with pipes.

    • Text/stream manipulation (useful for navigation too since you can manipulate results of ls, etc):

      • sort sorts lines in ascending by default, in dictionary order as characters. -n makes it sort in numeric order, -r gives descending, -k lets you pick a column as a key to sort by, -t lets you pick a field delimiter to create columns.

      • uniq gives the unique lines, but only by collapsing neighboring elements because generally we are stream processing. So, to do a true unique count, you want to sort first. cat FILE | sort | uniq | wc -l gives the number of unique lines in a file.

      • sed lets you edit streams with regular expressions but on Macs they use a crufty old BSD sed so you type sed -E to make it act more like Linux sed. usually something like cat file | sed -E 's/REGEX/whatever it is you want to do with the matches/g' will get the job done.

      • grep lets you search with regular expressions though once again I think you need to do grep -E because uhh Mac. the -v option inverts the match. This is useful for finding things in a text file or finding a file in a directory, etc. For example, cat FILENAME | grep -E '^foo' gets the lines starting with "foo".

      • awk is really nice for manipulating tabular data and more, but a bit arcane. It can be useful to do stuff like figure out the longest line, or just pluck out specific columns.

        • If you want longest line in characters, that would be something like cat tempfile | awk 'BEGIN {FS=""} {print NF}' | sort -rn | head -n 1, whereas the longest line in words (things separated by a space) would be cat tempfile | awk 'BEGIN {FS=" "} {print NF}' | sort -rn | head -n 1. FS means "field separator" and NF means "number of fields".

        • If you want the index of the line with the most fields, that could be something gross and imperative (but shows the power of awk) like

          cat tempfile | awk 'BEGIN {FS=" ";idx=0}  {print idx,NF;idx+=1}' | sort -rnk 2 | head -n 1 | cut -d' ' -f 1

          cut -d' ' -f 1 is equivalent to awk 'BEGIN{FS=" "} {print $1}', it sets a delimiter and selects a column or subset of columns.

        • Don't hesitate to ignore this stuff once it gets really hairy and just write a quick Python script. Doing complicated things in bash is often a fool's errand.

    • Text editors

      • For a shell-based text editor, which you should avoid, but sometimes must, use nano. vim and emacs are madness.

      • If you accidentally open vim, you can quit it by typing ":", then "q", then enter. No kidding.

      • If you accidentally open emacs, you can quit by holding control-x-c.

      • If you must pick one, pick emacs.

    • Downloading/Uploading stuff:

      • Use curl or wget. i don't think wget is on Macs by default but if you are using a package manager WHICH YOU SHOULD BE you can grab it as easy as brew install wget.

      • Use scp (ugh) or rsync (yea!) to upload/download stuff from remote sites. rsync is the best because it uses a sophisticated diffing algorithm to send minimal data, which is useful if its something you keep updating a lot but only incrementally, like a code directory.

    • Multiprocessing

      • Pressing ctrl-z after executing a command, while it's running (imagine a long running process like a long download or search, etc.), will take the currently running process, pause it, and give you a new terminal. Not very useful. But, the bg command will let it continue running on another process while you use this terminal. The fg command brings it back to the foreground and puts you back in the action and unable to use your terminal. You can do this with more than one process. It's often useful when you want to download a few things at once, etc.

      • Check out fork for more multiprocessing.

    • Slightly more advanced / miscellaneous:

      • xargs is very slept on, but it lets you take a stream of incoming lines and repeatedly execute a command on them, with the -I option it gives a lambda-like syntax where you introduce a free variable. For example, if you want to copy a bunch of files in the current directory, matching a certain prefix, to the target directory DEST, you could do ls | grep -E '^PREFIX' | xargs -I X cp X DEST.

      • tr will translate specific characters one to another.

      • join can do SQL style joins on specific columns and stuff.

      • check out cut, paste etc. I do most of it with awk or Python.

  • General coding stuff

    • Use git repositories for everything (even non-code stuff like this document) and commit more often than you think you should. it's always nice to have a full history.

    • The command line git is hard to use. Use SourceTree, it's free, incredibly slow, and lets you do all kinds of git stuff like revert files and commit things without losing your mind learning git commands. Branching is also good to work on new features without mauling your existing code, but I've been known to just copy-paste instead.

    • I hate Dropbox, probably going to switch to something else, but put your projects in a folder that is automatically backed up remotely. I currently use Dropbox but should switch to Google Drive or whatever doesn't take days for a fresh sync.

    • Use IDEs, not text editors. Even the most die-hard emacs fan will eventually crumble upon seeing the power of IntelliJ and the like. IntelliJ Community Edition is free to all, or if you have a student email you can get the professional edition for free. I don't think I've ever used any of the features of the professional edition. One of the best features (besides actually working, type-based autocomlete) is being able to "Go To Definition" of functions that are only in libraries you are referencing. Looking at library code to see what it does is often more useful than googling a bunch of confusing documentation. USE "GO TO DEFINITION" A LOT I CANNOT OVEREMPHASIZE HOW MUCH EASIER THIS WILL MAKE YOUR LIFE. The IDE will also point out syntax errors, do type-aware autocomplete, etc, before you even run it.

    • Some languages don't have an IDE or an IntelliJ plugin, then use Sublime Text with the appropriate packages, or maybe there's a Jupyter notebook addon that handles the language, which can be very cool.

    • Learn the shortcuts of your IDE to do things like rename symbols, search for symbols, classes, files, look up definitions of functions/classes or binding points of variables, find the usages of a function/class/variable. Hotkeys are good and don't hesitate to customize them. A lot of them are not good to begin with. Why is "go to definition" command-b and not command-d for "definition"? etc.

    • Use Jupyter notebooks. There are more languages supported now than just Python, and the interactivity is really awesome, and there are now plugins to basically build web apps with gui widgets that interact with code.

    • Python

      • Use Jupyter notebooks a lot to develop new ideas, interactively explore, and make little algorithms and peek at data before moving it into a more rigorously curated home in an IntelliJ project. Do this iteratively, you can import external Python modules into Jupyter.

      • Use PyCharm for your main development. I think it's basically just the IntelliJ Python plugin packaged as a standalone product. The community edition is free, professional edition with a student email is free. All rules from "general coding stuff" apply.

      • Use the pip package manager, or Anaconda, or both (recommended), and/or virtualenv (I just use Anaconda instead of virtualenv for making environments, but it could be good for distribution), to get libraries and manage packages / Python package environments. This will make your life incredibly easier. More detail in data science section.

    • Scala

      • Use the IntelliJ Scala plugin.

      • Don't get too cute with the module system and make yourself a total mess of code, but don't get too uncute either, otherwise you might as well be using Java. Higher kinded types are wild, as are first class modules.

      • The collections are slow. For loops are slow. Many things are compiled to methods that should be simple member lookups, like private vars in traits.

  • Scientific computing / "data science" / machine learning

    • Python

      • We like Python for data science because of libraries like scipy, numpy, matplotlib, tensorflow, pandas, jupyter, lots of other stuff. Otherwise we'd be using a language whose for loops weren't glacial and stuff. It's all about the tooling and libraries. I would be using Scala or OCaml all day otherwise.

      • I recommend installing Anaconda as a combination environment manager / package manager specifically for scientific computing. It lets you create different environments with their own Python versions and package versions, etc. Useful for checking out what package upgrade broke your project, as well as generally having complete control over your Python, especially in situations where you don't have root like on clusters. Sometimes anaconda doesn't have an up to date package. You can always use pip in this case, or sometimes the package will make you build your own package from source and install it. This should still happen safely in your current Anaconda environment. Anaconda rules.

      • Learn to love numpy. Get comfortable with vectorizing your code (turning it into batch operations on (pseudo)tensors). It's the way numerical algorithms are programmed these days, when possible, to allow GPU acceleration.

      • scipy has crazy functions you won't believe. Never assume some wild polygamma function or something hasn't been implemented.

      • matplotlib is great for exploring data, visualizing. Especially in combination with jupyter.

      • I use TensorFlow for most machine learning since I use neural networks a lot. Some people use PyTorch. It has its advantages, but TensorFlow's "eager" mode is an attempt to capture those advantages too. I will bet on the enormous team run by Jeff Dean to eventually outpace everything in the end, and its the only library that can run on the Google TPU cloud, I imagine.

      • Scikit-learn is a great library for general machine learning that has some good implementations of gnarly algorithms and can be very useful. For example when doing data analysis I wanted to try some sparse inverse covariance estimation, and Scikit-learn did it with no problem using its GLasso implementation.

      • graphviz can let you do some cool visualizations of graph data, especially inside jupyter notebooks.

      • I don't currently use pandas much but apparently it is amazing for manipulating and exploring data.

    • LaTeX

      • Any good scientist is going to have to write up some gnarly papers at some point, that's what tex is for.

      • I recommend the Sublime Text tex plugins when editing locally, paired with Skim to view pdfs it can sync up when it renders and scroll around, etc.

      • Usually though, you might as well edit your tex in the cloud, Overleaf is great for this. It even works for collaborative editing like Google Docs and will render previews in the browser.

      • When you just want to write little snippets of math without worrying about a document, LaTeXiT is an amazing tool. You can also easily drag/export/copy created formulae out of it in various formats, including vectorized. Very useful when typesetting math for presentations that you'll end up doing in Keynote or PowerPoint, too, or pasting into emails.

    • Non-LaTeX Documentation

      • Sometimes you just need to write some notes. I recommend writing them in a straight up text file, a shared Google Doc, or maybe in markdown like this one. I'm editing this in markdown in jupyter and you can use jupyter to render it to html, etc.