Blog

datasciY.com

What's New?

PyTorch Deep Learning from Scratch - SciPy-2020 tutorial

I am writing this post in the middle of this tutorial. SciPy virtual conference is going great! :-D Amazing how well it is going since it's the first virtual conference during the time of Covid-19 Pandemic. YouTube will have the corresponding video open to the public, maybe in a few days. There are 3 pre-recorded sets of videos already available from Enthought on YouTube. More playlists on biology will be released tonight at 6PM ET. Enjoy! :-D

Did the Big Bang never happen?

By chance I came upon videos giving persuasive arguments that The Big Bang may never have happened! I thought the Big Bang theory was gospel since around 1990. It's upsetting to think that we could have been so wrong. Accepting fuzzy numbers from experiments when they matched our expectations, and throwing out results after over-zealous scrutiny when they contradicted our expectations, seem to have been the culprit once again. This is extremely disheartening. Since I know how extremely accurate the experiments are, I can't believe we could have made such a gigantic error over several decades. Wow, if even astrophysics experiments can be distorted by confirmation bias, then I don't know what hope we have for more mundane experiments in economics, public health policy, and politics. Sigh...

Andrew Ng interview by Lex Fridman (MIT)

A good, recent interview of Andrew Ng. He is a co-founder of Coursera.org. He has taught DeepLearning.AI classes on Coursera during the past several years. His interview comment that he likes to take handwritten notes to learn was interesting. He summarizes what he is listening to rather than writing everything down verbatim. This helps him slow down and actively use his mind to make each concept concrete. Interview was posted on YouTube on Feb 20, 2020.

A Good Book about the daily life of a software developer

Software Developer Life book There are many YouTube videos out there that say you can become a software developer or a data scientist with a concentrated self-study in 9 months. There are also many boot camps that promise a $100,000+ coding job after finishing their 6-9 months program that costs $20,000 - $30,000. How realistic is the 6-9 months self-study plan or boot camp program? And even after you did all that work and paid the money and got the job, how do you know if you will be happy with your decision?

That's where this book comes in. YouTube videos can't provide the kind of deep, fact-filled analysis that a full-length book can. The author uses his own experience from working at various programming jobs after graduating from CMU with an engineering degree. He also uses the experience of his college friends and work friends to tell a more holistic story. I listened to the entire book over 2 days. I found it very helpful. While I am focused on a data science career, which is somewhat different from software development, I found it easy to apply the book's lessons to my situation. If you are contemplating a career in software or related fields, I highly recommend this book, Software Developer Life by David Xiang.

OK, now this is my own summary and opinion after having watched many, many YouTube videos on getting a job in software or data science. If you are most interested in getting a $100,000+ job in the shortest time possible, and you have a non-technical background (i.e., you don't have an engineering, statistics, math, or finance degree, a computer science bachelors or masters, or a PhD in math or physics), then your best chance comes from moving to Silicon Valley and going to a well-known boot camp for a Front-End Web Developer. Next using your newly acquired Silicon Valley network, send out job applications to large companies with plenty of money, e.g., Uber, Netflix, YouTube (Google), Facebook, and Apple. Your success rate will be 50% as of 2018 boot camp graduating class and declining. But this is still the best success rate available for non-technical people willing to study really hard during the boot camp and hustle like mad afterwards to land that first job. And if you get hired, you will make $100,000 to $115,000 as your starting salary. This option may not be possible for many people. The living cost in Palo Alto or Mountain View is astronomical. It REALLY helps if you can sleep on someone's couch for free. A tent on someone's backyard costs $1,000 to $2,000 a month to rent. Oh, and the average job search time post boot camp seems to be 4-6 months (for the 50% who were successful), and you will need to continue to support yourself during that time.

For those with a PhD in Math or Physics, they seem to go for a Data Science position. It is highest paying without moving to Silicon Valley. They seem to take about a 1 to 1.5 year for study and job search with about 6-9 months of it full-time after quitting their job. Their starting salary seems to be around $90,000 to $125,000 and is primarily dependent on location. Big cities pay more. There is no information on the number of people with PhDs who failed to land a job as a Data Scientist after putting in the effort. These are highly intelligent and motivated people and most of them already have jobs. So I would guess the successful transition rate will be lower than for the boot camp students who are all-in. Maybe 30% is successful?

For someone between the two above options, there is the low-paying or slow route. Many people with no technical degrees have successfully transitioned into Front-End Developer jobs after 1 year of full-time study, mostly via some form of formal schooling. Starting salary outside of Silicon Valley ranges from $35,000 to $50,000. Big cities pay more here too. If you want to become a Data Scientist and you have not programmed before, it will take longer. I am guessing about 2 years of full-time study to become competent in standard Python and Python data science libraries and machine learning concepts and related math. If you have not taken classes in calculus, linear algebra, and probability, you may need to add 6 months. By that time you will know enough to build a data science portfolio, which may take another 6 months. Job search will also take about 6 months, but some of this can be done concurrently while building up your portfolio.

PyData NY 2019 Conference

I attended the PyData NY 2019 conference this week. It was a wonderful learning opportunity. I especially enjoyed the tutorials. (See below my selfies at PyData and Time Square, New York City.)

Me at PyData Me at Time Square

Tutorials of Note:

  1. Michoel Snow, Hacking the Data Science Challenge (interviewing): Abstract 1 and Github 1.
  2. Stanley van der Merwe and Petr Wolf, From Raw Recruit Scripts to Perfect Python in 90 minutes: Abstract 2, and Github 2, YouTube link and tutorials-pdf
  3. Carlos, Afonso, Visualizing the 2019 Measles Outbreak in NYC: Abstract 3 and Github 3.

Bayesian Statistics:

I also heard about a new Bayesian approach to Statistical Inference that sounded interesting. There is a free class taught by Professor Richard McElreath, Statistical Rethinking: A Bayesian Course with Examples in R and Stan. I had a friend who was doing his PhD on the Bayesian Approach to Statistics. What I remember is that this approach makes a richer use of prior knowledge and estimates and finds ways to use them directly in the statistical model. Professor McElreath emphasizes that being able to reject a null hypothesis does not necessarily lead to our goal of positively accepting our research hypothesis.

"Rethinking: Is null hypothesis significance testing (NHST) falsificationist?
NHST is often identified with the falsificationist, or Popperian, philosophy of science. However, usually NHST is used to falsify a null hypothesis, not the actual research hypothesis. So the falsification is being done to something other than the explanatory model. This seems the reverse from Karl Popper’s philosophy." (See Statistical Rethinking book, page 5.)

Advances in Financial Machine Learning book

While I was at the NYC PyData Conference, I discussed a book for using Machine Learning to make money in 2019. The book is called Advances in Financial Machine Learning, by Marcos Lopez de Prado, c 2018 from Wiley.

Financial ML book It is a very opinionated book. And I don't agree with many of the author's views. But it offers an interesting look inside the mind of a hedge fund manager trying to use Machine Learning to make money in 2018. The author pans natural language processing to conduct sentiment analysis on earnings calls or satellite image processing to obtain product delivery or sales quantity estimates. The author thinks the low hanging fruit from those methods are gone. He believes in processing raw trading data from the exchanges to discover trading fingerprints of humans vs algo traders, as well as other classes of traders. This can be used to create spread trading strategies where one class of traders can be expected to outperform another class of traders, under specific economic conditions. The Machine Learning part is used to automate raw data processing, where huge volumes are processed for the small nuggets of silver.

The code used in book examples seem to be Python, but without PEP8 styling. A group of people have tried to translate the author's code examples into fully finished coding exercises. See Github link above.

Update on Long-Term Capital failure:

Originally in my chat with other attendees at PyData NY where I discussed this book, I was also explaining my take on the Long-Term Capital failure, and why I thought that they failed in an unusual way. Since then, I already got 2 posts from people commenting on the Slack channel.

Long-Term Capital book imageIn brief, I read in an article that Fisher Black told someone close that the reason he decided not to join Long-Term Capital was because he thought their strategy boiled down to shorting liquidity. I have no way to verify whether he said this. However, after many years of thinking about it, I came to agree that LT Capital failed primarily because they were short liquidity. This is unusual. Most failures are primarily due to market risk or credit risk. Although almost all failures do have a liquidity risk component, this is a short-term effect caused by deteriorating asset values. In most failures, the main invested assets are later discovered to be fundamentally flawed and loses significant value. This did not happen in the LT Capital failure. The fund ran out of time, but the bulk of the positions were later sold at a profit. I was working at the SEC when LT Capital failed and was involved in the wrap up. I think we can only have an imperfect understanding of what happened then, even though this case has been extensively studied and reported on.

Incidentally, you may also be interested in the book, When Genius Failed: The Rise and Fall of Long-Term Capital Management. Amazon link.

Hactoberfest 2019

Woohoo! It's Octoberfest for Hackers again. Register and submit 4 pull requests. Get started on creating a habit of frequent Github commits. The first 50,000 participants to finish gets a free T-shirt from Digital Ocean. Last year, I got reacquainted with Github through this Hactoberfest and a little help from NOVA Women Who Code.

Update: I finished my four pull requests on October 25, 2019, and was able to get my Hactoberfest 2019 t-shirt from Digital Ocean. This year about 60,000 people finished the challenge, and only the first 50,000 got t-shirts. So I was lucky to get one. (My t-shirt arrived yesterday, Nov 14, 2019, yeah!!!)

I also submitted a pull-request to geosnap, which is a neighborhood economic analysis package. I had more trouble with this repo because I am unfamiliar with the code base. See Github geosnap repo.

Understanding Convolutions - Otavio Good's talk on Word Lens

While studying deep learning with fast.ai, I came across a really good video that demonstrates a convolution in action. In a convolution layer, a small grid crawls across the source image to produce an output image layer that is a combination of a source and a filter array. Each small output area is a dot product of a small scanned input image area and the filter. The filter (also called kernal) is most often a 3 x 3 array of numbers. These frequently represent vertical, horizontal, or diagonal filters. These filters themselves are products of previous machine learning steps. Watch Otavio Good demonstrate how a convolution layer recognizes the letter "A."

Google acquired Word Lens app and the development team in May 2014. It's now part of Google Translate. It can translate written signs and text (not hand writing) using your phone's camera in real-time. It's really handy when travelling and trying to make sense of foreign language public signs on the street and at museums. ;-)

My Books in August 2019

Books

This is the current state of my coding bookshelf. On the top shelf there are financial coding, Python & R machine learning, coding interview, algorithms, and financial modeling books. Bottom shelf has references for R and C++ coding, financial risk management, and statistics, probability, and stochastic calculus. I like to study from several different books on the same topic. I find that different authors have varying approaches, and they work best in combination. Jake VanderPlas's Python Data Science Handbook (c 2017) is still my best book for learning Python data science libraries. It's my go to book for Numpy, Matplotlib, Scikit-Learn, and Jupyter Notebook (for %magic and !shell commands).

Quantum Computing, 2019 National Academies Study

Quantum Computing Report cover
  • Date: March 30, 2019
  • National Academies of Sciences, Engineering, and Medicine, 2019, Quantum Computing: Progress and Prospects, Washington, DC.   https://doi.org/10.17226/25196

Scientists estimate the time to a working commercial quantum computer at 10 years to maybe never. Error correction needs of qubits pose unknown challenges. A free downloadable study on the state of quantum computing is available from The National Academies Press.

Easy Explanation on How A Quantum Computer Works

This is an old video dated 2013, but has an easy to understand explanation on how a quantum computer works. 2^n is the number of information bits that can theoretically be combined. 2^300 is supposed to be a greater number than the number of [atoms] in the universe. But this is only useful for calculations that can make use of the super-position state. Also, for reading the final result, the quantum computer must drop back out of the super-position state into the normal state. For normal calculations, the quantum computer is projected to be slower than a regular computer.


Beginning of Probability Measure Theory

I think one of the most confusing and difficult part of learning probability measure theory comes at the very beginning! Obviously this project is going to be very opinionated. :-) DeMorgan's Laws and other rules for calculating probabilities, which comes after the beginning, are not that different from normal algebra. I think most people can follow along and understand the other parts, if they do not make the mistake of getting forever stuck on the starting definitions! We need to rename "probability space", "sigma-algebra", and all those greek letters, to something more English-like and easier to remember. Anyway, I plan to post a very opinionated translation from Greek-Math-speak to Normal-English-speak.

k-Means Clustering Talk

ISLR book cover

I gave my first short talk on a data science subject to a local Meetup group this week. Here's a shout-out to the group, Serious Data Science. Thanks Deborah, Julius, Elsa, Peter, Dan and others. You guys are so supportive and kind! I don't think I would have read the ISLR book with such attention without all of you helping to keep my motivation high! :-) If you, Reader, live near Sterling, Virginia, please come and join this wonderful Meetup group. We meet monthly on the 2nd Tuesday evenings at REI Systems Inc building.

GARP 20th Conference in NYC

I will be in NYC attending the 20th GARP Risk Conference. The agenda has several sessions on machine learning and AI along with the usual risk topics. I am interested in learning more about how data science and AI is being using by financial institutions. I will also catch up with my former colleagues from the SEC while I am there. Glad the scheduling worked out.

PyData DC 2018 and SciPy Austin 2018

Attended the PyData DC 2018 conference in Tysons Corner, VA over the weekend. I thoroughly enjoyed it. Everybody was very nice and welcoming towards relatively new programmers, like myself. I will post a write-up about several talks/software that caught my attention. This conference was more accessible for me than SciPy in July 2018 at Austin, TX. I come from a business background and have been learning Python and Data Science for only about 1.5 years. Many of the people I talked to at PyData had similar backgrounds. The SciPy community was more deeply into core python package development and were more advanced programmers. The majority seemed to have PhDs in a hard science or math field. For me personally, the learning experience was higher from the SciPy conference in a "tough love" way. But I felt more of a sense of belonging and was happier at the PyData conference. I will also have a writeup of a couple of the tools/talks I found most useful from the SciPy 2018 conference.

Future Topics:

What I am working on: