Welcome to my data science projects portfolio.
This projects portfolio is very much a work in progress. When the portfolio is full, my goal is to cover the full spectrum of data science process while using Python, SQL, Excel-VBA, Amazon Web Services (AWS) and Google Colaboratory (Colab).
My main area of interest is in applying data science tools to bring value to the financial derivative securities industry, the financial risk management industry, and the economic policy and financial regulation industries. I am also interested in visual deep learning as it is applied to brain segmentation image analysis (e.g., Janelia.org), to geospatial intelligence analysis (e.g., NGA.mil), and to geospatial image analysis for climate change (e.g., Planet Labs).
Some of the specific types of analyses I will be performing include: decision trees and random forests, principal component analysis (PCA), k-means clustering, sentiment analysis, natural language processing (NLP), linear regression, logistic regression, time-series, econometrics, big data cloud computing, deep-learning and convolutional neural networks (CNN), and image recognition.
I am working on a deep learning project to study the Amazon rainforest using satellite images from Planet Labs. Semi-annual high resolution data is provided free during 2019 to 2022 period due to a special funding. This will be an end-to-end deep learning project. I have two collaborators (Dan and Peter) who are providing me with feedback. I started a wiki on GitHub, see link above. I decided to document this project as I go. I am feeling the need to organize my work as this project gets large and unwieldy. I also think it will be too much work to document it all at once if I left it to the end. =.P
Practice exercises using NumPy n-dimensional arrays.
I am working on a demo of Matplotlib commands that clearly separates out the "ax" (object oriented) method from the "plt" (Matlab style) method. I found this area to be confusing when I was learning it, so perhaps other students can benefit from following my trail.
Demo to follow.
Dash allows you to build a web app with your customized sliders, radio buttons, text input, and user-selected data sorting and filtering. While Plotly has built-in default chart types with zoom, pan, expand/collapse and data filtering already included.
More to follow.
Passenger information from the Titanic ship is a common data set used in machine learning (ML). Here I use Python and data science libraries to find patterns in the data and build a prediction model. Then I use various visualization libraries to create pretty figures.
To be continued later.
A picture showing conceptually the bias-variance tradeoff in machine learning.
A test result with a bias problem refers to a case where the true mean was totally missed by the machine learning model. See bottom-left target in image above. A test result with a variance problem refers to a case where the machine learning predictions are too widely distributed to provide a meaningful indicator to the decision maker. See top-right target. In most modeling situations, there is a tradeoff between hitting the true mean and reducing the variability around that true mean. Generally, it is not possible to maximize both. See top-left target. But it is possible to achieve poor results in both parameters from a poor model parameter selection. See bottom-right target.
Source: Pierian Data, Udemy.com., Python Machine Learning Data Science Boot Camp
A demonstration of reading in a text file using 'rt' read text and 'rb' read binary methods. For reverse order read and non-sequential read, 'rb' binary read is usually faster. Relative address seek (from end of file and from current position) is only available for binary read method.
Later, I will cover reading from JSON, CSV, Excel, and SQL formats. I will also cover scraping data directly from websites.
While studying Andrew Ng's Deep Learning AI classes on Coursera.org, I needed a way to download the entire class at once, with all of the supporting images and data files. Using the GUI to download files, I had to click and download each file one by one. Some of the Jupyter notebooks have many supporting image files. Large datasets can't be downloaded at all due to download size limit on the server. Maybe Coursera will offer an easier download solution in the future. But for now, "tar" and "cat" Linux commands work on Coursera's Linux server and on my computer's Linux terminal. (I use Windows Subsystem for Linux.) It's important that "tar" and "cat" are default tools built into Linux, since I don't have the permission to install any new tools on Coursera's server. This method may work for non-Coursera classes as well, where you have access to a server-side Linux terminal or a Jupyter notebook with ! Linux shell command capability.
This is an example of a Jupyter notebook with "Open in Colab" and "Run in Colab" badges. Google Colab is a free machine learning resource using the Jupyter notebook interface. You can run deep learning models on GPU machines and lighter machine learning models on CPU machines. The badges open my example notebook stored on Github.com. You can see the badge code by changing the cell format containing the badge to "raw". Feel free to copy it into your own notebooks. Change the href link to point to your own Github account and file path.
Example notebook shows Matplotlib plotting functions. The first figure is an example of 3-dimensional projection of sign and cosign waves. The second figure is a histogram with three overlapping data series combined into one image. Alpha transparency is set to moderate opacity.