Welcome to my data science coding projects portfolio. This projects portfolio is very much a work in progress. When the portfolio is full, my goal is to cover the full spectrum of data science process while using Python, Excel and VBA, SQL database, Amazon Web Services (AWS) and Google Cloud.
My main area of interest is in applying data science tools to bring value to the financial derivative securities industry, the financial risk management industry and the economic policy setting industry. I am also interested in visual deep learning as it is applied to brain segmentation image analysis (e.g., Janelia.org) and to geospatial intelligence analysis (e.g., NGA.org).
Some of the specific types of analyses I will be performing include: decision trees and random forests, principal component analysis (PCA), k-means clustering, sentiment analysis, natural language processing (NLP), linear regression, logistic regression, time-series, econometrics, big data cloud computing, deep learning, and convolutional neural network (CNN).
Dash allows you to build a web app with your customized sliders, radio buttons, text input, and user-selected data sorting and filtering. While Plotly has built-in default chart types with zoom, pan, expand/collapse and data filtering already included.
To be added.
I will post exercises using NumPy and Pandas to clean and explore input data. I will cover reading from JSON, CSV and Excel file formats. I will also cover scraping data directly from websites in html formatted tables or PDF formatted tables.
Algorithm efficiency is studied using the Big-O math. Generally an order of log(n) is preferred over an order of n*log(n), n**3, or n!. The best algorithm has an order of n, O(n), but this is rarely achieved. An O(n) means that as the number of inputs grows, the time to execute grows linearly. In my sorting algorithm, I use a binary tree with a central pivot point and recursive function calls to itself. I use this algorithm to study Big-O math.
Example to post later.
Passenger information from the Titanic ship is a common data set used in machine learning (ML). Here I use Python and data science libraries to find patterns in the data and build a prediction model. Then I use various visualization libraries to create pretty figures.
To be continued later.
A picture showing conceptually the bias-variance tradeoff in machine learning.
A test result with a bias problem refers to a case where the true mean was totally missed by the machine learning model. See bottom-left target in image above. A test result with a variance problem refers to a case where the machine learning predictions are too widely distributed to provide a meaningful indicator to the decision maker. See top-right target. In most modeling situations, there is a tradeoff between hitting the true mean and reducing the variability around that true mean. Generally, it is not possible to maximize both. See top-left target. But it is possible to achieve poor results in both parameters from a poor model parameter selection. See bottom-right target.
Source: Pierian Data, Udemy.com., Python Machine Learning Data Science Boot Camp