datasciY.com

- Author: Jennifer Yoon
- Contact email: "datasciY.info@gmail.com"
- Resume: data science-PDF
- LinkedIn Profile: JenniferYoon
- GitHub Profile: JennEYoon
- Github DatasciY repo: datasciY
- Github Machine Learning and Deep Learning repo: learn-mldl
- Stack Overflow Profile: Jennifer E. Yoon, user:4693491

Welcome to my data science coding projects portfolio. This projects portfolio is very much a work in progress. When the portfolio is full, my goal is to cover the full spectrum of data science process while using Python, Excel and VBA, SQL database, Amazon Web Services (AWS) and Google Cloud.

My main area of interest is in applying data science tools to bring value to the financial derivative securities industry, the financial risk management industry and the economic policy setting industry. I am also interested in visual deep learning as it is applied to brain segmentation image analysis (e.g., Janelia.org) and to geospatial intelligence analysis (e.g., NGA.org).

Some of the specific types of analyses I will be performing include: decision trees and random forests, principal component analysis (PCA), k-means clustering, sentiment analysis, natural language processing (NLP), linear regression, logistic regression, time-series, econometrics, big data cloud computing, deep learning, and convolutional neural network (CNN).

WIP

Dash is an interactive charting app for the web that can be built using Python. No JavaScript required. Dash is built on top of the Plotly chart definitions. Python developers can use many of Plotly's chart styles in their default mode to create beautiful, interactive charts. Website visitors can zoom in or out of the chart, seeing details or a summary view. Full customization is available via Plotly's open-source GitHub repo.

Dash allows you to build a **web app** with your customized sliders, radio buttons, text input, and user-selected data sorting and filtering. While Plotly has **built-in default chart types** with zoom, pan, expand/collapse and data filtering already included.

To be added.

I will post exercises using NumPy and Pandas to clean and explore input data. I will cover reading from JSON, CSV and Excel file formats. I will also cover scraping data directly from websites in html formatted tables or PDF formatted tables.

- Numpy Exercise 1 (html), GitHub
- NumPy Exercise 2 (py), GitHub

Algorithm efficiency is studied using the Big-O math. Generally an order of log(n) is preferred over an order of n*log(n), n**3, or n!. The best algorithm has an order of n, **O(n)**, but this is rarely achieved. An O(n) means that as the number of inputs grows, the time to execute grows linearly. In my sorting algorithm, I use a binary tree with a central pivot point and recursive function calls to itself. I use this algorithm to study Big-O math.

Example to post later.

- Functions - Pass functions as inputs to another function. Function Exercise 1
- Functions - *arg, **kwargs, defaults, and variables order. To do.

Reference: www.geeksforgeeks.org" - Class Objects - Spaceship Class, Asteroid Class. To do.
- To do -- Pandas as Class Objects, NumPy as Class Objects.

Passenger information from the Titanic ship is a common data set used in machine learning (ML). Here I use Python and data science libraries to find patterns in the data and build a prediction model. Then I use various visualization libraries to create pretty figures.

- View html version of Jupyter notebook: Titanic-NB-HTML
- Download from GitHub, full Jupyter notebook: GitHub Titanic-NB
- Tags: exploratory data analysis (EDA), machine learning (ML), graphics, logistic regression
- Data: https://www.kaggle.com/c/titanic/data: Kaggle Titanic data.
- Reference: Rossant, Cyrille,
*Ipython Interactive Computing and Visualization Cookbook*, 2nd ed., Packt Publishing 2018, pp. 299-304.

To be continued later.

A picture showing conceptually the bias-variance tradeoff in machine learning.

A test result with a bias problem refers to a case where the true mean was totally missed by the machine learning model. See bottom-left target in image above. A test result with a variance problem refers to a case where the machine learning predictions are too widely distributed to provide a meaningful indicator to the decision maker. See top-right target. In most modeling situations, there is a tradeoff between hitting the true mean and reducing the variability around that true mean. Generally, it is not possible to maximize both. See top-left target. But it is possible to achieve poor results in both parameters from a poor model parameter selection. See bottom-right target.

Source: Pierian Data, Udemy.com., Python Machine Learning Data Science Boot Camp

- Udemy class link
- Image from section 16 link. Login to access.