XiyahC

2024 Summer Research Reading

Created2024-06-07

2024.06.07Medium Post, linkHow to Build a RAG System with a Self-Querying Retriever in LangChain

How to Build a RAG System with a Self-Querying Retriever in LangChain

Created2024-06-07

Original post: click here IdeaWhy can’t I use natural language to query a movie based more on the vibe or the substance of a movie, rather than just a title or actor? e.g. search: I liked “Everything Everywhere all at Once”, give me a similar film, but darker. Action: Build Film SearchRAG-based system, takes a user’s query, embeds it, and does a similarity search to find similar films.Different to vanilla RAG: using self-querying retriever. Ability: allow filtering movies by their meta ...

Organizations to Publish Papers

Created2024-05-29

ACLThe Association for Computational Linguistics (ACL) is a scientific and professional organization for people working on natural language processing.Paper Submit DDL: mid Feb. per year EMNLPEmpirical Methods in Natural Language Processing (EMNLP) is a leading conference in the area of natural language processing and artificial intelligence. IMCLInternational Making Cities Livable (IMCL)The mission of the IMCL has always been to raise awareness, through conferences and publications, of the ef ...

Number of Clusters in K-means

Created2024-05-28

K-Means: have k clusters, first randomly choose k centroids, then include to each cluster by smallest distance, calculate the mean of the points in each cluster as the new centroid. How will you define the number of clusters in a clustering algorithm? Why important? Clustering, a technique used to group together objects with similar characteristics. Clustering algorithms are the methods that group these data points into different clusters based on their similarities.

Book Notes-Be the Outlier How to Ace DS Interview

Created2024-05-27

Book: Be the Outlier How to Ace Data Science Interview Notes. Chapter 4 Modeling and Machine Learning QuestionsPractice Question 1 — Overfitting in Predictive Models Practice Question 2 — Number of Clusters in K-means

Overfitting in Predictive Models

Created2024-05-27

Why is overfitting important? We always look at prediction error when build a prediction model. Prediction error can be explained by bias + variance errors. Bias: the difference between the forecast ($\hat{y}$) and the actual ($y$) that we are trying to predict.Variance: the variability of the forecasted value and gives an estimate of the spread of the model data. Underfitting: high bias, low variance.Not able to capture the trend of data, happen due to insufficient data or too few features. ...

JHU NLP CS 601-671 Notes Collection

Created2024-05-23

Below are the notes I take from CS 601-671 course for NLP: self-Supervised Models.Time I toke: 2024 Spring. Connecting Language to the World

NLP-Connecting Language to the World

Created2024-05-23

connect vision - language generative vision-language model others[speech, audio] from language to code from language to action 1.connect vision - languageHistory1960s first cv project.2000s shallow classifiers and feature engineering.2012 deep learning revolution. CNN in ImageNet. unification of architectures. Rise of image generation (VAEs, GANs, etc.).2020s eras of vision transformer. How to encode images?Vision Transformers (ViT). Image to patch(matrices, e.g. you have differen ...

01-Conversion Rate

Created2023-12-12

Since the courses are needed to be purchased to get the dataset, I will study all the projects in this series from JifuZhao who shared the relative ipynb works on github. Reference link for this one: https://github.com/JifuZhao/DS-Take-Home/blob/master/01.%20Conversion%20Rate.ipynb Some usual structure for data analysis works: 1 State the Issues/Target 2 Collect the Data 3 Data cleaning and preprocessing 4 Do the Exploratory Data Analysis 5 Feature Engineering 6 Build Model to Solve the Issue 7 ...

Some Applied Statistics Notes about Outliers

Created2023-12-12

From JHU AMS class EN.553.613 ASDA 2023 Fall. Notice Outliers: Plot residuals($e_i=y_i-\hat{y_i}$) vs. $X$ or $\hat{y}$ box plot dot plot stem plot If > 4|$e_{i}^{*}$|, where $e_{i}^{*}=\frac{(e_i-\bar{e})}{\sqrt{MSE}}=\frac{e_i}{\sqrt{MSE}}=semistudentized\;residuals$, where $MSE=\frac{SSE}{n-p}=\frac{\sum_{i=1}^{N}(y_i-\hat{y_i})^2}{n-p}$, where p is the number of the estimators. We have $var(ei)=\theta^2(1-h{ii})$, $cov(ei, e_j)=-\theta^2h{ij}$, $h_{ij}=X_i(X^TX)^{-1}X_j^T$, $h_{ii}(leve ...