Unpopular opinion: Causality is **not relevant** in the majority of
#quantfinance
modeling applications! “Successful prediction does not require correct causal identification.”
Causal relationships are important if you want to **intervene** in a system. Quant traders are not
I'm
#hiring
a
#Quantitative
Analyst. Please see the link in the next post. We are a small
#investment
team working on a heterogeneous portfolio. We work with tabular, time series, and text data to support/reject investment views, boost investment team efficiency, measure and
Julia is the
#DataScience
and
#MachineLearning
language of the future. Look how easy it is to parallelize an expensive (say feature engineering) function across columns. Incredible.
#JuliaLang
Can the entire Python and PyData community please instantaneously decide and agree that we will all use `lets-plot` as the one and only plotting backend? Admit defeat and that ggplot and the grammar-of-graphics approach is far superior and nothing, until lets-plot, in the Python
@ChristophMolnar
Classical ML techniques like SVM/SVR and KNN are making somewhat of a comeback these days due to nvidia's cuML library. What's old is new. For example,
I am hiring a financial data scientist! Lots of fun and interesting things to work on (NLP, noisy time series stuff, small data problems, graphical models, risk modeling, and, yes, some dashboarding). Please take a look at this posting! 📈🦾🙏 In NYC...
@__mharrison__
after a cell throws error, execute %debug in the *next* cell and you get put into pdb at the point of error; nbdime for nb diffs; mixing bash and python, like `this_dir = !pwd`
I’ve been looking at sophisticated imputation strategies (probabilistic PCA, generalized low rank models; both loved by academics). The LightGBM iterative imputer by
@analokmaus
shown in Rob’s session blows those away. Amazing stuff to be found hidden in
@kaggle
notebooks.
🚀 This is tomorrow at 5pm CET! Learn all about handling missing values in tabular data from Kaggle Grandmaster, Rob Mulla! 🎉 Here is the youtube live link:
There will also be Q&A from the audience!
@sh_reya
SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
This in an excellent talk. Two hours of gold; no time wasted. Could be called the Zen of Pandas.
@dontusethiscode
makes great content for an intermediate audience. If you use
@pandas_dev
seriously and are frustrated 90%+ of the time (all of us) watch this.
Quant researchers often build strategies and test signals using rank IC; i.e., the ability to sort the universe effectively based on future performance. Here is a great new paper about using ML techniques from the “learning to rank” domain in this process.
@SamAltsMan
I read the health paper and it’s not encouraging. There is a short-lived benefit but after three years there is no difference between the health outcomes of the treatment and control groups. Right?
What's something that
@kaggle
competitors know but is not well appreciated by
#machinelearning
practitioners in industry? "Adversarial Validation"...reduce overfitting/make a model generalize better. I made a notebook on it for the Ubiquant competition.
TIL that *inside the Bloomberg terminal* you can launch a
@ProjectJupyter
Lab Python session, write queries against arbitrary Bloomberg data, build screens, run alpha factors, etc. and visualize. Very nice integration by
@TechAtBloomberg
I finished in top 16% in
@kaggle
M5 Accuracy competition. No medal but an enjoyable effort in the past few weeks. Plus I did my submission 💯 in
#JuliaLang
— very performant, nice for EDA, dead simple parallelism. Looking forward to the next competition. 😊
My new
@kaggle
kernel for trade selection in Jane Street: I train
@PyTorchLightnin
model simultaneously on *multiple return targets*, <= to the final horizon. This is a corollary to
@lopezdeprado
's triple barrier method. 1/N
I took this class this past Fall. It’s outstanding. Goes from rigorous theory tracing the history of consensus from 1980’s to today; progresses all the way up to DeFi and bleeding edge topics like layer 2/scaling, optimism, zk, validium (note: no coding).
Lecture 1 of my Foundations of Blockchains lecture series is now available: (Will try to post one new lecture a week for the next 2-3 months.)
tl;dr thread below:
1/12
@amasad
I absolutely love Replit and support lifelong learning but this particular example seems like a recipe for disaster. Massive technical debt build up incoming. “Non coder” who thinks devs are slow (because they are writing tests, thinking about maintainability, considering design
I am doing some research on MEV and came across a YouTube video which promises "$1200/day in profits with Frontrun Bot on Uniswap Mempool". Just copy his code, connect Metamask, deploy with Remix, deposit ETH into the contract, and click "Start". What could go wrong??? 1/N
Anyone looking at the
@kaggle
Jane Street competition? I’m working a kernel to make sense of the anonymized features. Hierarchical rank corr matrix maps clusters to feature meta data tags. Some clues emerging.
@Thom_Wolf
@AnthropicAI
@cohere
Command R+ is great. We need a better term for “open source model but with a highly restrictive license”. A true open source model is MIT or Apache licensed.
Join Jonathan Larkin at • NUMERCON • 1 April 2022 • San Francisco •
@jonathanrlarkin
is a Managing Director at Columbia Investment Management Co., LLC.
Register for in-person and remote:
I'm slowly digesting, internalizing, reading and re-reading, watching YoutTube, etc., content on
#causalinference
, both general (e.g., Book of Why, Statistical Rethinking) and specific to finance (e.g., LdP causual factor paper). This is a different paradigm to me, so it's slow
I'm "all-in" on foundation models (LLMs/diffusion models). Their abilities have surpassed all expectation; anyone who says otherwise is moving goal posts. To remain grounded I remind myself of Weizenbaum's distinction between deciding and choosing. FMs are deciding not choosing.
I published my first public
@kaggle
kernel! Can you infer the risk model used to residualize returns given raw data and the residual? I explore this with the latest
@twosigma
competition data.
#Kaggle
#KernelsAward
This is such an obvious winner. The python data scientist is expected to know all sorts of devops stuff and how to scale models to the cloud. JuliaHub’s forthcoming one-click cluster deployment is 🔥 and let’s data scientists focus on...data science.
#JuliaLang
We still haven't made JuliaHub's new compute capabilities available broadly. But every day I use it internally, I feel like I have a supercomputer attached to my local VS Code
#julialang
session. Learn more by signing up for the webinar.
🔥 New (1h56m) video lecture: "Let's build GPT: from scratch, in code, spelled out."
We build and train a Transformer following the "Attention Is All You Need" paper in the language modeling setting and end up with the core of nanoGPT.
This paper by Tucker Balch et al is 🔥! Portfolio Inference: given only time series of fund returns, learn stocks the strategy held??!! Novel application of
#machinelearning
in finance. "Sequential Oscillating Selection" solves 500 C 30 problem in seconds.
Looking at some portfolio construction stuff closely after a long absence. This package is spectacular and faithful to how a proper institutional quant thinks about the process.
“This paper applies a denoising filter to the whole time series before predicting it, meaning that each point has information from the future in it. And the authors also added trading costs to their PL” and other gems 😂🎁
In finance, data is small, signal is low. Does
#machinelearning
work in such a setting? In deep learning we see overparameterized models memorize the training set and *not* overfit. 🤔 Is double descent applicable to the financial domain? Read this.
Looking thru some old code today. Came across my implementation of long/short portfolio optimization under a historical CVaR (expected shortfall) constraint. Love these kinds of problems!
#quantfinance
A causal DAG can be very useful in *some* financial applications, e.g., trade execution, where your action changes the state (i.e., the limit order book). But is longer horizon problems where the agent is a price taker, not so much.
Transfer learning applied to quant trading! “In a few big regional markets, such as S&P 500, ...., QuantNet showed 2-10 times order of magnitude improvement in Sharpe and Calmar”
#MachineLearning
#quantitative
#finance
This is one of the most exciting areas of quant finance research right now. If synthetic data can work, it’s a game changer for alpha discovery and finding the optimal policy in reinforcement learning for portfolio management.
@eliasbareinboim
Wow, thank you for such a thoughtful and complete response. Twitter/X hasn’t typically been a forum for such dialogue. I’ll do my best to work through the sources you noted! Cheers, Jonathan
This paper has been making the rounds. While LLMs will almost surely be impactful in assisting investors, a significant red flag here is that all the alpha comes from the short side. This is often indicator that the alpha is a mirage and can’t be captured in practice.
A ChatGPT model generated a 500% return in the stock market (trading options) over a 15 month period by assigning a sentiment score to news articles about publicly traded companies.
Research by University of Florida's Dept. of Finance ↓
@BreveStonder
That’s funny. A (non technical, finance) colleague asked me what single thing they could do to get baseline literate as a data analyst and I recommended the excellent
@datacarpentry
class
@evalparse
This is a great thread. This is one of the key reasons I’ve been spending time with
#JuliaLang
: the promise of being able to modify the internals of an ML algorithm directly w/out touching C/C++ or Cython.
"Multiple comparisons bias and p-hacking" (bad!) vs "model selection via cross validation" (good!)??? Why isn't CV, which is trying N models in an automated way, just as bad as trying N models...manually? Finally groked this by reading
Fascinating
#pydatanyc
talk: HDF5 vs Zarr... pros/cons; chunked/compressed out of core data packages. “HDF5 codebase is almost as old as me“ 😂
@__qualname__
has a way of going super deep into low level cs complexities but presenting in way where I (sort of) understand!
The Ubiquant
@kaggle
competition is a good one. It's faithful to (in some business models) what a strategist/portfolio manager in a large quant firm does. I've been working on some ideas. Please check them out and comment.
#quantfinance
"The Man Who Solved the Market":
@GZuckerman
quotes Jim Simons “astrophysicists make great [
#finance
] quants bc they can’t do live experiments—they work with
#data
.” Example: great
@PyData
keynote by
@profsaraseager
: finding signal of exoplanets in noise
@dingding_peng
Philip … Pip for short. In Great Expectations, the kid was named Philip, and called Pip. Then when you feed him, you can say things like “Pip, install food”. 🤷♂️
Excellent
#pydatanyc
talk by
@Sasamos
: Uncertainty in
#MachineLearning
. Want uncertainty estimates? Want to use your favorite model? Use `quantile` loss function. Also `predict_proba(...)` most often doesn't give you proper probabilities...Calibrate first.
I like this alpha research approach to mitigate p-hacking... elegant idea: just calcuate all the permutations of choices you can make! The distributuion of the results shows how robust (or not) your alpha is.
@therealcritiq
@tszzl
This is a great paper which should be getting much more visibility: robot uses stable diffusion to hallucinate a scene and then creates the scene IRL. Truly embodied intelligence. More than just LLM.
As a transition from debunking disinformation to kaggling here is a thread debunking several myths about Kaggle, including lack of relevance to real world, overfiting, automl performance on kaggle, etc. Bear with me. 1/N
In 2004 the fastest super computer *in the world* (IBM Blue Gene/L) clocked in at 70.7 tflops. My machine learning workstation with dual RTX 3090’s finally arrived today… 71.1 tflops. Moore’s law in action.