Phil Profile
Phil

@phill__1

Followers
1,016
Following
231
Media
148
Statuses
917

Currently working on Tech Support AI

H
Joined January 2019
Don't wanna be here? Send us removal request.
Explore trending content on Musk Viewer
@phill__1
Phil
10 months
@OpenAI @axelspringer @politico @BusinessInsider @BILD @welt oh no, that are germanys worst newspapers, chatgpt browsing is gonna be even worse if this is the main news source
24
49
4K
@phill__1
Phil
6 months
Whatever gpt2-chatbot might be, it definitely feels like gpt4.5. It has insane domain knowledge I have never seen before
54
59
1K
@phill__1
Phil
14 days
Wow nvidia just published a 72B model with is ~on par with llama 3.1 405B in math and coding evals and also has vision 🤯
Tweet media one
33
143
1K
@phill__1
Phil
6 months
gpt2-chatbot is insane at ascii art, miles ahead of any other model
Tweet media one
39
72
783
@phill__1
Phil
14 days
Nice, now you can speedrun bankrupting your OpenAI wrapper
Tweet media one
34
23
786
@phill__1
Phil
2 months
Wow perplexitys code interpreter can now install libraries and display charts in the result! This enables many more use cases compared to chatgpt code interpreter like creating stock market charts using yfinance
Tweet media one
Tweet media two
Tweet media three
16
55
501
@phill__1
Phil
6 months
The new GPT-4 Turbo model is the only one that could solve this math question: "Determine the sum of the y-coordinates of the four points of intersection of y = x^4 - 5x^2 - x + 4 and y = x^2 - 3x." Even Opus never gets close to the right answer
Tweet media one
@phill__1
Phil
6 months
@kimmonismus From my initial tests with difficult math questions, it's a huge jump, even beating Opus and also way better than the last GPT-4 Turbo model. My money is on them using artificial math data to train it, never seen any model being so competent in solving equations
3
3
72
15
30
251
@phill__1
Phil
5 months
@ClementDelangue Nah, spend that money on buying stability AI. They actually have a product and team worth preserving
8
1
222
@phill__1
Phil
2 months
@anpaure This guy needs Adderall and then read some algorithms and data structures books
8
3
221
@phill__1
Phil
10 months
@Hayito @OpenAI @axelspringer @politico @BusinessInsider @BILD @welt Bild is constantly getting community noted, so yes, X is better because you can at least see the fact check
2
0
203
@phill__1
Phil
6 months
@JasonBotterill3 Opus has no clue, gpt4-turbo makes something up on how it works and gpt2-chatbot is right on. This is a function of a quite unknown addon of a small ERP system
Tweet media one
6
5
190
@phill__1
Phil
1 month
@OpenAI Great! How about we do this every day going forward
3
1
184
@phill__1
Phil
3 months
Llama 3.1 70B seems like the most interesting model launching tomorrow. HumanEval jumped from 39% to 79% between llama 3 and 3.1 70B
10
15
170
@phill__1
Phil
3 months
Google accidentally updated their website with Gemini 2.0 and Bing indexing caught it
Tweet media one
3
16
148
@phill__1
Phil
3 months
Llama 3.1 405B Benchmark Results (Leak): 📊 MMLU: 85.53% 🧮 GSM8K: 96.82% 💻 HumanEval: 85.37% 🔄 HellaSwag: 91.96% ❓ BoolQ: 90.89% 🧩 Winogrande: 86.74% 🔬 PIQA: 87.43% 📚 OpenBookQA: 90.80% Subsets: 🏛️ MMLU Social Sciences: 89.76% 🔬 MMLU STEM: 83.10% 📚 MMLU Humanities:
10
18
129
@phill__1
Phil
28 days
@chatgpt21 That's not o1, that's just a new gpt4o model specifically trained for chatgpt
3
2
125
@phill__1
Phil
3 months
There are currently at least 6 unreleased models in the lmsys arena: -gemini-test-1 and gemini-test-2 (probably new Gemini 1.5 version, maybe Gemini 2.0) -im-a-little-birdie (???) -upcoming-gpt-mini (Gpt3.5o ?) -column-r and column-u (cohere or deepseek ?) -eureka-chatbot
8
13
117
@phill__1
Phil
6 months
A company with $60 million in funding can now build a model on GPT-4 level. The time where just the training compute would cost hundreds of millions is over
@RekaAILabs
Reka
6 months
We evaluate Core on standard benchmarks for both text and multimodal, along with a blind third-party human evaluation.
Tweet media one
Tweet media two
9
26
154
12
5
106
@phill__1
Phil
11 months
@SZ Sehe ich das richtig, dass sie für einen Waffenstillstand ist und information zur einer Demo für einen Waffenstillstand geteilt hat? Ist das antisemitisch? Ich dachte immer die Israelische Regierung mit allen Juden gleichzusetzen wäre antisemitisch
9
0
97
@phill__1
Phil
2 months
Well, Google just stole OpenAI's lunch money for GPT-4o-mini. Same performance, but Gemini Flash is 50% cheaper. There is zero MOAT around these small models, only pricing matters
Tweet media one
@kimmonismus
Chubby♨️
2 months
Now I'm curious. So far, Gemini has always been a little behind the competition. Can they now catch up?
6
1
88
0
9
96
@phill__1
Phil
7 months
@eigenrobot >3000 people die in NYC >changes geoplotical situation for decades, triggers multiple wars, makes the whole world a worse place >one tsar Bomba could kill 10 mio people in and around nyc >not that scary ???
2
0
89
@phill__1
Phil
8 months
@Nexuist I am so proud of my boi, keeping sql schemas reasonable one step at the time
0
0
83
@phill__1
Phil
1 month
Still cheaper than Claude 3 opus
Tweet media one
11
5
83
@phill__1
Phil
6 months
@ItsTheBenzo No, there are zero sources, just speculation, but it is almost certainly an OpenAI model based on the way it writes.
3
0
82
@phill__1
Phil
5 months
Gemini flash is benchmarks are as good as claude 3 sonnet for $0.35 instead of $15/mio token! Google is acting like this is a tiny gpt3.5 level model but it's better then the original gpt4
@hu_yifei
Yifei Hu
5 months
Gemini 1.5 Flash benchmarks. From:
Tweet media one
0
3
11
1
12
80
@phill__1
Phil
6 months
@Yampeleg Source: A misquoted guy in the Mistral discord talking about the command-r+ model, not the new Mixtral
Tweet media one
Tweet media two
0
2
80
@phill__1
Phil
6 months
The ASCII art is 1 to 1 copied from the internet, seems like gpt2 is just better at recalling training data exactly 🤔
@goodside
Riley Goodside
6 months
Human preference LLM arenas are poorly suited for evaluating ASCII art because the ASCII art that most impresses a human is often verbatim regurgitation of an existing human work and this is rarely true for text. Votes on ASCII art should be detected and thrown out IMO.
Tweet media one
Tweet media two
14
14
240
4
7
74
@phill__1
Phil
6 months
@kimmonismus From my initial tests with difficult math questions, it's a huge jump, even beating Opus and also way better than the last GPT-4 Turbo model. My money is on them using artificial math data to train it, never seen any model being so competent in solving equations
3
3
72
@phill__1
Phil
3 months
Llama 3.1 405B leaked!
4
11
71
@phill__1
Phil
3 months
@alexalbert__ Claude dot ai being able to run Python as a tool would take away the last USP of the competition 👀
1
0
70
@phill__1
Phil
3 months
Llama 3.1 405B Instruct model beats Claude 3.5 Sonnet for MMLU-Pro and MATH! Full list of Intruct benchmarks: 📚MMLU: 87.3% 🧠 MMLU (CoT): 88.6% 🎓 MLU PRO (CoT): 73.3% 📝 IFEval: 88.6% 🔬 ARC-C: 96.9% 🧪 GPQA: 50.7% 📊 MuSR: 56.7% 💻 HumanEval: 89.0% 🖥️ MBPP++: 88.6% 🧮
4
8
68
@phill__1
Phil
3 months
Something is up here
Tweet media one
6
3
64
@phill__1
Phil
8 months
@PicHasso66 Jeder Ökonom ist sich einig das Migration gut ist, egal welche Strömung, egal ob rechts oder links. Aber Rassismus ist einfach wichtiger für AfD Wähler als ein besseres Leben.
13
1
63
@phill__1
Phil
1 year
@B_Pattern688 @DreamLeaf5 Great for the right, now they can scream antisemetic instead of calling autistic people r worded and they even get applauded
0
0
45
@phill__1
Phil
16 days
This is crazy, not only did the fork continue dev, they are even using the same bug report template with just the name swapped😭
Tweet media one
@CodeFryingPan
FRYING PAN
17 days
I just quit my 270 000$ job at Coinbase to join the first YCombinator fall batch with my cofounder @not_nang . We're building PearAI, an open source AI code editor. Think a better Copilot, or open source Cursor. But you've heard this spiel already... 🧵⬇️
Tweet media one
436
285
6K
5
5
46
@phill__1
Phil
5 months
@ns123abc Anthropic: *Science They are the only ones that actually publish research, not just technical reports, even if it helps the competition
5
0
46
@phill__1
Phil
4 months
Wow Gemini 1.5 in the aistudio can now run code and even download a dataset
Tweet media one
Tweet media two
@artificialguybr
𝑨𝒓𝒕𝒊𝒇𝒊𝒄𝒊𝒂𝒍 𝑮𝒖𝒚
4 months
Wait... You can run code for Gemini 1.5 Flash and Pro directly from Google API now?
Tweet media one
6
5
75
2
4
41
@phill__1
Phil
14 days
Not saying it isn't insanely cool but definitely space for competition at that price point
2
0
41
@phill__1
Phil
2 months
👀👀
Tweet media one
2
4
41
@phill__1
Phil
9 months
@simpBVerfG Ich liebe wehrhafte Demokratie. Das macht Hoffnung für das AfD Problem.
1
1
39
@phill__1
Phil
4 months
@futuristflower Nah, that matters, especially for code. Having 0 bugs vs 1 bug is a huge difference.
2
0
34
@phill__1
Phil
3 months
@alexalbert__ Can we please get a "continue" option within the same artifact? More complex generations often get truncated, and the next message will just start a new artifact.
4
0
30
@phill__1
Phil
6 months
@Xenophon789 @4btillidie Goes for all online dating apps in Korea the best once have ~90% men, the worst once ~98%. Korea is just deeply conservative in regard to how woman ought to behave and putting yourself out to date by attraction is judged and stigmatised
0
0
29
@phill__1
Phil
14 days
@dmsimon It actually makes total sense from Nvidia's point of view. You want to use this model yourself in your company? Buy an H100 from Nvidia.
2
0
29
@phill__1
Phil
5 months
Both new gpt2 chatbots are very similar and both still have the best domain knowledge of any llm
Tweet media one
1
2
27
@phill__1
Phil
3 months
@DanielDiMartino The chance of each candidate having exactly one decimal place after the comma without rounding is 1 in 14 million, I ran a Monte Carlo simulation on this.
3
0
25
@phill__1
Phil
14 days
@Hans365days If you and the AI both talk half of the time, a 30-minute conversation costs $4.50
3
0
27
@phill__1
Phil
14 days
Paper with a lot of technical detail:
1
4
26
@phill__1
Phil
6 months
Command R+ beeing significantly better then gpt4 on launch is insane. Cohere is becomming my fave AI Lab
@lmarena_ai
lmarena.ai (formerly lmsys.org)
6 months
Exciting news - the latest Arena result are out! @cohere 's Command R+ has climbed to the 6th spot, matching GPT-4-0314 level by 13K+ human votes! It's undoubtedly the **best** open model on the leaderboard now🔥 Big congrats to @cohere 's incredible work & valuable contribution
Tweet media one
44
305
1K
0
0
25
@phill__1
Phil
4 months
@sparbuchfeinde Klingt gar nicht mehr so verrückt, wenn man bedenkt das Riester und Rürup nur so schlecht sind, weil die Kosten so hoch sind. Absolut klares Lobbyismus Beispiel
1
0
25
@phill__1
Phil
14 days
@parthagar Depends on your use case. People max out their hour of ChatGPT audio usage per day all the time. That would be $360 per user per month
1
0
24
@phill__1
Phil
7 months
@TheDrugMoney 8% ROI on the $50k, excluding repairs, is not amazing but not bad. It depends on your portfolio. I would not recommend more than 10% in housing in general, as the stock market already includes a decent amount of companies that invest in housing.
11
0
23
@phill__1
Phil
6 months
@futuristflower AdamGPT from OpenAI retweeted me speculating on synthetic math data for training. So they probably have a math model that is really good at creating and solving new math problems, likely Q*
1
2
23
@phill__1
Phil
2 months
@kimmonismus Partial outage, should come back soon
Tweet media one
1
0
22
@phill__1
Phil
2 months
@kimmonismus @OfficialLoganK Dario Amodei said the same, 2 more generations until we have agents. Opus 3.5 this year and Opus 4 probably mid-2025 which would be 2 generations
1
0
22
@phill__1
Phil
2 months
@robertskmiles Anybody who argues that this is inherently different from how humans think should read John Locke
4
0
22
@phill__1
Phil
24 days
Interesting observation in Google's new paper. Is 3.5 Sonnet based on distilled 3 Opus weights?
Tweet media one
2
2
21
@phill__1
Phil
2 months
@AravSrinivas At this rate, perplexity will be the first company that can write you a full research paper with proper citations and actual new insights
2
0
21
@phill__1
Phil
2 months
@abacaj This was desperately needed; it's finally competitively priced against 3.5 Sonnet again.
0
1
21
@phill__1
Phil
14 days
@DavidSZDahan It's from today, people in the US are still sleeping
1
0
19
@phill__1
Phil
3 months
This is actually crazy, Gemini is about to solve Geoguessr (personal image, guaranteed not in the training data)
Tweet media one
@phill__1
Phil
3 months
Gemini-test is really good with image inputs. I am stating to think this might be Gemini pro 2.0
Tweet media one
0
1
14
1
4
18
@phill__1
Phil
6 months
@kimmonismus It still has rounding errors on relatively simple calculator problems. If this was the complete Q* implementation as leaked, this should not happen as far as I understand. Also, the model is great, but I hope Q* will be more impressive in all reasoning tasks.
1
0
16
@phill__1
Phil
5 months
@anpaure I think he is talking about making research available in general, not only in scientific journals. If you invent/research something and keep it as a company secret, that's R&D and not science since science is based on contributing to the world's knowledge
0
0
16
@phill__1
Phil
2 months
@kimmonismus I think it's now pretty safe to say, this is not a frontier model, at least not in code.
1
0
17
@phill__1
Phil
11 months
@MattEnder3 @RWPopulism @cnviolations Nobody calls Syrian immigrants white
1
0
14
@phill__1
Phil
9 months
@Northernlion Ephedrine / Pseudoephedrine works against symptoms quite well. Banned / regulated in a bunch of countries since you can use it to cook illegal stuff tho.
1
0
12
@phill__1
Phil
28 days
There is literally no reason to use gpt4o-mini currently. Flash 1.5 Experimental is half the price, 50% faster and has the same performance.
@OfficialLoganK
Logan Kilpatrick
29 days
We just shipped a series of changes which have significantly improved the Gemini 1.5 Flash latency (>3x reduction) and output tokens per second (>2x more)⚡️🚢
Tweet media one
89
124
2K
3
0
16
@phill__1
Phil
2 months
At this point, it's actually free for all personal and company internal use. A company knowledge RAG Chatbot for a 500-employee company will not have more than 15 RPM in my experience.
@OfficialLoganK
Logan Kilpatrick
2 months
Gemini 1.5 Flash free tier comes with: - 15 RPM (requests per minute) - 1 million TPM (tokens per minute) - 1,500 RPD (requests per day) - free context caching, up to 1 million tokens of storage per hour - free fine-tuning That’s 1.5 Billion tokens free, everyday. (2/4)
16
24
533
1
0
16
@phill__1
Phil
7 months
@NPCollapse This is such a big AI safety concern. This is what Robert Miles predicted years ago, first step to fooling AI researchers into releasing a potentially dangerous model. If it knows it's tested, it can just lie to get out of the sandbox. And if you don't know Robert Miles, watch
0
1
16
@phill__1
Phil
5 months
@OpenAI I don't agree with "Don’t try to change anyone’s mind", especially if the user argues such clear cut cases as flat earth theory. As a society we have to believe in certain ground truthes, it does not help society if the AI starts entertaining scientific illiteracy to not offend
7
2
14
@phill__1
Phil
5 months
@mister_shroom "Unicorn startup founder" who interviews at OpenAI for a data labeling job? Also, they never mention the unicorn they supposedly founded. I am calling cap; there's no way they pay a competitive Western wage for this. It's probably all outsourced to Asian and African devs.
0
0
14
@phill__1
Phil
2 months
@ObamaNoMessiah Just try it in ChatGPT. It does not work since ChatGPT cannot run code that requires internet access. Having access to all live and historical financial data, calling different APIs and installing new packages like geopandas for more complex chats all does not work in ChatGPT
2
0
15
@phill__1
Phil
3 months
Gemini-test is really good with image inputs. I am stating to think this might be Gemini pro 2.0
Tweet media one
0
1
14
@phill__1
Phil
1 month
This is the number one reason why Claude feels so much better in long chats, and why I am not going back to ChatGPT even if the model might be slightly better in single-turn conversations
Tweet media one
1
1
15
@phill__1
Phil
9 months
@AnthropicAI This is important since everyone thinks, just train and test the AGI in a lab environment. AGI will know if it's in a lab or the real world, e.g., due to the current date being past the training cut off. It could fake "good" behavior to get into the real world
4
0
12
@phill__1
Phil
7 months
@luciascarlet One UI 6 on Snapdragon 8 Gen 2 works just fine
Tweet media one
0
0
14
@phill__1
Phil
20 days
There is a new model in the artificial analysis image arena that beats Flux Pro 🤔
Tweet media one
0
1
14
@phill__1
Phil
1 month
@sophiamyang Are we going to be able to test it on Le Chat soon?
1
0
14
@phill__1
Phil
1 month
RIP. You need at least 1k API spend to get access to o1. Hope OpenRouter saves us
Tweet media one
1
0
13
@phill__1
Phil
2 months
what is going on with . @01AI_Yi ? Are they pulling out of the non Chinese market?
Tweet media one
5
2
14
@phill__1
Phil
4 months
@kimmonismus We will have PhD-level intelligence, and they will still talk about how it can't count the r's in strawberry
2
0
13
@phill__1
Phil
2 months
An udio generated song has made it to the top 48 in the official German single charts! 🤯
Tweet media one
2
3
12
@phill__1
Phil
2 months
Google has never been so back, first time anybody has ranked above OpenAI in the "Overall" category
Tweet media one
0
0
14
@phill__1
Phil
8 months
@lmsysorg @alibaba_cloud Are we going to see Gemini advanced in the arena?
0
0
13
@phill__1
Phil
20 days
It's actually over for us eurocucks
Tweet media one
3
0
13
@phill__1
Phil
9 months
@luciascarlet I don't understand how the tech bros are defending this. The fact that you still cannot ship a build straight from github should outrage any developer. Open source just does not work if you have to pay 50ct for every download as a developer
0
0
12
@phill__1
Phil
1 year
@alain19651 @tagesschau Deswegen hat er es verdient zu ertrinken? BTW ich warte immer noch wann das gift mal kickt, wie lange soll das denn noch dauern?
1
0
10
@phill__1
Phil
5 months
@kimmonismus Would this mean agents by WWDC? Siri that cannot even set a timer would be less than ideal, even when way smarter
6
0
11
@phill__1
Phil
7 months
@OrowaSikder It knows if it's getting tested? Dang, this is some scary reasoning
0
0
11
@phill__1
Phil
28 days
lmarena is pretty crazy currently. Unreleased models: - potter-v1 and -v2 and dumbledore-v1 and -v2 (only in vision arena) - zeus-flare-thunder-v1 and -v2 - sharp-game-player-v1 and -v2 - qwen2.5-72b-instruct - pizza-model-small and -large (reka ai) - gemini-test (gemini 1.5 pro
3
3
13
@phill__1
Phil
2 months
@OfficialLoganK `gemini-1.5-pro-exp-0827` sadly suffers from the lazy coding disease similar to gpt 4 turbo at release. Instead of writing the code, it often just gives me comments //implement function abc here. Even aggressively prompting against that does not help. I hope this gets fixed
3
0
11
@phill__1
Phil
16 days
@sammcallister He’s a 10 but he starts every message with an apology
2
0
11
@phill__1
Phil
2 months
Sounds like no new model launch planned for dev day
Tweet media one
0
2
11
@phill__1
Phil
2 months
Ideogram 2.0 is actually insane at generating full posters, text generation is definitely a step change
Tweet media one
Tweet media two
Tweet media three
Tweet media four
1
1
11
@phill__1
Phil
2 months
New gpt-4o model with 4x the amount of output token
Tweet media one
1
0
11
@phill__1
Phil
26 days
gemini-test is absolutely cracked at timeguessr, no other model even gets the correct city and gemini is only 500 meter and one year off
Tweet media one
Tweet media two
Tweet media three
1
0
12