A Powerful Generic TypeScript Function for Generating Valuable RAG Texts

In a Nutshell: Given X, what’s the probability of Y occurring?

Photo by ThisIsEngineering: https://www.pexels.com/photo/code-projected-over-woman-3861969/

The full code example used in this post is on TypeScript Playground.

Generating Deterministic Statistics for LLMs That are Bad at Deterministic Things

I’m currently hard at work on my latest SaaS product, AMT JOY, which is a historical, statistical, and probabilistic tool covering futures trading sessions, focusing on Auction Market Theory, or AMT. Part of the product we are calling “AMT GPT” — a chat-based tool where you can ask for any information about any historical session that ever occurred, eventually with the idea of planning intraday plays in real-time. This is where I discovered embeddings and Retrieval Augmented Generation (RAG). At first, skeptical as always with the accuracy of these LLMs, I was fully expecting the very poor results which I indeed was seeing initially. For example, I asked a very simple “What trading session had the highest return?” and I was each time getting a completely different answer! That’s when I realized:

This was completely my fault.

I had to first inform myself about how RAG works. In the background, RAG is first finding a select number of matching documents, and then creating a response from them. Since I only had per-day session descriptions, where the return of the session is only a small part of those description files, of course, the initial selection was faulty, and then the LLM could only report on the session that was returned. However, as soon as I included a sort of “overall stats” file, it started working much more reliably. This stats file literally had the line “The session with the highest return occurred on …”. This was definitely what the RAG was finding in subsequent queries. That’s when I realized, to make the RAG work for a vast variety of statistical and probabilistic questions, I needed mounds and mounds of not just my data converted to human readable / normal language documents — and crucially not just randomly generated data, but highly structured and statistically accurate data. Only then would AMT GPT start to be useful.

Generating the Data

At first, I started with a sort of alpha, using string interpolation to format various files, like my ‘stats’ file:

...

return `# Stats overview Data for ${data[0].symbol}
The session with the highest return was ${sessionWithHighestReturn.formattedDate} (${sessionWithHighestReturn.sessionId}) with a return of ${sessionWithHighestReturn.sessionReturn.toFixed(2)}.
The session with the lowest return was ${sessionWithLowestReturn.formattedDate} (${sessionWithLowestReturn.sessionId}) with a return of ${sessionWithLowestReturn.sessionReturn.toFixed(2)}.
The session with the highest volume was ${sessionWithHighestVolume.formattedDate} (${sessionWithHighestVolume.sessionId}) with a volume of ${sessionWithHighestVolume.totalVolume}.
The session with the lowest volume was ${sessionWithLowestVolume.formattedDate} (${sessionWithLowestVolume.sessionId}) with a volume of ${sessionWithLowestVolume.totalVolume}.
The session with the highest A period return was ${sessionWithHighestAPeriodReturn.formattedDate} (${sessionWithHighestAPeriodReturn.sessionId}) with a return of ${sessionWithHighestAPeriodReturn.candles[0].r.toFixed(2)}.
The session with the lowest A period return was ${sessionWithLowestAPeriodReturn.formattedDate} (${sessionWithLowestAPeriodReturn.sessionId}) with a return of ${sessionWithLowestAPeriodReturn.candles[0].r.toFixed(2)}.`;
};

...

I realized however, there were far more correlations possible, and since potentially thousands of traders will use AMT JOY, a correlation that I may think is useless, unnecessary, or not even realize entirely, may in fact be an essential part of another trader’s strategy! Just look at our ISessionStats interface, and realize how many “Given” … “Then” correlations are possible:

interface ISessionStats {
index: number;
symbol: string;
sessionId: string;
formattedDate: string;
dayOfWeek: string;
month: string;
year: number;
candles: Candle[];
dailyOTF: string;
weeklyOTF: string;
monthlyOTF: string;
orTrend: string;
orExtensionUp: string;
orExtensionDown: string;
ibTrend: string;
trTrend: string;
ibExtensionUp: string;
ibExtensionDown: string;
sessionType: string;
sessionAnnotation: string;
openingRange: Range;
initialBalance: Range;
tradingRange: Range;
vwap: VWAPWithStdDev[];
timeAboveOR: number;
timeInOR: number;
timeBelowOR: number;
timeAboveIB: number;
timeInIB: number;
timeBelowIB: number;
timeAboveTR: number;
timeInTR: number;
timeBelowTR: number;
timeAboveSD1: number;
timeWithinSD1: number;
timeBelowSD1: number;
timeAboveSD2: number;
timeWithinSD2: number;
timeBelowSD2: number;
timeAboveSD3: number;
timeWithinSD3: number;
timeBelowSD3: number;
timeAboveSD4: number;
timeWithinSD4: number;
timeBelowSD4: number;
vwapCrossesUp: number;
vwapCrossesDown: number;
closeInRelationToOR: string;
closeInRelationToIB: string;
candleClosesBelowBelowORMidpoint: number;
candleClosesAboveAboveORMidpoint: number;
candleClosesBelowBelowIBMidpoint: number;
candleClosesAboveAboveIBMidpoint: number;
crossesORHighPeriodUp: string[];
crossesORHighPeriodDown: string[];
crossesORLowPeriodDown: string[];
crossesORLowPeriodUp: string[];
crossesIBHighPeriodUp: string[];
crossesIBHighPeriodDown: string[];
crossesIBHighPeriodDownCount: number;
crossesIBHighPeriodUpCount: number;
crossesIBLowPeriodDown: string[];
crossesIBLowPeriodUp: string[];
crossesIBLowPeriodUpCount: number;
crossesIBLowPeriodDownCount: number;
openingDrive: string;
gapName: string;
gapPercent: number;
gapFillLevel: number;
gapFilled: string;
gapFillPeriod: string;
sessionReturn: number;
totalVolume: number;
totalVolumePercentile: number;
ibHighFromOpenPercentChange: number;
ibLowFromOpenPercentChange: number;
lowestLevelFromOpenPercentChange: number;
lowestLevelPeriod: string;
highestLevelFromOpenPercentChange: number;
highestLevelPeriod: string;
trueRange: number;
averageTrueRange: number;
aPeriodTrend: string;
bPeriodTrend: string;
cPeriodTrend: string;
dPeriodTrend: string;
ePeriodTrend: string;
fPeriodTrend: string;
gPeriodTrend: string;
hPeriodTrend: string;
iPeriodTrend: string;
jPeriodTrend: string;
kPeriodTrend: string;
lPeriodTrend: string;
mPeriodTrend: string;
phod: number;
plod: number;
prevClose: number;
mostSimilarSessionsByReturn: Array<ISessionCorrelation>;
}

In any case, a generic or at the very least, abstract approach is required to efficiently and effectively build a software solution to build as many statistics-based files as possible. The more combinations of stats we can make, the more powerful our RAG.

Given This, Then…

The first function I wrote takes an array of your objects (of any type — that’s the power of generics!) and generates “given X, the probability of Y occurring is Z” type sentences. There are two parts: the first part, we need to calculate all the different possibilities of the single metric alone. Only then can we combine the various combinations of metrics to build all our Given / Then sentences:

const calculatePropertyStats = <T>(
objects: T[],
metrics: IProbabilityMetric<T>[]
): IPropertyStats<T> => {
const propertyStats: IPropertyStats<T> = {
totalCount: 0,
uniqueValues: new Map<keyof T, Set<T[keyof T]>>(),
valueCounts: new Map<keyof T, Map<T[keyof T], number>>(),
};

for (const obj of objects) {
propertyStats.totalCount++;

for (const metric of metrics) {
const propertyKey = metric.property;
const propertyValue = obj[propertyKey];

if (!propertyStats.uniqueValues.has(propertyKey)) {
propertyStats.uniqueValues.set(propertyKey, new Set<T[keyof T]>());
propertyStats.valueCounts.set(
propertyKey,
new Map<T[keyof T], number>()
);
}

propertyStats.uniqueValues.get(propertyKey)!.add(propertyValue);

if (!propertyStats.valueCounts.get(propertyKey)!.has(propertyValue)) {
propertyStats.valueCounts.get(propertyKey)!.set(propertyValue, 0);
}

propertyStats.valueCounts
.get(propertyKey)!
.set(
propertyValue,
propertyStats.valueCounts.get(propertyKey)!.get(propertyValue)! + 1
);
}
}

return propertyStats;
}

Leveraging this function, we achieve the parent function, generateProbabilitySentences :

const generateProbabilitySentences = <T>(
objects: T[],
metrics: IProbabilityMetric<T>[]
): string[] => {
const propertyKeys = metrics.map((metric) => metric.property);
const propertyStats: IPropertyStats<T> = calculatePropertyStats(
objects,
metrics
);

const sentences: string[] = [];

for (const n1PropertyKey of propertyKeys) {
for (const n1Value of propertyStats.uniqueValues.get(n1PropertyKey)!) {
for (const n2PropertyKey of propertyKeys) {
for (const n2Value of propertyStats.uniqueValues.get(n2PropertyKey)!) {
const n1ValueCount = propertyStats.valueCounts
.get(n1PropertyKey)!
.get(n1Value)!;

const intersectionCount = objects.filter(
(obj) =>
obj[n1PropertyKey] === n1Value && obj[n2PropertyKey] === n2Value
).length;

const probability =
n1ValueCount > 0
? ((intersectionCount / n1ValueCount) * 100).toFixed(2)
: 0;

const label1 = metrics.find(
(metric) => metric.property === n1PropertyKey
)?.label;
const label2 = metrics.find(
(metric) => metric.property === n2PropertyKey
)?.label;

const propIndex1 = propertyKeys.indexOf(n1PropertyKey);
const propIndex2 = propertyKeys.indexOf(n2PropertyKey);

if (label1 !== label2 && propIndex1 < propIndex2) {
sentences.push(
`Given ${label1} is ${n1Value}, the probability of ${label2} ${n2Value} is ${probability}%.`
);
}
}
}
}
}

return sentences;
};

Example

Observe the following example data shape, ISalesData:

interface ISalesData {
dayOfWeek: string
productSold: string
usedCoupon: boolean
}

and data salesData :

const salesData: ISalesData[] = [
{ dayOfWeek: "Monday", productSold: "Product A", usedCoupon: true },
{ dayOfWeek: "Tuesday", productSold: "Product B", usedCoupon: false },
{ dayOfWeek: "Wednesday", productSold: "Product C", usedCoupon: true },
{ dayOfWeek: "Thursday", productSold: "Product A", usedCoupon: false },
{ dayOfWeek: "Friday", productSold: "Product B", usedCoupon: true },
{ dayOfWeek: "Saturday", productSold: "Product C", usedCoupon: false },
{ dayOfWeek: "Sunday", productSold: "Product A", usedCoupon: true },
{ dayOfWeek: "Monday", productSold: "Product B", usedCoupon: false },
{ dayOfWeek: "Tuesday", productSold: "Product C", usedCoupon: true },
{ dayOfWeek: "Wednesday", productSold: "Product A", usedCoupon: false },
{ dayOfWeek: "Thursday", productSold: "Product B", usedCoupon: true },
{ dayOfWeek: "Friday", productSold: "Product C", usedCoupon: false },
{ dayOfWeek: "Saturday", productSold: "Product A", usedCoupon: true },
{ dayOfWeek: "Sunday", productSold: "Product B", usedCoupon: false },
{ dayOfWeek: "Monday", productSold: "Product C", usedCoupon: true },
{ dayOfWeek: "Tuesday", productSold: "Product A", usedCoupon: false },
{ dayOfWeek: "Wednesday", productSold: "Product B", usedCoupon: true },
{ dayOfWeek: "Thursday", productSold: "Product C", usedCoupon: false },
{ dayOfWeek: "Friday", productSold: "Product A", usedCoupon: true },
{ dayOfWeek: "Saturday", productSold: "Product B", usedCoupon: false },
];

What if we want to know the probability / chance a customer buys a certain product on a given day? Or the chance they used a coupon on a certain day? Easy, we just need to pass each of these given / then scenarios into generateProbabilitySentences :

// day of week and coupon
let metrics: IProbabilityMetric<ISalesData>[] = [
{
label: "The day of week",
property: "dayOfWeek",
},
{
label: "the customer using a coupon being",
property: "usedCoupon",
},
];
let sentences = generateProbabilitySentences(salesData, metrics);

// log all sentances to the console
sentences.forEach(sentence => console.log(sentence))

// day of week and coupon
metrics = [
{
label: "The day of week",
property: "dayOfWeek",
},
{
label: "the customer purchasing the product",
property: "productSold",
},
];
sentences = generateProbabilitySentences(salesData, metrics);

// log all sentances to the console
sentences.forEach(sentence => console.log(sentence))

and… drumroll please… our amazing output:

Given the day of week is Monday, the probability of the customer using a coupon being true is 66.67%.
Given the day of week is Monday, the probability of the customer using a coupon being false is 33.33%.
Given the day of week is Tuesday, the probability of the customer using a coupon being true is 33.33%.
Given the day of week is Tuesday, the probability of the customer using a coupon being false is 66.67%.
Given the day of week is Wednesday, the probability of the customer using a coupon being true is 66.67%.
Given the day of week is Wednesday, the probability of the customer using a coupon being false is 33.33%.
Given the day of week is Thursday, the probability of the customer using a coupon being true is 33.33%.
Given the day of week is Thursday, the probability of the customer using a coupon being false is 66.67%.
Given the day of week is Friday, the probability of the customer using a coupon being true is 66.67%.
Given the day of week is Friday, the probability of the customer using a coupon being false is 33.33%.
Given the day of week is Saturday, the probability of the customer using a coupon being true is 33.33%.
Given the day of week is Saturday, the probability of the customer using a coupon being false is 66.67%.
Given the day of week is Sunday, the probability of the customer using a coupon being true is 50.00%.
Given the day of week is Sunday, the probability of the customer using a coupon being false is 50.00%.
Given the day of week is Monday, the probability of the customer purchasing the product Product A is 33.33%.
Given the day of week is Monday, the probability of the customer purchasing the product Product B is 33.33%.
Given the day of week is Monday, the probability of the customer purchasing the product Product C is 33.33%.
Given the day of week is Tuesday, the probability of the customer purchasing the product Product A is 33.33%.
Given the day of week is Tuesday, the probability of the customer purchasing the product Product B is 33.33%.
Given the day of week is Tuesday, the probability of the customer purchasing the product Product C is 33.33%.
Given the day of week is Wednesday, the probability of the customer purchasing the product Product A is 33.33%.
Given the day of week is Wednesday, the probability of the customer purchasing the product Product B is 33.33%.
Given the day of week is Wednesday, the probability of the customer purchasing the product Product C is 33.33%.
Given the day of week is Thursday, the probability of the customer purchasing the product Product A is 33.33%.
Given the day of week is Thursday, the probability of the customer purchasing the product Product B is 33.33%.
Given the day of week is Thursday, the probability of the customer purchasing the product Product C is 33.33%.
Given the day of week is Friday, the probability of the customer purchasing the product Product A is 33.33%.
Given the day of week is Friday, the probability of the customer purchasing the product Product B is 33.33%.
Given the day of week is Friday, the probability of the customer purchasing the product Product C is 33.33%.
Given the day of week is Saturday, the probability of the customer purchasing the product Product A is 33.33%.
Given the day of week is Saturday, the probability of the customer purchasing the product Product B is 33.33%.
Given the day of week is Saturday, the probability of the customer purchasing the product Product C is 33.33%.
Given the day of week is Sunday, the probability of the customer purchasing the product Product A is 50.00%.
Given the day of week is Sunday, the probability of the customer purchasing the product Product B is 50.00%.
Given the day of week is Sunday, the probability of the customer purchasing the product Product C is 0.00%.

I can guarantee the statistics are accurate :)

You could then throw data like this into a document, then into your favorite vector store (like Pinecone), and query against it with any embedder of your choice (like GPT4)! You can then be sure your queries will be matched with accurate, and not hallucinated, data.

Of course, this is a toy example, and you can see how this function works best with properties that are string enums, i.e. a set list of countable strings. You would need to define additional rules for things like sums, time frames (“show me all sales for June / July / August”) averages, max, or min, and this is exactly what we’re working on!

More Coming!

In the coming weeks and months, I’m looking to scale this tool out into a fully separate product. Essentially, you’ll be able to put in your organization's data — whatever type it might be, and we’ll be able to generate all possible probabilities and statistics, and then you can use RAG or any LLM of your choice against it to extract mathematically true values. It replaces the need for any big data analysis completely, and it operates through an extremely human-like interface. Think of it like a friendly human assistant that has memorized the entirety of your org’s data set — including probabilities and statistics that you may not even have thought of yourself!

Also, if you’ve got this far, do you know of any tools working on similar problems like this? Or what this field of technology is called? I’d love to research

Thanks & Cheers

If this helped you at all in developing your RAGs, give it a few claps! I’m not one of those snake oil salesmen that puts all this good stuff behind the Medium paywall. This post is on my blog as well.

Cheers,

-Chris

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Report

Comment

  • Top
  • Latest
empty
No comments yet